Seeing Through Clouds: Improving Robustness of SatML For Land Use Monitoring - AI for Climate Impact
Becky Xu
Final project for 6.7960 | Fall 2024 MIT
Outline

Introduction

Background

Methods

Results

Conclusion

Introduction

The integration of machine learning (ML) and satellite imagery—often termed Satellite Machine Learning (SatML)—is rapidly transforming the ability to understand the Earth’s surface and address urgent global challenges. From deforestation monitoring and agricultural productivity assessments to disaster response and urban planning, SatML provides a geospatial lens through which I can observe dynamic patterns at unprecedented scales and frequencies. High-resolution satellite data, now increasingly available from platforms like Sentinel-2 and Landsat, enables automated, near-real-time analysis. Such capabilities are crucial for stakeholders who must act swiftly to manage resources, preserve ecosystems, and respond to environmental crises (Zhu et al., 2017; Hansen et al., 2013).

Side-by-side panels to show land use/land cover (LULC) change from 2014-2018. Screenshot from Dynamic LULC Change, Chesapeake Conservancy taken 25-October-2022.

In response, researchers are exploring numerous strategies. Some turn to radar sensors (e.g., Sentinel-1) that can penetrate clouds, ensuring continuous data availability. Others develop advanced time-series analysis methods, cloud removal algorithms, or data fusion approaches that integrate optical and radar imagery.

As for other challenges in remote sensing, improvements in machine learning are also promising. Self-supervised and semi-supervised learning methods can help overcome limited annotated datasets, while domain adaptation techniques tackle the problem of distribution shifts caused by varying imaging conditions. Transfer learning—where a model is pre-trained on a large, generic dataset and then fine-tuned for a specific task—has emerged as a particularly powerful technique. It mirrors human learning processes, where prior knowledge accelerates mastery of new tasks. For satellite imagery, leveraging a pretrained model can significantly reduce the training time and data requirements needed to achieve robust performance, especially when dealing with noisy, complex, and incomplete data.

This project explores how strategic masking—simulating cloud cover and other occlusions—can enhance the robustness of pretrained models for land cover classification tasks through fine-tuning.

Background

Beyond RGB: The Complexity of Satellite Machine Learning

While the successes of deep learning on natural images have inspired progress in SatML, direct transfers of these methods are rarely sufficient. Satellite imagery often extends beyond the familiar visible bands (red, green, blue) to include infrared and other spectral bands that provide valuable insights into vegetation health, water content, and mineral composition. This multimodal data adds complexity and richness, but it also means that models must handle inputs with fundamentally different spectral and spatial characteristics than those found in standard benchmarks like ImageNet. The sheer scale of satellite data—often reaching petabytes—is matched by its diversity, as images differ across seasons, sensor types, geographic regions, and atmospheric conditions (Rolf, E et al., 2024).

Moreover, distribution shifts—arising from variations in sensor types, atmospheric conditions, seasons, and geographies—challenge models to generalize beyond their training domains. Conventional evaluation metrics, designed for non-spatial tasks, may be insufficient to fully capture the complexity of geospatial predictions and their spatial dependencies (Rolf, E et al., 2024).

Data scarcity further complicates the scenario. While unlabeled satellite data is abundant, the acquisition of high-quality ground truth labels is expensive and time-consuming, often requiring field surveys or manual annotation by experts. This has fueled research into self-supervised and semi-supervised learning methods, as well as transfer learning approaches that leverage pre-trained models for downstream tasks. Transfer learning has shown promise in enabling rapid adaptation of models to new conditions or tasks, especially when annotated data is limited or heterogeneous (He et al., 2022; Berman et al., 2023).

Within this context, cloud occlusions represent a critical bottleneck. While radar sensors like Sentinel-1 can penetrate cloud cover, optical data remains essential for detailed spectral analysis. Developing robust methods to handle missing or corrupted optical data is therefore paramount. Approaches include multi-temporal composites (stacking multiple time points to fill in cloudy gaps), cloud detection and removal algorithms, data fusion from radar and optical imagery, and synthetic data augmentation to simulate cloud effects. Each strategy aims to maintain data quality and continuity, enabling the model to extract meaningful features without being derailed by irregularities or missing pixels.

Model and Data: PRESTO and EuroSAT

To study how strategic masking and other data simulation techniques can improve robustness, this work leverages the PRESTO model. PRESTO is a transformer-based architecture optimized for satellite imagery through a self-supervised pre-training strategy called masked autoencoding (MAE). This approach helps the model learn meaningful representations of multispectral, multitemporal data without relying heavily on labeled examples. Although PRESTO is geared toward pixel-time series data, it also adapts the project to single-timestep imagery—an essential quality for use cases constrained by limited temporal observations (D. Kim et al. 2024).

This project focuses on land cover classification tasks using the EuroSAT dataset. EuroSAT provides a harmonized collection of geospatial and statistical data on various land cover classes across Europe. These data include agricultural areas, forests, urban zones, and water bodies, offering a the well-rounded testbed for evaluating model performance under challenging conditions. With 10 classes in this dataset, each containing around 2,000 and 3,000 images at 64x64 resolution, EuroSAT provides ample diversity. The total of approximately 27,000 images (across train, validation, and test splits) ensures that the model must learn to generalize across different landscapes, reflectance conditions, and environmental settings.

EuroSAT data are Sentinel-2 imagery that provides multiple spectral bands, each capturing reflectance at specific wavelengths. While the Red, Green, and Blue (RGB) bands approximate human vision, Sentinel-2 also offers Near-Infrared (NIR) and Shortwave-Infrared (SWIR) bands, along with red-edge and coastal aerosol bands. These extra bands provide insights into aspects of the landscape that are invisible in standard RGB: See appendix for a list of different satelites, bands, and their applications.

The primary question driving this investigation is how different cloud-masking strategies affect the robustness of fine-tuned models. By simulating cloud occlusions and other forms of data degradation in training data, the project aim to encourage the model to learn more generalized and resilient representations. Ultimately, this approach can yield improved performance on real-world tasks, where perfect imaging conditions are rarely guaranteed. In doing so, the project sheds light on practical methods for accelerating the adoption of SatML in time-sensitive applications—from deforestation alerts that inform conservation efforts to rapid assessments of flood extent that guide disaster response.

In summary, by merging advanced pretrained architectures like PRESTO with datasets like EuroSAT and carefully curated masking strategies, this project seeks to investigate what is possible in satellite-based land cover monitoring.

.

Methods

Objectives and Hypotheses

The primary objective of this work is to investigate how different masking strategies—designed to simulate cloud cover—affect a pretrained model’s ability to generalize and remain robust under noisy conditions. Specifically, the project aims to determine the optimal masking percentage and modality that push the model toward more effective feature learning, enabling it to handle occlusions and atmospheric artifacts commonly found in real-world satellite imagery. By systematically varying the masking parameters and evaluating the model’s performance on a land cover classification task, the project aim to uncover insights that can inform the design of more resilient SatML systems.

The central hypothesis is that a moderate level of masking (approximately 10% - 30%) will yield the best results, and high reflectance mask performs better than data removal mask. At this level, the model encounters enough partial occlusions to learn more generalizable representations without being overwhelmed by missing information. In other words, a modest amount of masking should force the model to develop a more holistic understanding of the input data, leading to improved robustness on subsequent test scenarios with varying degrees of cloud cover.

Pre-Training and Fine-Tuning Setup

Before experimenting with different masking strategies, I established a baseline fine-tuning configuration on the PRESTO pretrained model. Since my ultimate goal is to incorporate masking that simulates cloud cover conditions, I first defined a set of architectural and hyperparameter decisions based on preliminary experiments:

  1. Hyperparameters:
    • Loss function: A standard cross-entropy loss function for the land cover classification task. This choice is motivated by the simplicity and effectiveness of cross-entropy in supervised learning settings.
    • Optimization: Employed the Adam optimizer with a learning rate of 0.0003. Adam is a popular choice for deep learning tasks due to its adaptive learning rate properties and efficient convergence behavior.
    • Training schedule: Trained the model for 20 epochs on the EuroSAT dataset, with early stopping, using a batch size of 64. This configuration achieves a balance between training time and model convergence, ensuring that the model has sufficient exposure to the data.
    • Patch Size: chose a 16×16 patch size as a suitable compromise between spatial granularity and computational efficiency.


  2. Fine-Tuning Head Complexity: I evaluated a one-layer versus a three-layer Multi-Layer Perceptron (MLP) as the classification head. The three-layer MLP head consistently outperformed the single-layer variant, providing a richer parameterization and improved capacity to translate learned representations into accurate class predictions.

  3. Mask Patterns: I compared Gaussian masks with random masks to determine which masking pattern better simulates realistic cloud cover. Gaussian masks, characterized by smoother transitions and shape continuity, more closely resembled natural cloud formations and yielded substantially better results than purely random masking patterns. This choice ensures that the occlusions introduced during training align more closely with the actual phenomena I aim to handle.

    Left: Random mask pattern | Right: Gaussian mask pattern


  4. Mask Values (Data Removal vs Reflectance Representation): To simulate cloud coverage, I tested two primary masking value strategies. First, a simple approach set masked pixels to an extreme value (RGB=-9999), effectively nullifying those regions. However, this abrupt contrast was less effective at simulating realistic cloud conditions, where light scattering typically renders clouded areas as brighter regions. A second approach, setting masked areas to near-white values, produced results more consistent with real-world cloud appearances and improved model performance. High reflectance masking is achieved by setting RGB values to a range between 0.7 and 1.0 to emulate varying degrees of cloud brightness. Both masking strategies were evaluated across different masking percentages to determine the optimal balance between occlusion and information retention.
    Left: High Reflectance Mask (value: 0.7~1) | Right: Null Mask (value: -9999)


  5. Freezing vs. Non-Freezing Pretrained Weights: I tested whether to freeze the pretrained model’s weights or allow them to update during fine-tuning. Unfreezing the pretrained weights and allowing the entire model to adapt to the masked training scenarios resulted in significantly better performance. This suggests that the pretrained features, while already robust, still benefit from fine-grained adjustments when exposed to my specialized data augmentations.

Experiment 1: Multispectral Masking and Robustness

In my first experiment, I assessed how masking affects model robustness when using all available Sentinel-2 Bands (except channel 1 and 8A), resulting in 11 bands. This multispectral setup provides the model with a rich feature space that captures various environmental and atmospheric properties beyond the visible spectrum.

Mask Value Strategy: For this full-channel experiment, I set masked pixels to an invalid value (e.g., -9999) to indicate complete data removal. Brightness alteration was also tested. This approach allowed me to simulate not only cloud occlusions but also sensor dropout or data corruption.

Mask Percentages: I trained separate models with 0%, 10%, 30%, 50%, and 60% masked areas. Additionally, I introduced a random masking strategy (varying between 0% and 70%) during training. Each trained model was then tested on evaluation sets containing between 0% and 80% masked regions. This systematic approach enabled me to thoroughly analyze how varying levels of masking during training influence performance when confronted with a wide spectrum of occlusion intensities at test time.

Experiment 2: RGB-Only Scenario

Although Sentinel-2 provides multispectral data, many legacy remote sensing models and operational systems rely on simple RGB inputs due to their broad availability and ease of interpretation. To evaluate how well PRESTO adapts to such constraints, I conducted a second experiment using only the RGB bands.

Mask Value Strategy for RGB: Similar to the multispectral experiement, I simulated cloud coverage by applying a high reflectance masking value by sampled random values between 0.7 to 1.0 to represent the cloud, thus creating a more realistic and varied occlusion pattern compared to a single high reflectance value. Masking pixels to an invalid value (e.g., -9999) is also tested to assess the impact of complete data removal.

Training and Testing Under Occlusion: Similar to the multispectral experiment, I trained on varying levels of masked data and tested the resulting models against increasingly severe occlusions. By comparing the performance of the RGB-only model to the multispectral counterpart, I gained insights into how additional spectral information influences robustness and what trade-offs exist when limiting the data to visible bands.

.

Results

Overview of the Experiment

This experiment evaluates the performance of fine-tuned models using different masking strategies on Sentinel-2 satellite imagery. The models were trained using either all Bands, elaborated on analysis Part A, or only RGB channels, elaborated on analysis Part B, from Sentinel-2 images, each with two masking approaches: High Reflectance Mask: Setting all values in the masked pixels to a random number between 0.7 and 1 (same value across all bands); Null Mask: Setting the masked pixel values to -9999. In total, there are 4 scenarios studied.

Result Analysis Part A - Masking Strategies when all bands from Sentinel-2 are available

Overall Performance

The results show that both masking strategies yield relatively high performance, with mean F1-scores and accuracies above 0.7 for both approaches. However, the High Reflectance Mask consistently outperforms the Null Mask: High Reflectance Mask: Mean F1-Score = 0.78, Mean Accuracy = 0.79 Null Mask: Mean F1-Score = 0.71, Mean Accuracy = 0.73



Here are the accuracy and F1-Score results for each trained model evaluated by the range of percentages of masked pixels:

Accuracy by Different Percent of Masking on Training - All Bands - High Reflectance Mask


Accuracy by Different Percent of Masking on Training - All Bands - Null Mask


Null masking likely introduces discontinuities in data, reducing the ability to interpolate masked regions effectively. This could lead to lower overall accuracy compared to the high reflectance mask. By setting masked pixels to reflectance values (0.6 to 1), the strategy mimics spectral properties seen in highly reflective surfaces (e.g., water, urban materials). This helps maintain continuity in spectral data, enabling models to learn patterns in the context of realistic reflectance values. This explains the consistently higher accuracy. In addition, Reflectance Mask may act as a form of augmentation, simulating reflective surfaces like snow or water. Null Mask lacks this advantage.

Each plotted line is a model trained on a different percentage of masked pixels

Accuracy by Different Percent of Masking on Training - All Bands - High Reflectance Mask


Accuracy by Different Percent of Masking on Training - All Bands - Null Mask


This suggests that preserving some information in the masked areas (by using high reflectance values) is more productive for the model's learning process than completely nullifying the masked pixels.

Performance Across Classes

To understand how masking strategies affect class-specific performance, I analyzed the Accuracy and F1-scores for each class. The results reveal distinct patterns:

Accuracy by class - All Bands - High Reflectance Mask


Accuracy by class - All Bands - Null Mask


High Reflectance Classes (e.g., SeaLake, River): Both strategies perform well on these classes, with the high reflectance Mask slightly outperforming. High reflectance values align naturally with these classes, making the high reflectance Mask’s artificial reflectance a better approximation than the Null Mask.

Low Reflectance Classes (e.g., Forest, Pasture): The Null Mask leads to greater inaccuracies in low-reflectance classes. Extreme outliers (-9999) disrupt learned feature distributions, making models less effective in predicting low-reflectance areas.

Transitional Classes (e.g., Herbaceous Vegetation, Residential): These classes are sensitive to both masking strategies, with the high reflectance Mask showing better robustness. The high reflectance Mask maintains spectral patterns for these mixed-use or heterogeneous classes, whereas the Null Mask’s outliers create ambiguities.

Highway Class: Both accuracy and F1 scores are consistently lower compared to other classes. Extreme outlier values (-9999) make it difficult for the model to accurately predict highways, which are narrow, linear features often surrounded by heterogeneous land types. The improvement is likely due to the continuity provided by reflectance-like values (0.6–1), which are closer to the spectral characteristics of concrete or asphalt.

This suggests that preserving some information in the masked areas (by using high reflectance values) is more productive for the model's learning process than completely nullifying the masked pixels.

Result Analysis Part B - Masking Strategies when only RGB bands are available

Overall Performance

The results show that high reflectance masking consistently outperforms null masking across all masking percentages. At 0% training mask (baseline), both strategies achieve similar accuracy (84.7%), but the performance gap widens as masking increases. Models trained with high reflectance masking retain higher robustness and accuracy compared to those trained with null masking, especially as test masking percentages grow.

Here are the accuracy and F1-Score results for each trained model evaluated by the range of percentages of masked pixels:

Accuracy by Different Percent of Masking on Training - RGB - High Reflectance Mask


Accuracy by Different Percent of Masking on Training - RGB - Null Mask


For High Reflectance Masking, performance degradation is gradual and controlled with increasing test masking percentages. Models maintain reasonable accuracy (>70%) even with 30% masked pixels in the test data if trained with at least 10% masking. When models are trained with up to 60% masked data, they still achieve >60% accuracy when tested under similar masking conditions, showcasing resilience. For Null Masking, performance drops steeply with increasing test masking percentages, reflecting a lack of adaptability. Accuracy falls below 50% when test masking exceeds 40%, indicating poor handling of masked data. This rapid deterioration suggests that the extreme negative values (-9999) used in null masking disrupt the model’s generalization ability.

Each plotted line is a model trained on a different percentage of masked pixels

Accuracy by Different Percent of Masking on Training - RGB - High Reflectance Mask


Accuracy by Different Percent of Masking on Training - RGB - Null Mask


High reflectance masking enables better generalization to higher masking percentages, showing adaptability across varying test conditions. Training with 30-50% masked pixels offers the optimal balance between performance and robustness, allowing models to adapt to both low and high levels of test masking. Null masking, on the other hand, fails to generalize effectively to test conditions beyond the training mask percentage. The best performance is achieved when models are trained with a random percentage of masking. This exposure to occlusions handle missing data more effectively, leading to improved generalization.

Performance Across Classes

Same as for part A, to understand how masking strategies affect class-specific performance when only RGB bands are available, I analyzed the Accuracy and F1-scores for each class. The results reveal distinct patterns:

Accuracy by class - RGB Bands - High Reflectance Mask


Accuracy by class - RGB Bands - Null Mask


High reflectance masking demonstrates superior class-wise performance, achieving better F1 scores across most land cover categories.

It is particularly effective for complex and nuanced classes like Residential and Industrial areas, which are challenging to classify.

In challenging categories such as Rivers and Highways, high reflectance masking maintains more stable accuracy and F1 scores, indicating its robustness in diverse scenarios.

The results strongly favor high reflectance masking as the superior strategy for handling masked pixels in satellite imagery classification tasks. Its ability to: gradually degrade performance with increasing masking; generalize well to unseen test conditions, and maintain robust class-wise performance makes it a highly effective and practical choice compared to null masking.

The findings emphasize the importance of realistic masking strategies, such as high reflectance masking, for improving the robustness and accuracy of machine learning models in remote sensing applications.
.

Conclusion, Limitations & Discussions

The experiments suggest that using a High Reflectance Masking strategy offers notable benefits in terms of model robustness and generalization when finetuning a pretrained SatML model for land use classification task. By preserving some spectral information in occluded regions—mimicking realistic cloud reflectance patterns rather than nullifying those pixels completely—the model can maintain higher accuracy across various training and testing conditions. Surprisingly, the results show that applying moderate to substantial masking levels (e.g., around 50%) or training with a randomly sampled range of masking intensities (0–70%) can outperform minimal masking (0–10%). This indicates that controlled exposure to a range of occlusions during training encourages the model to develop more flexible and generalizable features.

The results from the RGB-only experiments further reinforce the advantages of high reflectance masking over null masking. While both strategies performed similarly in the absence of masked pixels, their differences became clear as occlusions increased. High reflectance masking provided a smoother, more controlled degradation in accuracy and consistently outperformed null masking across a range of training and testing mask percentages. Models trained with this approach proved more adaptable, maintaining reasonable accuracy even at high masking levels, and exhibited stronger class-wise performance across diverse land cover types, including challenging categories like Residential, Industrial, Rivers, and Highways.

Overall, these findings underscore the importance of choosing more realistic, high reflectance masking strategies to improve model robustness, adaptability, and overall accuracy in remote sensing tasks. By employing masking methods that closely mimic real-world conditions, practitioners can enhance model performance and reliably scale satellite-based land cover classification to more challenging environments.

Limitations and topics for further study

Class-Specific Adaptations - Performance differences among classes, particularly the ongoing challenges in classifying Residential and Permanent Crop areas, highlight opportunities for class-specific strategies. Tailoring the masking levels or data augmentation techniques per class—or incorporating additional spatial and temporal context—may help overcome subtle spectral similarities and improve class-specific accuracy.

Enhancing Highway Detection - The model’s difficulty in accurately identifying highways, which often appear as narrow, linear features spanning only a few pixels, underscores a common limitation in satellite-based classification. Future work could reduce patch sizes, introduce edge-detection preprocessing steps, or integrate spatial convolutional layers to better capture linear patterns. These refinements may also address spectral overlaps and improve the model’s capacity to distinguish highways from urban or impervious surfaces. Further Directions

Incorporating Real Cloud Data - While simulated masking provides controlled experimentation, integrating actual cloud-covered imagery could offer deeper insights. Using real-world conditions for training and testing might improve realism and better reflect operational constraints.

Advanced Cloud Handling Techniques - Beyond masking, a variety of additional strategies—such as multi-temporal composites, probabilistic cloud detection, and interpolation methods—merit exploration. Combining these techniques with High Reflectance Masking may yield models that are even more resilient to atmospheric interference.

Interpreting Model Decisions - Investigating why High Reflectance Masking outperforms null masking—potentially through attention maps, feature importance analyses, or saliency methods—could reveal how the model leverages subtle spectral cues. Such understanding can guide the design of future models and data augmentation strategies. Refining the Modeling Pipeline

Finally, this project made me realize the importance of a well-organized and clearly documented codebase as an open-sourced project. I spent a significant amount of time understanding the model and code base. Streamlining the model’s implementation, clarifying code structures, and adding thorough documentation can make these advanced techniques more accessible to practitioners and researchers.

Through continued experimentation, more refined masking strategies, and careful attention to model interpretability and code clarity, I hope to advance towards more robust, scalable, and operationally useful satellite-based land cover classification systems.

References:

  • D. Kim, N. M. Rahman and S. Mukhopadhyay, "PRESTO: A Processing-in-Memory-Based k-SAT Solver Using Recurrent Stochastic Neural Network With Unsupervised Learning," in IEEE Journal of Solid-State Circuits, vol. 59, no. 7, pp. 2310-2320, July 2024, doi: 10.1109/JSSC.2024.3352585.
  • Rolf, E., Basu, S., Beery, S., Brandt, C., Choi, D., Efremova, N., ... & Yosinski, J. (2024). Mission Critical -- Satellite Data is a Distinct Modality in Machine Learning. arXiv preprint arXiv:2402.01444.
  • Zhu, X. X., Tuia, D., Mou, L., Xia, G. S., Zhang, L., Xu, F., & Fraundorfer, F. (2017). Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources. IEEE Geoscience and Remote Sensing Magazine, 5(4), 8–36.
  • Hansen, M. C., Potapov, P. V., Moore, R., Hancher, M., Turubanova, S. A., Tyukavina, A., … & Townshend, J. R. G. (2013). High-Resolution Global Maps of 21st-Century Forest Cover Change. Science, 342(6160), 850–853.
  • Zhu, Z., & Woodcock, C. E. (2012). Object-based cloud and cloud shadow detection in Landsat imagery. Remote Sensing of Environment, 118, 83–94.
  • Helber, P., Bischke, B., Dengel, A., & Borth, D. (2019). EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7), 2217–2226.
  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
  • Berman, R. C. S., Li, W., Makarau, A., & Ghamisi, P. (2023). Self-Supervised Learning for Remote Sensing: An Introduction and Review of State-of-the-Art Methods. IEEE Geoscience and Remote Sensing Magazine (Early Access).
  • @misc{tseng2023lightthe projectight, title={Lightthe projectight, Pre-trained Transformers for Remote Sensing Timeseries}, author={Gabriel Tseng and Ruben Cartuyvels and Ivan Zvonkov and Mirali Purohit and David Rolnick and Hannah Kerner}, year={2023}, eprint={2304.14065}, archivePrefix={arXiv}, primaryClass={cs.CV} }

Appendix