Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data
Journal
arXiv preprint arXiv:2410.00469
Date Issued
2024-10-01
Author(s)
Abstract
Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation
applications, such as land cover mapping, urban planning, and environmental monitoring. However,
individual data sources often present limitations for this task. Very High Resolution (VHR) aerial
imagery provides rich spatial details but cannot capture temporal information about land cover
changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal
variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale
objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation
that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed
model consists of two independent deep learning branches. One branch integrates detailed textures
from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT)
backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite
image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to
state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation
using multi-source optical imagery. The findings highlight the importance of multi-modality fusion
in improving the accuracy and robustness of semantic segmentation in remote sensing applications.
applications, such as land cover mapping, urban planning, and environmental monitoring. However,
individual data sources often present limitations for this task. Very High Resolution (VHR) aerial
imagery provides rich spatial details but cannot capture temporal information about land cover
changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal
variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale
objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation
that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed
model consists of two independent deep learning branches. One branch integrates detailed textures
from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT)
backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite
image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to
state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation
using multi-source optical imagery. The findings highlight the importance of multi-modality fusion
in improving the accuracy and robustness of semantic segmentation in remote sensing applications.
Subjects
File(s)![Thumbnail Image]()
Loading...
Name
2410.00469v1.pdf
Size
2.08 MB
Format
Adobe PDF
Checksum
(MD5):d4cb39bf8031e4abf1750eff6c023f14
