Wei Huo1 | Xiaodan Zhang1,2 | xxxxx1 |
xxxxxx1 | xxxxxxx1 | xxxxxxx1 |
1 School of Computer Technology and Applications, Qinghai University, Xining 810016, China |
2 Qinghai Provincial Laboratory for Intelligent Computing and Application, Xining 810016, China |
📝 arXiv |
⚙️ code
|
Remote sensing image super-resolution (SR) plays a critical role in compensating for the missing information in high-resolution (HR) imagery. However, traditional methods often struggle to strike a balance between preserving fine-grained local details and maintaining global structural consistency. The effective fusion of multi-scale features remains a challenging issue.
To address these limitations, this paper proposes a novel multi-path generative adversarial network (SMGAN) tailored for remote sensing image super-resolution reconstruction. SMGAN integrates three heterogeneous feature extraction branches—a deep residual convolutional network, an enhanced Mamba module, and a constant-scale Swin Transformer—to model image representations at local, regional, and global levels, respectively. This design enables comprehensive characterization of fine details, spatial structures, and contextual semantics. To further enhance the quality of multi-branch feature fusion, we introduce a Residual Attention Module (RAM), which employs a two-stage mechanism to achieve effective coupling and selective enhancement between the main feature stream and the fused stream. Considering the critical importance of edge structures and textural details in remote sensing imagery, we design a dual-discriminator architecture operating in both the image and gradient domains. Additionally, a structure-aware gradient loss function is proposed to better preserve edge sharpness and fine textures during reconstruction.
Extensive experiments conducted on the self-built high-resolution remote sensing SR dataset RS-SR19 and the public land-use classification dataset AID demonstrate that SMGAN surpasses various state-of-the-art SR methods in terms of traditional quantitative metrics (e.g., PSNR and SSIM) as well as subjective visual quality. Notably, the model achieves mean LPIPS scores of approximately 0.344 and 0.357 on RS-SR19 and AID, respectively, indicating superior perceptual fidelity and detail restoration. Furthermore, on real-world remote sensing data from the complex terrain of the Sanjiangyuan region in Qinghai, SMGAN exhibits remarkable structural consistency and textural continuity, with robust performance under cloud occlusion conditions and a peak PSNR of around 36dB, highlighting its strong generalization and resilience.
This flowchart shows the overall architecture of the SMGAN model from the input of low-resolution images to the output of high-resolution images.
The generator structure of SMGAN can be divided into three parts as a whole. The first part is the three-branch feature extraction module. It models the image features of the input image and its gradient map from the local (DRSE) - regional (CSSM) - global (M4X) levels respectively, complementing each other and effectively enhancing the overall quality of image reconstruction. The second part is the two-stage residual enhancement module, which realizes the feature interaction among different modalities and scales with the help of the RAM mechanism to improve the consistency and discriminability of the representation. The third part is the feature fusion and image reconstruction module. Firstly, multi-branch information is aggregated through MERGE, and then the final high-resolution image is generated through ERT.
The three modules of the first part.
RAM.
The third part and pixel rearrangement.
This paper adopts two datasets for experiments: one is a self-constructed high-quality remote sensing image dataset, named RS-SR19; Another one is the widely used public remote sensing scene classification Dataset AID (Aerial Image Dataset). Five mainstream evaluation indicators were adopted: PSNR, SSIM, LPIPS, RMSE and SAM to evaluate the model performance from multiple dimensions such as image reconstruction accuracy, structural consistency, perceptual quality and spectral retention ability respectively.
(a) is the visual comparison of the self-made dataset RS-SR19, (b) is the visual comparison of the public dataset AID, and (c) is the visual effect of the Qinghai Sanjiangyuan dataset.
(a) is the ablation experiment of the residual enhancement module (RAM), (b) is the ablation experiment of the weighted hyperparameters in the loss function, (c) is the ablation experiment of the reconstructed channel output factor, and (d) is the ablation experiment of the specific gradient loss term.