IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2025)
Spatial–Spectral Hierarchical Multiscale Transformer-Based Masked Autoencoder for Hyperspectral Image Classification
Abstract
Due to the excellent feature extraction capabilities, deep learning has become the mainstream method for hyperspectral image (HSI) classification. Transformer, with its powerful long-range relationship modeling ability, has become a popular model; however, it usually requires a large number of labeled data for parameter training, which may be costly and impractical for HSI classification. As such, based on the self-supervised learning, this article proposes a spatial–spectral hierarchical multiscale transformer-based masked autoencoder (SSHMT-MAE) for HSI classification. First, after the spatial–spectral feature embedding with a spatial–spectral feature extraction module, to solve the increased computational complexity caused by filling invisible patches in traditional masked autoencoder (MAE), the grouped window attention module is introduced to process only the visible patches of HSIs during spatial–spectral reconstruction, avoiding unnecessary computations for masked ones. After that, a spatial–spectral hierarchical transformer is designed to build a hierarchical MAE structure, followed by a cross-feature fusion module to extract the multiscale spatial–spectral fusion features. It can not only assist the whole model in learning the fine-grained local spatial–spectral features within each local region but also capture the long-range dependencies between different regions, generating rich multiscale spatial–spectral features with high-level semantic and low-level detail information for HSI classification. Extensive experiments are conducted on the five public HSI datasets, evaluating the superiority of the proposed SSHMT-MAE model over several state-of-the-art methods.
Keywords