IEEE Access (Jan 2025)
DBTU-Net: A Dual Branch Network Fusing Transformer and U-Net for Skin Lesion Segmentation
Abstract
The dermoscopy, as a non-invasive diagnostic tool, plays a significant role in the early diagnosis of skin cancer and in improving patient survival rates. However, the complexity of skin lesion regions, the ambiguity of their boundaries, and issues such as hair occlusion pose challenges for the segmentation of skin lesions. Currently, models based on the convolutional neural network (CNN) and Transformer have been widely used for the segmentation of skin lesions regions. However, CNN-based models struggle with long-range feature modeling, while Transformer-based models tend to pay less attention to local information, resulting in lower boundary accuracy. To address the aforementioned issues, we propose a dual branch network fusing Transformer and U-Net (DBTU-Net). DBTU-Net utilizes attention dense U-Net to capture local features and employs vision Transformer (ViT) to model long-range dependencies and the local contribution score of the image, thereby achieving a comprehensive extraction of both local and global features. Attention dense U-Net includes a triple fusion attention module that extracts features across the height, width, and channel dimensions, aiding the U-Net in capturing the interdependencies between channel and spatial locations. Additionally, we employ channel and spatial fusion attention after the attention dense U-Net to fuse channel and spatial information, thereby enhancing the CNN’s ability to capture long-range dependencies. DBTU-Net achieves accuracies of 0.9680, 0.9647, and 0.9623 on the ISIC-2017, ISIC-2018, and PH2 datasets, respectively, demonstrating its strong generalization capability and superior performance in segmenting skin lesion regions.
Keywords