IEEE Access (Jan 2025)
SVPDSA: Selective View Perception Data Synthesis With Annotations Using Lightweight Diffusion Network
Abstract
The generation of high-quality annotated image datasets with low computational cost and automated labeling is essential for advancing computer vision systems. However, manual labeling of real images is often labor intensive and expensive. To overcome these challenges, proposed a model named SVPDSA, a generic dataset generation model that incorporates residual and attention block pruning, reduced sampling steps, and network quantization. The model comprises two main training phases: Refined LDM((Latent Diffusion Model) retraining and P-Decoder(Perception decoder) training. A compressed Unet-based diffusion model, pre-trained on the LAION-5B dataset, serves as the foundation for efficient text-to-image synthesis. The model is trained for approximately 50,000 iterations with a learning rate of 0.0001, ensuring lightweight yet effective generation. The proposed model efficiently generates diverse synthetic images with high-quality perception annotations. The proposed approach utilizes a lightweight trained diffusion model and extends text-guided image synthesis to perception data generation, ensuring the quality of the generated datasets while offering a flexible solution for label generation. A decoder module is introduced to expand latent code features and generate labeled annotations for tasks such as semantic segmentation, instance segmentation, and depth estimation. Training the decoder requires fewer than 100 manually labeled images, enabling the creation of an infinitely large annotated dataset. Evaluation on the Cityscapes dataset demonstrates that SVPDSA matches or surpasses existing methods like Mask2Former and DatasetDM in key object classes, including cars, buses, and bicycles. It achieves a mean IoU of 42.7 with ResNet-50 and 41.4 with Swin-B using only 9 real images and 38k synthetic samples, showcasing its efficiency in generating high-quality annotations with minimal real data. Deploying the proposed models on edge devices results in less than a 5-second inference time.This research contributes toward building resource-efficient data generation systems suitable for constrained training environments.
Keywords