SVPDSA: Selective View Perception Data Synthesis With Annotations Using Lightweight Diffusion Network

S. Raghavendra; Vijayalakshmi; Vainidhi; S. K. Abhilash; Venu Madhav Nookala; P. V. Arun Kumar; Ramyashree

doi:10.1109/ACCESS.2025.3588542

IEEE Access (Jan 2025)

SVPDSA: Selective View Perception Data Synthesis With Annotations Using Lightweight Diffusion Network

S. Raghavendra,
Vijayalakshmi,
Vainidhi,
S. K. Abhilash,
Venu Madhav Nookala,
P. V. Arun Kumar,
Ramyashree

Affiliations

S. Raghavendra: ORCiD; Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
Vijayalakshmi: ORCiD; Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
Vainidhi: Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India
S. K. Abhilash: ORCiD; KPIT Technologies, Bengaluru, India
Venu Madhav Nookala: ORCiD; KPIT Technologies, Bengaluru, India
P. V. Arun Kumar: KPIT Technologies, Bengaluru, India
Ramyashree: ORCiD; Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, India

DOI: https://doi.org/10.1109/access.2025.3588542
Journal volume & issue: Vol. 13
pp. 124051 – 124067

Abstract

Read online

The generation of high-quality annotated image datasets with low computational cost and automated labeling is essential for advancing computer vision systems. However, manual labeling of real images is often labor intensive and expensive. To overcome these challenges, proposed a model named SVPDSA, a generic dataset generation model that incorporates residual and attention block pruning, reduced sampling steps, and network quantization. The model comprises two main training phases: Refined LDM((Latent Diffusion Model) retraining and P-Decoder(Perception decoder) training. A compressed Unet-based diffusion model, pre-trained on the LAION-5B dataset, serves as the foundation for efficient text-to-image synthesis. The model is trained for approximately 50,000 iterations with a learning rate of 0.0001, ensuring lightweight yet effective generation. The proposed model efficiently generates diverse synthetic images with high-quality perception annotations. The proposed approach utilizes a lightweight trained diffusion model and extends text-guided image synthesis to perception data generation, ensuring the quality of the generated datasets while offering a flexible solution for label generation. A decoder module is introduced to expand latent code features and generate labeled annotations for tasks such as semantic segmentation, instance segmentation, and depth estimation. Training the decoder requires fewer than 100 manually labeled images, enabling the creation of an infinitely large annotated dataset. Evaluation on the Cityscapes dataset demonstrates that SVPDSA matches or surpasses existing methods like Mask2Former and DatasetDM in key object classes, including cars, buses, and bicycles. It achieves a mean IoU of 42.7 with ResNet-50 and 41.4 with Swin-B using only 9 real images and 38k synthetic samples, showcasing its efficiency in generating high-quality annotations with minimal real data. Deploying the proposed models on edge devices results in less than a 5-second inference time.This research contributes toward building resource-efficient data generation systems suitable for constrained training environments.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords