IEEE Access (Jan 2025)

Wi-Fi-Enabled Vision via Spatially-Variant Pose Estimation Based on Convolutional Transformer Network

  • Hyeon-Ju Lee,
  • Seok-Jun Buu

DOI
https://doi.org/10.1109/access.2025.3568505
Journal volume & issue
Vol. 13
pp. 84855 – 84869

Abstract

Read online

Wi-Fi-enabled vision offers a transformative paradigm for non-optical pose estimation, particularly in occluded or privacy-sensitive environments where traditional visual systems falter. Despite its promise, extracting reliable pose information from Wi-Fi Channel State Information (CSI) remains a formidable challenge due to spatial variability in torso localization, cross-view discrepancies, and inherent signal perturbations caused by multipath propagation and environmental noise. To address these challenges, we propose a Convolutional Transformer Network. This architecture integrates convolutional layers for localized spatial feature extraction and transformer layers for global temporal dependency modeling. This integrative design effectively captures the spatiotemporal dynamics of CSI signals, enabling robust pose estimation under cross-view and spatially-variant conditions. When evaluated on the benchmark WIDAR 3.0 datasets, the proposed model outperforms the structural and sequential learning baseline CNN-GRU by 1.72% in accuracy. It outperforms sequential models (RNN, GRU, LSTM) and image models (CNN, ViT) across all key metrics, demonstrating robust spatial-temporal modeling capabilities. These results highlight its advancement in non-optical pose estimation and practical applicability in real-world scenarios.

Keywords