
rized by the frequency bands with varying characteristics.
Zhao et al. [30] adopt Wi-Fi signals (2.4 GHz), which pos-
sess a unique ability that it is able to capture a person’s
pose even when he or she is standing behind a wall. In
spite of showing excellent results on 2D pose estimation,
the Wi-Fi sensors used by [30] is proprietorially-designed.
On the other hand, Sengupta et al. [19] introduces another
type of RF signal, Frequency Modulated Continuous Wave
(FMCW) radar, with frequency band periodically chang-
ing from 77GHz to 81GHz, which can precisely detect the
depth (range) and the velocity of an object. Comparing to
the Wi-Fi sensor used by [30], the mmWave radar sensor is
more economical and accessible, as well as commercially
available from many instrument providers [8]. The 3D HPE
results shown in [19] seem promising; however, it ignores
the human body keypoints with high uncertainty, such as
wrists, due to their low prediction accuracy, showing an
inferior capability of capturing human poses using radar.
Most importantly, the datasets of both [30, 19] remain in-
accessible to the public, restricting the further development
of an HPE in terms of RF data.
To overcome the challenging issues encountered in
RGB-based and RF-based HPE, we introduce a new bench-
mark, Human Pose with mmWave Radar (HuPR). Unlike
[19], we additionally incorporate velocity information in
our dataset, since radar sensors can provide a highly pre-
cise velocity information. Meanwhile, we propose a Cross-
and Self-Attention Module (CSAM) to better fuse the multi-
scale features from horizontal and vertical radars and a 2D
pose confidence refinement network based on Graph Con-
volutional Network (PRGCN) to refine the confidence in the
output pose heatmaps. Our framework consists of two cas-
cading components, 1) Multi-Scale Spatio-Temporal Radar
Feature Fusion with CSAM, which contains two branches
to encode temporal range-azimuth and range-elevation in-
formation respectively, followed by a decoder to decode the
fused features at every scale and predict 2D pose heatmaps
and 2) PRGCN which is applied to the output heatmaps
to refine the confidence of each keypoint based on a pre-
defined graph of human skeletons. Our contributions are
threefold:
• We introduce a novel RF-based HPE benchmark,
HuPR, which features privacy-preserving data, eco-
nomical and accessible radar sensors, and handy hard-
ware setup. The dataset and implementation code will
be released upon paper acceptance.
• We propose a new radar pre-processing method that
better extracts velocity information from radar signals
to help RF-based HPE.
• We propose CSAM to relate the features from two dif-
ferent radars for better feature fusion and PRGCN to
refine the confidence of each keypoint, especially to
Figure 2: Examples of actions in our dataset, including
standing with fixed actions, standing with waving hands,
and walking with waving hands.
improve the precision of the faster moving edge key-
points, such as wrists. Experimental results and abla-
tion studies show that our proposed method makes sig-
nificant improvement over RF-based 2D HPE methods
and 3D pointcloud-based methods.
2. Related Work
2.1. RGB-based HPE
There have been extensive studies on RGB-based HPE.
In general, these works can be split into two categories:
regression-based methods and heatmap-based methods.
Traditional regression-based methods [15, 27, 10] map in-
put sources to the coordinates of body keypoints via an end-
to-end neural network. The regression-based solutions are
straightforward but less attractive since it is more difficult
for a neural network to map image features into just sev-
eral keypoint coordinates. On the other hand, heatmap-
based methods [22, 11, 14, 6, 3] generally outperforming
regression-based methods and dominate the field of HPE.
Heatmap-based HPEs produce likelihood heatmaps for each
keypoint as the target of pose estimation.
2.2. RF-based HPE
RF-based data are often used to deal with simpler hu-
man sensing tasks, such as activity recognition [20, 21],
gesture recognition [24, 13] and human object detection
[5, 23]. Channel State Information (CSI) data are the main
RF sources in early days, but they do not provide range or
distance information. With the development of economical
radio sensors, the estimation of range and angle of arrival
becomes feasible with affordable devices, allowing more
detailed and complicated tasks like HPE to be conducted on
RF-based data [30, 19, 18, 4]. Zhao et al. [30] utilize WiFi-
ranged FMCW signals with the ability to generate the 2D