cube to wrap the object and aggregates information from
each surface of the cube to estimate the 3D position of human
joints, as shown in Fig. 1. More specifically, a fixed-size cube
wrapping the human body from the main view will be created
first, with a fixed number of points distributed uniformly
on each surface. Points on the same surface constitute an
attention matrix. Then, our network fuses the feature infor-
mation from all views to calculate the weight matrices of all
attention matrices w.r.t each joint. Finally, joints position are
deduced by the sum of the element-wise products of all the
attention matrices and corresponding weight matrices. Within
the model, feature maps are extracted from depth images
by a two phases backbone network; after that, a multi-view
fusion module integrates feature maps from different views
using dynamic weights according to the mechanism of cross
similarity. Next, a weight distribution module simultaneously
computes the attention matrix’s corresponding weight matrix
on each surface for the final regression. For each joint,
contributions of different attention points are not equal;
hence, each joint has its informative points (points with
high weight) to be used to regress the position and non-
informative points (points with low weight) to be discarded.
To validate our method, ACRNet is first tested on the ITOP
dataset. The results demonstrate that our method outperforms
the state-of-the-art methods on front-view settings while on
par with the best state-of-the-art method on top-view settings.
Moreover, the running speed of ACRNet achieves 92.3 FPS
on a single NVIDIA Tesla V100 GPU, enabling it to work in
a real-time environment. Furthermore, to verify the capability
of our model in real rehabilitation scenarios, therefore pro-
viding a technical foundation for the telemedicine platform,
we collect a new medical multi-view upper body movement
dataset (UBM) from 16 healthy subjects on the trunk support
trainer (TruST) [4], labeled by a Vicon infrared system.
Our model consistently outperforms the baseline [5] on this
dataset. Overall, the contributions of this manuscript are:
•ACRNet: A fully differentiable multi-view regression
network based on depth images to estimate 3D human
joint positions for telemedicine use.
•A new backbone structure and a dynamic multi-view
fusion module are proposed. Both of them improve the
representation ability of our model.
•UBM: A Vicon-labeled multi-view upper body move-
ment dataset for rehabilitation use, consisting of depth
images collected from 16 healthy subjects.
II. RELATED WORKS
A. 3D HPE with Sensor-based Methods
Currently, clinical diagnosis and treatments using motion
capture and pose estimation depend on Vicon because of its
preciseness, but this system is unsuitable for telemedicine
caused of its expensive components and difficulty trans-
ferring. Thus, Sensor-based wearable equipment is used in
telemedicine to capture patients’ motion data. Li et al. [6]
use multiple inertial sensors attached to the lower limbs of
children with cerebral palsy to evaluate their motor abilities
and validate therapy effectiveness. Sarker et al. [7] infer the
complete upper body kinematics for rehabilitation applica-
tions based on three standalone IMUs mounted on wrists
and pelvis. Nguyen et al. [8] propose using optical linear
encoders and accelerometers to capture the goniometric
data of limb joints. As these methods will affect patients’
movement, and some components are also hard to calibrate,
they will lead to an inaccurate diagnosis, weakening its
application value in telemedicine.
B. 3D HPE with Learning-based Methods
Learning-based HPE methods can be divided into machine
learning and deep learning. The former [9]–[12] usually
transforms the estimation problem into a classification prob-
lem by calculating the probability of the location for each
joint. A serious drawback of these methods is the severely
deficient representation ability when the estimation work
is complex. As a result, deep learning methods utilizing
RGB or depth images have become mainstream in this field.
RGB-images-based methods [5], [13]–[15] are intuitive and
convenient. Nevertheless, the accuracy of those methods
is relatively low due to the lack of spatial information.
With the popularity of depth cameras, depth-image-based
methods address this shortcoming. Guo et al. [16] propose
a tree-structured Region Ensemble Network to aggregate the
depth information. Kim et al. [17] estimate human pose by
projecting the depth and ridge data in various directions.
Qiu et al. [18] tackle the core problems of monocular HPE,
like self-occlusion and joint ambiguity, by an embedded
fusion layer that merges features from different views. He
et al. [19] extend this method with the Transformer to
match the given view with neighboring views along the
epipolar line by calculating feature similarity to obtain the
final 3D features. Further, Moon et al. [20] and Zhou et
al. [21] take advantage of point clouds with more intuitive
information transformed from depth images to acquire an
exact 3D position of the human body. Although point-cloud-
based methods are accurate enough, these methods generate
a plethora of parameters during execution, consuming more
time and memory, which prevents them from working in
real-time. Consequently, considering the pros and cons of
different data types, our work will directly adopt the depth
map as the model’s input.
Inspired by the work [22], which exploits the global-local
spatial information from 2D anchor points, we introduce the
3D attention cube. Our attention points are created following
their method; however, we enhance the correlation between 3
principle directions by facilitating the interaction of different
cube surfaces to eliminate the estimated bias of each surface.
This mutually constrained property improves the robustness
of our method.
III. METHODOLOGY
The workflow of our ACRNet is shown in Fig. 2. Given
images captured by two depth cameras simultaneously, ACR-
Net first extracts the feature map of each view by the
backbone network and then merges feature maps from two