ACRNet Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in Telemedicine Boce Hu Chenfei Zhu Xupeng Ai and Sunil K. Agrawal_2

2025-04-30 0 0 1.66MB 8 页 10玖币

侵权投诉

ACRNet: Attention Cube Regression Network for Multi-view Real-time

3D Human Pose Estimation in Telemedicine

Boce Hu, Chenfei Zhu, Xupeng Ai and Sunil K. Agrawal*

Abstract— Human pose estimation (HPE) for 3D skeleton

reconstruction in telemedicine has long received attention.

Although the development of deep learning has made HPE

methods in telemedicine simpler and easier to use, addressing

low accuracy and high latency remains a big challenge. In

this paper, we propose a novel multi-view Attention Cube

Regression Network (ACRNet), which regresses the 3D position

of joints in real time by aggregating informative attention points

on each cube surface. More specially, a cube whose each surface

contains uniformly distributed attention points with speciﬁc

coordinate values is ﬁrst created to wrap the target from

the main view. Then, our network regresses the 3D position

of each joint by summing and averaging the coordinates of

attention points on each surface after being weighted. To

verify our method, we ﬁrst tested ACRNet on the open-

source ITOP dataset; meanwhile, we collected a new multi-view

upper body movement dataset (UBM) on the trunk support

trainer (TruST) to validate the capability of our model in

real rehabilitation scenarios. Experimental results demonstrate

the superiority of ACRNet compared with other state-of-the-

art methods. We also validate the efﬁcacy of each module in

ACRNet. Furthermore, Our work analyzes the performance of

ACRNet under the medical monitoring indicator. Because of

the high accuracy and running speed, our model is suitable

for real-time telemedicine settings. The source code is available

at:https://github.com/BoceHu/ACRNet

I. INTRODUCTION

Telemedicine is an emerging and booming treatment ap-

proach in the medical ﬁeld because of its high efﬁciency,

cost-effective strategy, and safety. Compared with traditional

medical treatment, telemedicine improves treatment efﬁ-

ciency through timely feedback between doctors and patients.

Also, it leverages technologies, such as computer-aided pose

assessment, to provide accurate and objective patient condi-

tions, during which the time of supervision and evaluation by

therapists is reduced, and the number of face-to-face diagno-

sis sessions is also lessened, thus signiﬁcantly minimizing the

cost of rehabilitation. Meanwhile, telemedicine offers new

probabilities for patients with reduced mobility or disabilities

to be treated at home, effectively preventing infection caused

by exposure to unsanitary conditions. In practice, delivering

such a service remotely requires satisfying several constraints

like exploiting limited computing power on personal comput-

ers, high precision, and real-time performance.

Boce Hu, Chenfei Zhu, and Xupeng Ai are with the Department of

Mechanical Engineering, Columbia University, New York, NY, 10027 USA

(e-mail:bh2770@columbia.edu)

* Sunil K. Agrawal is with the Department of Mechanical Engineering,

Department of Rehabilitation and Regenerative Medicine, Columbia Uni-

versity, New York, NY 10027 USA (e-mail: sunil.agrawal@columbia.edu).

This work involved human subjects in its research. Approval of all ethical

and experimental procedures was granted by the Institutional Review Board

of Columbia University under Protocol No.AAAQ7781.

…

Fig. 1. An attention cube is introduced to wrap the target from the main

view, and evenly distributed gray points stand for attention points on each

surface. ACRNet calculates the point-wise weight on each surface to ﬁnd

the informative attention points for regressing the 3D position of joints. In

the ﬁgure, the darker the point’s color, the higher its weight.

Telemedicine has been widely used in three medical appli-

cation areas: prediction of movement disorders, diagnosis of

movement disorders, and sports rehabilitation training [1]–

[3]. One of the most signiﬁcant technology for realizing

them is utilizing human pose estimation (HPE) to recon-

struct the 3D human body skeleton. Considering the actual

implementation requirements in telemedicine, scientists pro-

posed sensor-based and learning-based methods to estimate

human pose for 3D reconstruction. However, sensor-based

methods (e.g., wearable equipment) need to be attached to

the body of patients, which affects patient movement, leading

to inaccurate diagnoses. Moreover, appropriately adjusting

devices on wearable equipment, such as inertial measurement

units (IMUs) and gyroscopes, requires professional skills.

Therefore, the drawbacks of sensor-based methods seriously

hinder its further development in telemedicine.

Beneﬁting from advances in deep learning and computer

vision, learning-based HPE technology enables telemedicine

to get rid of counting on sensor-based methods in a non-

contact and easily calibrated way. Nevertheless, these meth-

ods still face low accuracy and high latency problems. As a

result, to meet the multiple requirements in telemedicine, we

propose a novel Attention Cube Regression Network (ACR-

Net), a uniﬁed and effective network with fully differentiable

end-to-end training ability to perform estimation work based

on multi-view depth images. ACRNet introduces an attention

arXiv:2210.05130v1 [cs.CV] 11 Oct 2022

cube to wrap the object and aggregates information from

each surface of the cube to estimate the 3D position of human

joints, as shown in Fig. 1. More speciﬁcally, a ﬁxed-size cube

wrapping the human body from the main view will be created

ﬁrst, with a ﬁxed number of points distributed uniformly

on each surface. Points on the same surface constitute an

attention matrix. Then, our network fuses the feature infor-

mation from all views to calculate the weight matrices of all

attention matrices w.r.t each joint. Finally, joints position are

deduced by the sum of the element-wise products of all the

attention matrices and corresponding weight matrices. Within

the model, feature maps are extracted from depth images

by a two phases backbone network; after that, a multi-view

fusion module integrates feature maps from different views

using dynamic weights according to the mechanism of cross

similarity. Next, a weight distribution module simultaneously

computes the attention matrix’s corresponding weight matrix

on each surface for the ﬁnal regression. For each joint,

contributions of different attention points are not equal;

hence, each joint has its informative points (points with

high weight) to be used to regress the position and non-

informative points (points with low weight) to be discarded.

To validate our method, ACRNet is ﬁrst tested on the ITOP

dataset. The results demonstrate that our method outperforms

the state-of-the-art methods on front-view settings while on

par with the best state-of-the-art method on top-view settings.

Moreover, the running speed of ACRNet achieves 92.3 FPS

on a single NVIDIA Tesla V100 GPU, enabling it to work in

a real-time environment. Furthermore, to verify the capability

of our model in real rehabilitation scenarios, therefore pro-

viding a technical foundation for the telemedicine platform,

we collect a new medical multi-view upper body movement

dataset (UBM) from 16 healthy subjects on the trunk support

trainer (TruST) [4], labeled by a Vicon infrared system.

Our model consistently outperforms the baseline [5] on this

dataset. Overall, the contributions of this manuscript are:

•ACRNet: A fully differentiable multi-view regression

network based on depth images to estimate 3D human

joint positions for telemedicine use.

•A new backbone structure and a dynamic multi-view

fusion module are proposed. Both of them improve the

representation ability of our model.

•UBM: A Vicon-labeled multi-view upper body move-

ment dataset for rehabilitation use, consisting of depth

images collected from 16 healthy subjects.

II. RELATED WORKS

A. 3D HPE with Sensor-based Methods

Currently, clinical diagnosis and treatments using motion

capture and pose estimation depend on Vicon because of its

preciseness, but this system is unsuitable for telemedicine

caused of its expensive components and difﬁculty trans-

ferring. Thus, Sensor-based wearable equipment is used in

telemedicine to capture patients’ motion data. Li et al. [6]

use multiple inertial sensors attached to the lower limbs of

children with cerebral palsy to evaluate their motor abilities

and validate therapy effectiveness. Sarker et al. [7] infer the

complete upper body kinematics for rehabilitation applica-

tions based on three standalone IMUs mounted on wrists

and pelvis. Nguyen et al. [8] propose using optical linear

encoders and accelerometers to capture the goniometric

data of limb joints. As these methods will affect patients’

movement, and some components are also hard to calibrate,

they will lead to an inaccurate diagnosis, weakening its

application value in telemedicine.

B. 3D HPE with Learning-based Methods

Learning-based HPE methods can be divided into machine

learning and deep learning. The former [9]–[12] usually

transforms the estimation problem into a classiﬁcation prob-

lem by calculating the probability of the location for each

joint. A serious drawback of these methods is the severely

deﬁcient representation ability when the estimation work

is complex. As a result, deep learning methods utilizing

RGB or depth images have become mainstream in this ﬁeld.

RGB-images-based methods [5], [13]–[15] are intuitive and

convenient. Nevertheless, the accuracy of those methods

is relatively low due to the lack of spatial information.

With the popularity of depth cameras, depth-image-based

methods address this shortcoming. Guo et al. [16] propose

a tree-structured Region Ensemble Network to aggregate the

depth information. Kim et al. [17] estimate human pose by

projecting the depth and ridge data in various directions.

Qiu et al. [18] tackle the core problems of monocular HPE,

like self-occlusion and joint ambiguity, by an embedded

fusion layer that merges features from different views. He

et al. [19] extend this method with the Transformer to

match the given view with neighboring views along the

epipolar line by calculating feature similarity to obtain the

ﬁnal 3D features. Further, Moon et al. [20] and Zhou et

al. [21] take advantage of point clouds with more intuitive

information transformed from depth images to acquire an

exact 3D position of the human body. Although point-cloud-

based methods are accurate enough, these methods generate

a plethora of parameters during execution, consuming more

time and memory, which prevents them from working in

real-time. Consequently, considering the pros and cons of

different data types, our work will directly adopt the depth

map as the model’s input.

Inspired by the work [22], which exploits the global-local

spatial information from 2D anchor points, we introduce the

3D attention cube. Our attention points are created following

their method; however, we enhance the correlation between 3

principle directions by facilitating the interaction of different

cube surfaces to eliminate the estimated bias of each surface.

This mutually constrained property improves the robustness

of our method.

III. METHODOLOGY

The workﬂow of our ACRNet is shown in Fig. 2. Given

images captured by two depth cameras simultaneously, ACR-

Net ﬁrst extracts the feature map of each view by the

backbone network and then merges feature maps from two

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ACRNet:AttentionCubeRegressionNetworkforMulti-viewReal-time3DHumanPoseEstimationinTelemedicineBoceHu,ChenfeiZhu,XupengAiandSunilK.Agrawal*AbstractHumanposeestimation(HPE)for3Dskeletonreconstructionintelemedicinehaslongreceivedattention.AlthoughthedevelopmentofdeeplearninghasmadeHPEmethodsintelemedi...

展开>> 收起<<

ACRNet Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in Telemedicine Boce Hu Chenfei Zhu Xupeng Ai and Sunil K. Agrawal_2.pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

ACRNet Attention Cube Regression Network for Multi-view Real-time 3D Human Pose Estimation in Telemedicine Boce Hu Chenfei Zhu Xupeng Ai and Sunil K. Agrawal_2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: