1 Fast Beam Alignment via Pure Exploration in Multi-armed Bandits

2025-04-30 0 0 2.88MB 16 页 10玖币
侵权投诉
1
Fast Beam Alignment via Pure Exploration in
Multi-armed Bandits
Yi Wei, Zixin Zhong, and Vincent Y. F. Tan, Senior Member, IEEE
Abstract—The beam alignment (BA) problem consists in accu-
rately aligning the transmitter and receiver beams to establish a
reliable communication link in wireless communication systems.
Existing BA methods search the entire beam space to identify
the optimal transmit-receive beam pair. This incurs a significant
latency when the number of antennas is large. In this work, we
develop a bandit-based fast BA algorithm to reduce BA latency
for millimeter-wave (mmWave) communications. Our algorithm
is named Two-Phase Heteroscedastic Track-and-Stop (2PHT&S).
We first formulate the BA problem as a pure exploration problem
in multi-armed bandits in which the objective is to minimize the
required number of time steps given a certain fixed confidence
level. By taking advantage of the correlation structure among
beams that the information from nearby beams is similar and
the heteroscedastic property that the variance of the reward of an
arm (beam) is related to its mean, the proposed algorithm groups
all beams into several beam sets such that the optimal beam set
is first selected and the optimal beam is identified in this set after
that. Theoretical analysis and simulation results on synthetic and
semi-practical channel data demonstrate the clear superiority of
the proposed algorithm vis-`
a-vis other baseline competitors.
Index Terms—Beam alignment, beam selection, mmWave com-
munication, multi-armed bandits.
I. INTRODUCTION
In millimeter-wave (mmWave) communications, the beams
at both the transmitter and receiver are narrow directional.
The beam alignment (BA) problem consists in ensuring that
the transmitter and receiver beams are accurately aligned to
establish a reliable communication link [2], [3]. An optimal
transmitter (receiver) beam or a pair of transmitter and receiver
beam is required to be selected to maximize the overall
capacity, throughput, SNR or diversity. To achieve this goal, a
na¨
ıve exhaustive search method scans all the beam space and
hence causes significant BA latency. Specifically, there are
two fundamental challenges to implement BA: (1) the amount
of additional channel state information (CSI) corresponding
to each beam (pair) is large; (2) the frequency of the channel
measurement [4], [5] is high. These challenges become even
Y. Wei is with the College of Information Science and Electronic
Engineering, Zhejiang University and the Zhejiang Provincial Key Lab-
oratory of Information Processing, Communication and Networking (IP-
CAN), Hangzhou 310027, China (email: 21731133@zju.edu.cn). Z. Zhong
is with the Department of Computing Science, University of Alberta (email:
zzhong10@ualberta.ca). V. Y. F. Tan is with Department of Mathematics and
the Department of Electrical and Computer Engineering, National University
of Singapore (email: vtan@nus.edu.sg).
This work is funded by a Singapore Ministry of Education Tier 1 grant
(A-0009042-01-00), the Fundamental Research Funds for the Central Univer-
sities 226-2022-00195, and the Defense Industrial Technology Development
Program under Grant JCKY2020210B021.
This paper was presented in part at the 2022 International Symposium on
Information Theory [1].
more difficult to overcome when a large number of anten-
nas are employed at both of the transmitter and receiver.
Moreover, the change of antenna orientation, the effect of
transmitter/reciever mobility, and the dynamic nature of the
wireless channel will result in previously found optimal beam
pairs to be suboptimal over time, which further exacerbates
the latency problem [6], [7].
Therefore, the design of fast BA algorithms is of paramount
importance, and has stimulated intense research interest. By
utilizing the sparsity of the mmWave channel, Marzi et al.
[8] incorporated compressed sensing techniques in the BA
problem to reduce the beam measurement complexity. Wang
et al. [9] developed a fast-discovery multi-resolution beam
search method, which first probes the wide beam and continues
to narrow beams until the optimal arm is identified. However,
the beam resolution needs to be adjusted at each step. Besides,
various forms of side information, e.g., location information
[10], out-of-band measurements [11], and dedicated short-
range communication [12], have also been used to reduce the
required number of probing beams.
Due to its inherent ability to balance between exploration
and exploitation, multi-armed bandit (MAB) theory has been
recently utilized in wireless communication field, such as
channel access in cognitive network [13], channel allocation
[14] and adaptive modulation/coding [15], etc., and harnessed
to address the BA problems; see [16]–[22]. In a single-user
MIMO system, the work [16] employed the canonical upper
confidence bound (UCB) bandit algorithm in beam selection,
where the instantaneous full CSI is not required. The work [17]
applied linear Thompson sampling (LTS) to address the beam
selection problem. For mmWave vehicular systems, Sim et
al. [18] developed an online learning algorithm to address the
problem of beam selection with environment-awareness based
on contextual bandit theory. The proposed method explores
different beams over time while accounting for contextual
information (i.e., vehicles’ direction of arrival). Similarly,
Hashemi et al. [19] proposed to maximize the directivity gain
(i.e., received energy) of the BA policy within a time period
using the contextual information (i.e., the inherent correlation
and unimodality properties), and formulate the BA problem
as contextual bandits. Wu et al. [20] provided a method to
quickly and accurately align beams in multi-path channels for
a point-to-point mmwave system, which takes advantage of the
correlation structure among beams such that the information
from nearby beams is extracted to identify the optimal beam,
instead of searching the entire beam space. In [21], Va
et al. used a UCB-based framework to develop the online
learning algorithm for beam pair selection and refinement,
arXiv:2210.12625v1 [cs.IT] 23 Oct 2022
2
User
BS
Fig. 1: A mmWave massive MISO system system.
where the algorithm first learns coarse beam directions in some
predefined beam codebook, and then fine-tunes the identified
directions to match the peak of the power angular spectrum
at that position. Hussain et al. [22] also designed a novel
beam pair alignment scheme based on Bayesian multi-armed
bandits, with the goal of maximizing the alignment probability
and the data-communication throughput.
In this paper, different from existing methods, we first
formulate the BA problem as a pure exploration problem in the
theory of MABs, called bandit BA problem, with the objective
of minimizing the required time steps to find the optimal
beam pair with a prescribed probability of success (confidence
level). Second, we derive a lower bound on the expected
time complexity of any algorithm designed to solve the BA
problem. Third, we present a bandit-based fast BA algorithm,
which we name Two-Phase Heteroscedastic Track-and-Stop
(2PHT&S). We do so by exploiting the correlation among
beams and the heteroscedastic property that the variance of
the reward of an arm (beam) is linearly related to its mean.
Instead of searching over the entire beam space, the proposed
algorithm groups all beams into several beam sets such that
the optimal beam set is first selected and the optimal beam is
identified within this set during the second phase. We derive an
upper bound on the time complexity of the proposed 2PHT&S
algorithm. Finally, extensive numerical results demonstrate
that 2PHT&S significantly reduces the required number of
time steps to identify the optimal beam compared to existing
algorithms in both simulated and semi-practical channel data
scenarios.
The remainder of this paper is organized as follows. The
system model and problem setup are presented in Section II.
Section III proposes our bandit-based fast BA algorithm and
its time complexity upper bound. Simulation results are given
in Section IV, and Section V concludes this paper.
II. PROBLEM FORMULATION
In this section, we first describe the system model and
propose our problem setup, after that we provide a generic
lower bound of the time complexity of the problem.
A. System Model
As shown in Fig. 1, we consider a massive mmWave MISO
system, where a base station (BS) equipped with Ntransmit
antennas serves a single-antenna user. According to [23],
each transmitting beam is selected from a predefined analog
beamforming codebook Cof size K, which can be defined as
C,{fk=a(1+2k/K)|k= 0,1,2, . . . , K 1}
with a(·)denoting the array response vector. For a
typical uniform linear array, it holds that a(x) =
1
N1, ej2π
λdx, ej2π
λ2dx, . . . , ej2π
λ(N1)dxHCN×1,where
λis the wavelength and dis the antenna spacing. Note that if
KN, the beams in the codebook Care linearly independent
of one another. Based on channel measurement studies done
in [24], [25], the mmWave channel follows a multipath channel
model, and the number of propagation paths is small.
We consider a quasi-static channel, where the channel stays
unchanged for a period covariance interval which consists of
Ttime slots. As shown in Fig. 2, each covariance interval can
be divided into two phases, the BA phase (which consists of
TBtime slots) and the data transmission phase (which consists
of TD=TTBtime slots). With the selected beam f, the
effective achievable rate (EAR) of each covariance interval
Reff := 1TB
TDlog 1 + p|hHf|2
σ2
is widely accepted as a metric to measure the throughput
performance, where pand σ2represent the transmit power and
noise variance respectively. We observe that the throughput
performance increases with the decreasing TB/TDand in-
creasing |hHf|2. As such, the time allocated for selecting the
optimal beam f= arg maxf∈C |hHf|2should be minimized
for higher system throughput, which is the motivation of this
work.
In the BA phase, the BS selects one beam from the
codebook Cand transmits pilot signals to the user with this
beam. Without loss of generality, we set the pilot signal to
be 1. Then the received power can be expressed as
R(fk) = |phHfk+n|2=p|hHfk|2+|n|2+2p<(hHfkn),
where nrepresents the complex conjugate of the complex
Gaussian noise n∼ CN(0, σ2). Note that the BS can observe
the received power from the user through a feedback link, and
this assumption is also adopted in other studies such as [20],
[21]. We can see that R(fk)is a random variable which is
the sum of a Gamma random variable |n|2Γ(1,12)and
a Gaussian random variable p|hHfk|2+ 2p<(hHfkn)
N(p|hHfk|2,2p|hHfk|2σ2). Fig. 3 shows the average received
power associated with different beams over the codebook
space when K= 120,N= 64,p= 26 dBm and the power
of LoS path is around 3 dB larger than those of the NLoS
paths. It can be observed that all the received powers become
larger as the increase of the noise variance, and the improved
value is the difference between the noise variances. Since the
noise variance is much smaller than the transmit power, the
variable R(fk)is approximately a Gaussian random variable
with mean µk=p|hHfk|2and variance σ2
k= 2p|hHfk|2σ2,
i.e.,
rk=p|hHfk|2+ 2p<(hHfkn).(1)
The optimal beam is defined as the beam that has the largest
3
Coherence time: Channel stays constant
(𝑇time slots)
Beam Alignment
Phase Data Transmission
Phase Beam Alignment
Phase Data Transmission
Phase
(𝑇
𝐵time slots) (𝑇
𝐷time slots)
Fig. 2: Transmission scheme.
0 20 40 60 80 100 120
Beam index
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Average Reward/Recieved signal power (W)
2 = 0 dBm
2 = 10 dBm
Fig. 3: The average received power associated different beams over
the codebook space in a three-path channel (K= 120,N= 64,
p= 26 dBm and the power of line-of-sight (LoS) path is around 3
dB larger than that of the non-line-of-sight (NLoS) path.)
value of rk, i.e., k= arg maxkp|hHfk|2.
Let Jbe the correlation length of arms, which is related
to the number of beams Nand the size of the codebook K
as follows: J= 2dK
Ne − 1. Then the received power using
the beam fkand the nearby beam fi,|ki| ≤ J/2are
similar. A new 1
J-lower resolution beam codebook C(J)can
be constructed by grouping the nearby beams in the codebook
C, i.e.,
C(J),nbg=PJg
k=J(g1)+1 fk
|PJg
k=J(g1)+1 fk|:g= 0,1, . . . , G 1o,
where the beams bgs are normalized (to unit `2norm) to
mimic reality in a phased array. If we choose a beam bgin
the codebook C(J), the received power can be expressed as
Rg=p|hHbg|2+ 2p<(hHbgn).
Since the received powers associated with the nearby beams
in the codebook Chave means that are close to one another,
Rghas large mean if rk, k ∈ {(g1)J+ 1, . . . , gJ}have
large means.
B. Problem Setup
To introduce the idea of bandit to address the BA problem,
we consider a bandit model where the beam fkis regarded
as the base arm k, and the received power using this beam is
regarded as the reward of the base arm k. The means of the
rewards when arms are pulled have the following property:
Property 1. Let µ= (µ1, . . . , µK), and let µ(1) µ(2)
µ(3) . . . µ(K)be the sorted sequence of means. Then we
assume that these means have the following properties:
1. The means of the reward associated with arms kand i,
where |ik| ≤ J/2, are close.
2. The rewards are sparse. In particular, there are only LJ
arms that have high rewards; the other KLJ arms
have rewards that are close to zero.
3. The variance of the reward of an arm is related to its
mean as follows: σ2
k= 2µkσ2.
As shown in Fig. 3, Property 1.1 holds because the beam
codebook is designed such that beams that are nearby have
rewards that are close. Property 1.2 is due to the fact that there
are only a few main paths in the considered channel model,
which has been corroborated by studies in channel measure-
ments [26]. Finally, Property 1.3 holds because the mean of
reward associated with base arm kfollows a heteroscedastic
Gaussian distribution N(µk,2µkσ2)(see (1)). Furthermore, in
this work, we focus on a challenging case with large noise,
which satisfies that
σ2>max
k[K]
(µ(1) µ(k))2
4µ(k)2µ(1) 2µ(k)ln µ(k)
µ(1) ,(2)
and this condition is assumed to hold in our theoretical
analyses. In the simulations, we also numerically check that
this condition holds.
Since a large number of base arms have mean rewards that
are close to zero (Property 1.3), it is clearly a waste of BA
overhead to search for the optimal arm among all the base
arms. To overcome this problem, we propose to group the
base arms by utilizing Property 1.1 which says that the means
of the reward associated with nearby arms are similar. As such,
we can formulate the BA problem as a novel MAB problem,
called the bandit BA problem.
In a bandit BA problem with Kbase arms, the base arm
kis associated with the beam fk. We let [K] = {1, . . . , K}
and [K]
Jconsec be the set of all non-empty consecutive tuples
of length Jwhose element belongs to the set [K], e.g., if
J= 2 and K= 6,
[6]
2consec
=n{1},{1,2},{2},{2,3},{3},
{3,4},{4},{4,5},{5},{5,6},{6}o,
4
and we consider each tuple as a (K, J)-super arm. Moreover,
the super arm associated with bg∈ C(J)is a subset of
[K]
Jconsec, e.g., {{1,2},{3,4},{5,6}} ⊂ [6]
2consec. At
time step t, we choose an action (or a (K, J)-super arm)
A(t)[K]
Jconsec, and observe the reward R(t)which is
related to the base arms included in A(t), i.e.,
R(t) = FPkA(t)fk
|PkA(t)fk|, p, h, nt,(3)
where F(f, p, h, n) = p|hHf|2+ 2p<(hHfn),nis a com-
plex Gaussian random variable with mean zero and variance
σ2. We can conclude that for an arbitrary (K, J)-super arm
A[K]
Jconsec, the empirical mean RAalso follows a het-
eroscedastic Gaussian distribution, i.e., RA∼ N(µA,2µAσ2),
where µA=p|hHPkAfk|2.
To identify the optimal base arm, an agent uses an algorithm
πthat decides which super arms to pull, the time τπto stop
pulling, and which base arm kπto choose eventually. An
algorithm consists of a triple π:= {(πt)t, τπ, ψπ, J}in which:
a sampling rule πtdetermining the (K, J)-super arm
A(t)[K]
Jconsec to pull at time step tbased
on the observation history and the arm history
{A(1), R(1), A(2), R(2), . . . , A(t1), R(t1)};
a stopping rule leading to a stopping time τπsatisfying
P(τπ<)=1;
a recommendation rule ψπthat outputs a base arm kπ
[K].
We define the time complexity of πas τπ. In the fixed-
confidence setting, a maximum risk δ(0,1) is fixed. We
say an algorithm πis (δ, J)-probably approximately correct
(PAC) if P(kπ=k)1δ. The purpose is to design a
(δ, J)-PAC algorithm πthat minimizes the expected stopping
time E[τπ].
C. Lower Bound on the Time Complexity
Denote by νaK-armed heteroscedastic Gaussian bandit
instance, ν=N(µν
1,2µν
1σ2), . . . , N(µν
K,2µν
Kσ2). Let V
represent the set of K-armed heteroscedastic Gaussian bandit
instances such that each bandit instance ν∈ V has a unique
optimal arm: for each ν∈ V, there exists an arm A(ν)such
that µν
A(ν)>max{µν
k:k6=A(ν)}. Define WK=w
RK
+:PK
k=1 wk= 1,Alt(ν) := {u∈ V :A(u) 6=A(ν)},
and Tk(t)as the times the base arm kis pulled until time step
t. Moreover, the KL divergence between two heteroscedastic
Gaussian distributions, i.e., N(µi,2µiσ2)and N(µj,2µjσ2),
can be calculated as
DHG(µi, µj) = 1
2ln µj
µi+µi
2µj
+(µjµi)2
4µjσ21
2.
According to Theorem 1 of [27], we can obtain the general
lower bound of this problem as follows.
Theorem 1. For any (δ, J)PAC algorithm where δ(0,1),
it holds that
Eπ[τ]c(ν) ln 1
4δ,
where
c(ν)1= sup
w∈WK
inf
uAlt(ν)K
X
k=1
wkDHG(µν
k, µu
k).(4)
III. OUR ALGORITHM: 2PHT&S
In this section, we first state the performance baseline of this
work, i.e., the original Track-and-Stop (T&S) algorithm [27],
then we elaborate the design of the proposed 2PHT&S. We
then upper bound its time complexity in Theorem 2.
A. A Baseline Algorithm
Note that the original Track-and-Stop (T&S) [27] is the
state-of-the-art best arm identification algorithm with fixed
confidence for exponential bandits, which can be modified
to solve the bandit BA problem with low sample complexity,
resulting in a fast BA solution.
Let νrepresent a K-armed Gaussian bandit instance, ν=
Nµν
1,(σν
1)2,Nµν
2,(σν
2)2,..., Nµν
K,(σν
K)2. Given
the fixed δ, the time complexity of T&S, EτT&S, satisfies
lim
δ0
EτT&S
log(1)=T(ν),
where
T(ν)1
:= sup
w∈WK
inf
uAlt(ν)K
X
k=1
wkDNµν
k,(σν
k)2,Nµu
k,(σu
k)2.
and the KL-divergence DNµν
k,(σν
k)2,Nµu
k,(σu
k)2 =
ln σu
k
σν
k+(σν
k)2+(µν
kµu
k)2
2(σu
k)21
2.
Furthermore, when we apply T&S to the K-armed het-
eroscedastic Gaussian bandit instance ν∈ V under consid-
eration, we have
lim sup
δ0
EτT&S
log(1)T
u(ν),(5)
where
T
u(ν) = 8µν
A(ν)σ2
2
ν,min
+X
k6=A(ν)
8µν
A(ν)σ2
2
ν,k
,
and the gaps are ν,k := µν
A(ν)µν
k, k 6=A(ν)and
ν,min := mink6=A(ν)µν
A(ν)µν
k.
B. 2PHT&S
For our bandit BA problem, we propose the 2PHT&S
algorithm inspired by the original T&S algorithm [27] to
achieve a smaller time complexity by leveraging the structure
of the BA problem. The main idea of the proposed algorithm is
to exploit the prior knowledge which have not been considered
by existing algorithms, i.e., Property 1 and the fact that
neighboring beams can be grouped and the corresponding
“grouped” reward can be observed in one time step.
As shown in Fig. 4, 2PHT&S consists of two phases. We
describe them in the following.
摘要:

1FastBeamAlignmentviaPureExplorationinMulti-armedBanditsYiWei,ZixinZhong,andVincentY.F.Tan,SeniorMember,IEEEAbstract—Thebeamalignment(BA)problemconsistsinaccu-ratelyaligningthetransmitterandreceiverbeamstoestablishareliablecommunicationlinkinwirelesscommunicationsystems.ExistingBAmethodssearchtheent...

展开>> 收起<<
1 Fast Beam Alignment via Pure Exploration in Multi-armed Bandits.pdf

共16页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:16 页 大小:2.88MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 16
客服
关注