1 Fast Beam Alignment via Pure Exploration in Multi-armed Bandits

2025-04-30 0 0 2.88MB 16 页 10玖币

侵权投诉

Fast Beam Alignment via Pure Exploration in

Multi-armed Bandits

Yi Wei, Zixin Zhong, and Vincent Y. F. Tan, Senior Member, IEEE

Abstract—The beam alignment (BA) problem consists in accu-

rately aligning the transmitter and receiver beams to establish a

reliable communication link in wireless communication systems.

Existing BA methods search the entire beam space to identify

the optimal transmit-receive beam pair. This incurs a signiﬁcant

latency when the number of antennas is large. In this work, we

develop a bandit-based fast BA algorithm to reduce BA latency

for millimeter-wave (mmWave) communications. Our algorithm

is named Two-Phase Heteroscedastic Track-and-Stop (2PHT&S).

We ﬁrst formulate the BA problem as a pure exploration problem

in multi-armed bandits in which the objective is to minimize the

required number of time steps given a certain ﬁxed conﬁdence

level. By taking advantage of the correlation structure among

beams that the information from nearby beams is similar and

the heteroscedastic property that the variance of the reward of an

arm (beam) is related to its mean, the proposed algorithm groups

all beams into several beam sets such that the optimal beam set

is ﬁrst selected and the optimal beam is identiﬁed in this set after

that. Theoretical analysis and simulation results on synthetic and

semi-practical channel data demonstrate the clear superiority of

the proposed algorithm vis-`

a-vis other baseline competitors.

Index Terms—Beam alignment, beam selection, mmWave com-

munication, multi-armed bandits.

I. INTRODUCTION

In millimeter-wave (mmWave) communications, the beams

at both the transmitter and receiver are narrow directional.

The beam alignment (BA) problem consists in ensuring that

the transmitter and receiver beams are accurately aligned to

establish a reliable communication link [2], [3]. An optimal

transmitter (receiver) beam or a pair of transmitter and receiver

beam is required to be selected to maximize the overall

capacity, throughput, SNR or diversity. To achieve this goal, a

na¨

ıve exhaustive search method scans all the beam space and

hence causes signiﬁcant BA latency. Speciﬁcally, there are

two fundamental challenges to implement BA: (1) the amount

of additional channel state information (CSI) corresponding

to each beam (pair) is large; (2) the frequency of the channel

measurement [4], [5] is high. These challenges become even

Y. Wei is with the College of Information Science and Electronic

Engineering, Zhejiang University and the Zhejiang Provincial Key Lab-

oratory of Information Processing, Communication and Networking (IP-

CAN), Hangzhou 310027, China (email: 21731133@zju.edu.cn). Z. Zhong

is with the Department of Computing Science, University of Alberta (email:

zzhong10@ualberta.ca). V. Y. F. Tan is with Department of Mathematics and

the Department of Electrical and Computer Engineering, National University

of Singapore (email: vtan@nus.edu.sg).

This work is funded by a Singapore Ministry of Education Tier 1 grant

(A-0009042-01-00), the Fundamental Research Funds for the Central Univer-

sities 226-2022-00195, and the Defense Industrial Technology Development

Program under Grant JCKY2020210B021.

This paper was presented in part at the 2022 International Symposium on

Information Theory [1].

more difﬁcult to overcome when a large number of anten-

nas are employed at both of the transmitter and receiver.

Moreover, the change of antenna orientation, the effect of

transmitter/reciever mobility, and the dynamic nature of the

wireless channel will result in previously found optimal beam

pairs to be suboptimal over time, which further exacerbates

the latency problem [6], [7].

Therefore, the design of fast BA algorithms is of paramount

importance, and has stimulated intense research interest. By

utilizing the sparsity of the mmWave channel, Marzi et al.

[8] incorporated compressed sensing techniques in the BA

problem to reduce the beam measurement complexity. Wang

et al. [9] developed a fast-discovery multi-resolution beam

search method, which ﬁrst probes the wide beam and continues

to narrow beams until the optimal arm is identiﬁed. However,

the beam resolution needs to be adjusted at each step. Besides,

various forms of side information, e.g., location information

[10], out-of-band measurements [11], and dedicated short-

range communication [12], have also been used to reduce the

required number of probing beams.

Due to its inherent ability to balance between exploration

and exploitation, multi-armed bandit (MAB) theory has been

recently utilized in wireless communication ﬁeld, such as

channel access in cognitive network [13], channel allocation

[14] and adaptive modulation/coding [15], etc., and harnessed

to address the BA problems; see [16]–[22]. In a single-user

MIMO system, the work [16] employed the canonical upper

conﬁdence bound (UCB) bandit algorithm in beam selection,

where the instantaneous full CSI is not required. The work [17]

applied linear Thompson sampling (LTS) to address the beam

selection problem. For mmWave vehicular systems, Sim et

al. [18] developed an online learning algorithm to address the

problem of beam selection with environment-awareness based

on contextual bandit theory. The proposed method explores

different beams over time while accounting for contextual

information (i.e., vehicles’ direction of arrival). Similarly,

Hashemi et al. [19] proposed to maximize the directivity gain

(i.e., received energy) of the BA policy within a time period

using the contextual information (i.e., the inherent correlation

and unimodality properties), and formulate the BA problem

as contextual bandits. Wu et al. [20] provided a method to

quickly and accurately align beams in multi-path channels for

a point-to-point mmwave system, which takes advantage of the

correlation structure among beams such that the information

from nearby beams is extracted to identify the optimal beam,

instead of searching the entire beam space. In [21], Va

et al. used a UCB-based framework to develop the online

learning algorithm for beam pair selection and reﬁnement,

arXiv:2210.12625v1 [cs.IT] 23 Oct 2022

User

Fig. 1: A mmWave massive MISO system system.

where the algorithm ﬁrst learns coarse beam directions in some

predeﬁned beam codebook, and then ﬁne-tunes the identiﬁed

directions to match the peak of the power angular spectrum

at that position. Hussain et al. [22] also designed a novel

beam pair alignment scheme based on Bayesian multi-armed

bandits, with the goal of maximizing the alignment probability

and the data-communication throughput.

In this paper, different from existing methods, we ﬁrst

formulate the BA problem as a pure exploration problem in the

theory of MABs, called bandit BA problem, with the objective

of minimizing the required time steps to ﬁnd the optimal

beam pair with a prescribed probability of success (conﬁdence

level). Second, we derive a lower bound on the expected

time complexity of any algorithm designed to solve the BA

problem. Third, we present a bandit-based fast BA algorithm,

which we name Two-Phase Heteroscedastic Track-and-Stop

(2PHT&S). We do so by exploiting the correlation among

beams and the heteroscedastic property that the variance of

the reward of an arm (beam) is linearly related to its mean.

Instead of searching over the entire beam space, the proposed

algorithm groups all beams into several beam sets such that

the optimal beam set is ﬁrst selected and the optimal beam is

identiﬁed within this set during the second phase. We derive an

upper bound on the time complexity of the proposed 2PHT&S

algorithm. Finally, extensive numerical results demonstrate

that 2PHT&S signiﬁcantly reduces the required number of

time steps to identify the optimal beam compared to existing

algorithms in both simulated and semi-practical channel data

scenarios.

The remainder of this paper is organized as follows. The

system model and problem setup are presented in Section II.

Section III proposes our bandit-based fast BA algorithm and

its time complexity upper bound. Simulation results are given

in Section IV, and Section V concludes this paper.

II. PROBLEM FORMULATION

In this section, we ﬁrst describe the system model and

propose our problem setup, after that we provide a generic

lower bound of the time complexity of the problem.

A. System Model

As shown in Fig. 1, we consider a massive mmWave MISO

system, where a base station (BS) equipped with Ntransmit

antennas serves a single-antenna user. According to [23],

each transmitting beam is selected from a predeﬁned analog

beamforming codebook Cof size K, which can be deﬁned as

C,{fk=a(−1+2k/K)|k= 0,1,2, . . . , K −1}

with a(·)denoting the array response vector. For a

typical uniform linear array, it holds that a(x) =

√N1, ej2π

λdx, ej2π

λ2dx, . . . , ej2π

λ(N−1)dxH∈CN×1,where

λis the wavelength and dis the antenna spacing. Note that if

K≤N, the beams in the codebook Care linearly independent

of one another. Based on channel measurement studies done

in [24], [25], the mmWave channel follows a multipath channel

model, and the number of propagation paths is small.

We consider a quasi-static channel, where the channel stays

unchanged for a period covariance interval which consists of

Ttime slots. As shown in Fig. 2, each covariance interval can

be divided into two phases, the BA phase (which consists of

TBtime slots) and the data transmission phase (which consists

of TD=T−TBtime slots). With the selected beam f, the

effective achievable rate (EAR) of each covariance interval

Reff := 1−TB

TDlog 1 + p|hHf|2

σ2

is widely accepted as a metric to measure the throughput

performance, where pand σ2represent the transmit power and

noise variance respectively. We observe that the throughput

performance increases with the decreasing TB/TDand in-

creasing |hHf∗|2. As such, the time allocated for selecting the

optimal beam f∗= arg maxf∈C |hHf|2should be minimized

for higher system throughput, which is the motivation of this

work.

In the BA phase, the BS selects one beam from the

codebook Cand transmits pilot signals to the user with this

beam. Without loss of generality, we set the pilot signal to

be 1. Then the received power can be expressed as

R(fk) = |√phHfk+n|2=p|hHfk|2+|n|2+2√p<(hHfkn∗),

where n∗represents the complex conjugate of the complex

Gaussian noise n∼ CN(0, σ2). Note that the BS can observe

the received power from the user through a feedback link, and

this assumption is also adopted in other studies such as [20],

[21]. We can see that R(fk)is a random variable which is

the sum of a Gamma random variable |n|2∼Γ(1,1/σ2)and

a Gaussian random variable p|hHfk|2+ 2√p<(hHfkn∗)∼

N(p|hHfk|2,2p|hHfk|2σ2). Fig. 3 shows the average received

power associated with different beams over the codebook

space when K= 120,N= 64,p= 26 dBm and the power

of LoS path is around 3 dB larger than those of the NLoS

paths. It can be observed that all the received powers become

larger as the increase of the noise variance, and the improved

value is the difference between the noise variances. Since the

noise variance is much smaller than the transmit power, the

variable R(fk)is approximately a Gaussian random variable

with mean µk=p|hHfk|2and variance σ2

k= 2p|hHfk|2σ2,

i.e.,

rk=p|hHfk|2+ 2√p<(hHfkn∗).(1)

The optimal beam is deﬁned as the beam that has the largest

Coherence time: Channel stays constant

(𝑇time slots)

Beam Alignment

Phase Data Transmission

Phase Beam Alignment

Phase Data Transmission

Phase

(𝑇

𝐵time slots) (𝑇

𝐷time slots)

……

Fig. 2: Transmission scheme.

0 20 40 60 80 100 120

Beam index

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Average Reward/Recieved signal power (W)

2 = 0 dBm

2 = 10 dBm

Fig. 3: The average received power associated different beams over

the codebook space in a three-path channel (K= 120,N= 64,

p= 26 dBm and the power of line-of-sight (LoS) path is around 3

dB larger than that of the non-line-of-sight (NLoS) path.)

value of rk, i.e., k∗= arg maxkp|hHfk|2.

Let Jbe the correlation length of arms, which is related

to the number of beams Nand the size of the codebook K

as follows: J= 2dK

Ne − 1. Then the received power using

the beam fkand the nearby beam fi,|k−i| ≤ J/2are

similar. A new 1

J-lower resolution beam codebook C(J)can

be constructed by grouping the nearby beams in the codebook

C, i.e.,

C(J),nbg=PJg

k=J(g−1)+1 fk

|PJg

k=J(g−1)+1 fk|:g= 0,1, . . . , G −1o,

where the beams bg’s are normalized (to unit `2norm) to

mimic reality in a phased array. If we choose a beam bgin

the codebook C(J), the received power can be expressed as

Rg=p|hHbg|2+ 2√p<(hHbgn∗).

Since the received powers associated with the nearby beams

in the codebook Chave means that are close to one another,

Rghas large mean if rk, k ∈ {(g−1)J+ 1, . . . , gJ}have

large means.

B. Problem Setup

To introduce the idea of bandit to address the BA problem,

we consider a bandit model where the beam fkis regarded

as the base arm k, and the received power using this beam is

regarded as the reward of the base arm k. The means of the

rewards when arms are pulled have the following property:

Property 1. Let µ= (µ1, . . . , µK), and let µ(1) ≥µ(2) ≥

µ(3) ≥. . . ≥µ(K)be the sorted sequence of means. Then we

assume that these means have the following properties:

1. The means of the reward associated with arms kand i,

where |i−k| ≤ J/2, are close.

2. The rewards are sparse. In particular, there are only LJ

arms that have high rewards; the other K−LJ arms

have rewards that are close to zero.

3. The variance of the reward of an arm is related to its

mean as follows: σ2

k= 2µkσ2.

As shown in Fig. 3, Property 1.1 holds because the beam

codebook is designed such that beams that are nearby have

rewards that are close. Property 1.2 is due to the fact that there

are only a few main paths in the considered channel model,

which has been corroborated by studies in channel measure-

ments [26]. Finally, Property 1.3 holds because the mean of

reward associated with base arm kfollows a heteroscedastic

Gaussian distribution N(µk,2µkσ2)(see (1)). Furthermore, in

this work, we focus on a challenging case with large noise,

which satisﬁes that

σ2>max

k∈[K]

(µ(1) −µ(k))2

4µ(k)−2µ(1) −2µ(k)ln µ(k)

µ(1) ,(2)

and this condition is assumed to hold in our theoretical

analyses. In the simulations, we also numerically check that

this condition holds.

Since a large number of base arms have mean rewards that

are close to zero (Property 1.3), it is clearly a waste of BA

overhead to search for the optimal arm among all the base

arms. To overcome this problem, we propose to group the

base arms by utilizing Property 1.1 which says that the means

of the reward associated with nearby arms are similar. As such,

we can formulate the BA problem as a novel MAB problem,

called the bandit BA problem.

In a bandit BA problem with Kbase arms, the base arm

kis associated with the beam fk. We let [K] = {1, . . . , K}

and [K]

≤Jconsec be the set of all non-empty consecutive tuples

of length ≤Jwhose element belongs to the set [K], e.g., if

J= 2 and K= 6,

[6]

≤2consec

=n{1},{1,2},{2},{2,3},{3},

{3,4},{4},{4,5},{5},{5,6},{6}o,

and we consider each tuple as a (K, J)-super arm. Moreover,

the super arm associated with bg∈ C(J)is a subset of

[K]

≤Jconsec, e.g., {{1,2},{3,4},{5,6}} ⊂ [6]

≤2consec. At

time step t, we choose an action (or a (K, J)-super arm)

A(t)∈[K]

≤Jconsec, and observe the reward R(t)which is

related to the base arms included in A(t), i.e.,

R(t) = FPk∈A(t)fk

|Pk∈A(t)fk|, p, h, nt,(3)

where F(f, p, h, n) = p|hHf|2+ 2√p<(hHfn∗),nis a com-

plex Gaussian random variable with mean zero and variance

σ2. We can conclude that for an arbitrary (K, J)-super arm

A∈[K]

≤Jconsec, the empirical mean RAalso follows a het-

eroscedastic Gaussian distribution, i.e., RA∼ N(µA,2µAσ2),

where µA=p|hHPk∈Afk|2.

To identify the optimal base arm, an agent uses an algorithm

πthat decides which super arms to pull, the time τπto stop

pulling, and which base arm kπto choose eventually. An

algorithm consists of a triple π:= {(πt)t, τπ, ψπ, J}in which:

•a sampling rule πtdetermining the (K, J)-super arm

A(t)∈[K]

≤Jconsec to pull at time step tbased

on the observation history and the arm history

{A(1), R(1), A(2), R(2), . . . , A(t−1), R(t−1)};

•a stopping rule leading to a stopping time τπsatisfying

P(τπ<∞)=1;

•a recommendation rule ψπthat outputs a base arm kπ∈

[K].

We deﬁne the time complexity of πas τπ. In the ﬁxed-

conﬁdence setting, a maximum risk δ∈(0,1) is ﬁxed. We

say an algorithm πis (δ, J)-probably approximately correct

(PAC) if P(kπ=k∗)≥1−δ. The purpose is to design a

(δ, J)-PAC algorithm πthat minimizes the expected stopping

time E[τπ].

C. Lower Bound on the Time Complexity

Denote by νaK-armed heteroscedastic Gaussian bandit

instance, ν=N(µν

1,2µν

1σ2), . . . , N(µν

K,2µν

Kσ2). Let V

represent the set of K-armed heteroscedastic Gaussian bandit

instances such that each bandit instance ν∈ V has a unique

optimal arm: for each ν∈ V, there exists an arm A∗(ν)such

that µν

A∗(ν)>max{µν

k:k6=A∗(ν)}. Deﬁne WK=w∈

+:PK

k=1 wk= 1,Alt(ν) := {u∈ V :A∗(u) 6=A∗(ν)},

and Tk(t)as the times the base arm kis pulled until time step

t. Moreover, the KL divergence between two heteroscedastic

Gaussian distributions, i.e., N(µi,2µiσ2)and N(µj,2µjσ2),

can be calculated as

DHG(µi, µj) = 1

2ln µj

µi+µi

2µj

+(µj−µi)2

4µjσ2−1

According to Theorem 1 of [27], we can obtain the general

lower bound of this problem as follows.

Theorem 1. For any (δ, J)−PAC algorithm where δ∈(0,1),

it holds that

Eπ[τ]≥c∗(ν) ln 1

4δ,

where

c∗(ν)−1= sup

w∈WK

inf

u∈Alt(ν)K

k=1

wkDHG(µν

k, µu

k).(4)

III. OUR ALGORITHM: 2PHT&S

In this section, we ﬁrst state the performance baseline of this

work, i.e., the original Track-and-Stop (T&S) algorithm [27],

then we elaborate the design of the proposed 2PHT&S. We

then upper bound its time complexity in Theorem 2.

A. A Baseline Algorithm

Note that the original Track-and-Stop (T&S) [27] is the

state-of-the-art best arm identiﬁcation algorithm with ﬁxed

conﬁdence for exponential bandits, which can be modiﬁed

to solve the bandit BA problem with low sample complexity,

resulting in a fast BA solution.

Let νrepresent a K-armed Gaussian bandit instance, ν=

Nµν

1,(σν

1)2,Nµν

2,(σν

2)2,..., Nµν

K,(σν

K)2. Given

the ﬁxed δ, the time complexity of T&S, EτT&S, satisﬁes

lim

δ→0

EτT&S

log(1/δ)=T∗(ν),

where

T∗(ν)−1

:= sup

w∈WK

inf

u∈Alt(ν)K

k=1

wkDNµν

k,(σν

k)2,Nµu

k,(σu

k)2.

and the KL-divergence DNµν

k,(σν

k)2,Nµu

k,(σu

k)2 =

ln σu

σν

k+(σν

k)2+(µν

k−µu

k)2

2(σu

k)2−1

Furthermore, when we apply T&S to the K-armed het-

eroscedastic Gaussian bandit instance ν∈ V under consid-

eration, we have

lim sup

δ→0

EτT&S

log(1/δ)≤T∗

u(ν),(5)

where

T∗

u(ν) = 8µν

A∗(ν)σ2

∆2

ν,min

k6=A∗(ν)

8µν

A∗(ν)σ2

∆2

ν,k

and the gaps are ∆ν,k := µν

A∗(ν)−µν

k, k 6=A∗(ν)and

∆ν,min := mink6=A∗(ν)µν

A∗(ν)−µν

B. 2PHT&S

For our bandit BA problem, we propose the 2PHT&S

algorithm inspired by the original T&S algorithm [27] to

achieve a smaller time complexity by leveraging the structure

of the BA problem. The main idea of the proposed algorithm is

to exploit the prior knowledge which have not been considered

by existing algorithms, i.e., Property 1 and the fact that

neighboring beams can be grouped and the corresponding

“grouped” reward can be observed in one time step.

As shown in Fig. 4, 2PHT&S consists of two phases. We

describe them in the following.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

1FastBeamAlignmentviaPureExplorationinMulti-armedBanditsYiWei,ZixinZhong,andVincentY.F.Tan,SeniorMember,IEEEAbstractThebeamalignment(BA)problemconsistsinaccu-ratelyaligningthetransmitterandreceiverbeamstoestablishareliablecommunicationlinkinwirelesscommunicationsystems.ExistingBAmethodssearchtheent...

展开>> 收起<<

1 Fast Beam Alignment via Pure Exploration in Multi-armed Bandits.pdf

共16页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

1 Fast Beam Alignment via Pure Exploration in Multi-armed Bandits

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: