process, i.e., exploration and exploitation. To solve this problem,
contextual bandit-based approach is seen as a principled way in
which a learning algorithm sequentially recommends an item to
users based on contextual information of the user and item, while
simultaneously adapting its recommendation strategy based on his-
torical user-click feedback to maximize total user clicks in the long
run. Theoretically, seminal contextual bandit algorithms, including
LinUCB [
1
] and LinTS [
3
] are proved to achieve the optimal balance
between exploration and exploitation whereas empirically, these
approaches have shown remarkable performance and have been
deployed into various industrial scenarios for dealing with cold
start recommendation [18, 20, 29].
Despite their remarkable progress, such approaches cannot be
directly employed in many real-world scenarios where the recom-
mendation service provider only has some portions of the user’s
proles while the other user features belong to dierent depart-
ments and cannot be directly shared with the recommendation
provider due to privacy concerns or data regulations. For example,
in the e-commerce recommendation setting, the service provider
(e.g., Amazon and Taobao) can only hold shopping-related informa-
tion while nancial information such as deposits is held by other
departments. This setting is documented as the vertical federated
setting in the literature [
4
]. A question is naturally raised: How to
design the contextual bandits in the vertical federated setting?
To solve this problem, a natural idea is to apply modern pri-
vacy enhancing techniques such as secure multi-party computation
(MPC) [
10
] and dierential privacy (DP) [
8
,
9
] to current contex-
tual bandit approaches. However, this manner typically leads to
either high computation/communication costs or the degradation
of recommendation performance: For example, in the multi-party
computation, the Multiplication operator would lead to 380 times
slower and the comparison operator would even lead to 34000 times
slower [
12
]. Thus directly applying these cryptographic techniques
to the contextual bandit is intractable since there involve large
amounts of multiplication/comparison operations. For dierential
privacy, injecting sucient noise into the contextual information
leads to a signicant loss of the statistics utility and worse recom-
mendation performance [11, 34].
All these cryptographic techniques are originally designed for
general-purpose computations. In this paper, we note that the core
computation operators in the contextual bandits come with unique
properties, thus providing the opportunity to design customized
encryption schemes with better runtime eciency and statistical
utility. In general, the exploration mechanism is at the core of pre-
vious bandit algorithms, which typically use the empirical sample
covariance matrix to construct the quantity needed by the explo-
ration mechanism. Based on the property of the sample covariance
matrix, we propose a technique named
o
rthogonal
m
atrix based
m
ask
m
echanism (O3M) for computing this quantity in a private
and lossless manner, which is motivated by prior works on fed-
erated PCA [
4
]. O3M is used by the data provider for masking
the local contextual information (i.e., user features) before sending
them to the service provider. Specically, O3M encodes the raw
contextual information by multiplying with an orthogonal matrix
(i.e., mask) then the data provider will send the masked data to
the service provider. Since the service provider has no access to
the mask, it cannot infer the original information even though it
knows the masked data. However, due to the mathematical property
of the orthogonal matrix, its eect can be removed when used in
the construction of the key statistic in the bandit algorithms and
thus we can obtain "lossless" performance in a privacy-preserving
manner (see proof details in Section 4). We further apply the O3M
technique to two commonly-used bandit algorithms, LinUCB and
LinTS, and implement two practical protocols named VFUCB and
VFTS for real-world recommendation tasks. For simplicity, in the
following discussions, we refer these protocols as VFCB (Vertical
Federated Contextual Bandit) without contextual ambiguity.
Besides, to help the reader have a better understanding of our
protocols, we conduct comprehensive complexity analysis and pro-
vide formal utility and security proofs in Section 4. By perform-
ing extensive empirical studies on both synthetic and real-world
datasets, we show that the proposed protocols can achieve similar
runtime eciency and the same recommendation performance as
the centralized bandit with a theoretical privacy-protection guaran-
tee. We also point out that incorporating more user features indeed
improves the performance of the contextual bandit in the vertical
setting, which provides insight to improve the recommendation
quality in an online and federated manner.
2 RELATED WORK
2.1 Federated Contextual Bandits
Recently a few works have explored the concept of federated ban-
dits [
2
,
21
,
27
,
27
,
35
]. Among the federated bandits with linear
reward structures, Huang et al
.
consider the case where individual
clients face dierent K-armed stochastic bandits coupled through
common global parameters [
16
]. They propose a collaborative al-
gorithm to cope with the heterogeneity across clients without ex-
changing local feature vectors or raw data. Dubey and Pentland
design dierentially private mechanisms for tackling centralized
and decentralized (peer-to-peer) federated learning [
7
]. Tewari and
Murphy consider a fully-decentralized federated bandit learning
setting with gossiping, where an individual agent only has access
to biased rewards, and agents can only exchange their observed
rewards with their neighbor agents [
30
]. Li and Wang study the
federated linear contextual bandits problem with heterogeneous
clients respectively, i.e., the clients have various response times
and even occasional unavailability in reality. They propose an asyn-
chronous event-triggered communication framework to improve
our method’s robustness against possible delays and the temporary
unavailability of clients [
19
]. Shi et al
.
consider a similar setting to
Tewari and Murphy’s setting but the nal payo is a mixture of the
local and global model [
28
]. All the above-mentioned paper belong
to the horizontal federated bandits, i.e., each client has heteroge-
neous users and all the clients jointly train a central bandit model
without exchanging its own raw users’ features. Since each user
has her features all stored by one client, this horizontal federated
setting has a sharp contrast with the VFL setting. As far as we know,
our paper is the rst to consider the contextual bandit problem in
the VFL setting.
2.2 Vertical Federated Learning
Vertical Federated Learning (VFL), aiming at promoting collabora-
tions among non-competing organizations/entities with vertically
2