
As an exemplifying case study, we consider a multi-access
PT system consisting of a radio access network (RAN) similar
to that studied in [8]–[10]. It is emphasized that, unlike [8]–
[10], our goal here is not to address a particular task via
MARL, but rather to introduce a general framework supporting
the implementation of multiple functionalities at the DT, in-
cluding control via MARL and monitoring, despite the limited
data transfer from the PT to the DT.
B. Related Work
DT-aided control of PT systems is often formulated as a
model-based reinforcement learning (RL) problem in which
the DT is leveraged as simulation platform to optimize the PT
policy [11]–[13]. In [14], a Bayesian model is deployed by the
DT to enable the quantification epistemic uncertainty. Existing
work on DT platforms for wireless systems has investigated
mechanisms for DT-PT synchronization [15] and DT-aided
network optimization and monitoring [16], as well as DT-
based control for computation offloading via model-based RL
[11]. To the best of our knowledge, applications of DT relying
on Bayesian learning for communication systems have not
been reported in the literature.
C. Main Contributions
The main contributions of this paper are as follows:
•We propose a Bayesian framework to control and monitor
an AI-native wireless system, using a multi-access RAN as an
exemplifying case study [8]–[10]. In the proposed approach,
as illustrated in Fig. 1, a Bayesian model learning phase is
followed by policy optimization and monitoring phases, which
leverage the uncertainty quantification capacity of Bayesian
models.
•A key challenge in the definition of a Bayesian model
is the choice of a domain-specific factorization of the joint
distribution of all variables of interest. This paper elucidates
this design choice for a multi-access system consisting of
sensing devices with correlated packet arrivals reporting to
a common receiver through a shared multi-packet reception
(MPR) channel [17]. This case study is relevant for Internet-
of-Things (IoT) and machine-type communications.
•Experimental results confirm the advantages of the proposed
Bayesian framework as compared to conventional frequentist
model-based approaches in terms of metrics such as through-
put and buffer overflow, as well as area under the receiver
operating curve (ROC) for anomaly detection at the DT.
II. BAYESIAN DT FRAMEWORK
In this section, we formally define a PT system compris-
ing multiple network elements, such as mobile devices or
infrastructure nodes, referred to as agents. We then present
a Bayesian DT framework to estimate the PT dynamics,
optimize the agents decisions, and monitor possible anomalies
in the PT.
Fig. 2: Example of factorization in (1), excluding the variables
corresponding to previous time steps t−1.
A. Multi-Agent PT System
The PT system of interest consists of Kagents, e.g., K
sensing devices, indexed by k∈ K ={1, . . . , K}that operate
over a discrete time index t= 1,2, . . . , e.g., over time slots. At
each time t, each agent takes an action ak
t, e.g., a decision on
whether to transmit a packet from its queue or not. The action
is selected by following a policy that leverages information
collected by the agent regarding the current state stof the
overall system, which may include include, e.g., packet queue
lengths. The state stevolves according to some ground-truth
transition probability T(st+1|st, at), such that the probability
distribution of the next state st+1 ∼T(st+1|st, at)depends
on the current state stand joint action at= (a1
t, . . . , aK
t).
We restrict our framework to the case of jointly observable
states [18], in which the state stcan be identified if one has
access to all observations ok
tmade by all agents k∈ K at
time t, i.e., in which the state is a function of the collection
ot= (o1
t, . . . , oK
t)for all times t.
Agents in the PT cannot communicate, and hence the overall
information available at agent kup to time tis contained in
its action-observation history hk
t= (ok
1, ak
1, ok
2, . . . , ak
t−1, ok
t).
Accordingly, the behaviour of agent kis defined by its policy
πk(ak
t|hk
t), which defines the probability of each possible
action ak
tbased on the available information hk
t.
The state of a communication system typically comprises
several substates, describing, e.g., the current traffic conditions
or the quality of the wireless channel. As a result, one can
typically partition the state variables stinto Moperationally
distinct subsets {si
t}M
i=1 indexed by i. To describe the interac-
tions among these subsets, we introduce a Bayesian network
defined by a directed acyclic graph, as illustrated in Fig. 2,
in which each subset si
tis directly affected by the subset of
“parent” variables sP(i)
t⊆st(see, e.g., [19]). Accordingly,
the transition probability is assumed to factorize as
T(st+1|st, at) =
M
Y
i=1
Tisi
t+1
sP(i)
t+1 , st, at,(1)
where the conditional distribution Ti(si
t+1|sP(i)
t+1 , st, at)de-
scribes the evolution of the next states variables si
t+1 given
the current state st, action at, and parent variables sP(i)
t+1 .
In general, the distribution Ti(si
t+1|sP(i)
t+1 , st, at)depends on
some sufficient statistic of variables stand at, which may be
a function of subsets of such variables. We refer to Sec. III-A
for an instance of model (1).
B. Model Learning
The goal of the model learning phase (phase 1
in Fig. 1)
at the DT is to obtain an estimate of the PT system dynamics