BIT ERROR AND BLOCK ERROR RATE TRAINING FOR ML-ASSISTED COMMUNICATION Reinhard Wiesmayr1 Gian Marti1 Chris Dick2 Haochuan Song3 and Christoph Studer1 equal contribution1ETH Zurich2NVIDIA3Southeast University

2025-05-06 0 0 771.37KB 8 页 10玖币
侵权投诉
BIT ERROR AND BLOCK ERROR RATE TRAINING FOR ML-ASSISTED COMMUNICATION
Reinhard Wiesmayr ?,1, Gian Marti ?,1, Chris Dick2, Haochuan Song3, and Christoph Studer1
?equal contribution; 1ETH Zurich, 2NVIDIA, 3Southeast University
E-mail: wiesmayr@iis.ee.ethz.ch, marti@iis.ee.ethz.ch, cdick@nvidia.com, hcsong@seu.edu.cn, studer@ethz.ch
ABSTRACT
Even though machine learning (ML) techniques are being
widely used in communications, the question of how to train
communication systems has received surprisingly little atten-
tion. In this paper, we show that the commonly used binary
cross-entropy (BCE) loss is a sensible choice in uncoded sys-
tems, e.g., for training ML-assisted data detectors, but may not
be optimal in coded systems. We propose new loss functions
targeted at minimizing the block error rate and SNR deweight-
ing, a novel method that trains communication systems for
optimal performance over a range of signal-to-noise ratios.
The utility of the proposed loss functions as well as of SNR
deweighting is shown through simulations in NVIDIA Sionna.
1. INTRODUCTION
Machine learning (ML) has revolutionized a large number of
fields, including communications. The availability of software
frameworks, such as TensorFlow [1] and, recently, NVIDIA
Sionna [2], has made implementation and training of ML-
assisted communication systems convenient. Existing results
in ML-assisted communication systems range from the atom-
istic improvement of data detectors (e.g., using deep
unfolding)
[3
6] to model-free learning of end-to-end communication sys-
tems [7
9]. Quite surprisingly, only little attention has been
devoted to the question of how ML-assisted communication
systems should be trained. In particular, the choice of the cost
function is seldom discussed (see, e.g., the recent overview
papers [10,11]) and—given the similarity between communi-
cation and classification—one usually resorts to an empirical
cross-entropy (CE) loss [12
17]. The question of training a
communication system for good performance over a range of
signal-to-noise ratios (SNRs) is another issue that has not been
seriously investigated. Systems are usually trained on samples
from only one SNR [3,8], or on samples uniformly drawn from
the targeted SNR range [4,14,16], apparently without ques-
tioning how this may affect performance for different SNRs.
In this paper, we investigate how ML-assisted communi-
cation systems should be trained. We first consider the case
where the intended goal is to minimize the uncoded bit error
A shorter version of this paper has been submitted to the 2023 IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing (ICASSP).
All code and simulation scripts to reproduce the results of this paper are
available on GitHub: https://github.com/IIP-Group/BLER_Training
The authors thank Oscar Castañeda for comments and suggestions.
rate (BER) and discuss why the empirical binary cross-entropy
(BCE) loss is indeed a sensible choice in uncoded systems,
e.g., for data detectors in isolation. However, in most practical
communication applications, the relevant figure of merit is the
(coded) block error rate (BLER), as opposed to the BER, since
block errors cause undesirable retransmissions [18, Sec. 9.2],
whereas (coded) bit errors themselves are irrelevant.
1
We un-
derpin that minimizing the (coded) BER is not equivalent to
minimizing the BLER. This observation calls into question
the common practice of training coded systems with loss func-
tions that penalize individual bit errors (such as the empirical
BCE), and thus optimize for the (irrelevant) coded BER in-
stead of the BLER. In response, we propose a range of novel
loss functions that aim at minimizing the BLER by penaliz-
ing bit errors jointly. We also show that training on samples
that are uniformly drawn from a target SNR range will focus
primarily on the low-SNR region while neglecting high-SNR
performance. As a remedy, we propose a new technique called
SNR deweighting. We evaluate the impact of the different loss
functions as well as of SNR deweighting through simulations
in NVIDIA Sionna [2].
2. TRAINING FOR BIT ERROR RATE
ML-assisted communication systems are typically trained with
a focus on minimizing the (uncoded) BER [4, 16], under a
tacit assumption that the learned system could then be used in
combination with a forward error correction (FEC) scheme to
ensure reliable communication.
2
Due to the similarity between
detection and classification, the strategy typically consists of
(approximately) minimizing the empirical BCE
3
on a training
set
D={(b(n),y(n))}N
n=1
, where
b= (b1, . . . , bK)
is the
vector of bits of interest (even in uncoded systems, one is inter-
ested in multiple bits, e.g., when using higher-order constella-
tions, multiple OFDM subcarriers, or multi-user transmission),
y∈ Y
is the channel output, and
n
is the sample index. In
fact, this strategy appears to be so obvious that it is often not
motivated—let alone questioned—at all.
1
For this reason, physical layer (PHY) quality-of-service is assessed only
in terms of BLER (not BER) in 3GPP LTE and other standards. Reference [19]
notes that the relation between BER and BLER can be inconsistent.
2
The discussion also applies to systems that already include FEC, but we
argue in Secs. 1 and 3 that minimizing the coded BER is a category mistake.
3
When we speak of the BCE between vectors, we mean the sum of bi-
nary CEs between the individual components as defined in
(3)
, and not the
categorical CE between the bit-vector and its estimate (as used, e.g., in [7
9]).
arXiv:2210.14103v3 [cs.IT] 6 Mar 2023
2.1. Minimizing the BCE Learns the Posterior Marginals
An “ML style” justification is to note that the expected BCE
between the bit vector
b
and its estimate
f(y)=(f1, . . . , fK)
can be written as
PkH(bk|y) + EyD(pbk|ykfk)
, where
H(·|·)
and
D(·k·)
are the conditonal and relative entropy. The
expected BCE is thus minimized when the estimates
fk(y)
equal the true posterior marginals
pbk|y
.
4
Once the posterior is
learned, simple thresholding (at
1
2
) results in BER-optimal data
detection. The expected BCE is not available, but resorting to
an empirical proxy through stochastic gradient descent is so
common by now that it is often not even mentioned anymore.
We now argue explicitly—using the framework of empiri-
cal risk minimization (ERM)—that minimizing the empirical
(as opposed to the expected) BCE can learn the true posterior
marginals. We do not claim that this result is “novel,” but an
explicit derivation seems unavailable in the literature. In the
ERM framework, one learns a function
ˆ
f= arg minf∈F L(f,D),(1)
where
F ⊆ {f:Y [0,1]K}
is the set of admissible func-
tions f= (f1, . . . , fK)and
L(f,D) = Pn=1,...,N lBCE(b(n),f(y(n))),(2)
is the empirical risk, which here is induced by the BCE loss
lBCE(b,f) = PK
k=1 bklog(fk) + (1bk) log(1 fk).(3)
In principle, the empirical risk would be minimal if
f(y(n)) = b(n), n = 1, . . . , N. (4)
The optimal
f
would therefore make hard decisions on the
training data set that—with hindsight—are always right. How-
ever, there are a priori no restrictions on how such a function
f
responds to an input
y
that is not contained in
D
: We are at
the danger of overfitting. ERM with a BCE loss may therefore
be a reasonable strategy primarily in one of the following two
settings: Either
F
is “inflexible” or the range
Y 3 y
is “small”
compared to
D
. In either case,
(4)
cannot be satisfied and over-
fitting is prevented.
5
The first case is more relevant in practice
but more difficult to analyze. We therefore focus on the second
case, which we formalize through the following assumption:
Assumption 1.
We assume that
D
is large and representative
of the underlying posterior marginals
pbk|y
in the sense that,
for some 0<ε<1and for all kand all (b, y)∈ {0,1}×Y,
pbk|y(b= 1|y)1
|N (y)|Pn∈N (y)b(n)
k
ε, (5)
where N(y) = {n∈ {1, . . . , N}:y(n)=y}.
4
This assumes that the transmitter is not trainable, so that
H(b|y)
is a
constant. See [20] for a discussion that includes trainable transmitters.
5
It has been argued that learned systems may also generalize to new inputs
even when they achieve perfect accuracy on the training dataset [21,22]. An
investigation of such settings is, however, beyond the scope of this paper.
Proposition 1.
Under Ass. 1, ERM with
F={f:Y →[0,1]K}
and BCE loss learns the posterior marginals up to precision
ε
,
|pbk|y(b= 1|y)ˆ
fk(y)| ≤ ε, y∈ Y, k = 1, . . . , K. (6)
The proof of this proposition (as well as of all following
propositions) is shown in Sec. 7.1.
It should be interesting to translate this result to the case
where
Y
is uncountable but
F
is “inflexible,” or even to the
interpolating case described in [21]. We also note that, while
the BCE is the most natural and probably most widely used
loss in this context, it is by no means the only option. In fact,
an analogous version of Prop. 1 holds for the mean square error
(MSE) loss lMSE :{0,1}K×[0,1]K,(b,f)7→ kbfk2
2/K.
Proposition 2.
Under Ass. 1, ERM with
F={f:Y →[0,1]K}
and MSE loss learns the posterior marginals up to precision
ε
,
|pbk|y(b= 1|y)ˆ
fk(y)| ≤ ε, y∈ Y, k = 1, . . . , K. (7)
2.2. Posterior vs. Posterior Marginals
We now draw attention to a subtle but conceptually important
point: The loss in
(3)
considers the sum of empirical BCEs
between the individual components of
b
and
f
, and we have
shown that this loss can be used to learn the posterior marginals
pbk|y, k = 1, . . . , K
. But this is not equivalent to learning the
joint posterior
pb|y
, since we do not learn the conditional
dependencies between the different bits
bk
. As a consequence
of the summation of the component BCEs,
f
approximates
the posterior as a product of independent distributions. For an
information-theoretic perspective, see also Sec. 7.2.
3. TRAINING FOR BLOCK ERROR RATE
3.1. The Difference Between BER and BLER Optimality
Learning to minimize the BLER in (block-)coded systems is
not tantamount with learning to minimize the BER in those
systems. To see this, consider a (block-)coded system in which
the bits
b= (b1, . . . , bK)
are encoded into codewords
c=
enc(b)∈ C
for reliable data transmission. (In contrast to
Sec. 2, we now look at multiple bits from the same data stream.)
Optimal (coded) BER is obtained when we decode on the
basis of the posterior probabilities
p(bk|y)
, which—as we
have seen—can be learned, e.g., with a BCE loss function:
ˆ
bk= arg maxbk∈{0,1}pbk|y(bk|y), k = 1, . . . , K. (8)
Perhaps surprisingly, this need not coincide with BLER-
optimal decoding, which is achieved by the decoding rule
ˆ
b=dec(arg maxc∈C pc|y(c|y)),(9)
where
dec=enc1
is the inverse mapping of the encoder. The
reason is as follows: Even though the data bits
b
may be inde-
pendent a priori, their conditional distribution given the chan-
nel output,
pb|y(b|y)
, is in general no longer so,
pb|y(b|y)6=
Qk=1,...,K pbk|y(bk|y). We have the following result:
摘要:

BITERRORANDBLOCKERRORRATETRAININGFORML-ASSISTEDCOMMUNICATIONReinhardWiesmayr?;1,GianMarti?;1,ChrisDick2,HaochuanSong3,andChristophStuder1?equalcontribution;1ETHZurich,2NVIDIA,3SoutheastUniversityE-mail:wiesmayr@iis.ee.ethz.ch,marti@iis.ee.ethz.ch,cdick@nvidia.com,hcsong@seu.edu.cn,studer@ethz.chABST...

展开>> 收起<<
BIT ERROR AND BLOCK ERROR RATE TRAINING FOR ML-ASSISTED COMMUNICATION Reinhard Wiesmayr1 Gian Marti1 Chris Dick2 Haochuan Song3 and Christoph Studer1 equal contribution1ETH Zurich2NVIDIA3Southeast University.pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:771.37KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注