Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor Marco Bertuletti

2025-05-03 0 0 705.22KB 6 页 10玖币
侵权投诉
Efficient Parallelization of 5G-PUSCH on a
Scalable RISC-V Many-core Processor
Marco Bertuletti
ETH Z¨
urich
Z¨
urich, Switzerland
mbertuletti@iis.ee.ethz.ch
Yichao Zhang
ETH Z¨
urich
Z¨
urich, Switzerland
yiczhang@iis.ee.ethz.ch
Alessandro Vanelli-Coralli
ETH Z¨
urich
Z¨
urich, Switzerland
Universit`
a di Bologna
Bologna, Italy
avanelli@iis.ee.ethz.ch
Luca Benini
ETH Z¨
urich
Z¨
urich, Switzerland
Universit`
a di Bologna
Bologna, Italy
lbenini@iis.ee.ethz.ch
Abstract—5G Radio access network disaggregation and soft-
warization pose challenges in terms of computational perfor-
mance to the processing units. At the physical layer level, the
baseband processing computational effort is typically offloaded
to specialized hardware accelerators. However, the trend toward
software-defined radio-access networks demands flexible, pro-
grammable architectures. In this paper, we explore the software
design, parallelization and optimization of the key kernels of the
lower physical layer (PHY) for physical uplink shared channel
(PUSCH) reception on MemPool and TeraPool, two manycore
systems having respectively 256 and 1024 small and efficient
RISC-V cores with a large shared L1 data memory. PUSCH pro-
cessing is demanding and strictly time-constrained, it represents
a challenge for the baseband processors, and it is also common
to most of the uplink channels. Our analysis thus generalizes
to the entire lower PHY of the uplink receiver at gNodeB
(gNB). Based on the evaluation of the computational effort
(in multiply-accumulate operations) required by the PUSCH
algorithmic stages, we focus on the parallel implementation of
the dominant kernels, namely fast Fourier transform, matrix-
matrix multiplication, and matrix decomposition kernels for the
solution of linear systems. Our optimized parallel kernels achieve
respectively on MemPool and TeraPool speedups of 211, 225, 158,
and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74,
0.88, 0.71), comparable a single-core serial execution, moving a
step closer toward a full-software PUSCH implementation.
Index Terms—Many-core, RISC-V, 5G, OFDM, MIMO
I. INTRODUCTION
To provide increased flexibility, performance, and efficiency,
the 5G standard foresees the introduction of novel features
in its air-interface, known as new radio (NR), such as larger
bandwidths, higher spectrum frequencies, increased massive
multi-user multiple-input multiple-output (MIMO), beamform-
ing, etc. [1]. These enhancements require the processing of
high-dimensional signals in a fraction of milliseconds. Over
the last few years, a wide range of baseband processing
application-specific integrated circuits (ASICs) [2]–[4] have
been proposed. Industry stakeholders are, however, moving
towards more flexible solutions based on radio access network
(RAN) disaggregation and softwarization [5] to improve the
time-to-market in diverse deployment scenarios.
A key direction in RAN softwarization and disaggregation
is to exploit open software and hardware platforms, to ensure
long-term scalability, to speed up the adoption of innovative
community-developed solutions, and to reduce vendor captiv-
ity issues. The RISC-V instruction set architecture (ISA) plays
a strategic role in this context by enabling open software and
hardware architectures and designs, without the constraints
imposed by proprietary instruction sets. In this paper, we focus
on the PUSCH lower PHY of the uplink receiver at the gNB by
exploring the feasibility of implementing it on MemPool [6]
and its scaled-up version TeraPool, two clusters of respectively
256 and 1024 fully programmable RISC-V cores with a shared
low latency access L1 memory. The PUSCH lower PHY is
indeed one of the most challenging processing parts of the
entire receiving chain. The main contributions of this paper
are:
the identification of the most computationally complex
kernels of PUSCH lower PHY;
a local memory access parallel implementation of these
key kernels, reducing the memory-related stalls to less
than 10% of the execution time, in MemPool and Ter-
aPool;
a flexible scheduling policy that enables executing kernels
on subsets of the cluster’s cores, supported by the imple-
mentation of barriers for partial group synchronization;
the evaluation of the speedup of our parallel software-
defined PUSCH chain, compared to a single core serial
execution, and of the achievable efficiency in terms of
processor utilization and stall reduction.
The implemented parallel kernels achieve respectively on
MemPool and TeraPool speedups of 211, 225, 158, and 762,
880, 722, at utilizations 0.81, 0.89, 0.71, and 0.74, 0.88, 0.71.
The speedup obtained on the whole processing chain is 871.
The execution time, constrained to a realistic clock frequency
of 1GHz is 0.785ms, which is close to the 0.5ms per trans-
mission specified by the 5G PUSH standard. Our analysis thus
shows that a RISC-V-based ”pool of processors” architecture,
whose implementation feasibility was demonstrated in [6], is
a promising candidate for a parallel software implementation
of PUSCH on programmable cores.
II. 5G PUSCH KERNELS COMPLEXITY
This section reviews the key kernels in PUSCH processing.
Fig. 1 represents the reference PUSCH lower PHY receiving
chain. PUSCH transmission is based on orthogonal frequency
division multiple access (OFDMA) [7]. User equipments
arXiv:2210.09196v1 [cs.DC] 17 Oct 2022
(UEs) are multiplexed on a time and frequency grid (Fig. 2).
Each orthogonal frequency division multiplexing (OFDM)
symbol consists of NSC orthogonal sub-carriers. Nsymb are
sent during one slot transmission. PUSCH may be interleaved
in time and frequency with other channels, however, in the
worst case for PUSCH computational complexity the whole
spectrum is allocated to this channel. OFDM symbols are
received by a set of NRantennas.
Antennas
OFDM dem.
Fast Fourier
Transform
BF
Matrix-Matrix
Multiplication
CHE
Element-wise
division
NE
Autocorrelation
MIMO
Linear System
Solver
Fig. 1. PUSCH processing chain steps: OFDM demodulation, beamforming
(BF), MIMO, channel estimation (CHE) and noise estimation (NE). The steps
involving pilot symbols are reported in blue.
...
...
...
...
...
...
...
...
...
...
...
...
antennas
sub-carriers
data symbols
pilot symbols
time
OFDM
time-frequency grid
symbols
...
...
...
...
1
02345678 9 10 1112 13
0.5ms
30kHz
Beamforming
UE0
UE1
UE2
B0
B1
B2
B31
...
H
B0 B1
B2
UE1
UE0
UE2
Fig. 2. Time-frequency grid of an OFDM system and Beamforming.
At the beginning of the baseband digital signal processing
(DSP) chain, the signal received by each antenna is trans-
lated to the frequency domain via a Fast Fourier Transform
(FFT). The complexity of this stage can be estimated as
NSC ×log(NSC )complex multiply and accumulate operations
(MACs), and the kernel is run for each antenna and each
OFDM symbol. As shown in Fig. 2, beamforming linearly
combines the signal received by different antennas and cre-
ates NBreceiving beams. This results in a matrix-matrix
multiplication (MMM) with known coefficients, that requires
NR×NB×NSC complex MACs for each OFDM symbol.
After beamforming a yCNBsignal is obtained for each
sub-carrier. The relation between this signal and the xCNL
signal transmitted by NLUEs can be modeled as:
y=Hx +n(1)
where HCNB×NLis the channel matrix and nCNB
is additive white gaussian noise. In the MIMO stage, the
transmitted signal is extracted from the received signal through
least minimum mean squared error estimation. Before this
step, the channel matrix and the noise variance are estimated.
Introducing the variance of the Gaussian noise σ2, the identity
matrix I, the estimated channel matrix ˆ
H, its hermitian ˆ
HH,
and the Gramian matrix G, the MIMO stage of PUSCH
consists of the following:
x=ˆ
HHˆ
H+σ2I1ˆ
HHy=G1ˆ
HHy(2)
As suggested in [8], the computationally intensive matrix
inversion required by MIMO can be avoided by resorting to a
Cholesky decomposition of matrix G, followed by the solution
of two triangular systems. The complexity of these steps is
respectively N3
L/3and 2N2
L, for each sub-carrier and each
data OFDM symbol. The channel matrix and the variance of
noise used in (2) are pilot-based estimates.
MACs per stage in PUSCH chain
Fig. 3. Complex OFDMs allocated to each PUSCH stage for different number
of UEs transmitting at the same frequency.
TABLE I
PUSCH KERNELS AND COMPUTATIONAL COMPLEXITY
PUSCH stage Key kernel Complex MACs
OFDM dem. Fast Fourier transform Nsymb ×NR×NSC ×log(NS C )
BF Matrix-matrix multiplication Nsymb ×NSC ×NR×NB
MIMO Cholesky decomposition Ndatasymb ×NSC ×N3
L/3×2N2
L
CHE Element-wise division Npilotsymb ×NSC ×NB×NL
NE Autocorrelation Npilotsymb ×NSC ×2NB×NL
In this paper, the block-type arrangement described in [9], is
assumed and pilots are allocated to a whole OFDM symbol, as
shown in Fig. 2. The channel estimation block is based on least
squares estimation and it consists of an element-wise matrix
division. The computational cost of this kernel is NB×NL
MACs for each sub-carrier and for each OFDM symbol. The
noise variance is estimated by computing the autocorrelation
of the difference between the received signal and the expected
transmission output, obtained from the estimated channel and
the pilots. The complexity of this kernel is 2NB×NLcomplex
MACs for each sub-carrier and pilot symbol. Tab. I reports
the kernels of the PUSCH, and the number of complex MACs
required for each one of them.
Let us consider a typical NR use-case. According to the
3GPP NR numerology, we consider a bandwidth of 100MHz,
with sub-carrier spacing 30KHz, corresponding to 3276 sub-
carriers. We assume 14 symbols per transmission and 2 pilot
symbols, 64 receiving antennas, and 32 beams. Fig. 3 repre-
sents the complexity allocated to each kernel of the PUSCH
processing chain as a percentage fraction of the total. Most of
the effort is in OFDM demodulation and beamforming stages,
the impact of MIMO stage depends on the number of UEs
involved. According to Amdahl’s law, this analysis shows that
the throughput of the chain would greatly benefit from the
speedup of FFT, MMM, and Cholesky decomposition.
摘要:

EfcientParallelizationof5G-PUSCHonaScalableRISC-VMany-coreProcessorMarcoBertulettiETHZ¨urichZ¨urich,Switzerlandmbertuletti@iis.ee.ethz.chYichaoZhangETHZ¨urichZ¨urich,Switzerlandyiczhang@iis.ee.ethz.chAlessandroVanelli-CoralliETHZ¨urichZ¨urich,SwitzerlandUniversitadiBolognaBologna,Italyavanelli@iis...

展开>> 收起<<
Efficient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor Marco Bertuletti.pdf

共6页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:6 页 大小:705.22KB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 6
客服
关注