Efficient Parallelization of 5G-PUSCH on a
Scalable RISC-V Many-core Processor
Marco Bertuletti
ETH Z¨
urich
Z¨
urich, Switzerland
mbertuletti@iis.ee.ethz.ch
Yichao Zhang
ETH Z¨
urich
Z¨
urich, Switzerland
yiczhang@iis.ee.ethz.ch
Alessandro Vanelli-Coralli
ETH Z¨
urich
Z¨
urich, Switzerland
Universit`
a di Bologna
Bologna, Italy
avanelli@iis.ee.ethz.ch
Luca Benini
ETH Z¨
urich
Z¨
urich, Switzerland
Universit`
a di Bologna
Bologna, Italy
lbenini@iis.ee.ethz.ch
Abstract—5G Radio access network disaggregation and soft-
warization pose challenges in terms of computational perfor-
mance to the processing units. At the physical layer level, the
baseband processing computational effort is typically offloaded
to specialized hardware accelerators. However, the trend toward
software-defined radio-access networks demands flexible, pro-
grammable architectures. In this paper, we explore the software
design, parallelization and optimization of the key kernels of the
lower physical layer (PHY) for physical uplink shared channel
(PUSCH) reception on MemPool and TeraPool, two manycore
systems having respectively 256 and 1024 small and efficient
RISC-V cores with a large shared L1 data memory. PUSCH pro-
cessing is demanding and strictly time-constrained, it represents
a challenge for the baseband processors, and it is also common
to most of the uplink channels. Our analysis thus generalizes
to the entire lower PHY of the uplink receiver at gNodeB
(gNB). Based on the evaluation of the computational effort
(in multiply-accumulate operations) required by the PUSCH
algorithmic stages, we focus on the parallel implementation of
the dominant kernels, namely fast Fourier transform, matrix-
matrix multiplication, and matrix decomposition kernels for the
solution of linear systems. Our optimized parallel kernels achieve
respectively on MemPool and TeraPool speedups of 211, 225, 158,
and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74,
0.88, 0.71), comparable a single-core serial execution, moving a
step closer toward a full-software PUSCH implementation.
Index Terms—Many-core, RISC-V, 5G, OFDM, MIMO
I. INTRODUCTION
To provide increased flexibility, performance, and efficiency,
the 5G standard foresees the introduction of novel features
in its air-interface, known as new radio (NR), such as larger
bandwidths, higher spectrum frequencies, increased massive
multi-user multiple-input multiple-output (MIMO), beamform-
ing, etc. [1]. These enhancements require the processing of
high-dimensional signals in a fraction of milliseconds. Over
the last few years, a wide range of baseband processing
application-specific integrated circuits (ASICs) [2]–[4] have
been proposed. Industry stakeholders are, however, moving
towards more flexible solutions based on radio access network
(RAN) disaggregation and softwarization [5] to improve the
time-to-market in diverse deployment scenarios.
A key direction in RAN softwarization and disaggregation
is to exploit open software and hardware platforms, to ensure
long-term scalability, to speed up the adoption of innovative
community-developed solutions, and to reduce vendor captiv-
ity issues. The RISC-V instruction set architecture (ISA) plays
a strategic role in this context by enabling open software and
hardware architectures and designs, without the constraints
imposed by proprietary instruction sets. In this paper, we focus
on the PUSCH lower PHY of the uplink receiver at the gNB by
exploring the feasibility of implementing it on MemPool [6]
and its scaled-up version TeraPool, two clusters of respectively
256 and 1024 fully programmable RISC-V cores with a shared
low latency access L1 memory. The PUSCH lower PHY is
indeed one of the most challenging processing parts of the
entire receiving chain. The main contributions of this paper
are:
•the identification of the most computationally complex
kernels of PUSCH lower PHY;
•a local memory access parallel implementation of these
key kernels, reducing the memory-related stalls to less
than 10% of the execution time, in MemPool and Ter-
aPool;
•a flexible scheduling policy that enables executing kernels
on subsets of the cluster’s cores, supported by the imple-
mentation of barriers for partial group synchronization;
•the evaluation of the speedup of our parallel software-
defined PUSCH chain, compared to a single core serial
execution, and of the achievable efficiency in terms of
processor utilization and stall reduction.
The implemented parallel kernels achieve respectively on
MemPool and TeraPool speedups of 211, 225, 158, and 762,
880, 722, at utilizations 0.81, 0.89, 0.71, and 0.74, 0.88, 0.71.
The speedup obtained on the whole processing chain is 871.
The execution time, constrained to a realistic clock frequency
of 1GHz is 0.785ms, which is close to the 0.5ms per trans-
mission specified by the 5G PUSH standard. Our analysis thus
shows that a RISC-V-based ”pool of processors” architecture,
whose implementation feasibility was demonstrated in [6], is
a promising candidate for a parallel software implementation
of PUSCH on programmable cores.
II. 5G PUSCH KERNELS COMPLEXITY
This section reviews the key kernels in PUSCH processing.
Fig. 1 represents the reference PUSCH lower PHY receiving
chain. PUSCH transmission is based on orthogonal frequency
division multiple access (OFDMA) [7]. User equipments
arXiv:2210.09196v1 [cs.DC] 17 Oct 2022