Efﬁcient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor Marco Bertuletti

2025-05-03 0 0 705.22KB 6 页 10玖币

侵权投诉

Efﬁcient Parallelization of 5G-PUSCH on a

Scalable RISC-V Many-core Processor

Marco Bertuletti

ETH Z¨

urich

Z¨

urich, Switzerland

mbertuletti@iis.ee.ethz.ch

Yichao Zhang

ETH Z¨

urich

Z¨

urich, Switzerland

yiczhang@iis.ee.ethz.ch

Alessandro Vanelli-Coralli

ETH Z¨

urich

Z¨

urich, Switzerland

Universit`

a di Bologna

Bologna, Italy

avanelli@iis.ee.ethz.ch

Luca Benini

ETH Z¨

urich

Z¨

urich, Switzerland

Universit`

a di Bologna

Bologna, Italy

lbenini@iis.ee.ethz.ch

Abstract—5G Radio access network disaggregation and soft-

warization pose challenges in terms of computational perfor-

mance to the processing units. At the physical layer level, the

baseband processing computational effort is typically ofﬂoaded

to specialized hardware accelerators. However, the trend toward

software-deﬁned radio-access networks demands ﬂexible, pro-

grammable architectures. In this paper, we explore the software

design, parallelization and optimization of the key kernels of the

lower physical layer (PHY) for physical uplink shared channel

(PUSCH) reception on MemPool and TeraPool, two manycore

systems having respectively 256 and 1024 small and efﬁcient

RISC-V cores with a large shared L1 data memory. PUSCH pro-

cessing is demanding and strictly time-constrained, it represents

a challenge for the baseband processors, and it is also common

to most of the uplink channels. Our analysis thus generalizes

to the entire lower PHY of the uplink receiver at gNodeB

(gNB). Based on the evaluation of the computational effort

(in multiply-accumulate operations) required by the PUSCH

algorithmic stages, we focus on the parallel implementation of

the dominant kernels, namely fast Fourier transform, matrix-

matrix multiplication, and matrix decomposition kernels for the

solution of linear systems. Our optimized parallel kernels achieve

respectively on MemPool and TeraPool speedups of 211, 225, 158,

and 762, 880, 722, at high utilization (0.81, 0.89, 0.71, and 0.74,

0.88, 0.71), comparable a single-core serial execution, moving a

step closer toward a full-software PUSCH implementation.

Index Terms—Many-core, RISC-V, 5G, OFDM, MIMO

I. INTRODUCTION

To provide increased ﬂexibility, performance, and efﬁciency,

the 5G standard foresees the introduction of novel features

in its air-interface, known as new radio (NR), such as larger

bandwidths, higher spectrum frequencies, increased massive

multi-user multiple-input multiple-output (MIMO), beamform-

ing, etc. [1]. These enhancements require the processing of

high-dimensional signals in a fraction of milliseconds. Over

the last few years, a wide range of baseband processing

application-speciﬁc integrated circuits (ASICs) [2]–[4] have

been proposed. Industry stakeholders are, however, moving

towards more ﬂexible solutions based on radio access network

(RAN) disaggregation and softwarization [5] to improve the

time-to-market in diverse deployment scenarios.

A key direction in RAN softwarization and disaggregation

is to exploit open software and hardware platforms, to ensure

long-term scalability, to speed up the adoption of innovative

community-developed solutions, and to reduce vendor captiv-

ity issues. The RISC-V instruction set architecture (ISA) plays

a strategic role in this context by enabling open software and

hardware architectures and designs, without the constraints

imposed by proprietary instruction sets. In this paper, we focus

on the PUSCH lower PHY of the uplink receiver at the gNB by

exploring the feasibility of implementing it on MemPool [6]

and its scaled-up version TeraPool, two clusters of respectively

256 and 1024 fully programmable RISC-V cores with a shared

low latency access L1 memory. The PUSCH lower PHY is

indeed one of the most challenging processing parts of the

entire receiving chain. The main contributions of this paper

are:

•the identiﬁcation of the most computationally complex

kernels of PUSCH lower PHY;

•a local memory access parallel implementation of these

key kernels, reducing the memory-related stalls to less

than 10% of the execution time, in MemPool and Ter-

aPool;

•a ﬂexible scheduling policy that enables executing kernels

on subsets of the cluster’s cores, supported by the imple-

mentation of barriers for partial group synchronization;

•the evaluation of the speedup of our parallel software-

deﬁned PUSCH chain, compared to a single core serial

execution, and of the achievable efﬁciency in terms of

processor utilization and stall reduction.

The implemented parallel kernels achieve respectively on

MemPool and TeraPool speedups of 211, 225, 158, and 762,

880, 722, at utilizations 0.81, 0.89, 0.71, and 0.74, 0.88, 0.71.

The speedup obtained on the whole processing chain is 871.

The execution time, constrained to a realistic clock frequency

of 1GHz is 0.785ms, which is close to the 0.5ms per trans-

mission speciﬁed by the 5G PUSH standard. Our analysis thus

shows that a RISC-V-based ”pool of processors” architecture,

whose implementation feasibility was demonstrated in [6], is

a promising candidate for a parallel software implementation

of PUSCH on programmable cores.

II. 5G PUSCH KERNELS COMPLEXITY

This section reviews the key kernels in PUSCH processing.

Fig. 1 represents the reference PUSCH lower PHY receiving

chain. PUSCH transmission is based on orthogonal frequency

division multiple access (OFDMA) [7]. User equipments

arXiv:2210.09196v1 [cs.DC] 17 Oct 2022

(UEs) are multiplexed on a time and frequency grid (Fig. 2).

Each orthogonal frequency division multiplexing (OFDM)

symbol consists of NSC orthogonal sub-carriers. Nsymb are

sent during one slot transmission. PUSCH may be interleaved

in time and frequency with other channels, however, in the

worst case for PUSCH computational complexity the whole

spectrum is allocated to this channel. OFDM symbols are

received by a set of NRantennas.

Antennas

OFDM dem.

Fast Fourier

Transform

Matrix-Matrix

Multiplication

CHE

Element-wise

division

Autocorrelation

MIMO

Linear System

Solver

Fig. 1. PUSCH processing chain steps: OFDM demodulation, beamforming

(BF), MIMO, channel estimation (CHE) and noise estimation (NE). The steps

involving pilot symbols are reported in blue.

...

antennas

sub-carriers

data symbols

pilot symbols

time

OFDM

time-frequency grid

symbols

...

02345678 9 10 1112 13

0.5ms

30kHz

Beamforming

UE0

UE1

UE2

B31

...

B0 B1

UE1

UE0

UE2

Fig. 2. Time-frequency grid of an OFDM system and Beamforming.

At the beginning of the baseband digital signal processing

(DSP) chain, the signal received by each antenna is trans-

lated to the frequency domain via a Fast Fourier Transform

(FFT). The complexity of this stage can be estimated as

NSC ×log(NSC )complex multiply and accumulate operations

(MACs), and the kernel is run for each antenna and each

OFDM symbol. As shown in Fig. 2, beamforming linearly

combines the signal received by different antennas and cre-

ates NBreceiving beams. This results in a matrix-matrix

multiplication (MMM) with known coefﬁcients, that requires

NR×NB×NSC complex MACs for each OFDM symbol.

After beamforming a y∈CNBsignal is obtained for each

sub-carrier. The relation between this signal and the x∈CNL

signal transmitted by NLUEs can be modeled as:

y=Hx +n(1)

where H∈CNB×NLis the channel matrix and n∈CNB

is additive white gaussian noise. In the MIMO stage, the

transmitted signal is extracted from the received signal through

least minimum mean squared error estimation. Before this

step, the channel matrix and the noise variance are estimated.

Introducing the variance of the Gaussian noise σ2, the identity

matrix I, the estimated channel matrix ˆ

H, its hermitian ˆ

HH,

and the Gramian matrix G, the MIMO stage of PUSCH

consists of the following:

x=ˆ

HHˆ

H+σ2I−1ˆ

HHy=G−1ˆ

HHy(2)

As suggested in [8], the computationally intensive matrix

inversion required by MIMO can be avoided by resorting to a

Cholesky decomposition of matrix G, followed by the solution

of two triangular systems. The complexity of these steps is

respectively N3

L/3and 2N2

L, for each sub-carrier and each

data OFDM symbol. The channel matrix and the variance of

noise used in (2) are pilot-based estimates.

MACs per stage in PUSCH chain

Fig. 3. Complex OFDMs allocated to each PUSCH stage for different number

of UEs transmitting at the same frequency.

TABLE I

PUSCH KERNELS AND COMPUTATIONAL COMPLEXITY

PUSCH stage Key kernel Complex MACs

OFDM dem. Fast Fourier transform Nsymb ×NR×NSC ×log(NS C )

BF Matrix-matrix multiplication Nsymb ×NSC ×NR×NB

MIMO Cholesky decomposition Ndata−symb ×NSC ×N3

L/3×2N2

CHE Element-wise division Npilot−symb ×NSC ×NB×NL

NE Autocorrelation Npilot−symb ×NSC ×2NB×NL

In this paper, the block-type arrangement described in [9], is

assumed and pilots are allocated to a whole OFDM symbol, as

shown in Fig. 2. The channel estimation block is based on least

squares estimation and it consists of an element-wise matrix

division. The computational cost of this kernel is NB×NL

MACs for each sub-carrier and for each OFDM symbol. The

noise variance is estimated by computing the autocorrelation

of the difference between the received signal and the expected

transmission output, obtained from the estimated channel and

the pilots. The complexity of this kernel is 2NB×NLcomplex

MACs for each sub-carrier and pilot symbol. Tab. I reports

the kernels of the PUSCH, and the number of complex MACs

required for each one of them.

Let us consider a typical NR use-case. According to the

3GPP NR numerology, we consider a bandwidth of 100MHz,

with sub-carrier spacing 30KHz, corresponding to 3276 sub-

carriers. We assume 14 symbols per transmission and 2 pilot

symbols, 64 receiving antennas, and 32 beams. Fig. 3 repre-

sents the complexity allocated to each kernel of the PUSCH

processing chain as a percentage fraction of the total. Most of

the effort is in OFDM demodulation and beamforming stages,

the impact of MIMO stage depends on the number of UEs

involved. According to Amdahl’s law, this analysis shows that

the throughput of the chain would greatly beneﬁt from the

speedup of FFT, MMM, and Cholesky decomposition.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EfcientParallelizationof5G-PUSCHonaScalableRISC-VMany-coreProcessorMarcoBertulettiETHZ¨urichZ¨urich,Switzerlandmbertuletti@iis.ee.ethz.chYichaoZhangETHZ¨urichZ¨urich,Switzerlandyiczhang@iis.ee.ethz.chAlessandroVanelli-CoralliETHZ¨urichZ¨urich,SwitzerlandUniversitadiBolognaBologna,Italyavanelli@iis...

展开>> 收起<<

Efﬁcient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor Marco Bertuletti.pdf

共6页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Efﬁcient Parallelization of 5G-PUSCH on a Scalable RISC-V Many-core Processor Marco Bertuletti

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: