Memory-Ecient Recursive Evaluation of 3-Center Gaussian Integrals Andrey Asadchev and Edward F. Valeev

2025-05-02 0 0 1.07MB 41 页 10玖币

侵权投诉

Memory-Eﬃcient Recursive Evaluation of

3-Center Gaussian Integrals

Andrey Asadchev and Edward F. Valeev∗

Department of Chemistry, Virginia Tech, Blacksburg, VA 24061

E-mail: efv@vt.edu

January 18, 2023

Abstract

To improve the eﬃciency of Gaussian integral evaluation on modern accelerated

architectures FLOP-eﬃcient Obara-Saika-based recursive evaluation schemes are op-

timized for the memory footprint. For the 3-center 2-particle integrals that are key

for the evaluation of Coulomb and other 2-particle interactions in the density-ﬁtting

approximation the use of multi-quantal recurrences (in which multiple quanta are cre-

ated or transferred at once) is shown to produce signiﬁcant memory savings. Other

innovation include leveraging register memory for reduced memory footprint and di-

rect compile-time generation of optimized kernels (instead of custom code generation)

with compile-time features of modern C++/CUDA. Performance of conventional and

CUDA-based implementations of the proposed schemes is illustrated for both the in-

dividual batches of integrals involving up to Gaussians with low and high angular

momenta (up to L= 6) and contraction degrees, as well as for the density-ﬁtting-

based evaluation of the Coulomb potential. The computer implementation is available

in the open-source LibintX library.

arXiv:2210.03192v2 [physics.comp-ph] 17 Jan 2023

1 Introduction

Evaluation of Gaussian integrals1–3accounts for a signiﬁcant or a dominant portion of the to-

tal cost of many key tasks in Gaussian LCAO electronic structure computations of molecules

and solids. Therefore eﬃcient evaluation of various operators in Gaussian AO bases — and

in particular, 2-body Coulomb integrals (i.e., the electron repulsion integrals) — has been

the focus of much attention of the electronic structure community,1,4–32 with important de-

velopments continuing unabated.33–44

A particular challenge for the electronic structure community has been the greatly ex-

panded importance of the data parallelism for the performance of modern processors. Com-

pared to the other key kernels of the electronic structure, namely, the linear and tensor

algebra, evaluation of Gaussian integrals is diﬃcult to optimize due to many factors; among

the most important are: (1) the relatively low arithmetic intensity of the Gaussian inte-

gral kernels, (2) their irregular computation and data access patterns, and (3) signiﬁcant

dependence of the distributions of shell-set costs and sizes on the AO basis set family and

cardinal rank (such as Xin the correlation-consistent basis set family cc-pVXZ). All of these

factors make it especially challenging to port Gaussian integral kernels onto accelerated co-

processors, such as general-purpose graphical processing units (GPGPUs, or, simply, GPUs),

that have become the norm both on the commodity and high-end platforms. Hence there

has been an intense eﬀort to address these challenges, both on the modern central process-

ing units (CPUs) with wide single-instruction-multiple-data (SIMD) instructions36 and on

GPUs.35,38,40,42,45–53

In this work we design an eﬃcient approach for evaluation of 3-center 2-body Gaussian

integrals on massively-data-parallel devices like modern GPUs. The decision to focus on 3-

center 2-body integrals is due to their foundational role in the density ﬁtting technology54–56

that is crucial for eﬃcient evaluation of many-body operators in electronic structure.57–65

The density ﬁtting technology is especially crucial for the electronic structure on GPUs by

trading ﬂoating-point operations (FLOPs) for reduced memory footprint; this makes DF a

perfect companion for the modern memory-limited FLOP-rich compute devices. While our

work is speciﬁc to 3-center evaluation strategies,28 the main ideas apply directly to 4-center

Gaussian integrals. Lastly, while some implementation details of our work are speciﬁc to

the particular programming model of GPUs considered here (CUDA), the key algorithmic

innovations can be exploited on other data-parallel devices like modern SIMD-capable CPUs.

The rest of the manuscript is organized as follows. Section 2discusses the 3-center

integral evaluation in the context of modern GPU architectures and their programming

models; the conclusion is that eﬃcient recursive evaluation of 3-center Gaussian integrals

is possible on modern GPUs by reducing the memory footprint to ﬁt entirely inside the

“fast” (shared) memory. Section 3discusses crucial implementation details, such as how

the Gaussian integral recurrences can be implemented entirely in modern C++, without

the need for a specialized code generator, as well as brief details about the user API of

LibintX. Section 4reports the performance of our integral engine on conventional CPUs

and NVIDIA’s V100 devices for evaluation of individual integrals as well as for evaluation

of the Coulomb potential matrix. Section 5summarizes out ﬁndings and outlines the next

steps. The notation used throughout this paper is deﬁned in Appendix A.

2 Analysis

Our objective is to design a single evaluation strategy capable of competitive (even if not

optimal) performance for integral classes with Lup to 6 and varying contraction degrees

and optimized for modern and future heterogeneous platforms. To motivate the choice of

a particular evaluation method we ﬁrst must review the basics of the relevant aspects of

the GPU architecture and programming models (Section 2.1); due to the space limitations

the reader is referred to the respective hardware and programming model manuals for more

details. Evaluation, design and implementation strategies are then discussed in Sections 2.2

to 2.4.

2.1 Overview of GPU programming models and architecture

Although there exist several models for programming GPUs and other accelerators, our work

focuses on NVIDIA’s CUDA programming model, as it is the most established programming

model based on the C++ programming language (the importance of C++ for our purposes

will become clear later). Other vendors’ programming models for data-parallel processors

(HIP, DPC++), as well as the multi-vendor SYCL programming model, are modeled closely

after CUDA. Thus porting CUDA code to other accelerator architectures should be relatively

straightforward.

A single-process CUDA program consists of one or more threads of execution on the

host inserting device code (CUDA kernel) invocations into one or more CUDA streams.

Each stream executes kernels sequentially (in-order) but kernels from multiple streams can

execute at the same time; thus CUDA streams are analogous to the threads of a thread

pool on the host. Inserting and scheduling a CUDA kernel invocation involves substantial

overhead even when single host thread is involved, on the order of a few microseconds; in

such a short period of time a modern device can execute up to a 100 MFLOPs, thus the

amount of work per device kernel invocation must substantially exceed 1 GFLOPs. For the

sake of managing the code complexity the device kernels can include calls to other device

functions (to avoid confusion with CUDA kernels that are “invoked” from the host code,

we will refer to them here as subkernels) which are often inlined by the compiler, hence the

eﬀective cost of subkernel calls is negligible.

Key concepts of the CUDA execution model are threads and thread blocks. Execution of

an instruction by a thread is analogous to executing a single scalar component of a vector

(SIMD) instruction by a CPU. Threads are scheduled in blocks, with each thread block

further internally partitioned into a set of atomic groups of threads (warps); the warps of

a single thread block are typically executed concurrently. Each thread block is bound to a

streaming multiprocessor (SM), which is analogous to a single CPU core. Each SM may be

executing warps from one or more thread blocks concurrently. Having multiple thread blocks

resident on an SM allows to hide latency of certain operations, such as the main memory

accesses.

The CUDA memory hierarchy66 includes registers (private to a thread), shared memory

(private to a thread block), and global memory (accessible from any thread). These memory

spaces correspond to hardware memories located on each SM (registers, shared memory, L1

cache of the global memory) and shared by all SMs (L2 and optionally higher level caches

and DRAM).

A distinctive feature of modern GPUs is the availability of per-SM shared memory, also

known as local data store (LDS) in ROCm/HIP and local memory in SYCL; scratchpad

memory is a general term that is often used to describe these types of memory. Several

properties of shared memory make it the optimal location for non-register data in a high-

performance code on GPU: (a) its low latency (up to 50×lower than that of a location

in main memory missing from the Translation Lookaside Buﬀer (TLB)66), (b) usually fast

nonsequential access from consecutive threads (nonsequential accesses to the main memory

can be hindered by coalescing), and (c) fast reads/writes relative to the main memory.

Although the shared memory must be managed explicitly, that is an advantage for the

high-performance code.

These favorable features of the shared memory thus motivate the central objective of the

current work: to design an integral evaluation strategy that ensures that the entire data can

ﬁt into the registers and/or shared memory. Although the total size of registers and shared

memory varies between devices and architectures, the typical amount is on the order of a

few 100s of kB of memory. For example, each SM on the NVIDIA V100 GPU has 256 KiB of

registers and up to 96 KiB of shared memory, while the newer NVIDIA A100 GPU has up to

160 KiB of shared memory per SM. These ﬁgures are in line with the corresponding hardware

characteristics of high-end GPUs from other vendors. Also note that these numbers are per

SM, not per thread block: to allow SM concurrency each thread block must use at most half

of the shared memory and registers.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Memory-EcientRecursiveEvaluationof3-CenterGaussianIntegralsAndreyAsadchevandEdwardF.ValeevDepartmentofChemistry,VirginiaTech,Blacksburg,VA24061E-mail:efv@vt.eduJanuary18,2023AbstractToimprovetheeciencyofGaussianintegralevaluationonmodernacceleratedarchitecturesFLOP-ecientObara-Saika-basedrecursi...

展开>> 收起<<

Memory-Ecient Recursive Evaluation of 3-Center Gaussian Integrals Andrey Asadchev and Edward F. Valeev.pdf

共41页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Memory-Ecient Recursive Evaluation of 3-Center Gaussian Integrals Andrey Asadchev and Edward F. Valeev

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: