Memory-Ecient Recursive Evaluation of 3-Center Gaussian Integrals Andrey Asadchev and Edward F. Valeev

2025-05-02 0 0 1.07MB 41 页 10玖币
侵权投诉
Memory-Efficient Recursive Evaluation of
3-Center Gaussian Integrals
Andrey Asadchev and Edward F. Valeev
Department of Chemistry, Virginia Tech, Blacksburg, VA 24061
E-mail: efv@vt.edu
January 18, 2023
Abstract
To improve the efficiency of Gaussian integral evaluation on modern accelerated
architectures FLOP-efficient Obara-Saika-based recursive evaluation schemes are op-
timized for the memory footprint. For the 3-center 2-particle integrals that are key
for the evaluation of Coulomb and other 2-particle interactions in the density-fitting
approximation the use of multi-quantal recurrences (in which multiple quanta are cre-
ated or transferred at once) is shown to produce significant memory savings. Other
innovation include leveraging register memory for reduced memory footprint and di-
rect compile-time generation of optimized kernels (instead of custom code generation)
with compile-time features of modern C++/CUDA. Performance of conventional and
CUDA-based implementations of the proposed schemes is illustrated for both the in-
dividual batches of integrals involving up to Gaussians with low and high angular
momenta (up to L= 6) and contraction degrees, as well as for the density-fitting-
based evaluation of the Coulomb potential. The computer implementation is available
in the open-source LibintX library.
1
arXiv:2210.03192v2 [physics.comp-ph] 17 Jan 2023
1 Introduction
Evaluation of Gaussian integrals13accounts for a significant or a dominant portion of the to-
tal cost of many key tasks in Gaussian LCAO electronic structure computations of molecules
and solids. Therefore efficient evaluation of various operators in Gaussian AO bases — and
in particular, 2-body Coulomb integrals (i.e., the electron repulsion integrals) — has been
the focus of much attention of the electronic structure community,1,432 with important de-
velopments continuing unabated.3344
A particular challenge for the electronic structure community has been the greatly ex-
panded importance of the data parallelism for the performance of modern processors. Com-
pared to the other key kernels of the electronic structure, namely, the linear and tensor
algebra, evaluation of Gaussian integrals is difficult to optimize due to many factors; among
the most important are: (1) the relatively low arithmetic intensity of the Gaussian inte-
gral kernels, (2) their irregular computation and data access patterns, and (3) significant
dependence of the distributions of shell-set costs and sizes on the AO basis set family and
cardinal rank (such as Xin the correlation-consistent basis set family cc-pVXZ). All of these
factors make it especially challenging to port Gaussian integral kernels onto accelerated co-
processors, such as general-purpose graphical processing units (GPGPUs, or, simply, GPUs),
that have become the norm both on the commodity and high-end platforms. Hence there
has been an intense effort to address these challenges, both on the modern central process-
ing units (CPUs) with wide single-instruction-multiple-data (SIMD) instructions36 and on
GPUs.35,38,40,42,4553
In this work we design an efficient approach for evaluation of 3-center 2-body Gaussian
integrals on massively-data-parallel devices like modern GPUs. The decision to focus on 3-
center 2-body integrals is due to their foundational role in the density fitting technology5456
that is crucial for efficient evaluation of many-body operators in electronic structure.5765
The density fitting technology is especially crucial for the electronic structure on GPUs by
trading floating-point operations (FLOPs) for reduced memory footprint; this makes DF a
2
perfect companion for the modern memory-limited FLOP-rich compute devices. While our
work is specific to 3-center evaluation strategies,28 the main ideas apply directly to 4-center
Gaussian integrals. Lastly, while some implementation details of our work are specific to
the particular programming model of GPUs considered here (CUDA), the key algorithmic
innovations can be exploited on other data-parallel devices like modern SIMD-capable CPUs.
The rest of the manuscript is organized as follows. Section 2discusses the 3-center
integral evaluation in the context of modern GPU architectures and their programming
models; the conclusion is that efficient recursive evaluation of 3-center Gaussian integrals
is possible on modern GPUs by reducing the memory footprint to fit entirely inside the
“fast” (shared) memory. Section 3discusses crucial implementation details, such as how
the Gaussian integral recurrences can be implemented entirely in modern C++, without
the need for a specialized code generator, as well as brief details about the user API of
LibintX. Section 4reports the performance of our integral engine on conventional CPUs
and NVIDIA’s V100 devices for evaluation of individual integrals as well as for evaluation
of the Coulomb potential matrix. Section 5summarizes out findings and outlines the next
steps. The notation used throughout this paper is defined in Appendix A.
2 Analysis
Our objective is to design a single evaluation strategy capable of competitive (even if not
optimal) performance for integral classes with Lup to 6 and varying contraction degrees
and optimized for modern and future heterogeneous platforms. To motivate the choice of
a particular evaluation method we first must review the basics of the relevant aspects of
the GPU architecture and programming models (Section 2.1); due to the space limitations
the reader is referred to the respective hardware and programming model manuals for more
details. Evaluation, design and implementation strategies are then discussed in Sections 2.2
to 2.4.
3
2.1 Overview of GPU programming models and architecture
Although there exist several models for programming GPUs and other accelerators, our work
focuses on NVIDIA’s CUDA programming model, as it is the most established programming
model based on the C++ programming language (the importance of C++ for our purposes
will become clear later). Other vendors’ programming models for data-parallel processors
(HIP, DPC++), as well as the multi-vendor SYCL programming model, are modeled closely
after CUDA. Thus porting CUDA code to other accelerator architectures should be relatively
straightforward.
A single-process CUDA program consists of one or more threads of execution on the
host inserting device code (CUDA kernel) invocations into one or more CUDA streams.
Each stream executes kernels sequentially (in-order) but kernels from multiple streams can
execute at the same time; thus CUDA streams are analogous to the threads of a thread
pool on the host. Inserting and scheduling a CUDA kernel invocation involves substantial
overhead even when single host thread is involved, on the order of a few microseconds; in
such a short period of time a modern device can execute up to a 100 MFLOPs, thus the
amount of work per device kernel invocation must substantially exceed 1 GFLOPs. For the
sake of managing the code complexity the device kernels can include calls to other device
functions (to avoid confusion with CUDA kernels that are “invoked” from the host code,
we will refer to them here as subkernels) which are often inlined by the compiler, hence the
effective cost of subkernel calls is negligible.
Key concepts of the CUDA execution model are threads and thread blocks. Execution of
an instruction by a thread is analogous to executing a single scalar component of a vector
(SIMD) instruction by a CPU. Threads are scheduled in blocks, with each thread block
further internally partitioned into a set of atomic groups of threads (warps); the warps of
a single thread block are typically executed concurrently. Each thread block is bound to a
streaming multiprocessor (SM), which is analogous to a single CPU core. Each SM may be
executing warps from one or more thread blocks concurrently. Having multiple thread blocks
4
resident on an SM allows to hide latency of certain operations, such as the main memory
accesses.
The CUDA memory hierarchy66 includes registers (private to a thread), shared memory
(private to a thread block), and global memory (accessible from any thread). These memory
spaces correspond to hardware memories located on each SM (registers, shared memory, L1
cache of the global memory) and shared by all SMs (L2 and optionally higher level caches
and DRAM).
A distinctive feature of modern GPUs is the availability of per-SM shared memory, also
known as local data store (LDS) in ROCm/HIP and local memory in SYCL; scratchpad
memory is a general term that is often used to describe these types of memory. Several
properties of shared memory make it the optimal location for non-register data in a high-
performance code on GPU: (a) its low latency (up to 50×lower than that of a location
in main memory missing from the Translation Lookaside Buffer (TLB)66), (b) usually fast
nonsequential access from consecutive threads (nonsequential accesses to the main memory
can be hindered by coalescing), and (c) fast reads/writes relative to the main memory.
Although the shared memory must be managed explicitly, that is an advantage for the
high-performance code.
These favorable features of the shared memory thus motivate the central objective of the
current work: to design an integral evaluation strategy that ensures that the entire data can
fit into the registers and/or shared memory. Although the total size of registers and shared
memory varies between devices and architectures, the typical amount is on the order of a
few 100s of kB of memory. For example, each SM on the NVIDIA V100 GPU has 256 KiB of
registers and up to 96 KiB of shared memory, while the newer NVIDIA A100 GPU has up to
160 KiB of shared memory per SM. These figures are in line with the corresponding hardware
characteristics of high-end GPUs from other vendors. Also note that these numbers are per
SM, not per thread block: to allow SM concurrency each thread block must use at most half
of the shared memory and registers.
5
摘要:

Memory-EcientRecursiveEvaluationof3-CenterGaussianIntegralsAndreyAsadchevandEdwardF.ValeevDepartmentofChemistry,VirginiaTech,Blacksburg,VA24061E-mail:efv@vt.eduJanuary18,2023AbstractToimprovetheeciencyofGaussianintegralevaluationonmodernacceleratedarchitecturesFLOP-ecientObara-Saika-basedrecursi...

展开>> 收起<<
Memory-Ecient Recursive Evaluation of 3-Center Gaussian Integrals Andrey Asadchev and Edward F. Valeev.pdf

共41页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:41 页 大小:1.07MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 41
客服
关注