Projects per year
Abstract
Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ($\order(10^{12})$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.
We describe the multiGPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.
We describe the multiGPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.
Original language  English 

Pages (fromto)  5369 
Number of pages  20 
Journal  Parallel Computing 
Volume  50 
Early online date  28 Oct 2015 
DOIs  
Publication status  Published  1 Dec 2015 
Keywords
 iterative solver
 multigrid
 Graphics Processing Unit
 massively parallel
 atmospheric modelling
Fingerprint Dive into the research topics of 'Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters'. Together they form a unique fingerprint.
Projects
 2 Finished

Phase 2 Scalability of Elliptic Solvers in Weather and Climate Modelling
Natural Environment Research Council
24/06/13 → 30/06/16
Project: Research council

Scalability of Elliptic Solvers in Weather and Climate Modelling
Natural Environment Research Council
7/09/11 → 6/11/13
Project: Research council
Equipment

High Performance Computing (HPC) Facility
Steven Chapman (Manager)
University of BathFacility/equipment: Facility