Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters

Research output: Contribution to journalArticle

6 Citations (Scopus)
88 Downloads (Pure)

Abstract

Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ($\order(10^{12})$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.

We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.
Original languageEnglish
Pages (from-to)53-69
Number of pages20
JournalParallel Computing
Volume50
Early online date28 Oct 2015
DOIs
Publication statusPublished - 1 Dec 2015

Fingerprint

Data storage equipment
Modeling
Unknown
Fast Solvers
Iterative Solvers
Semi-implicit
Conjugate Gradient
Time Stepping
Asymptotically Optimal
Condition number
System of equations
Efficient Algorithms
Discretization
Bandwidth
Grid
Converge
Iteration
Graphics processing unit
Architecture

Keywords

  • iterative solver
  • multigrid
  • Graphics Processing Unit
  • massively parallel
  • atmospheric modelling

Cite this

Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters. / Mueller, Eike; Scheichl, Robert; Vainikko, Eero.

In: Parallel Computing, Vol. 50, 01.12.2015, p. 53-69.

Research output: Contribution to journalArticle

@article{40d7aba0676e475791bafd936c9c027a,
title = "Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters",
abstract = "Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ($\order(10^{12})$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\{\%}$ of the theoretical peak performance of the machine and we use more than $40\{\%}$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.",
keywords = "iterative solver, multigrid, Graphics Processing Unit, massively parallel, atmospheric modelling",
author = "Eike Mueller and Robert Scheichl and Eero Vainikko",
year = "2015",
month = "12",
day = "1",
doi = "10.1016/j.parco.2015.10.007",
language = "English",
volume = "50",
pages = "53--69",
journal = "Parallel Computing",
issn = "0167-8191",
publisher = "Elsevier",

}

TY - JOUR

T1 - Petascale solvers for anisotropic PDEs in atmospheric modelling on GPU clusters

AU - Mueller, Eike

AU - Scheichl, Robert

AU - Vainikko, Eero

PY - 2015/12/1

Y1 - 2015/12/1

N2 - Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ($\order(10^{12})$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.

AB - Memory bound applications such as solvers for large sparse systems of equations remain a challenge for GPUs. Fast solvers should be based on numerically efficient algorithms and implemented such that global memory access is minimised. To solve systems with trillions ($\order(10^{12})$) unknowns the code has to make efficient use of several million individual processor cores on large GPU clusters.We describe the multi-GPU implementation of two algorithmically optimal iterative solvers for anisotropic PDEs which are encountered in (semi-) implicit time stepping procedures in atmospheric modelling. In this application the condition number is large but independent of the grid resolution and both methods are asymptotically optimal, albeit with different absolute performance. In particular, an important constant in the discretisation is the CFL number; only the multigrid solver is robust to changes in this constant. We parallelise the solvers and adapt them to the specific features of GPU architectures, paying particular attention to efficient global memory access. We achieve a performance of up to 0.78 PFLOPs when solving an equation with $0.55\cdot 10^{12}$ unknowns on 16384 GPUs; this corresponds to about $3\%$ of the theoretical peak performance of the machine and we use more than $40\%$ of the peak memory bandwidth with a Conjugate Gradient (CG) solver. Although the other solver, a geometric multigrid algorithm, has a slightly worse performance in terms of FLOPs per second, overall it is faster as it needs less iterations to converge; the multigrid algorithm can solve a linear PDE with half a trillion unknowns in about one second.

KW - iterative solver

KW - multigrid

KW - Graphics Processing Unit

KW - massively parallel

KW - atmospheric modelling

U2 - 10.1016/j.parco.2015.10.007

DO - 10.1016/j.parco.2015.10.007

M3 - Article

VL - 50

SP - 53

EP - 69

JO - Parallel Computing

JF - Parallel Computing

SN - 0167-8191

ER -