Performance of Density Functional Theory codes on Cray XE6 Zhengji - PowerPoint PPT Presentation

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory

Outline • Motivation • Introduction to DFT codes • Threads and performance of VASP • OpenMP threads and performance of Qauntum Espresso • Conclusion

Motivation • Challenges from the multi-core trend – Address reduced per core memory, – Make use of faster intra node memory access • Recommended path forward is to use threads/OpenMP • Majority of the NERSC application codes are still in flat MPI • Exam the performance implications from the use of threads in real user applications

Why DFT codes • Materials and Chemistry applications account for 1/3 of NERSC workflow. • 75% of them run various DFT codes. • Among 500 application code instances at NERSC, VASP consumes the most computing cycles (~8%). • VASP is in pure MPI, current status of majority user codes • Quantum Espresso, an OpenMP/MPI hybrid codes, top #8 code at NERSC.

Density Functional Theory • What it solves – Kohn-sham equation { " 1 2 # 2 + V ( r )[ $ ]} % i ( r ) = E i % i ( r ) " # i ( r ) # j ( r ) dr = $ ij , { # i } i = 1,.., N Local Density Approximation: Z R % $ " ( r ') d 3 r ' + µ ( " ( r )) V ( r )[ " ] = | r # R | + | r # r '| R # C i , G e [ i ( k + G ). r ] " i ( r ) = G

Flow chart of DFT codes N electrons " ( r ) and trial charge density trial wavefunctions { " i } i = 1,.., N N wave functions H ! , 2 FFTs { " 1 Potential mixing - V in , V out ! new V in 2 + V in ( r )} $ i ( r ) = E i $ i ( r ) 2 # (CG, RMM-DIIS,Davison) # Orthogonalization { " i } i = 1,.., N " i ( r ) " j ( r ) dr = $ ij , Subspace diagonalization " i | H | " j Potential generation $ ( r ) | 2 " ( r ) = f n | # i solve poission equation i using density functional formula V out ( r ) yes no { " i } i = 1,.., N " E < # break

Parallelization in DFT codes Level 1: Parallel over k-points • The number of processors, N tot , is divided into n kg group, each group has N k number of processors (N tot =n kg *N k ) • Each group of processors deal with nk tot /n kg number of k points { " i , k }, i = 1,..., N ; k = 1, nk tot

Parallelization in DFT codes Level 2: Parallel over bands • The number of processors, Nk, is divided into Ng group, each group has Np number of processors (Ntot=Ng*Np) • N wavefunctions are also divided into Ng groups, each with m wavefunctions • One group of processors deal with one group of wavefunctions { " i , k } i = 1,.., N { " i , k } i = m *( Ng # 1) + 1, N { " i , k } i = 1, m { " i , k } i = m + 1,2 m ; ; …… ; Processors Processors Processors … N - Np+1 - N Np+1 - 2Np 1 - Np Group Ng Group 2 Group 1

Parallelization in DFT codes Level 3: Parallel over planewave basis set Within each group of processors, the planewave basis is divided among the Np number of processors: FFT # C i , G e [ i ( k + G ). r ] " i , k ( r ) = G Divide the G-space into columns, and Real space distribute them to the Np processors Figures from http://hpcrd.lbl.gov/~linwang/PEtot/PEtot_parallel.html

VASP • A planewave pseudopotential code – A commercial code from Univ. of Vienna • Libraries used – BLAS, fft • Parallel implementations – Over planewave basis set and bands – >1proc/atom scale – Flops 20-50% of peak (in real calculations) • VASP use at NERSC – Used by 83 projects, 200 active users http://cmp.univie.ac.at/vasp

VASP: Performance vs threads Test case A154: '#!!" '!!!" &#!!" 154 atoms &!!!" +$#',-./01234" !"#$%&'(% 998 electrons %#!!" %!!!" Zn 48 O 48 C 22 S 2 H 34 +$#',566,-778" $#!!" 80x70x140 real-space $!!!" 9:.;"6<7,566,-778," #!!" grids; =4>.?@A1"431A2" !" 160x140x280 FFT 9:.;"6<7,-./01234, $" %" &" (" $%" %'" =4>.?@A1"431A2" grids $''" )%" '*" %'" $%" (" 4 kpoints )*#+$,%-.%/0,$12'3456%/1'7'% • When the number of threads increases, a little or no performance gain. Code runs slower. • But in comparison to the flat MPI, at threads=3, VASP runs faster than the flat MPI on unpacked nodes by 20-25% 11

VASP: Memory usage vs threads !#'" Test case A154: !"#$%&'("%')$%"'*+,-' !#&$" !#&" !#%$" !#%" ,%$)-.//-0112" !#!$" ,%$)-34563789" !" %" &" '" (" %&" &)" %))" *&" )+" &)" %&" (" ./#0"%'$1'23%"4567!89'246:6' • Memory usage is reduced when the number of threads increases • At threads=3, the memory usage is reduced by 10% compared to that of threads=2 12

VASP: VASP runs slower when the number of threads increases '!!!" Test case A660: &#!!" &!!!" 660 atoms %#!!" !"#$%&'(% 2220 electrons %!!!" C 200 H 230 N 70 Na 20 O 120 P 20 $#!!" +((!,-./01234" 240x240x486 real- $!!!" +((!,566,-778" #!!" space grids; !" 480x380x972 FFT grids $" %" &" (" $%" %'" 1 kpoint (Gamma point) Gamma kpoint only )(*" &*'" %#(" $%*" ('" &%" VASP )*#+$,%-.%/0,$12'3456%/1'7'% Threaded VASP at best (threads=2) is slightly slower (~12%) than the flat MPI 13

VASP: Memory usage vs threads !#'" Test case A660: !"#$%&'("%')$%"'*+,-' !#&$" !#&" !#%$" !#%" ,((!-./012345" !#!$" ,((!-677-.889" !" %" &" '" (" %&" &)" *(+" '+)" &$(" %&+" ()" '&" ./#0"%'$1'23%"4567!89'246:6' Compare the memory usage for threads=2 and the flat MPI: For RMM-DIIS: there is a slight memory saving For Davidson: no memory saving at threads=2, slightly more use of memory (<3%) 14

Quantum Espresso • A planewave pseudopotential code – An open software DEMOCRITOS National Simulation Center and SISSA with collaboration with many other institutes • Libraries used – BLAS, fft • Parallel implementations – Over k-points, planewave basis and bands – >1proc/atom scale • QE use at NERSC – Used by 21 projects http://www.quantum-espresso.org

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI Test case GRIR686: &!!!" %#!!" 686 atoms %!!!" !"#$%&'(% 5174 electrons $#!!" C 200 Ir 486 +,-,'*'" $!!!" 180x180x216 FFT #!!" ./01"23-"45"60/78 grids 90:;<="54=<>" !" 2 kpoints $" %" &" '" $%" %(" $((!" )%!" (*!" %(!" $%!" '!" )*#+$,%-.%/0,$12'3456%/1'7'% At threads=2, QE runs faster than the flat MPI on half- packed nodes by 38% 16

QE: The OpenMP+MPI code uses less memory than the flat MPI $" Test case GRIR686: (#'" !"#$%&'("%')$%"'*+,-' (#&" (#%" 686 atoms (#$" 5174 electrons (" !#'" +,-,&'&" C 200 Ir 486 !#&" 180x180x216 FFT !#%" ./01"23-"45"60/78 !#$" grids 90:;<="54=<>" !" 2 kpoints (" $" )" &" ($" $%" (%%!" *$!" %'!" $%!" ($!" &!" ./#0"%'$1'23%"4567!89'246:"6' At threads=2, the memory usage is reduced by 64% when compared to the flat MPI 17

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI Test case &!!!" CNT10POR8: %#!!" %!!!" !"#$%&'(% 1532 atoms $#!!" 5232 electrons +,-$!./0)" $!!!" C 200 Ir 486 #!!" 1234"5.6"78"932:; 540x540x540 FFT <3=>?@"87@?A" !" grids $" %" &" '" $%" %(" 1 kpoint (Gamma point) $'&%" )$'" #((" %*%" $&'" ')" )*#+$,%-.%/0,$12'3456%/1'7'% At threads=2, QE runs faster than the flat MPI on half- packed nodes by 28% 18

QE: The OpenMP+MPI code uses less memory than the flat MPI ("%!!# Test case ("$!!# !"#$%&'("%')$%"'*+,-' CNT10POR8: ("!!!# !"'!!# !"&!!# !"%!!# ,-.(!/01'# !"$!!# !"!!!# (# $# )# &# ($# $%# (&)$# '(&# *%%# $+$# ()&# &'# ./#0"%'$1'23%"4567!89'246:6' At threads=2, the memory usage is reduced by 30% 19

QE: The Hybrid OpenMP+MPI code runs faster than the flat MPI #!!" Test case '#!" AUSURF112: '!!" &#!" !"#$%&'(% &!!" 112 atoms %#!" 5232 electrons %!!" +,-,./$$%" $#!" C 200 Ir 486 $!!" /012"345"67"8109: 125x64x200 FFT grids #!" ;1<=>?"76?>@" 80x90x288 smooth !" $" %" &" (" grids 2 k-points %))" $''" *(" ')" )*#+$,%-.%/0,$12'3456%/1'7'% At threads=2, QE runs faster than the flat MPI on half- packed nodes by 22% 20

Performance of Density Functional Theory codes on Cray XE6 Zhengji - PowerPoint PPT Presentation

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Outline Motivation Introduction to DFT codes

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Energy Efficiency Metrics and Cray XE6 Application Performance Wilfried Oed Principal Engineer

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

Benchmark Performance of Different Compilers on a Cray XE6 Mike Stewart and Helen He NERSC User

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Bringing Up Cielo: Experiences with a Cray XE6 System Or, Getting Started with Your New 140k

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Density Functional Theory Barry T Pickup Department of Chemistry University of Sheffield

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

G o i n g b e y o n d L o c a l D e n s i t y a n d G r a d i e n

New developments in the quantum ESPRESSO software distribution for quantum simulations at the

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

Bridging the High-level and Implementation Divide: Mission Impossible? Victor Konrad April 2002

Performance of Density Functional Theory codes on Cray XE6 Zhengji - PowerPoint PPT Presentation

Performance of Density Functional Theory codes on Cray XE6 Zhengji Zhao, and Nicholas Wright National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Outline Motivation Introduction to DFT codes

Application Characteristics and Performance on a Cray XE6 Performance on a Cray XE6 Courtenay T.

Application Performance Tuning on Cray XT Systems Luiz DeRose John Levesque PE Director CSCE

Cray Lustre Model Roadmap Cory Spitz and Derek Robb Cray Inc. 5/24/2011 Introduction and Agenda

Energy Efficiency Metrics and Cray XE6 Application Performance Wilfried Oed Principal Engineer

FFT libraries on Cray XT: CRay Adaptive FFT (CRAFFT) Jonathan Bentz Cray Inc. Outline

The Cray 1 Time line 1969 -- CDC Introduces 7600, designed by cray. 1972 -- Design of the

Benchmark Performance of Different Compilers on a Cray XE6 Mike Stewart and Helen He NERSC User

Introducing the Cray XMT Petr Konecny November 29 th 2007 Agenda Shared memory programming

Howard Pritchard and Igor Gorodetsky Cray, Inc. Cray User Group Conference 2011 1 Cray User

Detecting Application Load Imbalance on Cray Systems Heidi Poxon Technical Lead, Performance

Bringing Up Cielo: Experiences with a Cray XE6 System Or, Getting Started with Your New 140k

I/O Performance on Cray XC30 Zhengji Zhao 1) , Doug Petesch 2) , David Knaak 2) , and Tina Declerck

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

ECEN 5682 Theory and Practice of Error Control Codes Cyclic Codes Peter Mathys University of

Density Functional Theory Barry T Pickup Department of Chemistry University of Sheffield

GTC Overflow PARQUET Cray Inc. Confidential Slide 2 Cray has a long tradition of

I Logs Apache Kafka, Stream Processing, and Real-time Data Jay Kreps The Plan 1. What is Data

QE, main strategies of parallelization and levels of parallelisms Fabio AFFINITO SCAI - Cineca

G o i n g b e y o n d L o c a l D e n s i t y a n d G r a d i e n

New developments in the quantum ESPRESSO software distribution for quantum simulations at the

Using Space Effectively Ma Maneesh Agrawala CS 448B: Visualization Fall 2020 1 2 1 Last

ESPRESSO Ana Catarina Leite In Colaboration with: Carlos Martins IA-Porto Paolo

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

Bridging the High-level and Implementation Divide: Mission Impossible? Victor Konrad April 2002

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>