HECToR, the CoE and Large- Scale Application Performance
- n CLE
HECToR, the CoE and Large- Scale Application Performance on CLE - - PowerPoint PPT Presentation
HECToR, the CoE and Large- Scale Application Performance on CLE David Tanqueray, Jason Beech-Brandt, Kevin Roy*, Martyn Foster, Cray Centre of Excellence for HECToR Topics HECToR The Centre of Excellence Activities CASINO SBLI DLPOLY
May 08 Cray Inc. Proprietary Slide 2
May 08 Cray Inc. Proprietary Slide 3
May 08 Cray Inc. Proprietary Slide 4
Weekly Usage Figures
10 20 30 40 50 60 70 80 90 01/01/2008 01/02/2008 01/03/2008 01/04/2008 Date Percentage of Capacity Utilization
May 08 Cray Inc. Proprietary Slide 5
May 08 Cray Inc. Proprietary Slide 6
May 08 Cray Inc. Proprietary Slide 7
May 08 Cray Inc. Proprietary Slide 8
May 08 Cray Inc. Proprietary Slide 9
May 08 Cray Inc. Proprietary Slide 10
Users were running on 4k cores within one hour Allowed users to do simulations not possible on HPCx Have enough data from early access time for a journal publication Post-processing of this early-access data is ongoing
Figure illustrates instantaneous u-velocity contours of flow over a Delery bump
May 08 Cray Inc. Proprietary Slide 11
May 08 Cray Inc. Proprietary Slide 12
USER / deriv_d1eta_2_
12.4% Time 22.654139 39.8 Imb.Time 3.048877 Imb.Time% 12.1% Calls 2854 PAPI_L1_DCA 910.346M/sec 14907115715 refs DATA_CACHE_REFILLS:SYSTEM 2.024M/sec 33136218 fills DATA_CACHE_REFILLS:L2_ALL 39.088M/sec 640067739 fills REQUESTS_TO_L2:DATA 63.320M/sec 1036880831 req Cycles 16.375 secs 42575593125 cycles User time (approx) 16.375 secs 42575593125 cycles Utilization rate 72.3% L1 Data cache misses 41.111M/sec 673203957 misses LD & ST per D1 miss 22.14 refs/miss D1 cache hit ratio 95.5% 89.8% LD & ST per D2 miss 449.87 refs/miss D2 cache hit ratio 96.8% 90.2% L2 cache hit ratio 95.1% 87.5% Total cache hit ratio 99.8% Significantly better cache behaviour, and much less time is being spent doing these derivative calculations +43%
May 08 Cray Inc. Proprietary Slide 13
SBLI
10000 20000 30000 40000 50000 60000 64 128 256 512 1024 2048 4096 8192 Core count time(s)*cores (Cost)
May 08 Cray Inc. Proprietary Slide 15
500 1000 1500 2000 2500 3000 3500 4000 4500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Performance of DLPOLY 3.09+
3.7 MIllion particles
Optimised Code (3.09 + fixes) linear from 32p original Original (3.07)
Procs TFlops
May 08 Cray Inc. Proprietary Slide 16
100 200 300 400 500 600 1 2 3 4 5 6
DLPOLY 3.09+ Speedup vs original 64P time
70000 atoms (water/calcium + protein)
speedup (original) speedup (new code v1.0)
NP speedup
“Of course the downside of all these speedups is that at the end of the project myself and Colin are going to have to analyse about three times more data that we originally planned for! :-)” David Quigley (HECToRs 3rd largest user)
May 08 Cray Inc. Proprietary Slide 17
May 08 Cray Inc. Proprietary Slide 18
Aggregate Performance - Large Dataset
200 400 600 800 1000 1200 1400 1000 2000 3000 4000 5000 Cores MLSUP/s Fluid only & vr & iso Ideal
Aggregate Performance - Largest Dataset
2000 4000 6000 8000 10000 12000 500 1000 1500 2000 2500 3000 3500 4000 4500 Core count M L S U P /s Dataset#4 Ideal
May 08 Cray Inc. Proprietary Slide 19
500 1000 1500 2000 2500 Seconds 64 128 256 512 1024 2048 4096 8192 Cores
Initialisation Phase Improvements
OPT ORIG
May 08 Cray Inc. Proprietary Slide 20
Six months were spent re- engineering the code specifically for this platform
50% of time is spent on communication
5% of time is spent on communication
2
k2 0 1 2 3 4 5 k1 (au) Quantum-mechanical state of helium prepared by a short intense laser pulse. “I haven't seen anything this nice since the Cray T3D/E.” Jonathan Parker (Owner of one of HECToR’s highest scaling codes)
May 08 Cray Inc. Proprietary Slide 21
CFD Performance
500 1000 1500 2000 2500 3000 3500 4000 Runtime (mins) X2 SX8 X1E XT
Scalability of A Lattice Boltzmann Code
2000 4000 6000 8000 10000 1024 2048 4096 8192 Cores Speedup Linear Scalability
May 08 Cray Inc. Proprietary Slide 22
Current Job Mix
5 10 15 20 25 30 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 CPU Count Relative # jobs jobs*cores