HPC N ODE P ERFORMANCE AND P OWER S IMULATION WITH THE S NIPER M - PowerPoint PPT Presentation

HPC ¡N ODE ¡P ERFORMANCE ¡ AND ¡P OWER ¡ S IMULATION ¡ WITH ¡ THE ¡S NIPER ¡M ULTI -‑C ORE ¡ S IMULATOR ¡ T REVOR ¡E. ¡C ARLSON , ¡ W IM ¡H EIRMAN , ¡L IEVEN ¡E ECKHOUT ¡ HTTP :// WWW . SNIPERSIM . ORG ¡ S ATURDAY , ¡F EBRUARY ¡1 ST , ¡2014 ¡ FOSDEM ¡2014 ¡– ¡HPC ¡D EVROOM ¡– ¡B RUSSELS , ¡B ELGIUM ¡

M AJOR ¡G OALS ¡ OF ¡S NIPER ¡ • What ¡will ¡node ¡performance ¡look ¡like ¡for ¡ next-‑generaUon ¡systems? ¡ – Intel ¡Xeon, ¡Xeon ¡Phi, ¡etc. ¡ • What ¡opUmizaUons ¡can ¡we ¡make ¡for ¡these ¡ systems? ¡ – So[ware ¡OpUmizaUons ¡ – Hardware ¡/ ¡So[ware ¡co-‑design ¡ • How ¡is ¡my ¡applicaUon ¡performing? ¡ – Detailed ¡insight ¡into ¡applicaUon ¡performance ¡ on ¡today’s ¡systems ¡ 2 ¡

O PTIMIZING ¡T OMORROW ’ S ¡S OFTWARE ¡ • Design ¡tomorrow’s ¡processor ¡ ¡ using ¡today’s ¡hardware ¡ • OpUmize ¡tomorrow’s ¡so[ware ¡for ¡tomorrow’s ¡ processors ¡ • SimulaUon ¡is ¡one ¡promising ¡soluUon ¡ – Obtain ¡performance ¡characterisUcs ¡ ¡ for ¡new ¡architectures ¡ – Architectural ¡exploraUon ¡ – Early ¡so[ware ¡opUmizaUon ¡ 3 ¡

W HY ¡C AN ’ T ¡I ¡J UST ¡… ¡ ¡ use ¡performance ¡counters? ¡ – perf ¡stat, ¡perf ¡record ¡ ¡ use ¡Cachegrind? ¡ ¡ It ¡can ¡be ¡difficult ¡to ¡see ¡exactly ¡where ¡the ¡problems ¡are ¡ – Not ¡all ¡cache ¡misses ¡are ¡alike ¡– ¡latency ¡macers ¡ – Modern ¡out-‑of-‑order ¡processors ¡can ¡overlap ¡misses ¡ – Both ¡core ¡and ¡cache ¡performance ¡macers ¡ 4 ¡

N ODE -‑C OMPLEXITY ¡ IS ¡I NCREASING ¡ ¡ • Significant ¡HPC ¡node ¡architecture ¡changes ¡ – Increases ¡in ¡core ¡counts ¡ • More, ¡lower-‑power ¡cores ¡(for ¡energy ¡efficiency) ¡ – Increases ¡in ¡thread ¡(SMT) ¡counts ¡ – Cache-‑coherent ¡NUMA ¡ • OpUmizing ¡for ¡efficiency ¡ Source: ¡Wikimedia ¡Commons ¡ – How ¡do ¡we ¡analyze ¡our ¡current ¡so[ware? ¡ – How ¡do ¡we ¡design ¡our ¡next-‑generaUon ¡so[ware? ¡ 5 ¡

T RENDS ¡ IN ¡ PROCESSOR ¡ DESIGN : ¡ CORES ¡ Number ¡of ¡cores ¡per ¡node ¡is ¡increasing ¡ – 2001: ¡Dual-‑core ¡POWER4 ¡ – 2005: ¡Dual-‑core ¡AMD ¡Opteron ¡ – 2011: ¡10-‑core ¡Intel ¡Xeon ¡Westmere-‑EX ¡ – 2012: ¡Intel ¡MIC ¡Knights ¡Corner ¡(60+ ¡cores) ¡ – 2013: ¡Intel ¡MIC ¡Knights ¡Landing ¡announced 1 ¡ Xeon ¡Phi, ¡Source: ¡Intel ¡ Westmere-‑EX, ¡Source: ¡Intel ¡ 1 hcp://newsroom.intel.com/community/intel_newsroom/blog/2013/06/17/ ¡ 6 ¡ ¡ ¡intel-‑powers-‑the-‑worlds-‑fastest-‑supercomputer-‑reveals-‑new-‑and-‑future-‑high-‑performance-‑compuUng-‑technologies ¡

M ANY ¡A RCHITECTURE ¡O PTIONS ¡ L1 L1I ¡ L1 D ¡ L1I ¡ L1 L1 L1 L1 L1 D ¡ L1I ¡ L1 L1I ¡ L1I ¡ L1I ¡ L1I ¡ L1 L1 L1 L1 D ¡ L1I ¡ D ¡ D ¡ D ¡ D ¡ L1 L1I ¡ L1I ¡ L1I ¡ L1I ¡ D ¡ L1 L1 L1 L2 ¡ L1I ¡ L1 L1 D ¡ D ¡ D ¡ D ¡ L1I ¡ L1I ¡ L1I ¡ L1I ¡ D ¡ L2 ¡ L1I ¡ L1 L1 L1 L1 L1 D ¡ D ¡ D ¡ D ¡ D ¡ L2 ¡ L1I ¡ L1 L1I ¡ L1I ¡ L1I ¡ L1I ¡ D ¡ L2 ¡ L1I ¡ L1 D ¡ D ¡ D ¡ D ¡ D ¡ L2 ¡ L2 ¡ L2 ¡ L1I ¡ L1 D ¡ L2 ¡ L1I ¡ L1 L2 ¡ L2 ¡ D ¡ L2 ¡ L1I ¡ L1 L2 ¡ L2 ¡ D ¡ L2 ¡ L1I ¡ D ¡ L2 ¡ L2 ¡ L2 ¡ L2 ¡ L3 ¡ L2 ¡ L3 ¡ L2 ¡ L3 ¡ L3 ¡ NoC ¡ DRAM ¡ 7 ¡

U PCOMING ¡C HALLENGES ¡ • Future ¡systems ¡will ¡be ¡diverse ¡ – Varying ¡processor ¡speeds ¡ – Varying ¡failure ¡rates ¡for ¡different ¡components ¡ – Homogeneous ¡applicaUons ¡show ¡heterogeneous ¡performance ¡ • So[ware ¡and ¡hardware ¡soluUons ¡are ¡needed ¡to ¡ solve ¡these ¡challenges ¡ – Handle ¡heterogeneity ¡(reacUve ¡load ¡balancing) ¡ – Handle ¡fault ¡tolerance ¡ – Improve ¡power ¡efficiency ¡at ¡the ¡algorithmic ¡level ¡ (extreme ¡data ¡locality) ¡ • Hard ¡to ¡model ¡accurately ¡with ¡analyUcal ¡models ¡ 8 ¡

F AST ¡ AND ¡A CCURATE ¡S IMULATION ¡ IS ¡N EEDED ¡ • EvaluaUng ¡current ¡so[ware ¡on ¡current ¡hardware ¡is ¡ difficult ¡ – Performance ¡counters ¡do ¡not ¡provide ¡enough ¡insight ¡ • SimulaUon ¡use ¡cases ¡ – Pre-‑silicon ¡so[ware ¡opUmizaUon ¡ – Architecture ¡exploraUon ¡ • Cycle-‑accurate ¡simulaUon ¡is ¡too ¡slow ¡for ¡exploring ¡ mulU/many-‑core ¡design ¡space ¡and ¡so[ware ¡ • Key ¡quesUons ¡ – Can ¡we ¡raise ¡the ¡level ¡of ¡abstracUon? ¡ – What ¡is ¡the ¡right ¡level ¡of ¡abstracUon? ¡ – When ¡to ¡use ¡these ¡abstracUon ¡models? ¡ 9 ¡

S NIPER : ¡A ¡F AST ¡ AND ¡A CCURATE ¡S IMULATOR ¡ • Hybrid ¡simulaUon ¡approach ¡ – AnalyUcal ¡interval ¡core ¡model ¡ – Micro-‑architecture ¡structure ¡simulaUon ¡ • branch ¡predictors, ¡caches ¡(incl. ¡coherency), ¡NoC, ¡etc. ¡ • Hardware-‑validated, ¡Pin-‑based ¡ • Models ¡mulU/many-‑cores ¡running ¡mulU-‑ threaded ¡and ¡mulU-‑program ¡workloads ¡ • Parallel ¡simulator ¡scales ¡with ¡the ¡number ¡of ¡ simulated ¡cores ¡ • Available ¡at ¡ http://snipersim.org ¡ 10 ¡

T OP ¡S NIPER ¡F EATURES ¡ • Interval ¡Model ¡ • MulU-‑threaded ¡ApplicaUon ¡Sampling ¡ • CPI ¡Stacks ¡and ¡InteracUve ¡VisualizaUon ¡ • Parallel ¡MulUthreaded ¡Simulator ¡ • x86-‑64 ¡and ¡SSE2 ¡support ¡ • Validated ¡against ¡Core2, ¡Nehalem ¡ • Thread ¡scheduling ¡and ¡migraUon ¡ • Full ¡DVFS ¡support ¡ • Shared ¡and ¡private ¡caches ¡ • Modern ¡branch ¡predictor ¡ • Supports ¡pthreads ¡and ¡OpenMP, ¡TBB, ¡OpenCL, ¡MPI, ¡… ¡ • SimAPI ¡and ¡Python ¡interfaces ¡to ¡the ¡simulator ¡ • Many ¡flavors ¡of ¡Linux ¡supported ¡(Redhat, ¡Ubuntu, ¡etc.) ¡ 11 ¡

S NIPER ¡L IMITATIONS ¡ • User-‑level ¡ – Not ¡the ¡best ¡match ¡for ¡workloads ¡with ¡significant ¡OS ¡ involvement ¡ • FuncUonal-‑directed ¡ – No ¡simulaUon ¡/ ¡cache ¡accesses ¡along ¡false ¡paths ¡ • High-‑abstracUon ¡core ¡model ¡ – Not ¡suited ¡to ¡model ¡all ¡effects ¡of ¡core-‑level ¡changes ¡ – Perfect ¡for ¡memory ¡subsystem ¡or ¡NoC ¡work ¡ • x86 ¡only ¡ • But ¡… ¡is ¡a ¡perfect ¡match ¡for ¡HPC ¡evaluaUon ¡ 12 ¡

S NIPER ¡H ISTORY ¡ • November, ¡2011: ¡SC’11 ¡paper, ¡first ¡public ¡release ¡ • March ¡2012, ¡version ¡2.0: ¡MulU-‑program ¡workloads ¡ • May ¡2012, ¡version ¡3.0: ¡Heterogeneous ¡architectures ¡ • November ¡2012, ¡version ¡4.0: ¡Thread ¡scheduling ¡and ¡migraUon ¡ • April ¡2013, ¡version ¡5.0: ¡MulU-‑threaded ¡applicaUon ¡sampling ¡ • June ¡2013, ¡version ¡5.1: ¡SuggesUons ¡for ¡opUmizaUon ¡visualizaUon ¡ • September ¡2013, ¡ version ¡5.2: ¡ ¡MESI/F, ¡2-‑level ¡TLBs, ¡ ¡Python ¡scheduling ¡ • Today: ¡700+ ¡downloads ¡ from ¡60 ¡countries ¡ 13 ¡

T HE ¡S NIPER ¡M ULTI -‑C ORE ¡S IMULATOR ¡ V ISUALIZATION ¡ HTTP :// WWW . SNIPERSIM . ORG ¡ S ATURDAY , ¡F EBRUARY ¡1 ST , ¡2013 ¡ FOSDEM ¡2014 ¡– ¡HPC ¡D EVROOM ¡– ¡B RUSSELS , ¡B RLGIUM ¡

V ISUALIZATION ¡ Sniper ¡generates ¡quite ¡a ¡few ¡staUsUcs, ¡ but ¡only ¡with ¡text ¡is ¡it ¡difficult ¡to ¡understand ¡ performance ¡details ¡ Text ¡output ¡from ¡Sniper ¡(sim.stats) ¡ 15 ¡

C YCLE ¡ STACKS ¡ CPI ¡ • Where ¡did ¡my ¡cycles ¡go? ¡ • CPI ¡stack ¡ – Cycles ¡per ¡instrucUon ¡ – Broken ¡up ¡in ¡components ¡ • Normalize ¡by ¡either ¡ – Number ¡of ¡instrucUons ¡(CPI ¡stack) ¡ – ExecuUon ¡Ume ¡(Ume ¡stack) ¡ • Different ¡from ¡miss ¡rates: ¡ ¡ L2 ¡cache ¡ I-‑cache ¡ cycle ¡stacks ¡directly ¡quanUfy ¡ ¡ Branch ¡ the ¡effect ¡on ¡performance ¡ Base ¡ 16 ¡

HPC N ODE P ERFORMANCE AND P OWER S IMULATION WITH THE S NIPER M - PowerPoint PPT Presentation

HPC N ODE P ERFORMANCE AND P OWER S IMULATION WITH THE S NIPER M ULTI -C ORE S IMULATOR T REVOR E. C ARLSON , W IM H EIRMAN , L IEVEN E ECKHOUT HTTP :// WWW . SNIPERSIM . ORG S ATURDAY ,

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

ode ode Basic Concepts and Theorems The n th order linear ODE takes the form: n n 1 d y

E LECTRICAL P OWER : E LECTRICAL P OWER W E ARE A LEADER IN ELECTRICAL POWER MANAGEMENT W E

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

W HAT S SPECIAL ABOUT S YSTEM C ALLS ? From the point of view of a programmer, system calls

Definition. By a linear ODE of order 1 we mean any ODE written in the form y + a ( x ) y = b (

T HE P OWER OF P ERFORMANCE 51 TH A NNUAL C ONFERENCE M AY 22-25, 2018 O RLANDO , F LORIDA

T HE P OWER OF P ERFORMANCE 51 TH A NNUAL C ONFERENCE M AY 22-25, 2018 O RLANDO , F LORIDA

O VERALL P OWER C ORE C ONFIGURATION AND S YSTEM I NTEGRATION FOR ARIES-ACT1 F USION P OWER P LANT

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

T HE E DENWALD N EW T OWER : T HE E DENWALD N EW T OWER B RYAN H ART B RYAN H ART STRUCTURAL

UTRIENT TMDL TMDL S L OWER OWER S ALINAS ALINAS R IVER IVER & R & R ECLAMATION ECLAMATION C

8757 C 1-PL ACE : +/ -0.2 093 - 0-PL ACE : +/ -0.4 NOT E : T HI S DRAWI NG I N T

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Looking for the academic voice: Assessing undergraduate wri7ng

DEVELOPMENTANDEVALUATIONOFA*WEB1BASEDDISTANCECOURSEFOR

Agents and State Spaces CSCI 446: Ar*ficial Intelligence

Part-based R-CNNs for Fine-grained Category Detec7on

Objec&ves Directed Graphs Topological Orderings of DAGs Feb 5, 2018 CSCI211 -

Leveraging Health Plans in Medicaid Health Home Programs April 17, 2012; 2:00 3:00PM (ET)

!"#$%"&'()+(,-.%( /-0"1(23$%3+( 4&"03+."#5()(,-67+"1'3(

Welcome! Please Sit with Someone Surprising 1

HPC N ODE P ERFORMANCE AND P OWER S IMULATION WITH THE S NIPER M - PowerPoint PPT Presentation

HPC N ODE P ERFORMANCE AND P OWER S IMULATION WITH THE S NIPER M ULTI -C ORE S IMULATOR T REVOR E. C ARLSON , W IM H EIRMAN , L IEVEN E ECKHOUT HTTP :// WWW . SNIPERSIM . ORG S ATURDAY ,

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

ode ode Basic Concepts and Theorems The n th order linear ODE takes the form: n n 1 d y

E LECTRICAL P OWER : E LECTRICAL P OWER W E ARE A LEADER IN ELECTRICAL POWER MANAGEMENT W E

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

W HAT S SPECIAL ABOUT S YSTEM C ALLS ? From the point of view of a programmer, system calls

Definition. By a linear ODE of order 1 we mean any ODE written in the form y + a ( x ) y = b (

T HE P OWER OF P ERFORMANCE 51 TH A NNUAL C ONFERENCE M AY 22-25, 2018 O RLANDO , F LORIDA

T HE P OWER OF P ERFORMANCE 51 TH A NNUAL C ONFERENCE M AY 22-25, 2018 O RLANDO , F LORIDA

O VERALL P OWER C ORE C ONFIGURATION AND S YSTEM I NTEGRATION FOR ARIES-ACT1 F USION P OWER P LANT

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

T HE E DENWALD N EW T OWER : T HE E DENWALD N EW T OWER B RYAN H ART B RYAN H ART STRUCTURAL

UTRIENT TMDL TMDL S L OWER OWER S ALINAS ALINAS R IVER IVER &amp; R &amp; R ECLAMATION ECLAMATION C

8757 C 1-PL ACE : +/ -0.2 093 - 0-PL ACE : +/ -0.4 NOT E : T HI S DRAWI NG I N T

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

Looking for the academic voice: Assessing undergraduate wri7ng

DEVELOPMENT*AND*EVALUATION*OF*A*WEB1BASED*DISTANCE*COURSE*FOR*

Agents and State Spaces CSCI 446: Ar*ficial Intelligence

Part-based R-CNNs for Fine-grained Category Detec7on

Objec&amp;ves Directed Graphs Topological Orderings of DAGs Feb 5, 2018 CSCI211 -

Leveraging Health Plans in Medicaid Health Home Programs April 17, 2012; 2:00 3:00PM (ET)

!&quot;#$%&quot;&amp;'()*+(,-.%( /-0&quot;1(23$%3+( 4&amp;&quot;03+.&quot;#5(*)(,-67+&quot;1'3(

Welcome! Please Sit with Someone Surprising 1

UTRIENT TMDL TMDL S L OWER OWER S ALINAS ALINAS R IVER IVER & R & R ECLAMATION ECLAMATION C

DEVELOPMENTANDEVALUATIONOFA*WEB1BASEDDISTANCECOURSEFOR

Objec&ves Directed Graphs Topological Orderings of DAGs Feb 5, 2018 CSCI211 -

!"#$%"&'()+(,-.%( /-0"1(23$%3+( 4&"03+."#5()(,-67+"1'3(