Designing the Future: How Successful Codesign Helps Shape Hardware - PowerPoint PPT Presentation

Official ¡Use ¡Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development ¡ Christian Trott SAND2014-19833 C Unclassified, ¡Unlimited ¡release Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. 11/18/14

CoDesign ¡at ¡Sandia ¡ Mantevo MiniApps Post CMOS SST New technological base Architecture SImulation Testbeds Runtimes Early Access Hardware Portals, QThreads Kokkos Programming Model 11/18/14 2

CoDesign ¡at ¡Sandia ¡ Mantevo Provides comprehensive platform coverage - test codes and algorithms on all platforms MiniApps - helps developing portable code - typically 16-64 nodes Post CMOS SST Access to pre-production level hard-/software New technological base Architecture SImulation - investigate potential issues with new products - early feedback for vendors - find issues in software before release Testbeds Runtimes Early Access Hardware Portals, QThreads Kokkos Programming Model 11/18/14 3

CoDesign ¡at ¡Sandia ¡ Mantevo MiniApps Post CMOS SST New technological base Architecture SImulation Complex parallel hardware simulator - used by many organisations Testbeds Runtimes - can run on clusters Capabilities for wide range of fidelity Early Access Hardware Portals, QThreads - cores at instruction level - memory subsystem Kokkos - full system network Programming Model Modular design - add new capabilities 11/18/14 5

CoDesign ¡at ¡Sandia ¡ Mantevo MiniApps Post CMOS SST New technological base Architecture SImulation Provide small, representative codes - no or little dependencies - can be used with simulators Allow rapid modifications - test new programming models Testbeds Runtimes - test new algorithms Early Access Hardware Portals, QThreads Kokkos Programming Model 11/18/14 7

CoDesign ¡at ¡Sandia ¡ Programming model for hardware abstraction Mantevo - Memory abstraction: spaces, access traits, layouts - Execution abstraction: spaces, policies MiniApps Design influenced by information about future architectures - interaction with all vendors allows for future-safe general Post CMOS SST applicable abstractions New technological base Architecture SImulation - concepts in place to handle platforms in 2020 Influence hardware design for better programmability - what concepts work well for app developers - which capabilities are missing in architectures Influencing C++ standard to adopt successful concepts Testbeds Runtimes Early Access Hardware Portals, QThreads Kokkos Programming Model 11/18/14 9

Testbeds: ¡Shannon ¡ Primary ¡GPU ¡Testbed ¡ • Runtime 32 ¡Dual ¡Sandy-‑Bridge ¡nodes ¡ • 35 QDR ¡Infiniband ¡ • 30 128 ¡GB ¡Ram: ¡experiment ¡with ¡RAMDisk ¡ • 25 November ¡2012: ¡64 ¡K20x ¡ • 20 November ¡2013: ¡K40s ¡ • 15 November ¡2014: ¡8 ¡nodes ¡with ¡2xK80s ¡ • 10 K80 ¡proper)es: ¡ 5 • mostly two K40s on a single board 0 MiniFE Lennard SNA • increased register count 2x Jones Potential • increased L1/shared memory 2x • power limit 150W per GPU K40 K80 11/18/14 11

A ¡closer ¡look ¡at ¡NVIDIAs ¡K80 ¡ Power ¡consump@on: ¡ on ¡previous ¡GPUs ¡most ¡applicaTons ¡pull ¡significantly ¡less ¡than ¡TDP ¡ • use ¡that ¡knowledge ¡to ¡design ¡dual ¡GPU ¡with ¡no ¡performance ¡penalty ¡ • K40 ¡TDP ¡of ¡230W, ¡K80 ¡TDP ¡of ¡150W ¡(single ¡GPU) ¡ • ¡ Power Consumption Frequency 1000 200 800 150 600 100 400 50 200 0 0 miniFE Lennard SNA miniFE Lennard SNA Jones Potential Jones Potential Frequency K40 Frequency K80 Power K40 Power K80 11/18/14 12

IBM ¡Power ¡8 ¡& ¡NVIDIA ¡K20x ¡ Hardware: ¡ 8 ¡nodes ¡of ¡dual ¡socket ¡Power ¡8 ¡ • 2x ¡K20 ¡per ¡node ¡ • ¡ Cluster ¡is ¡running: ¡ CUDA ¡5.5 ¡+ ¡GCC ¡Toolchain ¡works ¡ • A ¡lot ¡of ¡other ¡so^ware ¡expected ¡on ¡HPC ¡pla`orms ¡in ¡early ¡stages ¡ • ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡e.g. ¡no ¡CUDA ¡aware ¡MPI ¡ Gebng ¡CUDA ¡applicaTons ¡to ¡run ¡relaTvely ¡painless ¡ • Performance ¡as ¡expected ¡(i.e. ¡the ¡same ¡as ¡on ¡X86 ¡based ¡systems ¡with ¡K20x) ¡ • ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡this ¡is ¡for ¡apps ¡running ¡exclusively ¡on ¡GPUs ¡ ¡ Goal: ¡ ¡ shake ¡out ¡problems ¡with ¡so^ware ¡stack ¡now ¡ • ¡ ¡ ¡ ¡ ¡ ¡ ¡-‑> ¡ready ¡for ¡Power ¡based ¡system ¡with ¡NVLink ¡in ¡2016 ¡ ¡ 11/18/14 13

OpenACC ¡and ¡C++ ¡ C++ ¡Situa@on ¡2013: ¡ no ¡support ¡for ¡class ¡member ¡access ¡ • not ¡able ¡to ¡call ¡class ¡member ¡funcTons ¡inside ¡kernels ¡ • replace ¡all ¡members ¡with ¡temporaries ¡/ ¡explicit ¡inlining ¡ • can’t ¡copy ¡up ¡class ¡instances ¡ • ¡ ¡ ¡ class SomeClass { int a; int *array; int n; void compute() { const int n_tmp = n; const int a_tmp = a; const int array_tmp = array #pragma acc parallel loop pcopy(array_tmp[0:n_tmp]) for(int i = 0; i< n_tmp ; i++) { array_tmp[i] = a_tmp + i; } } 11/18/14 14

OpenACC ¡and ¡C++ ¡ C++ ¡Situa@on ¡2013: ¡ no ¡support ¡for ¡class ¡member ¡access ¡ • not ¡able ¡to ¡call ¡class ¡member ¡funcTons ¡inside ¡kernels ¡ • replace ¡all ¡members ¡with ¡temporaries ¡/ ¡explicit ¡inlining ¡ • can’t ¡copy ¡up ¡class ¡instances ¡ • ¡ ¡ ¡ class SomeClass { int a; int *array; int n; Temporaries needed since “this” pointer not void compute() { valid in kernel. const int n_tmp = n; const int a_tmp = a; const int array_tmp = array #pragma acc parallel loop pcopy(array_tmp[0:n_tmp]) for(int i = 0; i< n_tmp ; i++) { array_tmp[i] = a_tmp + i; } } 11/18/14 15

OpenACC ¡and ¡C++ ¡ C++ ¡Situa@on ¡now: ¡ worked ¡with ¡PGI ¡to ¡address ¡issues ¡ • possibility ¡to ¡“ahach” ¡arrays ¡to ¡classes ¡ ¡ • class ¡member ¡access ¡and ¡inline ¡funcTons ¡work ¡ • nested ¡classes ¡sTll ¡problemaTc ¡ • looking ¡at ¡C++11 ¡now ¡ • ¡ class SomeClass { ¡ ¡ int a; int *array; int n; void compute() { #pragma acc parallel loop pcopy(array[0:n]) for(int i = 0; i< n ; i++) { array[i] = a + i; } } 11/18/14 16

CUDA ¡and ¡C++11 ¡ Experimental, ¡undocumented ¡support ¡in ¡CUDA ¡6.5 ¡ LAMBDA ¡inside ¡of ¡Kernels ¡ • auto, ¡decltype ¡ ¡ • variadic ¡templates ¡ • other ¡misc ¡stuff ¡ • ¡ Official ¡support ¡in ¡CUDA ¡7.0 ¡ ¡ Enables ¡simpler ¡code, ¡faster ¡porTng ¡ parTcular ¡benefits ¡for ¡haevily ¡templated ¡codes ¡ • deducTng ¡types ¡automaTcally ¡simplifies ¡user ¡interface ¡ • lambda ¡support ¡enables ¡more ¡abstrac1ons ¡ • ¡ 11/18/14 17

Kokkos: ¡hierarchical ¡parallelism ¡ parallel_for parallel_for(TeamVectorPolicy TeamVectorPolicy<16>(n_bins,8), Functor()); struct Functor { KOKKOS_INLINE_FUNCTION void operator() (TeamMember t) { … parallel_for parallel_for( TeamRange TeamRange(t,n_items_k), [&] (int i) { auto item_i = load_item(bin_k,i); double sum_i; parallel_for parallel_for( VectorRange VectorRange(t,n_items_l), [&] (int j, double& sum) { sum += Calculation(item_i,load_item(bin_l,j); },sum_i); VectorSingle([&] () { accumulate(item_i,sum_i); }); }); } } 11/18/14 18

Designing the Future: How Successful Codesign Helps Shape Hardware - PowerPoint PPT Presentation

Official Use Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Christian Trott SAND2014-19833 C Unclassified, Unlimited release Sandia National Laboratories is a multi-program

HW/SW Codesign w/ FPGAsGeneral Purpose Embedded Cores ECE 495/595 G.P. Embedded Cores (A

HW/SW Codesign w/ FPGAs The Nature of HW/SW I ECE 522 Hardware Software Codesign with FPGAs

HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores

HW/SW Codesign w/ FPGAs Microprogramming ECE 495/595 Limitations of FSMs (A Practical

Across the Second Valley of Death: Designing Successful Energy Demonstration Projects

Hardware/Software Hardware/Software Codesign Environments Codesign Environments Gert Jervan

LOGO DESIGN The strategies behind designing a successful and memorable logo involves a process

HW/SW Codesign w/ FPGAs Microprogramming II ECE 495/595 The Microprog. Datapath (A Practical

D E M ONTFORT U NIVERSITY (DMU) Established in late 19 th Century The name itself is

HW/SW Codesign w/ FPGAsGeneral Purpose Embedded Cores ECE 495/595 The RISC Pipeline (A Practical

DESIGNING, IMPLEMENTING, AND TRACKING SUCCESSFUL MARKETING STRATEGIES Judy Sanderson, Director

Future Space The team that helps you connect with your future: Careers Skills audit, CV

HW/SW Codesign w/ FPGAs Microprogramming III ECE 495/595 Micro-program Interpreters (A Practical

Wrap up The future of AI A future in AI AI in the future 1 AI You dont

Community conversation about the future for our County Dunn County Community Visioning Process -

Prototyping & Building a System How Prototyping helps (especially when done with

GRA-CRG MEMBER COUNTRY ACTIONS AND GROUP VISION Japan Ayaka W. Kishimoto-Mo mow@affrc.go.jp

Function Points What is Function Point Analysis? Approach to estimating SW size, which is

Wel elcome to o Cen Central l Reg egio ion D D-5, 5, D D-6 & & D-10 10 Design

Sustainability Initiative "The significant problems we face cannot be solved at the same

Probabilistic systems a place where categories meet probability Ana Sokolova SOS group, Radboud

Spatial Reconstruction Using Microsoft HoloLens GUPTA Aman ZAFAR Waleed AGENDA

Homotopy Nilpotency in p -compact groups Shizuo Kaji joint with Daisuke Kishimoto Department of

Bisimulation and path logic for sheaves a 1 Sebastian Enqvist 2 Giovanni Cin 1 ILLC 2 ILLC and

Designing the Future: How Successful Codesign Helps Shape Hardware - PowerPoint PPT Presentation

Official Use Only Designing the Future: How Successful Codesign Helps Shape Hardware and Software Development Christian Trott SAND2014-19833 C Unclassified, Unlimited release Sandia National Laboratories is a multi-program

HW/SW Codesign w/ FPGAsGeneral Purpose Embedded Cores ECE 495/595 G.P. Embedded Cores (A

HW/SW Codesign w/ FPGAs The Nature of HW/SW I ECE 522 Hardware Software Codesign with FPGAs

HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 Microprocessors/Embedded Cores

HW/SW Codesign w/ FPGAs Microprogramming ECE 495/595 Limitations of FSMs (A Practical

Across the Second Valley of Death: Designing Successful Energy Demonstration Projects

Hardware/Software Hardware/Software Codesign Environments Codesign Environments Gert Jervan

LOGO DESIGN The strategies behind designing a successful and memorable logo involves a process

HW/SW Codesign w/ FPGAs Microprogramming II ECE 495/595 The Microprog. Datapath (A Practical

D E M ONTFORT U NIVERSITY (DMU) Established in late 19 th Century The name itself is

HW/SW Codesign w/ FPGAsGeneral Purpose Embedded Cores ECE 495/595 The RISC Pipeline (A Practical

DESIGNING, IMPLEMENTING, AND TRACKING SUCCESSFUL MARKETING STRATEGIES Judy Sanderson, Director

Future Space The team that helps you connect with your future: Careers Skills audit, CV

HW/SW Codesign w/ FPGAs Microprogramming III ECE 495/595 Micro-program Interpreters (A Practical

Wrap up The future of AI A future in AI AI in the future 1 AI You dont

Community conversation about the future for our County Dunn County Community Visioning Process -

Prototyping &amp; Building a System How Prototyping helps (especially when done with

GRA-CRG MEMBER COUNTRY ACTIONS AND GROUP VISION Japan Ayaka W. Kishimoto-Mo mow@affrc.go.jp

Function Points What is Function Point Analysis? Approach to estimating SW size, which is

Wel elcome to o Cen Central l Reg egio ion D D-5, 5, D D-6 &amp; &amp; D-10 10 Design

Sustainability Initiative &quot;The significant problems we face cannot be solved at the same

Probabilistic systems a place where categories meet probability Ana Sokolova SOS group, Radboud

Spatial Reconstruction Using Microsoft HoloLens GUPTA Aman ZAFAR Waleed AGENDA

Homotopy Nilpotency in p -compact groups Shizuo Kaji joint with Daisuke Kishimoto Department of

Bisimulation and path logic for sheaves a 1 Sebastian Enqvist 2 Giovanni Cin 1 ILLC 2 ILLC and

Prototyping & Building a System How Prototyping helps (especially when done with

Wel elcome to o Cen Central l Reg egio ion D D-5, 5, D D-6 & & D-10 10 Design

Sustainability Initiative "The significant problems we face cannot be solved at the same