Computing in Space PRACE Keynote Oskar Mencer, April 2014 - PowerPoint PPT Presentation

Computing in Space PRACE ¡Keynote ¡ Oskar ¡Mencer, ¡April ¡2014 ¡

Thinking Fast and Slow Daniel ¡Kahneman ¡ ¡ Nobel ¡Prize ¡in ¡Economics, ¡2002 ¡ ¡ 17 ¡× ¡24 ¡= ¡? ¡ ¡ Kahneman ¡splits ¡thinking ¡into: ¡ System ¡1: ¡fast, ¡hard ¡to ¡control ¡... ¡400 ¡ System ¡2: ¡slow, ¡easier ¡to ¡control ¡... ¡408 ¡

Assembly-line computing in action SYSTEM ¡2 ¡ SYSTEM ¡1 ¡ flexible ¡memory ¡ x86 ¡cores ¡ plus ¡logic ¡ OpVmal ¡ Encoding ¡ Low ¡Latency ¡ High ¡Throughput ¡ Memory ¡ Memory ¡ System ¡ minimize data movement

Temporal Computing (1D) • A ¡program ¡is ¡a ¡sequence ¡ of ¡instrucVons ¡ ¡ • Performance ¡is ¡dominated ¡ by: ¡ CPU ¡ Memory ¡ – Memory ¡latency ¡ – ALU ¡availability ¡ Actual ¡computaVon ¡Vme ¡ C ¡ C ¡ C ¡ Read ¡ Write ¡ Read ¡ Write ¡ Get ¡ Get ¡ Get ¡ Read ¡ Write ¡ O ¡ O ¡ O ¡ data ¡ Result ¡ data ¡ Result ¡ Inst. ¡ Inst. ¡ Inst. ¡ data ¡ Result ¡ M ¡ M ¡ M ¡ 1 ¡ 1 ¡ 2 ¡ 2 ¡ 1 ¡ 2 ¡ 3 ¡ 3 ¡ 3 ¡ P ¡ P ¡ P ¡ Time ¡ 5 ¡

Spatial Computing (2D) Synchronous ¡data ¡movement ¡ Control ¡ Control ¡ ALU ¡ data ¡ data ¡ ALU ¡ in ¡ out ¡ Buffer ¡ ALU ¡ ALU ¡ ALU ¡ Read ¡data ¡[1..N] ¡ ComputaVon ¡ Write ¡results ¡[1..N] ¡ Time ¡ Throughput ¡dominated ¡ 6 ¡

OpenSPL ¡in ¡PracVce New ¡CME ¡Electronic ¡Trading ¡Gateway ¡will ¡be ¡going ¡live ¡in ¡ March ¡2014! ¡ Webinar ¡Page: ¡ hbp://www.cmegroup.com/educaVon/new-‑ilink-‑architecture-‑webinar.html ¡ ¡ ¡ CME ¡ Group ¡ Inc. ¡ (Chicago ¡ MercanVle ¡ Exchange) ¡ is ¡ one ¡ of ¡ the ¡ largest ¡ opVons ¡ and ¡ futures ¡ exchanges . ¡It ¡owns ¡and ¡operates ¡large ¡derivaVves ¡and ¡futures ¡exchanges ¡in ¡Chicago, ¡and ¡New ¡ York ¡City, ¡as ¡well ¡as ¡online ¡trading ¡plagorms. ¡It ¡also ¡ owns ¡the ¡ Dow ¡Jones ¡ stock ¡and ¡financial ¡ indexes, ¡ and ¡ CME ¡ Clearing ¡ Services, ¡ which ¡ provides ¡ seblement ¡ and ¡ clearing ¡ of ¡ exchange ¡ trades. ¡…. ¡[from ¡Wikipedia] ¡ 7 ¡

Maxeler Seismic Imaging Platform • Maxeler provides Hardware plus application software for seismic modeling • MaxSkins allow access to Ultrafast Modelling and RTM for research and development of RTM and Full Waveform Inversion (FWI) from MatLab, Python, R, C/C++ and Fortran . • Bonus: MaxGenFD is a MaxCompiler plugin that allows the user to specify any 3D Finite Difference problem, including the PDE, coefficients, boundary conditions, etc, and automatically generate a fully parallelized implementation for a whole rack of Maxeler MPC nodes. ApplicaVon ¡areas: ¡ ¡ • ¡O&G ¡ • ¡Weather ¡ • ¡3D ¡PDE ¡Solvers ¡ • ¡High ¡Energy ¡Physics ¡ • ¡Medical ¡Imaging ¡ 9 ¡

Example: ¡ data ¡flow ¡graph ¡ ¡ generated ¡by ¡ ¡ MaxCompiler ¡ ¡ 4866 ¡ ¡ staVc ¡dataflow ¡cores ¡ in ¡1 ¡chip ¡

Star versus Cube Stencil When ¡compuVng ¡in ¡space, ¡compuVng ¡more ¡can ¡be ¡faster ¡than ¡compuVng ¡ less ¡(if ¡it ¡reduces ¡the ¡amount ¡of ¡data ¡that ¡needs ¡to ¡be ¡moved ¡around! ¡ 19 ¡MADDs ¡ 27 ¡MADDs ¡ Local ¡Buffer ¡= ¡6 ¡slices ¡ Local ¡Buffer ¡= ¡3 ¡slices ¡

Computing in Space - Why Now? Semiconductor ¡technology ¡is ¡ready ¡ • – Within ¡ten ¡years ¡(2003 ¡to ¡2013) ¡the ¡number ¡of ¡transistors ¡on ¡a ¡chip ¡went ¡up ¡from ¡ 400M ¡(Itanium ¡2) ¡to ¡5Bln ¡(Xeon ¡Phi) ¡ Memory ¡performance ¡isn’t ¡keeping ¡up ¡ • – Memory ¡density ¡has ¡followed ¡the ¡trend ¡set ¡by ¡Moore’s ¡law ¡ – But ¡Memory ¡latency ¡has ¡increased ¡from ¡10s ¡to ¡100s ¡of ¡CPU ¡clock ¡cycles ¡ – As ¡a ¡result, ¡ ¡On-‑die ¡cache ¡% ¡of ¡die ¡area ¡increased ¡from ¡15% ¡(1um) ¡to ¡ ¡40% ¡(32nm) ¡ ¡ – Memory ¡latency ¡gap ¡could ¡eliminate ¡most ¡of ¡the ¡benefits ¡of ¡CPU ¡improvements ¡ Petascale ¡challenges ¡(10^15 ¡FLOPS) ¡ • – Clock ¡frequencies ¡stagnated ¡in ¡the ¡few ¡GHz ¡range ¡ – Energy ¡usage ¡and ¡Power ¡wastage ¡of ¡modern ¡HPC ¡systems ¡are ¡becoming ¡a ¡huge ¡ economic ¡burden ¡that ¡can ¡not ¡be ¡ignored ¡any ¡longer ¡ – Requirements ¡for ¡annual ¡performance ¡improvements ¡grow ¡steadily ¡ ¡ – Programmers ¡conVnue ¡to ¡rely ¡on ¡sequenVal ¡execuVon ¡(1D ¡approach) ¡ For ¡affordable ¡petascale ¡systems ¡ è ¡Novel ¡approach ¡is ¡needed ¡ • 12 ¡

OpenSPL Models • Memory: ¡ – Fast ¡Memory ¡(FMEM): ¡many, ¡small ¡in ¡size, ¡low ¡latency ¡ – Large ¡Memory ¡(LMEM): ¡few, ¡large ¡in ¡size, ¡high ¡latency ¡ – Scalars: ¡many, ¡Vny, ¡lowest ¡latency, ¡fixed ¡during ¡exec. ¡ • ExecuVon: ¡ – datasets ¡+ ¡scalar ¡setngs ¡sent ¡as ¡atomic ¡“acVons” ¡ – all ¡data ¡flows ¡through ¡the ¡system ¡synchronously ¡in ¡“Vcks” ¡ • Programming: ¡ – API ¡allows ¡construcVon ¡of ¡a ¡graph ¡computaVon ¡ – meta-‑programming ¡allows ¡complex ¡construcVon ¡ 13 ¡

OpenSPL Machine • A ¡spaVal ¡compuVng ¡machine ¡system ¡consists ¡of: ¡ – appropriate ¡hardware ¡technology, ¡ ¡ i.e. ¡the ¡SpaVal ¡CompuVng ¡Substrate ¡(SCS) ¡ ¡ ¡ and ¡flexible ¡arithmeVc/computaVon ¡units ¡and ¡interconnect ¡ – an ¡SCS ¡specific ¡compilaVon ¡tool-‑chain ¡ – CPU-‑based ¡runVme ¡for ¡control ¡of ¡SCS ¡ ¡ • ComputaVon ¡divided ¡into ¡discrete ¡kernels ¡interconnected ¡by ¡ data ¡flow ¡streams ¡to ¡form ¡bigger ¡enVVes ¡ ¡ • In ¡a ¡spaVal ¡system ¡one ¡or ¡more ¡SCS ¡engines ¡exist, ¡each ¡ execuVng ¡a ¡single ¡acVon ¡at ¡any ¡moment ¡in ¡Vme ¡ 14 ¡

OpenSPL Example: X 2 + 30 x ¡ x ¡ SCSVar x = io.input("x", scsInt(32)); 30 ¡ SCSVar result = x * x + 30; + ¡ io.output("y", result, scsInt(32)); y ¡ 15 ¡

OpenSPL Example: Moving Average Y ¡= ¡(X n-‑1 ¡+ ¡X ¡+ ¡X n+1 ) ¡/ ¡3 ¡ SCSVar x = io.input(“x”, scsFloat(7,17)); SCSVar prev = stream.offset(x, -1); SCSVar next = stream.offset(x, 1); SCSVar sum = prev + x + next; SCSVar result = sum / 3; io.output(“y”, result, scsFloat(7,17)); 16 ¡

OpenSPL Example: Choices x ¡ 1 ¡ 1 ¡ 10 ¡ -‑ ¡ + SCSVar x = io.input(“x”, scsUInt(24)); > SCSVar result = (x>10) ? x+1 : x-1; io.output(“y”, result, scsUInt(24)); y ¡ 17 ¡

Maxeler Dataflow Engine Platforms High ¡Density ¡DFEs ¡ The ¡Dataflow ¡Appliance ¡ The ¡Low ¡Latency ¡Appliance ¡ Intel ¡Xeon ¡CPU ¡cores ¡and ¡up ¡to ¡6 ¡ Dense ¡compute ¡with ¡8 ¡DFEs, ¡ Intel ¡Xeon ¡CPUs ¡and ¡1-‑2 ¡DFEs ¡with ¡ DFEs ¡with ¡288GB ¡of ¡RAM ¡ 384GB ¡of ¡RAM ¡and ¡dynamic ¡ direct ¡links ¡to ¡up ¡to ¡six ¡10Gbit ¡ allocaVon ¡of ¡DFEs ¡to ¡CPU ¡servers ¡ Ethernet ¡connecVons ¡ with ¡zero-‑copy ¡RDMA ¡access ¡ 18 ¡

Bringing Scalability and Efficiency to the Datacenter 19 ¡

Fixed-point bit-width exploration 8-bit fixed-point 10-bit fixed-point Different ¡parts ¡are ¡explored ¡ separately , ¡i.e., ¡when ¡we ¡invesVgate ¡ one ¡part, ¡we ¡keep ¡the ¡bit-‑widths ¡in ¡ other ¡parts ¡a ¡constant ¡high ¡value ¡ ‘true’ image: single-precision floating-point Difference Indicator Values of the Generated Images 1.00E+06 Similarly, ¡we ¡observe ¡a ¡significant ¡ 1.00E+05 drop ¡of ¡the ¡error ¡when ¡the ¡SQRT ¡ 1.00E+04 bit-‑width ¡increases ¡from ¡8 ¡to ¡10 ¡ 1.00E+03 1.00E+02 1.00E+01 4 6 8 10 12 14 16 18 20 22 SQRT Bit-width Similar ¡precision ¡thresholds ¡observed ¡in ¡both ¡syntheVc ¡and ¡field ¡results. ¡This ¡behavior ¡enables ¡an ¡ automaVc ¡tool ¡to ¡determine ¡the ¡minimum ¡precision ¡that ¡sVll ¡keeps ¡the ¡result ¡ good ¡enough ¡ ¡

Computing in Space PRACE Keynote Oskar Mencer, April 2014 - PowerPoint PPT Presentation

Computing in Space PRACE Keynote Oskar Mencer, April 2014 Thinking Fast and Slow Daniel Kahneman Nobel Prize in Economics, 2002 17 24 = ? Kahneman

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

Last lecture Configuration Space Free-Space and C-Space Obstacles Minkowski Sums 1

Partial-Order Planning 1 State-Space vs. Plan-Space State-space ( situation space ) planning

Trustworthy Computing * Reverse engineers agree on that! Trustworthy Computing Trustworthy

Wave Sound Space Tourist On-board Protection Space Tourist On-board Protection Space Tourist

SRM Space Tokens SRM Space Tokens SRM Space Tokens SRM Space Tokens Scalla/xrootd Andrew

Space Art astronautical extraterrestrial What is space art? astronomical How has space art

COMPUTING COMMUNITY CONSORTIUM The mission of the Computing Research Association's Computing

THE COMPUTING COMMUNITY CONSORTIUM (CCC) COMPUTING COMMUNITY CONSORTIUM The mission of Computing

Calm Computing The Coming Age of Mark Weiser and John Seely Brown Calm Computing Whyfor, Calm

Ray Wu Presentation to School of Computing, National University of Singapore Computing Evolution

ManyCore ManyCore Computing: ManyCore ManyCore Computing: Computing: Computing: The Impact on

The Great Space Telescopes A Deeper Look Into Space Presented By Stevan Akerley 3/31,

SHAPING THE NATIONS SPACE AGENDA Dr Noordin Ahmad National Space Agency (ANGKASA) SPACE

WHY SPACE GEODESY? WHY SPACE GEODESY? 5 th ASSA Symposium November 2002 Space Geodesy Programme

COOPERATION CAPABILITIES for ESCAP: Space Applications Developed by Russian Space Systems 1 1

An FPGA-Based Scalable Hardware Scheduler For Data-Flow Models Roberto Giorgi, Marco Procaccini,

CS510 Software Engineering Static Program Analysis Asst. Prof. Mathias Payer Department of

Data Speculation Adam Wierman Daniel Neill Lipasti and Shen. Exceeding the dataflow limit, 1996.

Chapter 9 such statements as they tend to sound pretty silly in 5 years Alternative

CS654 Advanced Computer Architecture Lec 8 Instruction Level Parallelism Peter Kemper

Fundamentals of Computer Design Computer Architecture J. Daniel Garca Snchez (coordinator)

Data Flow Coverage 1 Stuart Anderson Stuart Anderson Data Flow Coverage 1 2011 c 1 Why

CS184c: Computer Architecture [Parallel and Multithreaded] Day 14: May 24, 2001 SCORE CALTECH