Architectures Prof. Mateo Valero BSC Director Barcelona, July 4, - PowerPoint PPT Presentation

From Classical to Runtime Aware Architectures Prof. Mateo Valero BSC Director Barcelona, July 4, 2018

Professor Tomas Lang

Once upon a time …

Our Origins … Parsys Multiprocessor Parsytec CCi-8D Maricel Compaq GS-160 BULL NovaScale 5160 Compaq GS-140 4.45 Gflop/s 14.4 Tflops, 20 KW 23.4 Gflop/s 48 Gflop/s 12.5 Gflop/s Transputer cluster Convex C3800 SGI Origin 2000 SGI Altix 4700 SL8500 32 Gflop/s 819.2 Gflops 6 Petabytes Connection Machine CM-200 Research prototypes 0,64 Gflop/s IBM RS-6000 SP & IBM p630 IBM PP970 / Myrinet 192+144 Gflop/s MareNostrum 42.35, 94.21 Tflop/s 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives PhD programme, Supercomputing services R&D in Computer, to Spanish and technology transfer, Life, Earth and public engagement EU researchers Engineering Sciences 60% Spanish Government BSC-CNS is a consortium 30% Catalan Government that includes Univ. Politècnica de Catalunya (UPC) 10%

Mission of BSC Scientific Departments Computer Earth Sciences Sciences To influence the way machines are built, programmed To develop and implement global and and used: programming models, performance tools, regional state-of-the-art models for short- Big Data, computer architecture, energy efficiency term air quality forecast and long-term climate applications Life CASE Sciences To develop scientific and engineering software to To understand living organisms by means of efficiently exploit super-computing capabilities theoretical and computational methods (biomedical, geophysics, atmospheric, energy, social (molecular modeling, genomics, proteomics) and economic simulations)

MareNostrum4 Total peak performance: 13,7 Pflops Gene eneral l Pur urpose Clus luster: 11.15 11.15 Pflo flops (1.07. (1.07.20 2017) CTE1 CTE1-P9+ 9+Volta: 1.57 1.57 Pflo flops (1.03. (1.03.20 2018) CTE2 CTE2-Arm V8: 8: 0.5 Pflo 0.5 flops (?? (????) CTE3 CTE3-KNH?: 0.5 0.5 Pflo flops (?? (????) MareNostrum 1 MareNostrum 2 MareNostrum 3 MareNostrum 4 2006 – 94,2 Tflops 2017 – 11,1 Pflops 2012 – 1,1 Pflops 2004 – 42,3 Tflops 1 st Europe / 5 th World 2 nd Europe / 13 th World 12 th Europe / 36 th World 1 st Europe / 4 th World New technologies New technologies New technologies

MareNostrum 4

From MN3 to MN4

BSC & The Global IT Industry 2018

Collaborations with Industry Research into advanced technologies Collaboration agreement for the Research on wind farms for the exploration of hydrocarbons, development of advanced systems optimization and wing energy subterranean and subsea reserve of deep learning with applications production forecasts modelling and fluid flows to banking services Research on the protein-drug mechanism of BSC’s dust storm forecast system Simulation of fluid-structure action in Nuclear Hormone receptors and licensed to be used to improve interaction problem with the developments on PELE method to perform the safety of business flights. multi-physics software Alya protein energy landscape explorations

Design of Superscalar Processors Decoupled from the software stack Programs “decoupled” from hardware Applications Simple interface Sequential ISA program ILP

Latency Has Been a Problem from the Beginning...  Data Cache Instruction Wakeup+ Rename Register Register Window Decode Commit Bypass select Fetch Write file • Feeding the pipeline with the right instructions: • Software trace cache (ICS’99) • Prophet/Critic Hybrid Branch Predictor (ISCA’04) • Locality/reuse • Cache Memory with Hybrid Mapping (IASTED87). Victim Cache  • Dual Data Cache (ICS¨95) • A novel renaming mechanism that boosts software prefetching (ICS’01) • Virtual-Physical Registers (HPCA’98) • Kilo Instruction Processors (ISHPC03,HPCA’06, ISCA’08 )

… and the Power Wall Appeared Later  Data Cache Instruction Wakeup+ Rename Register Register Window Decode Commit Bypass select Fetch Write file • Better Technologies • Two-level organization (Locality Exploitation) • Register file for Superscalar (ISCA’00) • Instruction queues (ICCD’05) • Load/Store Queues (ISCA’08) • Direct Wakeup, Pointer-based Instruction Queue Design (ICCD’04, ICCD’05) • Content-aware register file (ISCA’09 ) • Fuzzy computation (ICS’01, IEEE CAL’02, IEEE - TC’05). Currently known as Approximate Computing 

Fuzzy computation Performance This image is the @ Low Power original one Fuzzy Computation Binary Compresion systems protocols (bmp) (jpeg) This one only used ~85% of the time Accuracy Size while consuming ~75% of the power

SMT and Memory Latency …  Data Cache Instruction Thread 1 Wakeup+ Rename Window Register Register Commit Decode Bypass select Write Fetch file Thread N • Simultaneous Multithreading (SMT) • Benefits of SMT Processors: • Increase core resource utilization • Basic pipeline unchanged: • Few replicated resources, other shared • Some of our contributions: • Dynamically Controlled Resource Allocation (MICRO 2004) • Quality of Service (QoS) in SMTs (IEEE TC 2006) • Runahead Threads for SMTs (HPCA 2008)

Time Predictability (in multicore and SMT processors) QoS Definition: space • Ability to provide a minimum performance to a task • Requires biasing processor resource allocation • Where is it required: • Increasingly required in handheld/desktop devices • Also in embedded hard real- time systems (cars, planes, trains, …) • How to achieve it: • Controlling how resources are assigned to co-running tasks • Soft real-time systems • SMT: DCRA resource allocation policy (MICRO 2004, IEEE Micro 2004) • Multicores: Cache partitioning (ACM OSR 2009, IEEE Micro 2009) • Hard real-time systems • Deterministic resource ‘securing’ (ISCA 2009) • Time-Randomised designs (DAC 2014 best paper award)

Statically scheduled VLIW architectures • Power-efficient FU • Clustering • Widening (MICRO-98) • μSIMD and multimedia vector units (ICPP-05) • Locality-aware RF • Sacks (CONPAR-94) • Non-consistent (HPCA95) • Two-level hierarchical (MICRO-00) • Integrated modulo scheduling techniques, register allocation and spilling (MICRO-95, PACT-96, MICRO-96, MICRO-01)

Vector Architectures… Memory Latency and Power  • Out-of-Order Access to Vectors (ISCA 1992, ISCA 1995) • Command Memory Vector (PACT 1998) • In-memory computation • Decoupling Vector Architectures (HPCA 1996) • Cray SX1 • Out-of-order Vector Architectures (Micro 1996) • Multithreaded Vector Architectures (HPCA 1997) • SMT Vector Architectures (HICS 1997, IEEE MICRO J. 1997) • Vector register-file organization (PACT 1997) • Vector Microprocessors (ICS 1999, SPAA 2001) • Architectures with Short Vectors (PACT 1997, ICS 1998) • Tarantula (ISCA 2002) , Knights Corner • Vector Architectures for Multimedia ( HPCA 2001, Micro 2002 ) • High-Speed Buffers Routers ( Micro 2003, IEEE TC 2006 ) • Vector Architectures for Data-Base (Micro 2012, HPCA2015,ISCA2016)

Awards in Computer Architecture Eckert-Mauchly: IEEE Computer Society and ACM :…… “ For extraordinary leadership in building a world class computer architecture research center, for seminal contributions in the areas of vector computing and multithreading, and for pioneering basic new approaches to instruction-level parallelism .” June 2007 Seymour Cray: IEEE Computer Society :…… “In recognition of seminal contributions to vector, out-of-order, multithreaded, and VLIW architectures .” November 2015 Charles Babbage: IEEE Computer Society : .....“For contributions to parallel computation through brilliant technical work, mentoring PhD students, and building an incredibly productive European research environment .”. April, 2017

The MultiCore Era Moore’s Law + Memory Wall + Power Wall Chip MultiProcessors (CMPs) POWER4 (2001) Intel Xeon 7100 (2006) UltraSPARC T2 (2007)

How Multicores Were Designed at the Beginning? IBM Power4 (2001) IBM Power7 (2010) IBM Power8 (2014) • • • 2 cores, ST 8 cores, SMT4 12 cores, SMT8 • • • 0.7 MB/core L2, 256 KB/core L2 512 KB/core L2 16MB/core L3 (off-chip) 16MB/core L3 (on-chip) 8MB/core L3 (on-chip) • • • 115W TDP 170W TDP 250W TDP • • • 10GB/s mem BW 100GB/s mem BW 410GB/s mem BW

How To Parallelize Future Applications ? • From sequential to parallel codes • Efficient runs on manycore processors C C C C Cluster Interconnect Cluster Interconnect implies handling: C C C C • Massive amount of cores and available C C C C parallelism C A C A C C C C • Heterogeneous systems • Same or multiple ISAs L2 L2 • Accelerators, specialization Interconnect • Deep and heterogeneous memory hierarchy MC • Non-Uniform Memory Access (NUMA) L3 L3 L3 L3 DRAM DRAM MRAM MRAM • Multiple address spaces • Stringent energy budget • Load Balancing A Really Fuzzy Space

Living in the Programming Revolution Multicores made the interface to leak… Parallel application logic + Platform specificities Applications Applications Parallel hardware with multiple ISA /API address spaces (hierarchy, transfer), control flows, …

Architectures Prof. Mateo Valero BSC Director Barcelona, July 4, - PowerPoint PPT Presentation

From Classical to Runtime Aware Architectures Prof. Mateo Valero BSC Director Barcelona, July 4, 2018 Professor Tomas Lang Once upon a time Our Origins Parsys Multiprocessor Parsytec CCi-8D Maricel Compaq GS-160 BULL NovaScale

Architectures Architectural styles Software architectures Architectures versus middleware

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

Architectures, Architectures, Microkernels, IPC, Microkernels, IPC, Capabilities Capabilities

Overview Agent Architectures Definition of agent architecture Classical Architectures for

CompSci 356: Computer Network Architectures Lecture 2: Network Architectures Xiaowei Yang

HPC Architectures Types of resource currently in use Outline Shared memory architectures

HPC Architectures Types of resource currently in use Outline Shared memory architectures

NOAA Software Engineering for Novel Architectures (SENA) Project Leslie Hart GTC DC 2016

Building Partitioned Architectures Building Partitioned Architectures based on the based on the

Aligning, not Integrating Aligning, not Integrating Architectures: Architectures: Leveraging a

Network Kernel Architectures and Implementation (01204423) Single-Node Architectures Chaiporn

Alternative Architectures Philipp Koehn 15 October 2020 Philipp Koehn Machine Translation:

Layered Systems Software Design and Architectures Layered Systems BSD Unix Layered Architecture

Nanowire- -Based Based Nanowire Programmable Programmable Architectures Architectures

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

Quantum Quantum Architectures Architectures June 1, 2005 June 1, 2005 Computing? Computing?

Speed Improvements in pqR: Current Status and Future Plans Radford M. Neal, University of Toronto

Wait, IPython can do that?! Sebastian Witowski $ whoami Python consultant and trainer

Dynamical analysis of logical models of genetic regulatory networks Contents Logical

Cells and the Immune System AQA: 3.4, 3.8.2.3 EdExcel: 7.4 OCR: 6.1.3 Notes for Teachers We

PROTEIN EXPRESSION AND PURIFICATION PROTEIN EXPRESSION AND PURIFICATION Why do we decide to

Genome-wide supervised ChIP-seq peak detection Toby Dylan Hocking toby.hocking@mail.mcgill.ca

1 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 2 3 4 Where is everyone located? Why is peer

SOLUTIONS An translator ( a.k.a. compiler) wriDen in the implementaEon language reads a program