Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 – Computer Architecture Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand

Spring 2015 :: CSE 502 – Computer Architecture Getting More Performance • OoO superscalars extract ILP from sequential programs – Hardly more than 1-2 IPC on real workloads – Although some studies suggest ILP degrees of 10’s - 100’s • In practice, IPC is limited by: – Limited BW • From memory and cache • Fetch/commit bandwidth • Renaming (must find dependences among all insns dispatched in a cycle) – Limited HW resources • # renaming registers, ROB, RS and LSQ entries, functional units – True data dependences • Coming from algorithm and compiler – Branch prediction accuracy – Imperfect memory disambiguation

Spring 2015 :: CSE 502 – Computer Architecture Getting More Performance • Keep pushing IPC and/or frequency – Design complexity (time to market) – Cooling (cost) – Power delivery (cost) – … • Possible, but too costly

Spring 2015 :: CSE 502 – Computer Architecture Bridging the Gap Watts / IPC Power has been growing exponentially as well 100 10 1 Diminishing returns w.r.t. larger instruction window, higher issue-width Single-Issue Limits Superscalar Superscalar Pipelined Out-of-Order Out-of-Order (Today) (Hypothetical- Aggressive)

Spring 2015 :: CSE 502 – Computer Architecture Higher Complexity not Worth Effort Performance Made sense to go Superscalar/OOO: good ROI Very little gain for substantial effort “Effort” Scalar Moderate-Pipe Very-Deep-Pipe In-Order Superscalar/OOO Aggressive Superscalar/OOO

Spring 2015 :: CSE 502 – Computer Architecture User Visible/Invisible (1/2) • Problem: HW is in charge of finding parallelism → User-invisible parallelism – Most of what of what we discussed in the class so far! • Users got “free” performance just by buying a new chip – No change needed to the program (same ISA) – Higher frequency & higher IPC (different micro-arch) – But this was not sustainable…

Spring 2015 :: CSE 502 – Computer Architecture User Visible/Invisible (2/2) • Alternative: User-visible parallelism – User (developer) responsible for finding and expressing parallelism – HW does not need to find parallelism → Simpler, more efficient HW • Common forms – Data-Level Parallelism (DLP) : Vector processors, SIMD extensions, GPUs – Thread-Level Parallelism (TLP) : Multiprocessors, Hardware Multithreading – Request-Level Parallelism (RLP) : Data centers CSE 610 (Parallel Computer Architectures) next semester will cover these and other related subjects comprehensively

Spring 2015 :: CSE 502 – Computer Architecture Thread-Level Parallelism (TLP)

Spring 2015 :: CSE 502 – Computer Architecture Sources of TLP • Different applications – MP3 player in background while you work in Office – Other background tasks: OS/kernel, virus check, etc… – Piped applications • gunzip -c foo.gz | grep bar | perl some-script.pl • Threads within the same application – Explicitly coded multi-threading • pthreads – Parallel languages and libraries • OpenMP, Cilk , TBB, etc…

Spring 2015 :: CSE 502 – Computer Architecture Architectures to Exploit TLP • Multiprocessors (MP): Different threads run on different processors – Symmetric Multiprocessors (SMP) – Chip Multiprocessors (CMP) • Hardware Multithreading (MT) : Multiple threads share the same processor pipeline – Coarse-grained MT (CGMT) – Fine-grained MT (FMT) – Simultaneous MT (SMT)

Spring 2015 :: CSE 502 – Computer Architecture Multiprocessors (MP)

Spring 2015 :: CSE 502 – Computer Architecture SMP Machines • SMP = Symmetric Multi-Processing – Symmetric = All CPUs are the same and have “equal” access to memory – All CPUs are treated as similar by the OS • E.g.: no master/slave, no bigger or smaller CPUs, … • OS sees multiple CPUs – Runs one process (or thread) on each CPU CPU 0 CPU 1 CPU 2 CPU 3

Spring 2015 :: CSE 502 – Computer Architecture Chip-Multiprocessing ( CMP ) • Simple SMP on the same chip – CPUs now called “cores” by hardware designers – OS designers still call these “CPUs” Intel “Smithfield” (Pentium D) Block Diagram AMD Dual-Core Athlon FX

Spring 2015 :: CSE 502 – Computer Architecture Benefits of CMP • Cheaper than multi-chip SMP – All/most interface logic integrated on chip • Fewer chips • Single CPU socket • Single interface to memory – Less power than multi-chip SMP • Communication on die uses less power than chip to chip • Efficiency – Use transistors for multiple cores (instead of wider/more aggressive OoO) – Potentially better use of hardware resources

Spring 2015 :: CSE 502 – Computer Architecture CMP Performance vs. Power • 2x CPUs not necessarily equal to 2x performance • 2x CPUs  ½ power for each – Maybe a little better than ½ if resources can be shared • Back-of-the-Envelope calculation: – 3.8 GHz CPU at 100W – Dual-core: 50W per Core 3 = 100W/50W  V CMP = 0.8 V orig – P  V 3 : V orig 3 /V CMP – f  V: f CMP = 3.0GHz

Spring 2015 :: CSE 502 – Computer Architecture Shared-Memory Multiprocessors • Multiple threads use shared memory (address space) – “System V Shared Memory” or “Threads” in software • Communication implicit via loads and stores – Opposite of explicit message-passing multiprocessors P 1 P 2 P 3 P 4 Memory System

Spring 2015 :: CSE 502 – Computer Architecture Why Shared Memory? • Pluses + Programmers don’t need to learn about explicit communications • Because communication is implicit (through memory) + Applications similar to the case of multitasking uniprocessor • Programmers already know about synchronization + OS needs only evolutionary extensions • Minuses – Communication is hard to optimize • Because it is implicit • Not easy to get good performance out of shared-memory programs – Synchronization is complex • Over-synchronization → bad performance • Under-synchronization → incorrect programs • Very difficult to debug – Hard to implement in hardware Result: the most popular form of parallel programming

Spring 2015 :: CSE 502 – Computer Architecture Paired vs. Separate Processor/Memory? • Separate CPU/memory • Paired CPU/memory – Uniform memory access – Non-uniform memory access ( UMA ) ( NUMA ) • Equal latency to memory • Faster local memory • Data placement matters – Lower peak performance – Higher peak performance CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem

Spring 2015 :: CSE 502 – Computer Architecture Shared vs. Point-to-Point Networks • Shared network • Point-to-point network: – Example: bus – Example: mesh, ring – Low latency – High latency (many “ hops ”) – Low bandwidth – Higher bandwidth • Doesn’t scale > ~16 cores • Scales to 1000s of cores – Simpler cache coherence – Complex cache coherence CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem CPU($) CPU($)

Spring 2015 :: CSE 502 – Computer Architecture Organizing Point-To-Point Networks • Network topology : organization of network – Trade off perf. (connectivity, latency, bandwidth)  cost • Router chips – Networks w/separate router chips are indirect – Networks w/ processor/memory/router in chip are direct • Fewer components, “ Glueless MP ” R CPU($) CPU($) Mem R R Mem R R Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($)

Spring 2015 :: CSE 502 – Computer Architecture Issues for Shared Memory Systems • Two big ones – Cache coherence – Memory consistency model • Closely related – But often confused • Will talk about these a lot more in CSE 610

Spring 2015 :: CSE 502 – Computer Architecture Cache Coherence

Spring 2015 :: CSE 502 – Computer Architecture Cache Coherence: The Problem (1/3) • Multiple copies of each cache block – One in main memory – Up to one in each cache • Multiple copies can get inconsistent when writes happen – Should make sure all processors have a consistent view of memory P 1 P 2 P 3 P 4 P 1 P 2 P 3 P 4 $ $ $ $ Memory System Memory Logical View Reality (more or less!) Should propagate one processor’s write to others

Spring 2015 :: CSE 502 – Computer Architecture Cache Coherence: The Problem (2/3) • Variable A initially has value 0 • P1 stores value 1 into A • P2 loads A from memory and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 1 A: 0 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2’s cache coherent

Spring 2015 :: CSE 502 – Computer Architecture Cache Coherence: The Problem (3/3) • P1 and P2 both have variable A (value 0) in their caches • P1 stores value 1 into A • P2 loads A from its cache and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 1 A: 0 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2’s cache coherent

Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Getting More Performance OoO superscalars extract ILP from sequential

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Exploiting More ILP ILP = __________ _ ________

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

FEDERAL ENERGY REGULATORY COMMISSION Multi-Stakeholder ILP Effectiveness Technical Conference

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

The ILP approach to the layered graph drawing Ago Kuusik Veskisilla Teooriapevad 1-3.10.2004

Logical reduction of metarules Andrew Cropper & Sophie Tourret ILP Examples Learner

Parameterized Complexity of Integer Linear Programming (ILP) Sebastian Ordyniak Parameterized

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

International Stakeholder Forum Ofcom Riverside House July 2019 PROMOTING CHOICE

Secure hashing, authen/ca/on root@topi:/etc# more shadow

360 VIDEO CLOUD STREAMING & HTMLVIDEOELEMENT EXTENSIONS Louay Bassbouss | Fraunhofer FOKUS

Recent results of the NA64 experiment at the CERN SPS . Mikhail Kirsanov INR RAS (Moscow) (on

QUALITY COMES AS STANDARD Part of Link Group | Corporate Markets 1 Agenda Introduction to

Semi-parametric estimation Large-Time Scaling (LRD). Fourier vs Wavelets E. Moulines (ENST) C.

Principles of Program Analysis: A Sampler of Approaches Transparencies based on Chapter 1 of the

Completeness for Mosss Coalgebraic Logic (Boolean version) Clemens Kupke, Alexander Kurz , Yde

Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand - PowerPoint PPT Presentation

Spring 2015 :: CSE 502 Computer Architecture Beyond ILP In Search of More Parallelism Instructor: Nima Honarmand Spring 2015 :: CSE 502 Computer Architecture Getting More Performance OoO superscalars extract ILP from sequential

1 ILP Ferrara sept 2018 Games 2 ILP Ferrara sept 2018 Interest of games for AI Excellent

Exploiting More ILP ILP = __________________ _________________ ________________

MLP yes! Definitions ILP no ! MLP ILP = Instruction Level = Memory Level Parallelism Work

Limits to ILP Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C

263-2810: Advanced Compiler Design 8.2 Scheduling for ILP processors Thomas R. Gross Computer

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

FEDERAL ENERGY REGULATORY COMMISSION Multi-Stakeholder ILP Effectiveness Technical Conference

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

Instruction-Level Parallelism (ILP) Fine-grained parallelism Obtained by: instruction

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques Edmund Clarke Anubhav

Chapter 3: Instruction Level Parallelism (ILP) and its exploitation Pipeline CPI = Ideal

The ILP approach to the layered graph drawing Ago Kuusik Veskisilla Teooriapevad 1-3.10.2004

Logical reduction of metarules Andrew Cropper &amp; Sophie Tourret ILP Examples Learner

Parameterized Complexity of Integer Linear Programming (ILP) Sebastian Ordyniak Parameterized

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

MEDIA DISRUPTION SEEING BEYOND SEEING BEYOND SEEING BEYOND SEEING BEYOND LED BY THE BLIND

International Stakeholder Forum Ofcom Riverside House July 2019 PROMOTING CHOICE

Secure hashing, authen/ca/on root@topi:/etc# more shadow

360 VIDEO CLOUD STREAMING &amp; HTMLVIDEOELEMENT EXTENSIONS Louay Bassbouss | Fraunhofer FOKUS

Recent results of the NA64 experiment at the CERN SPS . Mikhail Kirsanov INR RAS (Moscow) (on

QUALITY COMES AS STANDARD Part of Link Group | Corporate Markets 1 Agenda Introduction to

Semi-parametric estimation Large-Time Scaling (LRD). Fourier vs Wavelets E. Moulines (ENST) C.

Principles of Program Analysis: A Sampler of Approaches Transparencies based on Chapter 1 of the

Completeness for Mosss Coalgebraic Logic (Boolean version) Clemens Kupke, Alexander Kurz , Yde

Exploiting More ILP ILP = __________ _ ________

Logical reduction of metarules Andrew Cropper & Sophie Tourret ILP Examples Learner

360 VIDEO CLOUD STREAMING & HTMLVIDEOELEMENT EXTENSIONS Louay Bassbouss | Fraunhofer FOKUS