System on Chip C (SoC-C) Efficient programming abstractions for - PowerPoint PPT Presentation

System on Chip C (SoC-C) Efficient programming abstractions for heterogeneous multicore Systems on Chip Alastair Reid ARM Ltd Yuan Lin University of Michigan Krisztian Flautner ARM Ltd Edmund Grimley-Evans ARM Ltd 1

Mobile Consumer Electronics Trends Mobile Application Requirements Still Growing Rapidly § Still cameras: 2Mpixel à 10 Mpixel § Video cameras: VGA à HD 1080p à … § Video players: MPEG-2 à H.264 § 2D Graphics: QVGA à HVGA à VGA à FWVGA à … § 3D Gaming: > 30Mtriangle/s, antialiasing, … § Bandwidth: HSDPA (14.4Mbps) à WiMax (70Mbps) à LTE (326Mbps) Feature Convergence § Phone § + graphics + UI + games § + still camera + video camera § + music § + WiFi + Bluetooth + 3.5G + 3.9G + WiMax + GPS § + … 2

Pocket Supercomputers The challenge is not processing power The challenge is energy efficiency 3

Different Requirements Desktop/Laptop/Server Consumer Electronics § 1-10Gop/s § 10-100Gop/s § 10-100W § 100mW-1W 10x performance 1/100 power consumption = 1000x energy efficiency 4

… leading to Different Hardware Drop Frequency 10x § Desktop: 2-4GHz § Pocket: 200-400MHz Increase Parallelism 100x § Desktop: 1-2 cores § Pocket: 32-way SIMD Instruction Set, 4-8 cores Match Processor Type to Task § Desktop: homogeneous, general purpose § Pocket: heterogeneous, specialised Keep Memory Local § Desktop: coherent, shared memory § Pocket: processor-memory clusters linked by DMA 5

Example Architecture Artist’s impression SIMD Instruction Set Control Processor Data Engines Accelerators Distributed Memories 6

What’s wrong with plain C? C doesn’t provide language features to support § Multiple processors (or multi-ISA systems) § Distributed memory § Multiple threads 7

Use Indirection (Strawman #1) Add a layer of indirection § Operating System § Layer of middleware § Device drivers § Hardware support All impose a cost in Power/Performance/Area 8

Raise Pain Threshold (Strawman #2) Write efficient code at very low level of abstraction Problems § Hard, slow and expensive to write, test, debug and maintain § Design intent drowns in sea of low level detail § Not portable across different architectures § Expensive to try different points in design space 9

Our Response Extend C § Support Asymmetric Multiprocessors § SoC-C language raises level of abstraction § … but take care not to hide expensive operations Use (simple) compiler technology § Explicit design intent allows error checking § High-level compiler optimizations § Compiler takes care of low-level details 10

Overview Pocket-Sized Supercomputers § Energy efficient hardware is “lumpy” § … and unsupported by C § … but supported by SoC-C How SoC-C tackles the underlying hardware issues § Using SoC-C § Compiling SoC-C Conclusion 11

3 steps in mapping an application 1. Decide how to parallelize 2. Choose processors for each pipeline stage 3. Resolve distributed memory issues 12

A Simple Program int x[100]; int y[100]; int z[100]; while (1) { get(x); foo(y,x); bar(z,y); baz(z); put(z); } 13

Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; while (1) { get(x); 50% of work foo(y,x); bar(z,y); baz(z); 50% of work put(z); } 14

Step 1: Decide how to parallelize int x[100]; int y[100]; int z[100]; PIPELINE { PIPELINE while (1) { indicates region to parallelize get(x); foo(y,x); FIFO FIFO(y); bar(z,y); indicates boundaries baz(z); between pipeline stages put(z); } } 15

SoC-C Feature #1: Pipeline Parallelism Annotations express coarse-grained pipeline parallelism § PIPELINE indicates scope of parallelism § FIFO indicates boundaries between pipeline stages Compiler splits into threads communicating through FIFOs § Uses IN/OUT annotations on functions for dataflow analysis FIFO § passes ownership of data § does not copy data 16

Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); foo(y,x); FIFO(y); bar(z,y); baz(z); put(z); } } 17

Step 2: Choose Processors int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); @ P foo(y,x) @ P0 ; FIFO(y); indicates processor to bar(z,y) @ P1 ; execute function baz(z) @ P1 ; put(z); } } 18

SoC-C Feature #2: RPC Annotations Annotations express where code is to execute § Behaves like Synchronous Remote Procedure Call § Migrating thread model § Does not change meaning of program § Bulk data is not implicitly copied to processor’s local memory 19

Step 3: Resolve Memory Issues int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { P0 uses x à x must be in M0 get(x); foo(y,x) @ P0; P1 uses z à z must be in M1 FIFO(y); bar(z,y) @ P1; P0 uses y à y must be in M0 baz(z) @ P1; put(z); Conflict?! } } P1 uses y à y must be in M1 20

Hardware Cache Coherency P1 P0 invalidate x $1 $0 copy x invalidate x write x read x write x 21

Step 3: Resolve Memory Issues int x[100]; int y[100]; y has two coherent versions. int z[100]; One in M0, one in M1 PIPELINE { while (1) { get(x); foo(y,x) @ P0; SYNC(x) @ P SYNC(x) @ DMA; FIFO(y); copies data from one bar(z,y) @ P1; version of x to another baz(z) @ P1; using processor P put(z); } } 22

SoC-C Feature #3: Compile Time Coherency Variables can have multiple coherent versions § Compiler uses memory topology to determine which version is being accessed Compiler applies cache coherency protocol § Writing to a version makes it valid and other versions invalid § Dataflow analysis propagates validity § Reading from an invalid version is an error § SYNC(x) copies from valid version to invalid version 23

What SoC-C Provides SoC-C language features § Pipeline to support parallelism § Coherence to support distributed memory § RPC to support multiple processors/ISAs Non-features § Does not choose boundary between pipeline stages § Does not resolve coherence problems § Does not allocate processors SoC-C is concise notation to express mapping decisions (not a tool for making them on your behalf) 24

Compiling SoC-C 1. Data Placement a) Infer data placement b) Propagate coherence c) Split variables with multiple placement 2. Pipeline Parallelism a) Identify maximal threads b) Split into multiple threads c) Apply zero copy optimization 3. RPC (see paper for details) 25

Step 1a: Infer Data Placement int x[100]; int y[100]; int z[100]; PIPELINE { while (1) { get(x); § foo(y,x) @ P0; Memory Topology constrains where variables could live SYNC(x) @ DMA; FIFO(y); bar(z,y) @ P1; baz(z) @ P1; put(z); } } 26

Step 1a: Infer Data Placement int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x @? ); § foo(y @M0 , x @M0 ) @ P0; Memory Topology constrains where variables could live SYNC(y, ?,? ) @ DMA; FIFO(y @? ); bar(z @M1 , y @M1 ) @ P1; baz(z @M1 ) @ P1; put(z @? ); } } 27

Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x @? ); § foo(y @M0 , x @M0 ) @ P0; Memory Topology constrains where variables could live SYNC(y, ?,? ) @ DMA; § Forwards Dataflow propagates FIFO(y @? ); availability of valid versions bar(z @M1 , y @M1 ) @ P1; baz(z @M1 ) @ P1; put(z @? ); } } 28

Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x @? ); § foo(y @M0 , x @M0 ) @ P0; Memory Topology constrains where variables could live SYNC(y, ?,M0 ) @ DMA; § Forwards Dataflow propagates FIFO(y @? ); availability of valid versions bar(z @M1 , y @M1 ) @ P1; baz(z @M1 ) @ P1; put(z @M1 ); } } 29

Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x @? ); § foo(y @M0 , x @M0 ) @ P0; Memory Topology constrains where variables could live SYNC(y, ?,M0 ) @ DMA; § Forwards Dataflow propagates FIFO(y @? ); availability of valid versions bar(z @M1 , y @M1 ) @ P1; § Backwards Dataflow baz(z @M1 ) @ P1; propagates need for valid put(z @M1 ); versions } } 30

Step 1b: Propagate Coherence int x[100] @ {M0}; int y[100] @ {M0,M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x @M0 ); § foo(y @M0 , x @M0 ) @ P0; Memory Topology constrains where variables could live SYNC(y, M1,M0 ) @ DMA; § Forwards Dataflow propagates FIFO(y @M1 ); availability of valid versions bar(z @M1 , y @M1 ) @ P1; § Backwards Dataflow baz(z @M1 ) @ P1; propagates need for valid put(z @M1 ); versions } (Can use unification+constraints } instead) 31

Step 1c: Split Variables int x[100] @ {M0}; int y0[100] @ {M0}; Split variables with multiple locations int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); foo( y0 , x) @ P0; Replace SYNC with memcpy memcpy(y1,y0,…) @ DMA; FIFO( y1 ); bar(z, y1 ) @ P1; baz(z) @ P1; put(z); } } 32

Step 2: Implement Pipeline Annotation int x[100] @ {M0}; int y0[100] @ {M0}; int y1[100] @ {M1}; int z[100] @ {M1}; PIPELINE { while (1) { get(x); Dependency Analysis foo(y0, x) @ P0; memcpy(y1,y0,…) @ DMA; FIFO(y1); bar(z, y1) @ P1; baz(z) @ P1; put(z); } } 33

System on Chip C (SoC-C) Efficient programming abstractions for - PowerPoint PPT Presentation

System on Chip C (SoC-C) Efficient programming abstractions for heterogeneous multicore Systems on Chip Alastair Reid ARM Ltd Yuan Lin University of Michigan Krisztian Flautner ARM Ltd Edmund Grimley-Evans ARM Ltd 1 Mobile Consumer

System on on a Chip (SoC) Cristian Sisterna Universidad Nacional San Juan Argentina SoC ICTP

SoC SoC Design SoC SoC Design Design Design Lecture Lecture 1 1: Introduction :

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Eduardo Gandara LEARNING OUTCOMES Introduced to the SOC Program Brief History of SOC

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Preparing SOC 1, SOC 2 or SOC 3 Reports: Best Practices Meeting Challenges Arising From SSAE 16,

TKT TKT- -2431 SoC design 2431 SoC design Introduction to exercises SoC design / September 09

FPAA: FPGA Extension SoC FPAA SoC FPAA: Computing SoC Dev Board Large Scale FPAA Devices

Untethering the Rocket-Chip Producing a stand-alone lowRISC SoC Wei Song 07/10/2015 1

From Spaghetti wires to Noc Marcello Coppola MPSOC05 STMicroelectronics On-chip communication

MP System-on-Chip: MP System-on-Chip: Embedded Test, Diagnosis and Embedded Test, Diagnosis and

Outline of Tutorial SoC Architectures for Hardware Designers Technology opportunities and

SoCGen: A Push Button Idea to GDS2 SoC Design Flow Habiba Gamal, Amr Gouhar, Mohamed Shalan What

SoC Pain Points and Gaps An Overview - MUKUND PAI Intel Corp. All Discussions in this Slide

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Responsive and Enforced Interrupt Handling for Real-Time System Virtualization Hyoseung Kim *

15-853:Algorithms in the Real World Error Correcting Codes 15-853 Page1 Welc**e t* t*e

AVC Patent Portfolio License Briefing TF-Media Task Force meeting Athens, Greece 19 March 2010

P as t and F ut ur e R es ear ch P as t and F ut ur e R es ear ch P as t and F ut ur e R es

ECE 223 Signals & Systems II Syllabus Miscellaneous Notes ece.pdx.edu/

Static Scheduling of Latency Insensitive Designs with Lucy-n Louis Mandel Florence Plateau LRI,

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction

History of the Per-Mile Charge in the United States 2 What is a Per Mile Charge? A VMT?

System on Chip C (SoC-C) Efficient programming abstractions for - PowerPoint PPT Presentation

System on Chip C (SoC-C) Efficient programming abstractions for heterogeneous multicore Systems on Chip Alastair Reid ARM Ltd Yuan Lin University of Michigan Krisztian Flautner ARM Ltd Edmund Grimley-Evans ARM Ltd 1 Mobile Consumer

System on on a Chip (SoC) Cristian Sisterna Universidad Nacional San Juan Argentina SoC ICTP

SoC SoC Design SoC SoC Design Design Design Lecture Lecture 1 1: Introduction :

1 Memory SoC Persistent Memory-Driven Memory Memory Processor-Centric Memory SoC SoC

Eduardo Gandara LEARNING OUTCOMES Introduced to the SOC Program Brief History of SOC

Calibration des Microroc (II) Alex, Cyril, Giom, Jean, Max 09 Mai 2011, Annecy 1 Reminder 2

SoC Design Lecture 10: On-Chip Interconnection Networks Lecture 10: On Chip Interconnection

Preparing SOC 1, SOC 2 or SOC 3 Reports: Best Practices Meeting Challenges Arising From SSAE 16,

TKT TKT- -2431 SoC design 2431 SoC design Introduction to exercises SoC design / September 09

FPAA: FPGA Extension SoC FPAA SoC FPAA: Computing SoC Dev Board Large Scale FPAA Devices

Untethering the Rocket-Chip Producing a stand-alone lowRISC SoC Wei Song 07/10/2015 1

From Spaghetti wires to Noc Marcello Coppola MPSOC05 STMicroelectronics On-chip communication

MP System-on-Chip: MP System-on-Chip: Embedded Test, Diagnosis and Embedded Test, Diagnosis and

Outline of Tutorial SoC Architectures for Hardware Designers Technology opportunities and

SoCGen: A Push Button Idea to GDS2 SoC Design Flow Habiba Gamal, Amr Gouhar, Mohamed Shalan What

SoC Pain Points and Gaps An Overview - MUKUND PAI Intel Corp. All Discussions in this Slide

Study Of Chip Breaker El-Sherbeeny, PhD 2014 Project-Group 6 TYPES ES OF F CHI HIP a)

Responsive and Enforced Interrupt Handling for Real-Time System Virtualization Hyoseung Kim *

15-853:Algorithms in the Real World Error Correcting Codes 15-853 Page1 Welc**e t* t*e

AVC Patent Portfolio License Briefing TF-Media Task Force meeting Athens, Greece 19 March 2010

P as t and F ut ur e R es ear ch P as t and F ut ur e R es ear ch P as t and F ut ur e R es

ECE 223 Signals &amp; Systems II Syllabus Miscellaneous Notes ece.pdx.edu/

Static Scheduling of Latency Insensitive Designs with Lucy-n Louis Mandel Florence Plateau LRI,

Image and Video Coding: Intra Prediction &amp; Picture Partitioning Intra-Picture Prediction

History of the Per-Mile Charge in the United States 2 What is a Per Mile Charge? A VMT?

ECE 223 Signals & Systems II Syllabus Miscellaneous Notes ece.pdx.edu/

Image and Video Coding: Intra Prediction & Picture Partitioning Intra-Picture Prediction