Task Superscalar: Using Processors as Functional Units Yoav Etsion - PowerPoint PPT Presentation

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero Yoav Etsion HotPar, June 2010 Senior Researcher

Parallel Programming is Hard • A few key problems of parallel programming: 1. Exposing operations that can execute in parallel 2. Managing data synchronization 3. Managing data transfers • None of these exist in sequential programming… • …but they do exists in processors executing sequential programs Hot Topics in Parallelism, June 2010 2

Sequential Program, Parallel Execution Engine • Out-of-Order pipelines automatically manage a parallel substrate • A heterogeneous parallel substrate (FP, ALU, BR…) • Yet, the input instruction stream is sequential [Tomasulo’67][Patt’85] • The obvious questions: 1. How do out-of-order processors manage parallelism? 2. Why can’t ILP out-of-order pipelines scale? 3. Can we apply the same principles to tasks? 4. Can task pipelines scale? Hot Topics in Parallelism, June 2010 3

Outline • Recap: How do OoO processors uncover parallelism? • The StarSs programming model • A high-level view of the task superscalar pipeline • Can a task pipeline scale? • Conclusions and Future Work Hot Topics in Parallelism, June 2010 4

How Do Out-of-Order Processors Do it? • Exposing parallelism • Register renaming tables m ap consumers to producers • Observing an instruction window to find independent instructions • Data synchronization • Data transfers act as synchronization tokens • Dataflow scheduling prevents data conflicts • Data transfers • Broadcasts tagged data • Input is a sequential stream: complexities are hidden from programmer Hot Topics in Parallelism, June 2010 5

Can We Scale Out-of-Order Processors? • Building a large instruction window is difficult ( Latency related ) • Timing constraints require a global clock • Broadcast does not scale, but latency cannot tolerate switched networks • Broadcasting tags yields a large associative lookup in the reservation stations • Utilizing a large instruction window • Control path speculation is a real problem, as most in-flight instructions are speculated ( Not latency related! ) • Most available parallelism used to overcome ( Back to latency… ) the memory wall, not exploit parallel resource • But what happens if we operate on tasks rather than instructions? Hot Topics in Parallelism, June 2010 6

Outline • Recap: How do OoO processors uncover parallelism? • The StarSs programming model: Tasks as abstract instructions • High-level view of the task superscalar pipeline • Can a task pipeline scale? • Conclusions and Future Work Hot Topics in Parallelism, June 2010 7

The StarSs Programming Model • Tasks as the basic work unit • Operational flow: a master thread spawns tasks, which are dispatched to multiple worker processors (aka the functional units) • Runtime system dynamically resolves dependencies, construct the task graph, and schedules tasks • Programmers annotate the directionality of operands • input , output , or inout • Operands can consist of memory regions, not only scalar values • Further extends the pipeline capabilities • Shameless plug: StarSs versions for SMPs and the Cell are freely available Hot Topics in Parallelism, June 2010 8

The StarSs Programming Model Intuitively Annotated Kernel Functions • Simple annotations #pragma css task input( a , b ) inout( c ) • All effects on shared state are void sgemm_t (float a[M][M], float b[M][M], explicitly expressed float c[M][M]); • Kernels can be compiled for #pragma css task inout( a ) different processors void spotrf_t (float a[M][M]); #pragma css task input( a ) inout( b ) void strsm_t (float a[M][M], float b[M][M]); #pragma css task input( a ) inout( b ) void ssyrk_t (float a[M][M], float b[M][M]); Example: Cholesky Decomposition Hot Topics in Parallelism, June 2010 9

The StarSs Programming Model Seemingly Sequential Code • Code is seemingly sequential, and for (int j = 0; j<N; j++) { executes on the master thread for (int k = 0; k<j; k++) for (int i = j+1; i<N; i++) • Invoking kernel functions sgemm_t (A[i][k], A[j][k], A[i][j]); generates tasks, which are sent to the runtime for (int i = 0; i<j; i++) • s2s filter injects necessary code ssyrk_t (A[j][i], A[j][j]); • Runtime dynamically constructs spotrf_t (A[j][j]); the task dependency graph for (int i = j+1; i<N; i++) strsm_t (A[j][j], A[i][j]); • Easier to debug, since execution is } similar to sequential execution Example: Cholesky Decomposition Hot Topics in Parallelism, June 2010 10

The StarSs Programming Model Resulting Task Graph (5x5 matrix) • It is not feasible to have a programmer express such a graph… • Out-of-order execution • No loop level barriers (a-la OpenMP) Facilitates distant parallelism • Tasks 6 and 23 execute in parallel • Availability of data dependencies supports relaxed memory models • DAG consistency [Blumofe’96] • Bulk consistency [Torrellas’09] Hot Topics in Parallelism, June 2010 11

So Why Move to Hardware? • Problem: software runtime does not scale beyond 32-64 processors • Software decode rate is 700ns - 2.5us per task • Difference is between Intel Xeon and Cell PPU • Scaling therefore implies much longer tasks • Longer tasks imply larger datasets that do not fit in the cache • Hardware offers inherent parallelism • Vertical: pipelining • Horizontal: distributing load over multiple units Hot Topics in Parallelism, June 2010 12

Outline • Recap: How do OoO processors uncover parallelism? • The StarSs programming model: Tasks as abstract instructions • A high-level view of the task superscalar pipeline • Can a task pipeline scale? • Conclusions and Future Work Hot Topics in Parallelism, June 2010 13

Task Superscalar: a high-level view • Master processors send tasks to the pipeline • Object versioning table (OVTs) are used to map data consumers and producers • Combination of a register file and a renaming table • Task dependency graph is stored in multiplexed reservation stations • Heterogeneous backend • GPUs become equivalent to a vector unit found in many processors ... Master Processors Fetch P P P P Task Decode Unit Decode ... Task Dep. Decode Nested Task Generation OVT OVT OVT OVT Multiplexed Reservation Stations Dispatch ... MRS MRS MRS MRS HW/SW Scheduler Processors Worker Functional P P P P P Units GPU Hot Topics in Parallelism, June 2010 14

Result: Uncovering Large Amounts of Parallelism 1000 900 800 # of ready tasks 700 600 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized execution time MatMul 4Kx4K Cholesky 4Kx4K Jacobi 4Kx4K FFT 256K Knn 50K samp. H264Dec 1 HD Fr. • The figure shows the number of ready tasks throughout the execution • Parallelism can be found even in complex dependency structures • Cholesky, H264, Jacobi Hot Topics in Parallelism, June 2010 15

Overcoming the limitations of ILP pipelines • Task window timing • No need for a global clock – we can afford crossing clock domains • Building a large pipeline • Multiplex reservation stations into a single structure • Relaxed timing constraints on decodes facilitates the creation of explicit graph edges • Eliminates the need for associative MRS lookups • We estimate storing tens-of-thousands of in-flight tasks Hot Topics in Parallelism, June 2010 17

Overcoming the limitations of ILP pipelines • Utilizing a large window • Tasks are non-speculative • We can afford to wait for branches to be resolved • Overcoming the memory wall • Explicit data dependency graph facilitates data scheduling • Overlap computation with communications • Schedule work to exploit locality • Already done on the Cell B.E. version of StarSs • StarSs runtime tolerates memory latencies of thousands of cycles Hot Topics in Parallelism, June 2010 18

Conclusions • Dynamically scheduled out-of-order pipelines are very effective in managing parallelism • The mechanism is effective, but limited by timing constraints • Task-based dataflow programming models uncover runtime parallelism • Utilize an intuitive task abstraction • Intel RapidMind [McCool’06], Sequoia [Fatahalian’06], OoOJava [Jenista’10] • Combine the two: Task Superscalar • A task-level out-of-order pipeline using cores as functional units • We are currently implementing a task superscalar simulator • Execution engine for high-level models: Ct, CnC, MapReduce Hot Topics in Parallelism, June 2010 20

Task Superscalar: Using Processors as Functional Units Yoav Etsion - PowerPoint PPT Presentation

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero Yoav Etsion HotPar, June 2010 Senior Researcher Parallel Programming is Hard A few key

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Scientific Units & Conversions Objective: Students will be able to convert units and choose

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Overview Computer architecture Scaling performance and CMOS 1 Trends in Microprocessor

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Overview Respondent pool makeup 50-99 Other / 0-49 units multiple units 7% types

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I

Full System Simulator Simulates different new IBM architectures like PERCS, PowerPC 970 and

Code Communication SWEN-610 Foundations of Software Engineering Department of Software

Task Superscalar: Using Processors as Functional Units Yoav Etsion - PowerPoint PPT Presentation

Task Superscalar: Using Processors as Functional Units Yoav Etsion Alex Ramirez Rosa M. Badia Eduard Ayguade Jesus Labarta Mateo Valero Yoav Etsion HotPar, June 2010 Senior Researcher Parallel Programming is Hard A few key

Superscalar Processors Raul Queiroz Feitosa Parts of these slides are from the support material

Out- -of of- -Order Order Out Tomasulos Algorithm Superscalar CPU Superscalar CPU -

Out- -of of- -Order Order Out Superscalar CPU Superscalar CPU Cliff Frey and Vicky Liu May

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Scientific Units &amp; Conversions Objective: Students will be able to convert units and choose

Lecturer: Francesco Quaglia Hardware insights Pipelining and superscalar processors

Overview Computer architecture Scaling performance and CMOS 1 Trends in Microprocessor

A Fault Tolerant Superscalar Processor 1 [Based on Coverage of a Microarchitecture-level

1 Register Renaming Examples Register Mapping Status Loop: Renamed dynamic instructions: R1

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Overview Respondent pool makeup 50-99 Other / 0-49 units multiple units 7% types

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Welcome to Elyria High School Complete 21 course credits Units Our Districts Course

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight

PERCEPTION IS IS Market is headed to the bottom of the current real estate cycle with

Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang , Weifeng Liu, Wei Xue, Li Wu 2 1 3 Ou

Architecting a Kotlin JVM and JS multiplatform project FELIPE LIMA / OCT 4TH, 2018 / KOTLINCONF

Total Work-Flow: Exploiting Hybrid Computing Architectures for Scientific Computing ScicomP 15

Structure Groups (and Rings) Wolfgang Rump Instead of Groups and associated Structures I

Full System Simulator Simulates different new IBM architectures like PERCS, PowerPC 970 and

Code Communication SWEN-610 Foundations of Software Engineering Department of Software

Scientific Units & Conversions Objective: Students will be able to convert units and choose