Many-core Computing Many-core Computing Can compilers and tools do - PowerPoint PPT Presentation

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools do the heavy lifting? heavy lifting? Wen-mei Hwu Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT University of Illinois, Urbana-Champaign University of Illinois, Urbana-Champaign

Outline Outline • Parallel application outlook Parallel application outlook • Heavy lifting in “simple” parallel applications Heavy lifting in “simple” parallel applications • Promising tool strategies and early evidence Promising tool strategies and early evidence • Challenges and opportunities Challenges and opportunities SoC specific opportnities and challenges? 2 2 MPSoc, August 3, 2009 MPSoc, August 3, 2009

The Energy Behind Parallel The Energy Behind Parallel Revolution Revolution Courtesy: 3 year shift John Owens GPU in every PC– massive volume and potential impact GPU in every PC– massive volume and potential impact • Courtesy: John Owens 3 3 MPSoc, August 3, 2009 MPSoc, August 3, 2009

My Predictions My Predictions • Mass market parallel apps will focus on many-core Mass market parallel apps will focus on many-core GPUs in the next three to four years GPUs in the next three to four years • NVIDIA GeForce, ATI Radon, Intel Larrabee NVIDIA GeForce, ATI Radon, Intel Larrabee • “ “Simple” (vector) parallelism Simple” (vector) parallelism • Dense matrix, single/multi-grids, stencils, etc. Dense matrix, single/multi-grids, stencils, etc. • Even “simple” parallelism can be challenging Even “simple” parallelism can be challenging • Memory bandwidth limitation Memory bandwidth limitation • Portability and scalability Portability and scalability • Heterogeneity and data affinity Heterogeneity and data affinity 4 4 MPSoc, August 3, 2009 MPSoc, August 3, 2009

DRAM Bandwidth Trends DRAM Bandwidth Trends • Random access BW Random access BW 1.2% 1.2% of peak for DDR3-1600, of peak for DDR3-1600, 0.8% 0.8% for for GDDR4-1600 (and falling) GDDR4-1600 (and falling) • 3D stacking and optical interconnects will unlikely help. 3D stacking and optical interconnects will unlikely help. 5 5 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Dense Matrix Multiplication Dense Matrix Multiplication Example (G80) Example (G80) Ryoo, et al, PPoPP 2008 140 120 unroll 1 100 unroll 2 GFLOPS 80 unroll 4 60 Cannot run 40 complete unroll 20 0 normal normal normal normal normal normal prefetch prefetch prefetch prefetch prefetch prefetch Optimizations 1x1 1x2 1x4 1x1 1x2 1x4 8x8 tiles 16x16 tiles Memory bandwidth limited Instruction throughput limited Register tiling allows 200 GFOPS Volkov and Demmel , SC’08 6 6 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Example: Convolution – Base Parallel Example: Convolution – Base Parallel Code Code Each parallel task calculates an output element • Each parallel task calculates an output element Figure shows • Figure shows 1D convolution with K=5 kernel • 1D convolution with K=5 kernel • Calculation of 3 output elements Calculation of 3 output elements Highly parallel but memory bandwidth inefficient • Highly parallel but memory bandwidth inefficient • Uses massive threading to tolerate memory latency Uses massive threading to tolerate memory latency Each input element loaded up to K times • Each input element loaded up to K times Input elements in main memory 7 7 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Example: convolution using on-chip caching Example: convolution using on-chip caching • Output elements calculated from cache contents Output elements calculated from cache contents • Each input element loaded only once Each input element loaded only once • Cache pressure – (K-1+N) input elements needed for N Cache pressure – (K-1+N) input elements needed for N output elements output elements • 7/3 = 2.3, 7 7/3 = 2.3, 7 2 /3 2 = 5.4, 7 3 / 3 3 = 12 2 /3 2 = 5.4, 7 3 / 3 3 = 12 • For small caches, the benefit can be significantly reduced due For small caches, the benefit can be significantly reduced due to the high-ratio of additional elements loaded. to the high-ratio of additional elements loaded. Input elements first loaded into cache 8 8 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Example: Streaming for Reduced Example: Streaming for Reduced Cache Pressure Cache Pressure • Each input element is loaded into cache in turn Each input element is loaded into cache in turn • Or a (n-1)D slice in nD convolution Or a (n-1)D slice in nD convolution • All threads consume that input element All threads consume that input element • “ “loop skewing” needed to align the consumption of input loop skewing” needed to align the consumption of input elements elements • This stretches the effective size of the on-chip cache This stretches the effective size of the on-chip cache 9 9 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Many-core GPU Timing Results Many-core GPU Timing Results Time to compute a 3D k 3 -kernel convolution on 4 frames of a • Time to compute a 3D k 3 -kernel convolution on 4 frames of a 720X560 video sequence 720X560 video sequence • All times are in milliseconds All times are in milliseconds • Timed on a Tesla S1070 using one G280 GPU Timed on a Tesla S1070 using one G280 GPU 10 10 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Multi-core CPU Timing Results Multi-core CPU Timing Results • Time to compute a 3D k Time to compute a 3D k 3 -kernel convolution on 4 3 -kernel convolution on 4 frames of a 720X560 video sequence frames of a 720X560 video sequence All times are in milliseconds • All times are in milliseconds Timed on a Dual-Socket Duo-Core 2.4 GHz Opteron • Timed on a Dual-Socket Duo-Core 2.4 GHz Opteron system, all four cores used system, all four cores used 11 11 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Application Example: Up-resolution Application Example: Up-resolution of Video of Video Nearest & bilinear interpolation: + Fast but low quality Bicubic interpolation: + Higher quality but computational intensive 12 12 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Implementation Overview Implementation Overview • Step 1: Find the coefficients of the shifted B- Step 1: Find the coefficients of the shifted B- Splines. Splines. • Two single pole IIR filters along each dimension Two single pole IIR filters along each dimension • Implemented with recursion along scan lines Implemented with recursion along scan lines • Step 2: Use the coefficients to interpolate the Step 2: Use the coefficients to interpolate the image image • FIR filter for bicubic interpolation implemented as a k=4 2D FIR filter for bicubic interpolation implemented as a k=4 2D convolution with (2+16+2) 2 input tiles with halos convolution with (2+16+2) 2 input tiles with halos • Streaming not required due to small 2D kernel, on-chip cache Streaming not required due to small 2D kernel, on-chip cache works well as is. works well as is. • Step 3: DirectX displays from the GPU Step 3: DirectX displays from the GPU 13 13 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Upconversion Results Upconversion Results Parallelize bicubic B-spline interpolation • Parallelize bicubic B-spline interpolation Interpolate QCIF (176x144) to nearly HDTV (1232x1008) • Interpolate QCIF (176x144) to nearly HDTV (1232x1008) Improved quality over typical bilinear interpolation • Improved quality over typical bilinear interpolation Improved speed over typical CPU implementations • Improved speed over typical CPU implementations Measured 350x speedup over un-optimized CPU code • Measured 350x speedup over un-optimized CPU code Estimated 50x speedup over optimized CPU code from inspection of CPU code • Estimated 50x speedup over optimized CPU code from inspection of CPU code Real-time! • Real-time! Hardware IIR FIR CPU Intel Pentium D 5 ms 1689 ms GPU nVidia GeForce 1 ms 4 ms 8800 GTX 14 14 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Application Example: Application Example: Depth-Image Based Rendering Depth-Image Based Rendering • Three main steps: Three main steps: • Depth propagation Depth propagation • Color-based depth enhancement Color-based depth enhancement • Rendering Rendering 15 15 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Color-based depth enhancement Color-based depth enhancement Propagated depth image at color view Depth-color Directional Occlusion Depth edge bilateral disocclusion removal enhancement filtering filling Enhanced depth image Before After Naïve disocclusion filling Directional disocclusion filling Enhanced depth Propagated depth 16 16 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Depth – color bilateral filtering Depth – color bilateral filtering I − G ( I I ) 2 A B σ r  −  G ( x x ) 2 A B σ s G 2 * G 2 σ σ s r 17 17 MPSoc, August 3, 2009 MPSoc, August 3, 2009

DIBR Visual Results DIBR Visual Results Left view Right view Middle view Rendered view 18 18 MPSoc, August 3, 2009 MPSoc, August 3, 2009

Many-core Computing Many-core Computing Can compilers and tools do - PowerPoint PPT Presentation

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools do the heavy lifting? heavy lifting? Wen-mei Hwu Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT FCRP GSRC, Illinois

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the

Comparing P2P Systems Anthony D. Joseph John Kubiatowicz CS294-4 Why so many systems? Many

The Elective Options 7 th Grade Core & Electives Core Choose Core & Elective

CORE COMPETENCIES How does School District 47 help students to self-assess core competencies in:

CORE 2016 A fresh approach to the Certificate of Resuscitation and Emergency Care (CORE) August

Core Working Group Report Philip Levis ( speaking on behalf of the WG ) TTX 5 2/22/08 Core WG

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Core type theory David Ripley Monash University http://davewripley.rocks Core logic Core logic

Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie Wang Aparna

Natural Response to Non-zero Initial Conditions Prof. Seungchul Lee Industrial AI Lab. The

Solar-powering your geek gear Alternative and mobile energy for all your little toys Michael

CS5412: NETWORKS AND THE CLOUD Lecture III Ken Birman The Internet and the Cloud 2 Cloud

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

November 16, 2017 Gildas Avoine Loc Ferreira Rescuing LoRaWAN 1.0 Workshop CRYPTACUS 1

SURFsara NOC Flash talk Erik Ruiter, Sr. Network Specialist, SURFsara TF-NOC Meeting Cambridge

Selected results on heavy flavour physics at LHCb Matthew CHARLES (UPMC/LPNHE) 1 Plan

Many-core Computing Many-core Computing Can compilers and tools do - PowerPoint PPT Presentation

Many-core Computing Many-core Computing Can compilers and tools do the Can compilers and tools do the heavy lifting? heavy lifting? Wen-mei Hwu Wen-mei Hwu FCRP GSRC, Illinois UPCRC, Illinois CUDA CoE, IACAT , IMPACT FCRP GSRC, Illinois

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Toward Efficient Many-to-Many Broadcast in Dynamic Wireless Networks Fabian Mager , Carsten

Motivation Memory is a shared resource Core Core Memory Core Core Threads requests

PSHE curriculum Robert Willmott Core Themes Core Theme 1: Health and Core Theme 2: Core Theme

Final Assembly Chip Core Your final project chip consists of a core The Chip Core is

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Real-Time Multi/Many-Core Architecture Heechul Yun 1 Real-Time Multi/Many-Core Architecture

Software Sustainability in the Many-Core Era Jonas Thies &gt; Software Sustainability in the

Comparing P2P Systems Anthony D. Joseph John Kubiatowicz CS294-4 Why so many systems? Many

The Elective Options 7 th Grade Core &amp; Electives Core Choose Core &amp; Elective

CORE COMPETENCIES How does School District 47 help students to self-assess core competencies in:

CORE 2016 A fresh approach to the Certificate of Resuscitation and Emergency Care (CORE) August

Core Working Group Report Philip Levis ( speaking on behalf of the WG ) TTX 5 2/22/08 Core WG

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Core type theory David Ripley Monash University http://davewripley.rocks Core logic Core logic

Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie Wang Aparna

Natural Response to Non-zero Initial Conditions Prof. Seungchul Lee Industrial AI Lab. The

Solar-powering your geek gear Alternative and mobile energy for all your little toys Michael

CS5412: NETWORKS AND THE CLOUD Lecture III Ken Birman The Internet and the Cloud 2 Cloud

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

November 16, 2017 Gildas Avoine Loc Ferreira Rescuing LoRaWAN 1.0 Workshop CRYPTACUS 1

SURFsara NOC Flash talk Erik Ruiter, Sr. Network Specialist, SURFsara TF-NOC Meeting Cambridge

Selected results on heavy flavour physics at LHCb Matthew CHARLES (UPMC/LPNHE) 1 Plan

Software Sustainability in the Many-Core Era Jonas Thies > Software Sustainability in the

The Elective Options 7 th Grade Core & Electives Core Choose Core & Elective