CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel - PowerPoint PPT Presentation

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

Reference Problem • Matrix multiplication − Basic operation in many engineering, data, and imaging processing tasks − Image filtering, noise reduction, … − Many closely related operations § E.g. stereo vision (project 4) • dgemm − double precision floating point matrix multiplication CS 61c Lecture 18: Parallel Processing - SIMD 5

Application Example: Deep Learning • Image classification (cats …) • Pick “best” vacation photos • Machine translation • Clean up accent • Fingerprint verification • Automatic game playing CS 61c Lecture 18: Parallel Processing - SIMD 6

� Matrices 𝑘 • Square (or rectangular) N x N array of numbers N-1 0 0 − Dimension N 𝑗 𝐷 = 𝐵 ' 𝐶 𝑑 "# 𝑑 "# = ) 𝑏 "+ 𝑐 +# N-1 + CS 61c Lecture 18: Parallel Processing - SIMD 7

� Matrix Multiplication 𝑘 𝑫 = 𝑩 ' 𝑪 𝑙 𝑑 "# = ) 𝑏 "+ 𝑐 +# + 𝑙 𝑗 CS 61c 8

Reference: Python • Matrix multiplication in Python N Python [Mflops] • 1 Mflop = 1 Million floating point operations per 32 5.4 second (fadd, fmul) 160 5.5 • dgemm(N …) takes 480 5.4 2*N 3 flops 960 5.3 CS 61c Lecture 18: Parallel Processing - SIMD 9

C • c = a x b • a, b, c are N x N matrices CS 61c Lecture 18: Parallel Processing - SIMD 10

Timing Program Execution CS 61c Lecture 18: Parallel Processing - SIMD 11

C versus Python N C [Gflops] Python [Gflops] 32 1.30 0.0054 240x 160 1.30 0.0055 ! 480 1.32 0.0054 960 0.91 0.0053 Which class gives you this kind of power? We could stop here … but why? Let’s do better! CS 61c Lecture 18: Parallel Processing - SIMD 12

New-School Machine Structures (It’s a bit more complicated!) Software Hardware • Parallel Requests Warehouse Smart Assigned to computer Scale Phone e.g., Search “Katz” Computer Harness • Parallel Threads Parallelism & Achieve High Assigned to core Computer Performance e.g., Lookup, Ads … Core Core • Parallel Instructions Memory (Cache) >1 instruction @ one time Input/Output e.g., 5 pipelined instructions Core Today’s • Parallel Data Functional Instruction Unit(s) Lecture Unit(s) >1 data item @ one time A 2 +B 2 A 3 +B 3 A 0 +B 0 A 1 +B 1 e.g., Add of 4 pairs of words Cache Memory • Hardware descriptions Logic Gates All gates @ one time 16 • Programming Languages

Multiple-Instruction/Single-Data Stream (MISD) • Multiple-Instruction, Single-Data stream computer that exploits multiple instruction streams against a single data stream. • Historical significance This has few applications. Not covered in 61C. CS 61c Lecture 18: Parallel Processing - SIMD 20

SIMD Applications & Implementations • Applications − Scientific computing § Matlab, NumPy − Graphics and video processing § Photoshop, … − Big Data § Deep learning − Gaming − … • Implementations − x86 − ARM − … CS 61c Lecture 18: Parallel Processing - SIMD 24

Raw Double Precision Throughput (Bernhard’s Powerbook Pro) Characteristic Value CPU i7-5557U Clock rate (sustained) 3.1 GHz Instructions per clock (mul_pd) 2 Parallel multiplies per instruction 4 Peak double flops 24.8 Gflops https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ Actual performance is lower because of overhead CS 61c Lecture 18: Parallel Processing - SIMD 36

Vectorized Matrix Multiplication for i …; i+=4 𝑘 for j ... Inner Loop: 𝑙 𝑙 𝑗 i += 4 CS 61c 37

“Vectorized” dgemm CS 61c Lecture 18: Parallel Processing - SIMD 38

Performance Gflops N scalar avx 32 1.30 4.56 160 1.30 5.47 480 1.32 5.27 960 0.91 3.64 • 4x faster • But still << theoretical 25 Gflops! CS 61c Lecture 18: Parallel Processing - SIMD 39

Pipeline Hazards – dgemm CS 61c Lecture 18: Parallel Processing - SIMD 54

Loop Unrolling 4 registers Compiler does the unrolling How do you verify that the generated code is actually unrolled? CS 61c Lecture 18: Parallel Processing - SIMD 55

Performance Gflops N scalar avx unroll 32 1.30 4.56 12.95 160 1.30 5.47 19.70 480 1.32 5.27 14.50 960 0.91 3.64 6.91 CS 61c Lecture 18: Parallel Processing - SIMD 56

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel - PowerPoint PPT Presentation

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c Reference Problem Matrix multiplication Basic operation in many engineering,

Engineering Culture Secret Sauce of Great Software Great Software process model Great

Stable Matchings 15-251: Great Theoretical Ideas in Computer Science Fall 2016 Lecture 11

Social History of Ideas Social History of Ideas Historians have a rich appreciation of ideas

CS449/649: Human-Computer Interaction Winter 2018 Lecture VII Anastasia Kuzminykh Create

CS449/649: Human-Computer Interaction Spring 2017 Lecture VII Anastasia Kuzminykh Create

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

DPW and Parks & Recreation Facility Site 1 10 Great Meadow Road Site 2 1-3 Great

Great People Great Companies Great People Great Companies Executive Search & Campaign

The Great Lakes The Great Lakes The Great Lakes The Great Lakes and and and and St.

Todays topics Computer Applications Computer Security Upcoming Operating Systems ( Great

15-251 Great Ideas in Theoretical Computer Science Lecture 1: Introduction to the course

15-251 Great Ideas in Theoretical Computer Science Lecture 1: Introduction to the course

15-251 Great Theoretical Ideas in Computer Science Lecture 1: Introduction to the course

Great Lakes Legacy Act Great Lakes Legacy Act Marc Tuchman Marc Tuchman U.S. EPA U.S. EPA

www.UNHistory.org www.UNHistory.org The Power of Ideas The Power of Ideas UNIHP Book Series

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO?

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel - PowerPoint PPT Presentation

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c Reference Problem Matrix multiplication Basic operation in many engineering,

Engineering Culture Secret Sauce of Great Software Great Software process model Great

Stable Matchings 15-251: Great Theoretical Ideas in Computer Science Fall 2016 Lecture 11

Social History of Ideas Social History of Ideas Historians have a rich appreciation of ideas

CS449/649: Human-Computer Interaction Winter 2018 Lecture VII Anastasia Kuzminykh Create

CS449/649: Human-Computer Interaction Spring 2017 Lecture VII Anastasia Kuzminykh Create

CMS Strip Readout Architecture for SLHC OUTLINE brief review of LHC strip readout architecture p

DPW and Parks &amp; Recreation Facility Site 1 10 Great Meadow Road Site 2 1-3 Great

Great People Great Companies Great People Great Companies Executive Search &amp; Campaign

The Great Lakes The Great Lakes The Great Lakes The Great Lakes and and and and St.

Todays topics Computer Applications Computer Security Upcoming Operating Systems ( Great

15-251 Great Ideas in Theoretical Computer Science Lecture 1: Introduction to the course

15-251 Great Ideas in Theoretical Computer Science Lecture 1: Introduction to the course

15-251 Great Theoretical Ideas in Computer Science Lecture 1: Introduction to the course

Great Lakes Legacy Act Great Lakes Legacy Act Marc Tuchman Marc Tuchman U.S. EPA U.S. EPA

www.UNHistory.org www.UNHistory.org The Power of Ideas The Power of Ideas UNIHP Book Series

Innovative Ideas to Engage Agents Will Bickmore &amp; Sarah-Lynne Rand Senior Account Managers

CS 839: Design the Next-Generation Database Lecture 7: GPU Database Xiangyao Yu 2/11/2020 1

Programming the Adapteva Epiphany 64-core Network-on-chip Coprocessor Anish Varghese, Robert

Samuel Cremer 1,2 , Michel Bagein 1 , Sad Mahmoudi 1 , Pierre Manneback 1 1 UMONS, University of

Nikolay Khokhlov, MIPT Quasilinear equations, inverse problems and their applications Moscow

Welcome! Todays Agenda: OOP Performance Pitfalls DOD Concepts DOD or OO?

HPC Future Look Exascale and Challenges Outline Future architectures Exascale initiatives

CS4402-9535: Many-core Computing with CUDA Marc Moreno Maza University of Western Ontario,

Placement resource view visualization $ openstack resource provider tree balazs.gibizer@est.tech

DPW and Parks & Recreation Facility Site 1 10 Great Meadow Road Site 2 1-3 Great

Great People Great Companies Great People Great Companies Executive Search & Campaign

Innovative Ideas to Engage Agents Will Bickmore & Sarah-Lynne Rand Senior Account Managers