cs 61c great ideas in computer architecture lecture 18
play

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel - PowerPoint PPT Presentation

CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c Reference Problem Matrix multiplication Basic operation in many engineering,


  1. CS 61C: Great Ideas in Computer Architecture Lecture 18: Parallel Processing – SIMD Bernhard Boser & Randy Katz http://inst.eecs.berkeley.edu/~cs61c

  2. Reference Problem • Matrix multiplication − Basic operation in many engineering, data, and imaging processing tasks − Image filtering, noise reduction, … − Many closely related operations § E.g. stereo vision (project 4) • dgemm − double precision floating point matrix multiplication CS 61c Lecture 18: Parallel Processing - SIMD 5

  3. Application Example: Deep Learning • Image classification (cats …) • Pick “best” vacation photos • Machine translation • Clean up accent • Fingerprint verification • Automatic game playing CS 61c Lecture 18: Parallel Processing - SIMD 6

  4. � Matrices 𝑘 • Square (or rectangular) N x N array of numbers N-1 0 0 − Dimension N 𝑗 𝐷 = 𝐵 ' 𝐶 𝑑 "# 𝑑 "# = ) 𝑏 "+ 𝑐 +# N-1 + CS 61c Lecture 18: Parallel Processing - SIMD 7

  5. � Matrix Multiplication 𝑘 𝑫 = 𝑩 ' 𝑪 𝑙 𝑑 "# = ) 𝑏 "+ 𝑐 +# + 𝑙 𝑗 CS 61c 8

  6. Reference: Python • Matrix multiplication in Python N Python [Mflops] • 1 Mflop = 1 Million floating point operations per 32 5.4 second (fadd, fmul) 160 5.5 • dgemm(N …) takes 480 5.4 2*N 3 flops 960 5.3 CS 61c Lecture 18: Parallel Processing - SIMD 9

  7. C • c = a x b • a, b, c are N x N matrices CS 61c Lecture 18: Parallel Processing - SIMD 10

  8. Timing Program Execution CS 61c Lecture 18: Parallel Processing - SIMD 11

  9. C versus Python N C [Gflops] Python [Gflops] 32 1.30 0.0054 240x 160 1.30 0.0055 ! 480 1.32 0.0054 960 0.91 0.0053 Which class gives you this kind of power? We could stop here … but why? Let’s do better! CS 61c Lecture 18: Parallel Processing - SIMD 12

  10. New-School Machine Structures (It’s a bit more complicated!) Software Hardware • Parallel Requests Warehouse Smart Assigned to computer Scale Phone e.g., Search “Katz” Computer Harness • Parallel Threads Parallelism & Achieve High Assigned to core Computer Performance e.g., Lookup, Ads … Core Core • Parallel Instructions Memory (Cache) >1 instruction @ one time Input/Output e.g., 5 pipelined instructions Core Today’s • Parallel Data Functional Instruction Unit(s) Lecture Unit(s) >1 data item @ one time A 2 +B 2 A 3 +B 3 A 0 +B 0 A 1 +B 1 e.g., Add of 4 pairs of words Cache Memory • Hardware descriptions Logic Gates All gates @ one time 16 • Programming Languages

  11. Multiple-Instruction/Single-Data Stream (MISD) • Multiple-Instruction, Single-Data stream computer that exploits multiple instruction streams against a single data stream. • Historical significance This has few applications. Not covered in 61C. CS 61c Lecture 18: Parallel Processing - SIMD 20

  12. SIMD Applications & Implementations • Applications − Scientific computing § Matlab, NumPy − Graphics and video processing § Photoshop, … − Big Data § Deep learning − Gaming − … • Implementations − x86 − ARM − … CS 61c Lecture 18: Parallel Processing - SIMD 24

  13. Raw Double Precision Throughput (Bernhard’s Powerbook Pro) Characteristic Value CPU i7-5557U Clock rate (sustained) 3.1 GHz Instructions per clock (mul_pd) 2 Parallel multiplies per instruction 4 Peak double flops 24.8 Gflops https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/ Actual performance is lower because of overhead CS 61c Lecture 18: Parallel Processing - SIMD 36

  14. Vectorized Matrix Multiplication for i …; i+=4 𝑘 for j ... Inner Loop: 𝑙 𝑙 𝑗 i += 4 CS 61c 37

  15. “Vectorized” dgemm CS 61c Lecture 18: Parallel Processing - SIMD 38

  16. Performance Gflops N scalar avx 32 1.30 4.56 160 1.30 5.47 480 1.32 5.27 960 0.91 3.64 • 4x faster • But still << theoretical 25 Gflops! CS 61c Lecture 18: Parallel Processing - SIMD 39

  17. Pipeline Hazards – dgemm CS 61c Lecture 18: Parallel Processing - SIMD 54

  18. Loop Unrolling 4 registers Compiler does the unrolling How do you verify that the generated code is actually unrolled? CS 61c Lecture 18: Parallel Processing - SIMD 55

  19. Performance Gflops N scalar avx unroll 32 1.30 4.56 12.95 160 1.30 5.47 19.70 480 1.32 5.27 14.50 960 0.91 3.64 6.91 CS 61c Lecture 18: Parallel Processing - SIMD 56

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend