high performance execution of multithreaded workloads on
play

High-Performance Execution of Multithreaded Workloads on CMPs M. - PowerPoint PPT Presentation

High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1 How do we use the transistors? More transistors Higher performance core


  1. High-Performance Execution of Multithreaded Workloads on CMPs M. Aater Suleman Advisor: Yale Patt HPS Research Group The University of Texas at Austin 1

  2. How do we use the transistors? • More transistors � Higher performance core – Performance increases without programmer effort – Larger cores are complex and consume more power But, do CMPs improve • More transistors � Bigger cache performance? – Assist the core by reducing memory accesses – Easier to design and consume less power – Pentium M: 50M out of the 77M were cache • More transistors � More cores – Chip Multiprocessors (CMPs) – Less complex – Run at lower frequency (Power α frequency 3 ) 2

  3. Multithreading Single-Threaded But, can we do this for all applications? To leverage CMPs, applications must be split into threads 3

  4. Easy-to-parallelize Kernels Kernel from ImageMagick GrayscaleToMonochrome (picture) foreach (OldPixel in picture) if( OldPixel > Threshold) NewPixel = 1 else NewPixel = 0 4

  5. Serial Kernels Kernel from 1 2 3 4 ImageMagick Old pixels: avg avg avg New pixels: Smooth(Picture) for i = 1 to N Pixel[i] = (Pixel[i-1] + Pixel[i])/2 5

  6. Amdahl’s Law As the number of cores increase, even a small serial part can have significant impact on overall performance Future CMPs must improve performance of both parallel and serial parts 6

  7. Outline • Background • Speeding up serial part – Asymmetric Chip Multiprocessor (ACMP) • Speeding up parallel part – Feedback-Driven Threading (FDT) • Summary 7

  8. Current CMP Architectures 8

  9. Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach • Tile many small cores • Sun Niagara Processor • High throughput on the parallel part • Low performance on the serial part 9

  10. Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach 10

  11. Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach “Tile-Large”Approach • Tile a few large cores • IBM Power 5, AMD Barcelona, Intel Core2Quad • High performance on the serial part • Low throughput on the parallel part 11

  12. Current CMP Architectures Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like Large Large core core core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core “Niagara” Approach “Tile-Large”Approach 12

  13. The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach • Provide one large core and many small cores • Accelerate serial part using the large core • Execute parallel part on small cores for high throughput 13

  14. The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach 14

  15. The Asymmetric Chip Multiprocessor (ACMP) Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large Large Large core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like Large Large core core core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core ACMP Approach “Niagara” Approach “Tile-Large”Approach • Analytical experiment details – One large core replaces four small cores – Large core provides 2x performance 15

  16. Performance vs. Parallelism 9 At medium At high parallelism, Niagara Speedup vs. 1 Large Core 8 parallelism, ACMP Niagara Tile-Large 7 wins outperforms ACMP ACMP 6 Niagara beats ACMP at 97% 5 Both ACMP and parallelism 4 Tile-Large outperform Niagara 3 2 1 0 0 0.2 0.4 0.6 0.8 1 Degree of Parallelism 16

  17. Throughput of ACMP vs. Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like Large core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core Niagara Niagara Niagara Niagara Niagara Niagara Niagara Niagara -like -like -like -like -like -like -like -like core core core core core core core core 17

  18. ACMP Scheduling Niagara Niagara -like -like Large core core core Niagara Niagara -like -like core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core Niagara Niagara Niagara Niagara -like -like -like -like core core core core ACMP Approach 18

  19. Data Transfers in ACMP • Data is transferred if the serial part requires the data generated by the parallel part or vice-versa • ACMP – Data is transferred from all small cores • Niagara/Tile-Large – Data is transferred from all but one core • Number of data transfers increases by only 3.8% 19

  20. Experimental Methodology • Configurations : – Niagara: 16 small cores – Tile-Large: 4 large cores – ACMP: 1 large core, 12 small cores • Simulated existing multithreaded applications without modification • Simulation parameters: – x86 cycle accurate processor simulator – Large core: 2GHz, out-of-order, 128-entry window, 4-wide issue, 12-stage pipeline – Small core: 2GHz, in-order, 2-wide, 5-stage pipeline – Private 32 KB L1, private 256KB L2 – On-chip interconnect: Bi-directional ring 20

  21. Performance Results Tile-Large 1.4 ACMP 1.2 Speedup over Niagara 1 0.8 0.6 0.4 0.2 Low Medium High 0 Parallelism Parallelism Parallelism h p p h p p p y t f e 4 d r s c s s s s m k e s g 6 e a m a s a a a a v 2 a o l n l n p e n n n p p . _ _ h s l _ _ _ o s o t g _ s g p c _ r h a m i e m t c c f f m f 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend