thread tailor
play

Thread Tailor Dynamically Weaving Threads Together for Efficient, - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More


  1. Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

  2. Motivation • Hardware Trends – Put more cores in a single chip NO! More threads always win? 2009 201X • CPU intensive programs – Exploits Thread Level Parallelism

  3. Optimal Number of Threads • Too many threads – More synchronization – More contention for system resources • Too few threads – Resource underutilization • Who can decide the number? – Not a programmer

  4. Why NOT? • Input changes – Various working-set size • The system changes Decision must be made at runtime – Various available resources • Hardware changes – Various L2/L3 cache structure / size, etc.

  5. Proposal 16 Thr. … OK. I will create Thread Tailor lots of threads > 128 Thr. Combine New Threads Binary … Binary Compile Distribute • Combining Threads – Group Several Threads into a Single Thread • Threads in the same group are executed in serial • Executed on the SAME core

  6. Details Profiler Graphs Profile Instrument Info. Instrumented > 128 Thr. Codes … Collect System Info. Binary Run Combine Algorithm Result Code Generator Combined Codes Thread Tailor Development Distribution

  7. Graph Construction Thread 1 Thread 2 Synchronization Cost Cycles = 10M (cycles) Communication Cost Working-set = 10K

  8. Communication Cost • Intuition : STORE Instruction causes coherence miss in cache • Log Memory Access per Thread Thread 1 Thread 2 Address Address LD Count ST Count LD Count ST Count LD LD … … … … … … 0x00001234 5 10 0x00001234 0 7 ST ST 0x00001338 4 9 0x00002000 4 4 Graph … … … … … … 0x00004000 7 7 0x00004000 3 8 29 1 2 … … … … … … 0x00001234: MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29

  9. Combining Algorithm • Kernighan-Lin(KL) Graph Partitioning Heuristic – Goal : Minimize Execution Cycles – Precondition : Combined Threads ≤ Cores 60 60 = 100 Cycles A B E F 60 60 60 60 60 60 60 60 2 Cores C D G H 60 60 10 Partition 1 Cycle Partition2 Cycle Move Move Partition 1 Partition 2 Estimation Estimation From Node B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D

  10. Thread Combining Application Replace Thread APIs with Wrapper Functions Dynamic Compiler Translation Code Cache Wrapper Function for Thread Creation vm_thread_create() No Yes : Create : Create Target to combine? Normal User Thread Thread Thread Thread Context Switched by Dynamic Compiler User User Thread Thread … Serially Execute User Threads in Real Thread Thread …

  11. Experimental Setup • 2 cores – Intel Core 2 Duo 6600 (2.4 Ghz) • 4 cores – Intel Core 2 Quad Q6600 (2.4.Ghz) • 8 cores – 2 Quad-core CPUs with SMT – Intel Xeon E5520 ( 2.26 Ghz ) • 16 cores (Logical) – 2 Quad-core CPUs with SMT and HyperThreading – Intel Xeon E5520 ( 2.26 Ghz )

  12. Results 1.31 1.66 2.36 1.83 1.2 1.15 1.1 Speedup 1.05 1 0.95 0.9 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Core Number

  13. Result Analysis - Transpose • Transpose m * n matrix to n * m 1 4 1 2 3 2 5 4 5 6 3 6 • Parallel Transpose Thread 1 … Thread 2 128 cols distance Input Matrix 128 rows distance Output Matrix …

  14. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  15. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 64 Byte Block L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  16. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) Output Matrix 16K x 16K

  17. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 L1 private (32K) 512 Byte distance … Input Matrix 16K x 16K L2 private (256K) 128 rows distance L3 Shared (8M) Output Matrix 16K x 16K

  18. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB iterates 128 times (128 * 64byte) 8KB … (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) iterates 128 times L3 Shared (8M) … Output Matrix 16K x 16K

  19. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  20. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  21. Result Analysis - Transpose • Transpose m * n matrix to n * m Intel Nehalem 1 4 1 2 3 2 5 Core 0 4 5 6 3 6 8KB (128 * 64byte) Working-set fits into L1 Cache (No Capacity Miss!) 8KB WRITE HIT! (128 * 64byte) L1 private (32K) … Input Matrix 16K x 16K L2 private (256K) L3 Shared (8M) … Output Matrix 16K x 16K

  22. Summary • Choosing Optimal Number of Threads is Hard • Thread Tailor Ease the Pain – Graph Representation – Combine Threads at Runtime

  23. Thank you

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend