Thread Tailor
Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark
Thread Tailor Dynamically Weaving Threads Together for Efficient, - - PowerPoint PPT Presentation
Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More
Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications
Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark
201X 2009
lots of threads
Binary
… > 128 Thr. Thread Tailor
New Binary
… 16 Thr. Distribute Compile
Combine Threads
Development Distribution
Binary
… > 128 Thr.
Profile Info.
Instrument
Profiler
Instrumented Codes
Graphs
Result
Combined Codes
Collect System Info. Run Combine Algorithm Code Generator
Thread Tailor
Cycles = 10M Working-set = 10K
Thread 1 Thread 2
Synchronization Cost (cycles) Communication Cost
Thread 1 Thread 2
ST LD ST LD
0x00001234: 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29 MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12
Address
LD Count ST Count
… … … 0x00001234 5 10 0x00001338 4 9 … … … 0x00004000 7 7 … … … Address
LD Count ST Count
… … … 0x00001234 7 0x00002000 4 4 … … … 0x00004000 3 8 … … … 1 2 29 Graph
Partition 1 Partition 2 Partition 1 Cycle Estimation Partition2 Cycle Estimation Move From Move Node
B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D
A B C D E F G H
60 60 60 60 60 60 60 60 60 60 60 60
10 = 100 Cycles 2 Cores
Code Cache
Target to combine? Yes No
Dynamic Compiler
vm_thread_create()
Thread Thread
: Create User Thread : Create Normal Thread
Thread
…
Application
Translation
User Thread User Thread …
Replace Thread APIs with Wrapper Functions Wrapper Function for Thread Creation Serially Execute User Threads in Real Thread Context Switched by Dynamic Compiler
– Intel Core 2 Duo 6600 (2.4 Ghz)
– Intel Core 2 Quad Q6600 (2.4.Ghz)
– 2 Quad-core CPUs with SMT – Intel Xeon E5520 ( 2.26 Ghz )
– 2 Quad-core CPUs with SMT and HyperThreading – Intel Xeon E5520 ( 2.26 Ghz )
0.9 0.95 1 1.05 1.1 1.15 1.2
2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Speedup Core Number
1.31 1.66 2.36 1.83
1 2 3 4 5 6 1 4 2 5 3 6 Input Matrix Output Matrix
… …
Thread 1
128 cols distance 128 rows distance
Thread 2
1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)
L1 private (32K)
…
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6 64 Byte Block L3 Shared (8M)
L1 private (32K)
…
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)
L1 private (32K)
…
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6
512 Byte distance
L3 Shared (8M)
L1 private (32K)
…
128 rows distance
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6
… iterates 128 times
L3 Shared (8M)
L1 private (32K)
8KB (128 * 64byte)
…
… iterates 128 times 8KB (128 * 64byte)
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)
L1 private (32K)
8KB (128 * 64byte)
…
… 8KB (128 * 64byte)
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem
1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)
L1 private (32K)
8KB (128 * 64byte)
…
… 8KB (128 * 64byte)
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem WRITE HIT!
1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)
L1 private (32K)
8KB (128 * 64byte)
…
… 8KB (128 * 64byte)
Input Matrix 16K x 16K Output Matrix 16K x 16K
L2 private (256K)
Core 0
Intel Nehalem WRITE HIT!