Thread Tailor Dynamically Weaving Threads Together for Efficient, - - PowerPoint PPT Presentation

thread tailor
SMART_READER_LITE
LIVE PREVIEW

Thread Tailor Dynamically Weaving Threads Together for Efficient, - - PowerPoint PPT Presentation

Thread Tailor Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark Motivation Hardware Trends Put more cores in a single chip NO! More


slide-1
SLIDE 1

Thread Tailor

Dynamically Weaving Threads Together for Efficient, Adaptive Parallel Applications

Janghaeng Lee, Haicheng Wu, Madhumitha Ravichandran, Nathan Clark

slide-2
SLIDE 2

Motivation

  • Hardware Trends

– Put more cores in a single chip

  • CPU intensive programs

– Exploits Thread Level Parallelism

201X 2009

More threads always win?

NO!

slide-3
SLIDE 3

Optimal Number of Threads

  • Too many threads

– More synchronization – More contention for system resources

  • Too few threads

– Resource underutilization

  • Who can decide the number?

– Not a programmer

slide-4
SLIDE 4

Why NOT?

  • Input changes

– Various working-set size

  • The system changes

– Various available resources

  • Hardware changes

– Various L2/L3 cache structure / size, etc.

Decision must be made at runtime

slide-5
SLIDE 5

Proposal

  • Combining Threads

– Group Several Threads into a Single Thread

  • Threads in the same group are executed in serial
  • Executed on the SAME core
  • OK. I will create

lots of threads

Binary

… > 128 Thr. Thread Tailor

New Binary

… 16 Thr. Distribute Compile

Combine Threads

slide-6
SLIDE 6

Details

Development Distribution

Binary

… > 128 Thr.

Profile Info.

Instrument

Profiler

Instrumented Codes

Graphs

Result

Combined Codes

Collect System Info. Run Combine Algorithm Code Generator

Thread Tailor

slide-7
SLIDE 7

Graph Construction

Cycles = 10M Working-set = 10K

Thread 1 Thread 2

Synchronization Cost (cycles) Communication Cost

slide-8
SLIDE 8

Communication Cost

  • Intuition : STORE Instruction causes coherence miss in cache
  • Log Memory Access per Thread

Thread 1 Thread 2

ST LD ST LD

0x00001234: 0x00004000: MIN(7, 8) + MIN( 7, 3) + MIN( 7, 8) = 17 Total Communication Cost: 12 + 17 = 29 MIN(5, 7) + MIN(10, 0) + MIN(10, 7) = 12

Address

LD Count ST Count

… … … 0x00001234 5 10 0x00001338 4 9 … … … 0x00004000 7 7 … … … Address

LD Count ST Count

… … … 0x00001234 7 0x00002000 4 4 … … … 0x00004000 3 8 … … … 1 2 29 Graph

slide-9
SLIDE 9

Partition 1 Partition 2 Partition 1 Cycle Estimation Partition2 Cycle Estimation Move From Move Node

B C D G A E F H 210 220 2 A A B C D G E F H 130 120 1 G A B C D E F G H 40 40 1 D

Combining Algorithm

  • Kernighan-Lin(KL) Graph Partitioning Heuristic

– Goal : Minimize Execution Cycles – Precondition : Combined Threads ≤ Cores

A B C D E F G H

60 60 60 60 60 60 60 60 60 60 60 60

10 = 100 Cycles 2 Cores

slide-10
SLIDE 10

Thread Combining

Code Cache

Target to combine? Yes No

Dynamic Compiler

vm_thread_create()

Thread Thread

: Create User Thread : Create Normal Thread

Thread

Application

Translation

User Thread User Thread …

Replace Thread APIs with Wrapper Functions Wrapper Function for Thread Creation Serially Execute User Threads in Real Thread Context Switched by Dynamic Compiler

slide-11
SLIDE 11

Experimental Setup

  • 2 cores

– Intel Core 2 Duo 6600 (2.4 Ghz)

  • 4 cores

– Intel Core 2 Quad Q6600 (2.4.Ghz)

  • 8 cores

– 2 Quad-core CPUs with SMT – Intel Xeon E5520 ( 2.26 Ghz )

  • 16 cores (Logical)

– 2 Quad-core CPUs with SMT and HyperThreading – Intel Xeon E5520 ( 2.26 Ghz )

slide-12
SLIDE 12

0.9 0.95 1 1.05 1.1 1.15 1.2

2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 2 4 8 16 fluidanimate transpose blackscholes twister water_n^2 swaptions Speedup Core Number

1.31 1.66 2.36 1.83

Results

slide-13
SLIDE 13

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 Input Matrix Output Matrix

  • Parallel Transpose

… …

Thread 1

128 cols distance 128 rows distance

Thread 2

slide-14
SLIDE 14

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)

L1 private (32K)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-15
SLIDE 15

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 64 Byte Block L3 Shared (8M)

L1 private (32K)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-16
SLIDE 16

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)

L1 private (32K)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-17
SLIDE 17

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6

512 Byte distance

L3 Shared (8M)

L1 private (32K)

128 rows distance

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-18
SLIDE 18

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6

… iterates 128 times

L3 Shared (8M)

L1 private (32K)

8KB (128 * 64byte)

… iterates 128 times 8KB (128 * 64byte)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-19
SLIDE 19

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)

L1 private (32K)

8KB (128 * 64byte)

… 8KB (128 * 64byte)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem

slide-20
SLIDE 20

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)

L1 private (32K)

8KB (128 * 64byte)

… 8KB (128 * 64byte)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem WRITE HIT!

slide-21
SLIDE 21

Result Analysis - Transpose

  • Transpose m * n matrix to n * m

1 2 3 4 5 6 1 4 2 5 3 6 L3 Shared (8M)

L1 private (32K)

8KB (128 * 64byte)

… 8KB (128 * 64byte)

Input Matrix 16K x 16K Output Matrix 16K x 16K

L2 private (256K)

Core 0

Intel Nehalem WRITE HIT!

Working-set fits into L1 Cache (No Capacity Miss!)

slide-22
SLIDE 22

Summary

  • Choosing Optimal Number of Threads is Hard
  • Thread Tailor Ease the Pain

– Graph Representation – Combine Threads at Runtime

slide-23
SLIDE 23

Thank you