Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, - - PowerPoint PPT Presentation
Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, - - PowerPoint PPT Presentation
Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, & David I. August The Liberty Research Group Princeton University Comparison: IMT, PMT, CMT IMT PMT CMT Core 1 Core 2 Core 1
Comparison: IMT, PMT, CMT
1 2 3 4 5
C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6
Core 1 Core 2
IMT
1 iter/cycle
1 2 3 4 5
LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5
Core 1 Core 2
PMT
1 iter/cycle lat(comm) = 1:
1 2 3 4 5
LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6
Core 1 Core 2
CMT
1 iter/cycle
Comparison: IMT, PMT, CMT
1 2 3 4 5
C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6
Core 1 Core 2
IMT
1 iter/cycle
1 2 3 4 5
LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6
Core 1 Core 2
PMT
1 iter/cycle lat(comm) = 1:
1 2 3 4 5
LD:1 X:1 LD:2 X:2 LD:3 X:3
Core 1 Core 2
CMT
0.5 iter/cycle
692
Bandwidth 40MB/s 430MB/s 2Gb/s DBLS '07 Technology Lamport '83 FastForward '08 Liberty '10 MCRB '09 DSWP Transactional Memory Applications StreamIt Line-Rate Network Traffic Monitoring SRMT Multithreaded Assertions
Lamport Queues
Tail CPU 0: Shared CPU 1: Shared
4 3 2 1
Head CPU 0: Exclusive CPU 1: Invalid
Bandwidth: 40 MB/s
Lamport Queues
Tail CPU 0: Invalid CPU 1: Exclusive
4 3 2 1
Head CPU 0: Shared CPU 1: Shared
Every produce-consumer pair produces a cache ping-pong!
Bandwidth: 40 MB/s
Lamport Queues
Tail CPU 0: Invalid CPU 1: Exclusive
4 3 2 1
Head CPU 0: Shared CPU 1: Shared
Every produce-consumer pair produces a cache ping-pong!
Bandwidth: 40 MB/s
MCRB and DBLS
Tail CPU 0: Invalid CPU 1: Exclusive
4 3 2 1
Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive
Bandwidth: 430 MB/s
MCRB and DBLS
Tail CPU 0: Invalid CPU 1: Exclusive
5 4 3 2 1
Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive
Bandwidth: 430 MB/s
MCRB and DBLS
Tail CPU 0: Invalid CPU 1: Exclusive
7 6 5 4 3 2 1
Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive
Bandwidth: 430 MB/s
MCRB and DBLS
Tail CPU 0: Invalid CPU 1: Exclusive
7 6 5 4 3 2 1
Head CPU 0: Shared CPU 1: Shared Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive
Bandwidth: 430 MB/s
Tail CPU 0: Invalid CPU 1: Exclusive
7 6 5 4 3 2 1
Head CPU 0: Shared CPU 1: Shared Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive
Caching eliminates ping-ponging!
Bandwidth: 430 MB/s
Tail
4 3 2 1
Head
7 6 5 4 3 2 1
Liberty Queues
Bandwidth: 500 MB/s
Tail
5 4 3 2 1
Head
7 6 5 4 3 2 1
Liberty Queues
Bandwidth: 500 MB/s
Tail
5 4 3 2 1
Head
7 6 5 4 3 2 5
Liberty Queues
Bandwidth: 500 MB/s
Tail
5 4 3 2 1
Head
7 6 5 5
Liberty Queues Prefeching doesn’t help
Bandwidth: 490 MB/s
Tail
8 7 6 5 4 3 2 1
Head
4 3 2 5
Liberty Queues
Bandwidth: 690 MB/s
Tail
8 7 6 5 4 3 2 1
Head
8 7 6 5
Liberty Queues
Bandwidth: 2170 MB/s
0,5 1 1,5 2 2,5 3 3,5 4 4,5 GB/s
Liberty Queue Bandwidth
64bit 128bit