Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, - - PowerPoint PPT Presentation

liberty queues for epic architectures
SMART_READER_LITE
LIVE PREVIEW

Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, - - PowerPoint PPT Presentation

Liberty Queues for EPIC Architectures Thomas Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, & David I. August The Liberty Research Group Princeton University Comparison: IMT, PMT, CMT IMT PMT CMT Core 1 Core 2 Core 1


slide-1
SLIDE 1

Liberty Queues for EPIC Architectures

Thomas Jablin, Yun Zhang, James A. Jablin, Jialu Huang, Hanjun Kim, & David I. August

The Liberty Research Group

Princeton University

slide-2
SLIDE 2

Comparison: IMT, PMT, CMT

1 2 3 4 5

C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6

Core 1 Core 2

IMT

1 iter/cycle

1 2 3 4 5

LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6 X:5

Core 1 Core 2

PMT

1 iter/cycle lat(comm) = 1:

1 2 3 4 5

LD:1 X:1 LD:2 X:2 LD:4 X:4 LD:3 X:3 LD:5 X:5 LD:6

Core 1 Core 2

CMT

1 iter/cycle

slide-3
SLIDE 3

Comparison: IMT, PMT, CMT

1 2 3 4 5

C:1 X:1 C:2 X:2 C:4 X:4 C:3 X:3 C:5 X:5 C:6 X:6

Core 1 Core 2

IMT

1 iter/cycle

1 2 3 4 5

LD:1 LD:2 X:1 X:2 X:3 X:4 LD:3 LD:4 LD:5 LD:6

Core 1 Core 2

PMT

1 iter/cycle lat(comm) = 1:

1 2 3 4 5

LD:1 X:1 LD:2 X:2 LD:3 X:3

Core 1 Core 2

CMT

0.5 iter/cycle

692

slide-4
SLIDE 4

Bandwidth 40MB/s 430MB/s 2Gb/s DBLS '07 Technology Lamport '83 FastForward '08 Liberty '10 MCRB '09 DSWP Transactional Memory Applications StreamIt Line-Rate Network Traffic Monitoring SRMT Multithreaded Assertions

slide-5
SLIDE 5

Lamport Queues

Tail CPU 0: Shared CPU 1: Shared

4 3 2 1

Head CPU 0: Exclusive CPU 1: Invalid

Bandwidth: 40 MB/s

slide-6
SLIDE 6

Lamport Queues

Tail CPU 0: Invalid CPU 1: Exclusive

4 3 2 1

Head CPU 0: Shared CPU 1: Shared

Every produce-consumer pair produces a cache ping-pong!

Bandwidth: 40 MB/s

slide-7
SLIDE 7

Lamport Queues

Tail CPU 0: Invalid CPU 1: Exclusive

4 3 2 1

Head CPU 0: Shared CPU 1: Shared

Every produce-consumer pair produces a cache ping-pong!

Bandwidth: 40 MB/s

slide-8
SLIDE 8

MCRB and DBLS

Tail CPU 0: Invalid CPU 1: Exclusive

4 3 2 1

Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive

Bandwidth: 430 MB/s

slide-9
SLIDE 9

MCRB and DBLS

Tail CPU 0: Invalid CPU 1: Exclusive

5 4 3 2 1

Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive

Bandwidth: 430 MB/s

slide-10
SLIDE 10

MCRB and DBLS

Tail CPU 0: Invalid CPU 1: Exclusive

7 6 5 4 3 2 1

Head CPU 0: Exclusive CPU 1: Invalid Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive

Bandwidth: 430 MB/s

slide-11
SLIDE 11

MCRB and DBLS

Tail CPU 0: Invalid CPU 1: Exclusive

7 6 5 4 3 2 1

Head CPU 0: Shared CPU 1: Shared Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive

Bandwidth: 430 MB/s

slide-12
SLIDE 12

Tail CPU 0: Invalid CPU 1: Exclusive

7 6 5 4 3 2 1

Head CPU 0: Shared CPU 1: Shared Cached Head CPU 1: Exclusive Cached Tail CPU 0: Exclusive

Caching eliminates ping-ponging!

Bandwidth: 430 MB/s

slide-13
SLIDE 13

Tail

4 3 2 1

Head

7 6 5 4 3 2 1

Liberty Queues

Bandwidth: 500 MB/s

slide-14
SLIDE 14

Tail

5 4 3 2 1

Head

7 6 5 4 3 2 1

Liberty Queues

Bandwidth: 500 MB/s

slide-15
SLIDE 15

Tail

5 4 3 2 1

Head

7 6 5 4 3 2 5

Liberty Queues

Bandwidth: 500 MB/s

slide-16
SLIDE 16

Tail

5 4 3 2 1

Head

7 6 5 5

Liberty Queues Prefeching doesn’t help

Bandwidth: 490 MB/s

slide-17
SLIDE 17

Tail

8 7 6 5 4 3 2 1

Head

4 3 2 5

Liberty Queues

Bandwidth: 690 MB/s

slide-18
SLIDE 18

Tail

8 7 6 5 4 3 2 1

Head

8 7 6 5

Liberty Queues

Bandwidth: 2170 MB/s

slide-19
SLIDE 19

0,5 1 1,5 2 2,5 3 3,5 4 4,5 GB/s

Liberty Queue Bandwidth

64bit 128bit