Proximity-Aware Directory-based Coherence for Multi-core Processor - - PowerPoint PPT Presentation

proximity aware directory based coherence for multi core
SMART_READER_LITE
LIVE PREVIEW

Proximity-Aware Directory-based Coherence for Multi-core Processor - - PowerPoint PPT Presentation

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures Jeff Brown Rakesh Kumar Dean Tullsen UC San Diego University of Illinois at Urbana-Champaign SPAA19 June 9, 2007 Introduction The chip multiprocessor


slide-1
SLIDE 1

Proximity-Aware Directory-based Coherence for Multi-core Processor Architectures

Jeff Brown Rakesh Kumar Dean Tullsen

UC San Diego ● University of Illinois at Urbana-Champaign SPAA19 ● June 9, 2007

slide-2
SLIDE 2

Introduction

  • The chip multiprocessor (CMP)

era is upon us!

  • Caching complicate writes
  • Cache Coherence ensures

caching is done safely

  • Multi-core designs offer new tradeoffs
slide-3
SLIDE 3

Introduction

  • The chip multiprocessor (CMP)

era is upon us!

  • Caching complicate writes
  • Cache Coherence ensures

caching is done safely

  • Multi-core designs offer new tradeoffs

P M P M

slide-4
SLIDE 4

Introduction

  • The chip multiprocessor (CMP)

era is upon us!

  • Caching complicate writes
  • Cache Coherence ensures

caching is done safely

  • Multi-core designs offer new tradeoffs

P M P M P P M M

slide-5
SLIDE 5

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server
slide-6
SLIDE 6

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server

– Processors request data, permissions

P

slide-7
SLIDE 7

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server

– Processors request data, permissions – Directory controllers manage memory access

P Dir

slide-8
SLIDE 8

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server

– Processors request data, permissions – Directory controllers manage memory access

P M Dir

slide-9
SLIDE 9

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server

– Processors request data, permissions – Directory controllers manage memory access

P M Dir

slide-10
SLIDE 10

Background: Directory-based Cache Coherence

  • Directory-based; explicit per-block accounting

– Doesn't rely on broadcasts

  • Directory operation: client/server

– Processors request data, permissions – Directory controllers manage memory access

  • Updates, conflicts

P M P Dir

slide-11
SLIDE 11

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M

slide-12
SLIDE 12

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M Cache Miss

slide-13
SLIDE 13

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M Cache Miss "Home Node"

slide-14
SLIDE 14

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M Cache Miss "Home Node"

slide-15
SLIDE 15

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M Cache Miss "Home Node" Data Request

slide-16
SLIDE 16

Background: Historical MP Cache Coherence

  • Distributed directory, memory

P M P M P M P M Cache Miss "Home Node" Data Request Reply

slide-17
SLIDE 17

Motivation: Multi-core Cache Coherence

M M P M P P P M

slide-18
SLIDE 18

Motivation: Multi-core Cache Coherence

M M P M P P P M Cache Miss

slide-19
SLIDE 19

Motivation: Multi-core Cache Coherence

M M P M P P P M Cache Miss

slide-20
SLIDE 20

Motivation: Multi-core Cache Coherence

"Home Node" M M P M P P P M Cache Miss

slide-21
SLIDE 21

Motivation: Multi-core Cache Coherence

"Home Node" Data Request M M P M P P P M Cache Miss

slide-22
SLIDE 22

Motivation: Multi-core Cache Coherence

"Home Node" Data Request M M P M P P P M Cache Miss

slide-23
SLIDE 23

Motivation: Multi-core Cache Coherence

"Home Node" Reply M M P M P P P M Cache Miss

slide-24
SLIDE 24

Motivation: Multi-core Cache Coherence

M M P M P P P M Additional Sharer

slide-25
SLIDE 25

Motivation: Multi-core Cache Coherence

M M P M P P P M Additional Sharer

  • Multi-core designs present radically different

relative latency & bandwidth

slide-26
SLIDE 26

Outline

  • Introduction & Background
  • System Architecture
  • Proximity-Aware Coherence
  • Results
  • Conclusion
slide-27
SLIDE 27

Directory-based Cache Coherence

  • Directory structures
slide-28
SLIDE 28

Directory-based Cache Coherence

  • Directory structures

Main Memory

slide-29
SLIDE 29

Directory-based Cache Coherence

  • Directory structures

Main Memory

slide-30
SLIDE 30

Directory-based Cache Coherence

  • Directory structures

– Directory Memory

Main Memory Directory Memory

slide-31
SLIDE 31

Directory-based Cache Coherence

  • Directory structures

– Directory Memory – Directory Entries

Main Memory Directory Memory

slide-32
SLIDE 32

Directory-based Cache Coherence

  • Directory structures

– Directory Memory – Directory Entries – Directory Controller

Main Memory Directory Memory Controller

slide-33
SLIDE 33

A Traditional Multiprocessor

Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem

slide-34
SLIDE 34

A Traditional Multiprocessor

Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem

(Chassis, board, etc.)

slide-35
SLIDE 35

A Traditional Multiprocessor

Core L2 $ Dir Mem Interconnect Core L2 $ Dir Mem

(Chassis, board, etc.)

slide-36
SLIDE 36

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-37
SLIDE 37

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-38
SLIDE 38

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-39
SLIDE 39

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-40
SLIDE 40

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-41
SLIDE 41

Our 16-Core Chip Multiprocessor

Core L2 $ Bus

Dir control

Net. switch

Dir $

Mem. channel

Tile Tile 1 Tile 15 ...

slide-42
SLIDE 42

Outline

  • Introduction & Background
  • System Architecture
  • Proximity-Aware Coherence
  • Results
  • Conclusion
slide-43
SLIDE 43

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

slide-44
SLIDE 44

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible

slide-45
SLIDE 45

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

slide-46
SLIDE 46

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

M M P M P P P M

slide-47
SLIDE 47

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

"Home Node" Data Request Cache Miss M M P M P P P M

slide-48
SLIDE 48

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

"Home Node" M M P M P P P M Additional Sharer

slide-49
SLIDE 49

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

"Home Node" M M P M P P P M Additional Sharer Forward Request

slide-50
SLIDE 50

Proximity-Aware Coherence

  • Idea: home node asks sharer nearest requester

to forward its cached copy

– Stay on-chip when possible – Minimize transit of large data-carrying replies

Reply M M P M P P P M

slide-51
SLIDE 51

Proximity-Aware Coherence

  • To service read misses for shared data,

traditional protocols use main memory

  • Other nodes may hold copies
  • On the CMP landscape, inter-node latency is

much less than memory latency

slide-52
SLIDE 52

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

slide-53
SLIDE 53

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

Miss

Home

slide-54
SLIDE 54

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

– rand

Miss

Home

slide-55
SLIDE 55

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

– rand – near1

Miss

Home

slide-56
SLIDE 56

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

– rand – near1 – via1

Miss

Home

slide-57
SLIDE 57

Sharer Selection

  • When the home node lacks a cached copy, it

selects a sharer to ask

– rand – near1 – via1

  • Retries didn't prove beneficial

Miss

Home

slide-58
SLIDE 58

Outline

  • Introduction & Background
  • System Architecture
  • Proximity-Aware Coherence
  • Results
  • Conclusion
slide-59
SLIDE 59

Methodology

  • Detailed, execution-driven processor and

network simulation

  • "RSIM" simulator, adapted to our CMP model
  • Parallel workloads from several suites
  • Hardware, benchmark details in paper
slide-60
SLIDE 60

Proximity-Aware: Potential Coverage

appbt fft lu mp3d

  • cean

quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1

Fraction of read misses to shared lines

slide-61
SLIDE 61

Proximity-Aware: Potential Coverage

appbt fft lu mp3d

  • cean

quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1

Fraction of read misses to shared lines

slide-62
SLIDE 62

Proximity-Aware: Potential Coverage

appbt fft lu mp3d

  • cean

quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1

Fraction of read misses to shared lines Overall x=43%

slide-63
SLIDE 63

Proximity-Aware: Potential Coverage

appbt fft lu mp3d

  • cean

quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1

Fraction of read misses to shared lines Overall x=43%

slide-64
SLIDE 64

Proximity-Aware: Potential Coverage

appbt fft lu mp3d

  • cean

quicksort unstruct 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 5 4 3 2 1

Fraction of read misses to shared lines Overall x=43% dist 1 x=75%

slide-65
SLIDE 65

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

slide-66
SLIDE 66

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

slide-67
SLIDE 67

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

Latency

  • 25%
slide-68
SLIDE 68

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

Latency

  • 25%
slide-69
SLIDE 69

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

Latency

  • 25%

Reply traffic

  • 6%
slide-70
SLIDE 70

Proximity-Aware: Latency Benefit

appbt fft lu mp3d

  • cean

quick sort un- struct mean

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

rand near1 via1

Normalized L2 miss latency

Latency

  • 25%

Reply traffic

  • 6%
slide-71
SLIDE 71

Proximity-Aware: Speedup

appbt fft lu mp3d

  • cean

quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80

rand near1 via1

Speedup

slide-72
SLIDE 72

Proximity-Aware: Speedup

appbt fft lu mp3d

  • cean

quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80

rand near1 via1

Speedup

slide-73
SLIDE 73

Proximity-Aware: Speedup

appbt fft lu mp3d

  • cean

quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80

rand near1 via1

Speedup

Speedup 16%

slide-74
SLIDE 74

Proximity-Aware: Speedup

  • L2 latency sensitivity of workloads

appbt fft lu mp3d

  • cean

quick sort un- struct mean 0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80

rand near1 via1

Speedup

Speedup 16%

slide-75
SLIDE 75

Conclusion

  • The latency/bandwidth aspects of CMPs

motivates multicore-aware coherence redesign

  • One such change: Proximity-Aware Coherence

– Ideas: stay on-chip, decrease "bulk" transit – Mean speedup 16%, mean L2 latency down 25%

  • More aggressive techniques are under study
slide-76
SLIDE 76

Conclusion

  • The latency/bandwidth aspects of CMPs

motivates multicore-aware coherence redesign

  • One such change: Proximity-Aware Coherence

– Ideas: stay on-chip, decrease "bulk" transit – Mean speedup 16%, mean L2 latency down 25%

  • More aggressive techniques are under study
  • Questions?