Summary NUCA is giving us more capacity, but further away 40 - - PowerPoint PPT Presentation

summary
SMART_READER_LITE
LIVE PREVIEW

Summary NUCA is giving us more capacity, but further away 40 - - PowerPoint PPT Presentation

J IGSAW : S CALABLE S OFTWARE -D EFINED C ACHES N ATHAN B ECKMANN AND D ANIEL S ANCHEZ MIT CSAIL PACT13 - E DINBURGH , S COTLAND S EP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications have widely


slide-1
SLIDE 1

JIGSAW:

SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL

PACT’13 - EDINBURGH, SCOTLAND SEP 11, 2013

slide-2
SLIDE 2

Summary

¨ NUCA is giving us more capacity, but further away ¨ Applications have widely

varying cache behavior

¨ Cache organization should adapt to application ¨ Jigsaw uses physical cache resources as building blocks

  • f virtual caches, or shares

Cache Size 16MB 40 MPKI libquantum zeusmp sphinx3

slide-3
SLIDE 3

Approach

3

¨ Jigsaw uses physical cache resources as building blocks

  • f virtual caches, or shares

libquantum zeusmp sphinx3 Cache Size 16MB 40 MPKI

Tiled Multicore

Bank

slide-4
SLIDE 4

Agenda

4

¨ Introduction ¨ Background

¤ Goals ¤ Existing Approaches

¨ Jigsaw Design ¨ Evaluation

slide-5
SLIDE 5

Goals

5

¨ Make effective use of cache capacity ¨ Place data for low latency ¨ Provide capacity isolation for performance ¨ Have a simple implementation

slide-6
SLIDE 6

Existing Approaches: S-NUCA

6

Spread lines evenly across banks

¨ High Capacity ¨ High Latency ¨ No Isolation ¨ Simple

slide-7
SLIDE 7

Existing Approaches: Partitioning

7

Isolate regions of cache between applications.

¨ High Capacity ¨ High Latency ¨ Isolation ¨ Simple ¨ Jigsaw needs partitioning; uses Vantage to get strong

guarantees with no loss in associativity

slide-8
SLIDE 8

Existing Approaches: Private

8

Place lines in local bank

¨ Low Capacity ¨ Low Latency ¨ Isolation ¨ Complex – LLC directory

slide-9
SLIDE 9

Existing Approaches: D-NUCA

9

Placement, migration, and replication heuristics

¨ High Capacity

¤ But beware of over-replication

and restrictive mappings

¨ Low Latency

¤ Don’t fully exploit capacity

  • vs. latency tradeoff

¨ No Isolation ¨ Complexity Varies

¤ Private-baseline schemes require LLC directory

slide-10
SLIDE 10

Existing Approaches: Summary

10

S-NUCA

Partitioning

Private D-NUCA High Capacity Yes Yes No Yes Low Latency No No Yes Yes Isolation No Yes Yes No Simple Yes Yes No Depends

slide-11
SLIDE 11

Jigsaw

11

¨ High Capacity – Any share can

take full capacity, no replication

¨ Low Latency – Shares allocated

near cores that use them

¨ Isolation – Partitions within each

bank

¨ Simple – Low overhead hardware, no LLC directory,

software-managed

slide-12
SLIDE 12

Agenda

12

¨ Introduction ¨ Background ¨ Jigsaw Design

¤ Operation ¤ Monitoring ¤ Configuration

¨ Evaluation

slide-13
SLIDE 13

Jigsaw Components

13

Operation Monitoring Configuration Miss Curves Accesses Size & Placement

slide-14
SLIDE 14

Jigsaw Components

14

Operation Monitoring Configuration

slide-15
SLIDE 15

Agenda

15

¨ Introduction ¨ Background ¨ Jigsaw Design

¤ Operation ¤ Monitoring ¤ Configuration

¨ Evaluation

slide-16
SLIDE 16

Operation: Access

16

Classifier STB Share 1 Share 2 Share 3 Share N ... TLB

Core L1I L1D L2 LLC

LD 0x5CA1AB1E

Data è shares, so no LLC coherence required

slide-17
SLIDE 17

¨ Jigsaw classifies data based on access pattern

¤ Thread, Process, Global, and Kernel

¨ Data lazily re-classified on TLB miss

¤ Similar to R-NUCA but…

n R-NUCA: Classification è Location n Jigsaw: Classification è Share (sized & placed dynamically)

¤ Negligible overhead

Data Classification

17

Operating System

  • 6 thread shares
  • 2 process shares
  • 1 global share
  • 1 kernel share
slide-18
SLIDE 18

Operation: Share-bank Translation Buffer

18 STB Entry STB Entry STB Entry STB Entry

Address (from L1 miss) Share Id (from TLB) 1/3 3/5 1/3 H 0/8

1

Bank/ Part 0 Bank/ Part 63

Address 0x5CA1AB1E maps to bank 3, partition 5 2706 4 entries, associative, exception on miss Share Config 0x5CA1AB1E

¨ Hash lines proportionally

q

Share:

q

STB:

¨ 400 bytes; low overhead

A A B A A B

¨ Gives unique location of

the line in the LLC

¨ Address, Share è

Bank, Partition

slide-19
SLIDE 19

Agenda

19

¨ Introduction ¨ Background ¨ Jigsaw Design

¤ Operation ¤ Monitoring ¤ Configuration

¨ Evaluation

slide-20
SLIDE 20

Monitoring

20

¨ Software requires miss curves for each share ¨ Add utility monitors (UMONs) per tile to produce miss curves ¨ Dynamic sampling to model full LLC at each bank; see paper

0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA … 0xB3D3GA 0x0E5A7B 0x123456 0x7890AB … 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 … 0x7A5744 0x7A4A70 0xADD235 0x541302 … 717,543 117,030 213,021 32,103 … … …

Hit Counters Tag Array Way 0 Way N-1

Misses Size Cache Size

slide-21
SLIDE 21

Configuration

21

¨ Software decides share configuration ¨ Approach: Size è Place

¤ Solving independently is simple ¤ Sizing is hard, placing is easy PLACE

Misses Size LLC

SIZE

slide-22
SLIDE 22

Configuration: Sizing

22

¨ Partitioning problem: Divide cache capacity of S among P

partitions/shares to maximize hits

¨ Use miss curves to describe partition behavior ¨ NP-complete in general ¨ Existing approaches:

¤ Hill climbing is fast but gets stuck in local optima ¤ UCP Lookahead is good but scales quadratically: O(P x S2)

Utility-based Cache Partitioning, Qureshi and Patt, MICRO’06

Can we scale Lookahead?

slide-23
SLIDE 23

Configuration: Lookahead

23

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-24
SLIDE 24

Configuration: Lookahead

24

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-25
SLIDE 25

Configuration: Lookahead

25

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-26
SLIDE 26

Configuration: Lookahead

26

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-27
SLIDE 27

Configuration: Lookahead

27

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-28
SLIDE 28

Configuration: Lookahead

28

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-29
SLIDE 29

Configuration: Lookahead

29

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-30
SLIDE 30

Configuration: Lookahead

30

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-31
SLIDE 31

Configuration: Lookahead

31

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-32
SLIDE 32

Configuration: Lookahead

32

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-33
SLIDE 33

Configuration: Lookahead

33

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-34
SLIDE 34

Configuration: Lookahead

34

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-35
SLIDE 35

Configuration: Lookahead

35

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-36
SLIDE 36

Configuration: Lookahead

36

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

slide-37
SLIDE 37

Configuration: Lookahead

37

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

Maximum cache utility

slide-38
SLIDE 38

Configuration: Lookahead

38

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

Maximum cache utility

slide-39
SLIDE 39

Configuration: Lookahead

39

¨ UCP Lookahead:

¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Misses Size LLC Size

Maximum cache utility

slide-40
SLIDE 40

Configuration: Lookahead

40

¨ Observation: Lookahead traces the convex hull of the miss

curve

Misses Size LLC Size

Maximum cache utility

slide-41
SLIDE 41

Convex Hulls

41

¨ The convex hull of a curve is the set containing all lines

between any two points on the curve, or “the curve connecting the points along the bottom”

Misses Size LLC Size Misses Size LLC Size

slide-42
SLIDE 42

Configuration: Peekahead

42

¨ There are well-known linear algorithms to compute convex

hulls

¨ Peekahead algorithm is an exact, linear-time

implementation of UCP Lookahead

Misses Size LLC Size Misses Size LLC Size

slide-43
SLIDE 43

Configuration: Peekahead

43

¨ Peekahead computes all convex hulls encountered during

allocation in linear time

¤ Starting from every possible allocation ¤ Up to any remaining cache capacity

Misses Size LLC Size Misses Size LLC Size

slide-44
SLIDE 44

Configuration: Peekahead

44

¨ Knowing the convex hull, each allocation step is O(log P)

¤ Convex hulls have decreasing slope è decreasing average

cache utility è only consider next point on hull

¤ Use max-heap to compare between partitions

Best Step?

slide-45
SLIDE 45

Configuration: Peekahead

45

¨ Knowing the convex hull, each allocation step is O(log P)

Current Allocation Best Step?

slide-46
SLIDE 46

Configuration: Peekahead

46

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step Current Allocation

slide-47
SLIDE 47

Configuration: Peekahead

47

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step?

slide-48
SLIDE 48

Configuration: Peekahead

48

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step

slide-49
SLIDE 49

Configuration: Peekahead

49

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step?

slide-50
SLIDE 50

Configuration: Peekahead

50

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step

slide-51
SLIDE 51

Configuration: Peekahead

51

¨ Knowing the convex hull, each allocation step is O(log P)

Best Step

slide-52
SLIDE 52

Configuration: Peekahead

52

¨ Full runtime is O(P x S)

¤ P – number of partitions ¤ S – cache size

¨ See paper for additional examples, algorithm, and

corner cases

¨ See technical report for additional detail, proofs, and

run-time analysis

¤ Jigsaw: Scalable Software-Defined Caches (Extended Version), Nathan Beckmann and Daniel

Sanchez, Technical Report MIT-CSAIL-TR-2013-017, Massachusetts Institute of Technology, July 2013

slide-53
SLIDE 53

Re-configuration

53

¨ When STB changes, some addresses hash to different

banks

¨ Selective invalidation hardware walks the LLC and

invalidates lines that have moved

¨ Heavy-handed but infrequent and avoids directory

¤ Maximum of 300K cycles / 50M cycles = 0.6% overhead

1/3 1/3 3/5 1/3 1/3 3/5

H

0x5CA1AB1E 1/3 4/9 3/5 1/3 1/3 3/5

H

0x5CA1AB1E

slide-54
SLIDE 54

Design: Hardware Summary

54

¨ Operation:

¤ Share-bank translation buffer (STB)

handles accesses

¤ TLB augmented with share id

¨ Monitoring HW: produces miss curves ¨ Configuration: invalidation HW ¨ Partitioning HW (Vantage)

Tile Organization

Jigsaw L3 Bank

NoC Router

Bank partitioning HW (Vantage) Inv HW Monitoring HW

Core STB

TLBs

L1I L1D L2 Modified structures New/added structures

slide-55
SLIDE 55

Agenda

55

¨ Introduction ¨ Background ¨ Jigsaw Design ¨ Evaluation

¤ Methodology ¤ Performance ¤ Energy

slide-56
SLIDE 56

Methodology

56

¨ Execution-driven simulation using zsim ¨ Workloads:

¤ 16-core singlethreaded mixes of SPECCPU2006 workloads ¤ 64-core multithreaded (4x16-thread) mixes of PARSEC

¨ Cache organizations

¤ LRU – shared S-NUCA cache with LRU replacement; baseline ¤ Vantage – S-NUCA with Vantage and UCP Lookahead ¤ R-NUCA – state-of-the-art shared-baseline D-NUCA organization ¤ IdealSPD (“shared-private D-NUCA”) – private L3 + shared L4

n 2x capacity of other schemes n Upper bound for private-baseline D-NUCA organizations

¤ Jigsaw

slide-57
SLIDE 57

Evaluation: Performance

57

¨ 16-core multiprogrammed mixes of SPECCPU2006 ¨ Jigsaw achieves best performance

¤ Up to 50% improved throughput, 2.2x improved w. speedup ¤ Gmean +14% throughput, +18% w. speedup

¨ Jigsaw does even better on the most memory intensive mixes

¤ Top 20% of LRU MPKI ¤ Gmean +21% throughput, +29% w. speedup

slide-58
SLIDE 58

Evaluation: Performance

58

¨ 64-core multithreaded mixes of PARSEC ¨ Jigsaw achieves best performance

¤ Gmean +9% throughput, +9% w. speedup

¨ Remember IdealSPD is an upper bound with 2x capacity

slide-59
SLIDE 59

Evaluation: Performance Breakdown

59

¨ 16-core multiprogrammed mixes of SPECCPU2006 ¨ Breakdown memory stalls into

network and DRAM

¤ Normalized to LRU

¨ R-NUCA is limited by capacity in these workloads

(private data è local bank)

¨ Vantage only benefits DRAM ¨ IdealSPD acts as either a private organization (benefit

network) or a shared organization (benefit DRAM)

¨ Jigsaw is the only scheme to simultaneously benefit

network and DRAM latency

Optimum

slide-60
SLIDE 60

Evaluation: Energy

60

¨ 16-core multiprogrammed mixes ¨ McPAT models of full-system energy (chip + DRAM) ¨ Jigsaw achieves best energy reduction

¤ Up to 72%, gmean of 11% ¤ Reduces both network and DRAM energy

slide-61
SLIDE 61

Conclusion

¨ NUCA is giving us more capacity, but further away ¨ Applications have widely

varying cache behavior

¨ Cache organization should adapt

to meet application needs

¨ Jigsaw uses physical cache resources as

building blocks of virtual caches, or shares

¤ Sized to fit working set ¤ Placed near application for low latency ¨ Jigsaw improves performance up to 2.2x and reduces energy up

to 72%

Cache Size 16MB 40 MPKI

slide-62
SLIDE 62

QUESTIONS

62

Misses Size LLC Size

0x3F7AB 0xFE3D98 0xD380B 0x3930EA … 0xBD3GA 0x0E5A7B 0x123456 0x7890AB … 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 … 0x7A5744 0x74A70 0xAD235 0x541302 … 717,543 117,030 213,021 32,103 … … … Hit Counters Tag Array Way 0 Way N-1 … Address H3 Limit

slide-63
SLIDE 63

Placement

¨ Greedy algorithm ¨ Each share is allocated budget ¨ Shares take turns grabbing space in “nearby” banks

¤ Banks ordered by distance from “center of mass” of cores

accessing share

¨ Repeat until budget & banks exhausted

slide-64
SLIDE 64

Monitoring

64

¨ Software requires miss curves for each share ¨ Add UMONs per tile

¤ Small tag array that models LRU on sample of accesses ¤ Tracks # hits per way, # misses è miss curve

¨ Changing sampling rate

models a larger cache

¨ STB spreads lines proportionally to partition size, so sampling

rate must compensate

Lines Cache Modeled Lines UMON Rate Sampling = Lines Cache Modeled Lines UMON size Partition size Share Rate Sampling × =

slide-65
SLIDE 65

Monitoring

65

¨ STB spreads addresses unevenly è change sampling rate to compensate ¨ Augment UMON with hash (shared with STB) and 32-bit limit register that

gives fine control over sampling rate

¨ UMON now models full LLC capacity exactly

¤ Shares require only one UMON ¤ Max four shares / bank è four UMONs / bank è 1.4% overhead

0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA … 0xB3D3GA 0x0E5A7B 0x123456 0x7890AB … 0xCDEF00 0x3173FC 0xCDC911 0xBAD031 … 0x7A5744 0x7A4A70 0xADD235 0x541302 … 717,543 117,030 213,021 32,103 … … …

Hit Counters Tag Array Way 0 Way N-1

Address H3 Limit <

slide-66
SLIDE 66

Evaluation: Extra

66

¨ See paper for:

¤ Out-of-order results ¤ Execution time breakdown ¤ Peekahead performance ¤ Sensitivity studies