When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - - PowerPoint PPT Presentation

when to use 3d die stacked memory for bandwidth
SMART_READER_LITE
LIVE PREVIEW

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - - PowerPoint PPT Presentation

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016 Big Data == Big Memory Low latency Real-time What is the performance Can we


slide-1
SLIDE 1

4/3/2016 BPOE 7 @ ASPLOS 2016

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads

Jason Lowe-Power || Mark D. Hill || David A. Wood

slide-2
SLIDE 2

4/3/2016 BPOE 7 @ ASPLOS 2016

Low latency → Real-time

Big Data == Big Memory

2

Can we execute complex queries in 10 ms? What’s the best performance for 100kW? What is the performance for 16 TB system?

slide-3
SLIDE 3

4/3/2016 BPOE 7 @ ASPLOS 2016

Best performance! Lowest power! Highest capacity! Which is best? Which is best? It depends

3

slide-4
SLIDE 4

4/3/2016 BPOE 7 @ ASPLOS 2016

Dell PowerEdge R930

Big Memory Machines

Memory capacity

3 TB (3,072 GB)

Memory bandwidth

408 GB/s

Processors

64 cores

4

slide-5
SLIDE 5

4/3/2016 BPOE 7 @ ASPLOS 2016

DRAM (per socket) 1 GB Amount accessible per second Amount accessible in 10 ms

5

slide-6
SLIDE 6

4/3/2016 BPOE 7 @ ASPLOS 2016

Amount accessible per second Amount accessible in 10 ms CPU processing in 10 ms GPU processing in 10 ms

Processing 2x–10x faster than data supply

6

slide-7
SLIDE 7

4/3/2016 BPOE 7 @ ASPLOS 2016

3D Die-Stacking

DRAM (per socket) Amount accessible per second Amount accessible in 10 ms Data supply to data processing ≈1

7

slide-8
SLIDE 8

4/3/2016 BPOE 7 @ ASPLOS 2016

Big-Memory Server

↑ Higher bandwidth ↑↑ Higher capacity

(compared to traditional)

8

Traditional Server Die-Stacked Server

slide-9
SLIDE 9

4/3/2016 BPOE 7 @ ASPLOS 2016

Model and Workload Model results Discussion

9

slide-10
SLIDE 10

4/3/2016 BPOE 7 @ ASPLOS 2016

Evaluation

10

Option 1: Build the hardware Option 2: Simulation Option 3: Analytical Model!

slide-11
SLIDE 11

4/3/2016 BPOE 7 @ ASPLOS 2016

Model Example

Provisioning: 10 ms response time Data to read: 16,384 GB × 0.20 = 3,276.8 GB Bandwidth: 3,276.8 GB ÷ 0.010 s = 327.680 TB/s Chips needed: 327.680 TB/s ÷ 102 GB/s/chip = 3213 chips = 800 blades

For traditional server Power: 458 kW Capacity: 800 TB

11

slide-12
SLIDE 12

4/3/2016 BPOE 7 @ ASPLOS 2016

Model details

From the paper

research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/

Online

12

slide-13
SLIDE 13

4/3/2016 BPOE 7 @ ASPLOS 2016

Workload Assumptions

▪ 16 TB data corpus ▪ Each request accesses 20%

  • f data corpus (3.2 TB)

▪ One core can process 6 GB/s ▪ No communication between cores

13

https://xkcd.com/1339/

slide-14
SLIDE 14

4/3/2016 BPOE 7 @ ASPLOS 2016

Model and Workload Model results Discussion

14

slide-15
SLIDE 15

4/3/2016 BPOE 7 @ ASPLOS 2016

Metrics

Performance

Response time (SLA)

Power

Major component of datacenter cost

Data capacity

Workload size

15

slide-16
SLIDE 16

4/3/2016 BPOE 7 @ ASPLOS 2016 16

Goal: Design cluster to meet a service level agreement (SLA)

Performance Provisioning

500 ms 50 ms 50 ms 10 ms

Get matches

50 ms

Sort

100 ms

Ads . . .

slide-17
SLIDE 17

4/3/2016 BPOE 7 @ ASPLOS 2016

Performance Provisioning

10 ms SLA

Capacity Power

17

Current systems require memory over provisioning 50✕ 213✕ 1✕

slide-18
SLIDE 18

4/3/2016 BPOE 7 @ ASPLOS 2016

Memory Over Provisioning

18

50% Wasted

slide-19
SLIDE 19

4/3/2016 BPOE 7 @ ASPLOS 2016

Performance Provisioning

10 ms SLA

Capacity Power

19

Die-stacking: 2–5✕ less power

slide-20
SLIDE 20

4/3/2016 BPOE 7 @ ASPLOS 2016

Performance Provisioning

Power for relaxed SLAs

20

Traditional needs less over provisioned memory

slide-21
SLIDE 21

4/3/2016 BPOE 7 @ ASPLOS 2016

Power Provisioning

21

10–20 kW 100kW–1MW 10–100 MW

Goal: Design cluster to not exceed some power constraint

slide-22
SLIDE 22

4/3/2016 BPOE 7 @ ASPLOS 2016

Die-stacking: 3–5✕ faster

Power Provisioning

Capacity

1 MW Power

Die-stacking: Less capacity for power budget Response time

22

slide-23
SLIDE 23

4/3/2016 BPOE 7 @ ASPLOS 2016

Data Capacity Provisioning

23

Search: Inverted Index Graph: Friends lists Database: Purchases Goal: Design cluster capacity for workload

slide-24
SLIDE 24

4/3/2016 BPOE 7 @ ASPLOS 2016

Data Capacity Provisioning

16 TB Database Die-stacking: 25-50✕ more power

Power Response time

24

Die-stacking: 60–256✕ faster

slide-25
SLIDE 25

4/3/2016 BPOE 7 @ ASPLOS 2016

Traditional Big Memory Die-Stacked Performance Power Data capacity 2–5x less power for 10ms SLA Over provisioned memory Best for SLA 60+ms 2x faster with 50 KW 3–4x faster with 1 MW 3x memory capacity 2–50x less power 60–250x faster Somewhere between

25

slide-26
SLIDE 26

4/3/2016 BPOE 7 @ ASPLOS 2016

Model and Workload Model results Discussion

26

slide-27
SLIDE 27

4/3/2016 BPOE 7 @ ASPLOS 2016

Model deficiencies

You chose the wrong number!

See research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/

Communication between cores

This makes 2048 die-stacked systems worse How to move data between stacks?

Compute energy or data energy? Cost?

27

slide-28
SLIDE 28

4/3/2016 BPOE 7 @ ASPLOS 2016

In Memory Big Data Workloads

Which is best? Today: It depends… Today: It depends… Tomorrow: Die-stacked?

28

slide-29
SLIDE 29

4/3/2016 BPOE 7 @ ASPLOS 2016

Questions‽

research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/

bit.ly/bpoe-interactive powerjg@cs.wisc.edu

slide-30
SLIDE 30

4/3/2016 BPOE 7 @ ASPLOS 2016

Systems

Traditional Big memory Die-stacked

Bandwidth Capacity Blades (16TB) Cluster bandwidth

102 GB/s 196 GB/s 256 GB/s 256 GB 2 TB 8 GB 16 8 6.4 TB/s 1.5 TB/s 512 TB/s

30

228

slide-31
SLIDE 31

4/3/2016 BPOE 7 @ ASPLOS 2016

Power Breakdown

Compute power dominates die-stacked

31

slide-32
SLIDE 32

4/3/2016 BPOE 7 @ ASPLOS 2016

Decreased Compute Power

10 ms SLA 100 kW Power 16 TB Capacity

32

slide-33
SLIDE 33

4/3/2016 BPOE 7 @ ASPLOS 2016

100 ms SLA 100 kW Power 16 TB Capacity

Increased Memory Density

33