when to use 3d die stacked memory for bandwidth
play

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big - PowerPoint PPT Presentation

When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016 Big Data == Big Memory Low latency Real-time What is the performance Can we


  1. When to use 3D Die-Stacked Memory for Bandwidth-Constrained Big Data Workloads Jason Lowe-Power || Mark D. Hill || David A. Wood 4/3/2016 BPOE 7 @ ASPLOS 2016

  2. Big Data == Big Memory Low latency → Real-time What is the performance Can we execute complex What’s the best performance for 100kW? for 16 TB system? queries in 10 ms? 4/3/2016 BPOE 7 @ ASPLOS 2016 2

  3. Lowest Highest power! capacity! Which is best? Which is best? It depends Best performance! 4/3/2016 BPOE 7 @ ASPLOS 2016 3

  4. Big Memory Machines Memory capacity 3 TB (3,072 GB) Memory bandwidth 408 GB/s Processors Dell PowerEdge R930 64 cores 4/3/2016 BPOE 7 @ ASPLOS 2016 4

  5. Amount accessible DRAM (per socket) per second Amount accessible in 10 ms 1 GB 4/3/2016 BPOE 7 @ ASPLOS 2016 5

  6. Processing 2x–10x faster than data supply Amount accessible CPU processing per second in 10 ms Amount accessible GPU processing in 10 ms in 10 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 6

  7. 3D Die-Stacking DRAM (per socket) Amount accessible Amount accessible per second in 10 ms Data supply to data processing ≈1 4/3/2016 BPOE 7 @ ASPLOS 2016 7

  8. Traditional Big-Memory Die-Stacked Server Server Server ↑ Higher bandwidth ↑↑ Higher capacity (compared to traditional) 4/3/2016 BPOE 7 @ ASPLOS 2016 8

  9. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 9

  10. Evaluation Option 1: Build the hardware Option 2: Simulation Option 3: Analytical Model! 4/3/2016 BPOE 7 @ ASPLOS 2016 10

  11. Model Example Provisioning: 10 ms response time Data to read: 16,384 GB × 0.20 = 3,276.8 GB Bandwidth: 3,276.8 GB ÷ 0.010 s = 327.680 TB/s Chips needed: 327.680 TB/s ÷ 102 GB/s/chip Power: 458 kW = 3213 chips = 800 blades Capacity: 800 TB For traditional server 4/3/2016 BPOE 7 @ ASPLOS 2016 11

  12. Model details From the paper Online research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ 4/3/2016 BPOE 7 @ ASPLOS 2016 12

  13. Workload Assumptions ▪ 16 TB data corpus ▪ Each request accesses 20% of data corpus (3.2 TB) ▪ One core can process 6 GB/s ▪ No communication between cores https://xkcd.com/1339/ 4/3/2016 BPOE 7 @ ASPLOS 2016 13

  14. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 14

  15. Metrics Performance Response time (SLA) Power Major component of datacenter cost Data capacity Workload size 4/3/2016 BPOE 7 @ ASPLOS 2016 15

  16. Performance Provisioning Goal: Design cluster Get matches 10 ms to meet a service level agreement Sort 50 ms (SLA) Ads 50 ms 100 ms . . . 50 ms 500 ms 4/3/2016 BPOE 7 @ ASPLOS 2016 16

  17. Performance Provisioning 10 ms SLA Power Capacity 213 ✕ Current systems require memory over provisioning 50 ✕ 1 ✕ 4/3/2016 BPOE 7 @ ASPLOS 2016 17

  18. Memory Over Provisioning 50% Wasted 4/3/2016 BPOE 7 @ ASPLOS 2016 18

  19. Performance Provisioning Die-stacking : 10 ms SLA 2–5 ✕ less power Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 19

  20. Performance Provisioning Power for relaxed SLAs Traditional needs less over provisioned memory 4/3/2016 BPOE 7 @ ASPLOS 2016 20

  21. Power Provisioning 10–20 kW 100kW–1MW Goal: Design cluster to not exceed some power constraint 10–100 MW 4/3/2016 BPOE 7 @ ASPLOS 2016 21

  22. Power Provisioning Die-stacking : Die-stacking : 1 MW Power Less capacity for 3–5 ✕ faster power budget Response time Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 22

  23. Data Capacity Provisioning Search: Inverted Index Graph: Friends lists Goal: Design cluster capacity for workload Database: Purchases 4/3/2016 BPOE 7 @ ASPLOS 2016 23

  24. Data Capacity Provisioning 16 TB Database Response time Power Die-stacking : Die-stacking : 25-50 ✕ more power 60–256 ✕ faster 4/3/2016 BPOE 7 @ ASPLOS 2016 24

  25. Traditional Big Memory Die-Stacked Over 2–5x less Best for SLA provisioned power for Performance 60+ms memory 10ms SLA 2x faster 3x memory 3–4x faster Power with 50 KW capacity with 1 MW Somewhere 2–50x less 60–250x Data capacity between power faster 4/3/2016 BPOE 7 @ ASPLOS 2016 25

  26. Model and Workload Model results Discussion 4/3/2016 BPOE 7 @ ASPLOS 2016 26

  27. Model deficiencies You chose the wrong number! See research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ Communication between cores This makes 2048 die-stacked systems worse How to move data between stacks? Compute energy or data energy? Cost? 4/3/2016 BPOE 7 @ ASPLOS 2016 27

  28. In Memory Big Data Workloads Which is best? Today: It depends… Today: It depends… Tomorrow: Die-stacked? 4/3/2016 BPOE 7 @ ASPLOS 2016 28

  29. Questions ‽ research.cs.wisc.edu/multifacet/bpoe16_3d_bandwidth_model/ bit.ly/bpoe-interactive powerjg@cs.wisc.edu 4/3/2016 BPOE 7 @ ASPLOS 2016

  30. Systems Big memory Traditional Die-stacked 102 GB/s 196 GB/s 256 GB/s Bandwidth 256 GB 2 TB 8 GB Capacity Blades 16 8 228 (16TB) Cluster 6.4 TB/s 1.5 TB/s 512 TB/s bandwidth 4/3/2016 BPOE 7 @ ASPLOS 2016 30

  31. Power Breakdown Compute power dominates die-stacked 4/3/2016 BPOE 7 @ ASPLOS 2016 31

  32. Decreased Compute Power 10 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 32

  33. Increased Memory Density 100 ms 100 kW 16 TB SLA Power Capacity 4/3/2016 BPOE 7 @ ASPLOS 2016 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend