demystifying the characteristics of 3d stacked memories a
play

Demystifying the Characteristics of 3D-Stacked Memories: A Case - PowerPoint PPT Presentation

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC) , BaharAsgari, Burhan Ahmad Mudassar, SaibalMukhopadhyay, SudhakarYalamanchili, and HyesoonKim IISWC17 Talk Memory Evolution 2


  1. Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC) , BaharAsgari, Burhan Ahmad Mudassar, SaibalMukhopadhyay, SudhakarYalamanchili, and HyesoonKim IISWC’17 Talk

  2. Memory Evolution 2 IISWC’17

  3. 3D-Stacking Technology 3 Provides opportunities & novel features 3D-DRAMs: } Provide higher bandwidth and density } Enable lower power consumption } Motivate processing-in-memory HMC is an example of such memories. IISWC’17

  4. New Considerations 4 New internal organization New thermal behavior New latency and bandwidth hierarchy New packet-switched interface Operational Cooling Temperature Bound Latency Power Device Bandwidth Bandwidth Bandwidth IISWC’17

  5. Contributions 5 o We evaluate a real system with HMC 1.1 to: o Study new memory organization o Present bandwidth, power, and HMC temperature relationships o Investigate required cooling power FPGA o Explore contributing factors to latency o AC510 To realize the full-system impact of 3D-stacked memories and HMC in particular. IISWC’17

  6. Hybrid Memory Cube (HMC) 6 HMC 1.1 (Gen2): 4GB size Bank Bank TSV n o t i i t r a P Vault Logic Layer Vault Controller DRAM Layer IISWC’17

  7. Hybrid Memory Cube (HMC) 7 HMC 1.1 (Gen2): 4GB size Bank Bank 16 Banks/Vault TSV n Total Number of Banks = 256 o t i i t r a P Size of Each Bank = 16 MB Vault Logic Layer Vault Controller DRAM Layer IISWC’17

  8. HMC Communication I 8 o Follows a serialized packet-switched protocol o Partitioned into 16-byte flit o Each transfer incurs 1 flit of overhead IISWC’17

  9. HMC Communication I 9 o Follows a serialized packet-switched protocol o Partitioned into 16-byte flit o Each transfer incurs 1 flit of overhead IISWC’17

  10. HMC Communication II 10 o Two/Four full duplex external links: o Width of 8 or 16 lanes o Configurable speeds of 10, 12.5, and 15 Gbps Our evaluated system 2 external links – 8 lanes each IISWC’17

  11. Experimental Setup I 11 o Pico SC6 Mini o EX700 Backplane o AC510 Module EX700 HMC 4GB HMC 1.1 o Pico SC6 Mini Pico SC6 Mini DC Power Supply: Fan Speed Control _ + FPGA AC-510 EX700 AC-510 EX700 Power W Measurement 45° 45 cm AC510 90 cm 15W 135 cm IISWC’17 Fan

  12. Experimental Setup I 12 o Pico SC6 Mini o EX700 Backplane o AC510 Module EX700 HMC 4GB HMC 1.1 o Pico SC6 Mini Pico SC6 Mini DC Power Supply: Fan Speed Control _ + FPGA AC-510 EX700 AC-510 EX700 Power W Measurement 45° 45 cm AC510 90 cm 15W 135 cm IISWC’17 Fan

  13. o Apply different masks to addresses o Modified GUPS (giga updates per second) benchmark o FPGA frequency: 187.5 MHz Experimental Setup II PCIe Driver Pico API Software Host Pico EX700 PCIe Switch PCIe 3.0 x8 AC-510 IISWC’17 Add. Gen. FPGA Add. Gen. Add. Gen. Add. Gen. Ports (9x) Monitoring Monitoring Monitoring Arbitration Arbitration Arbitration Arbitration Rd. Tag Wr. Req. Rd. Tag Pool FIFO Wr. Req. Rd. Tag Wr. Req. Rd. Tag Wr. Req. Pool FIFO Pool FIFO Pool FIFO Data Gen. Data Gen. Data Gen. Data Gen. Monitoring HMC Controller Transceiver Transceiver HMC 13

  14. Access Patterns 14 Accessing Less Banks Access Patterns IISWC’17

  15. Access Patterns 15 Accessing Less Banks Access Patterns IISWC’17

  16. Access Patterns 16 Accessing Less Banks Access Patterns IISWC’17

  17. Access Patterns 17 Accessing Less Banks Access Patterns IISWC’17

  18. Bandwidth 18 Type of Accesses: ro wo rw (dependent) 30 Bandwidth (GB/s) 25 20 15 10 5 0 Access Pattern IISWC’17

  19. Bandwidth 19 Accessing 4 banks saturates 1 vault bandwidth. Type of Accesses: ro wo rw (dependent) External bandwidth is saturated at 4 vaults. 30 Bandwidth (GB/s) 25 20 15 10 5 0 Access Pattern IISWC’17

  20. Thermal/Power Experiments 20 Pico SC6 Mini DC Power Supply: Fan Speed Control _ + AC-510 EX700 Power W Measurement 45° 45 cm 90 cm 15W 135 cm Fan IISWC’17

  21. Temperature (read only) 21 Thermal Configurations: Cfg4 Cfg3 Cfg2 Cfg1 Bandwidth (GB/s) 80 30 Temperature ( ° C) 25 70 20 60 15 50 10 40 5 30 0 Access Pattern IISWC’17

  22. Temperature (read only) 22 Thermal Configurations: Cfg4 Cfg3 Cfg2 Cfg1 Bandwidth (GB/s) 80 30 Temperature ( ° C) 25 70 20 Access patterns affect temperature. 60 15 50 10 40 5 30 0 Access Pattern IISWC’17

  23. Temperature & Bandwidth 23 ro wo rw Type of Accesses: 60 58 Temperature ( ° C) 56 54 52 50 48 0 5 10 15 20 25 30 Bandwidth (GB/s) IISWC’17

  24. Temperature & Bandwidth 24 ro wo rw Type of Accesses: 60 58 Temperature ( ° C) 56 A Bandwidth increment of 15 GB/s 54 About 4 ° C increment in temperature 52 50 48 0 5 10 15 20 25 30 Bandwidth (GB/s) IISWC’17

  25. Temperature & Bandwidth 25 ro wo rw Type of Accesses: 60 58 Temperature ( ° C) Greater slope for writes 56 54 Writes are more sensitivity to temperature 52 50 48 0 5 10 15 20 25 30 Bandwidth (GB/s) IISWC’17

  26. Device Power Consumption (read only) 26 Thermal Configurations: Cfg4 Cfg3 Cfg2 Cfg1 Bandwidth (GB/s) 18 30 Average Power (W) 16 25 14 20 12 15 10 10 8 5 6 4 0 Access Pattern IISWC’17

  27. Device Power & Bandwidth 27 ro wo rw Type of Accesses: 14 12 Power (W) 10 8 6 0 5 10 15 20 25 30 Bandwidth (GB/s) IISWC’17

  28. Device Power & Bandwidth 28 ro wo rw Type of Accesses: 14 12 A Bandwidth increment of 15 GB/s Power (W) 10 About 2 W increment in device power 8 6 0 5 10 15 20 25 30 Bandwidth (GB/s) IISWC’17

  29. Cooling Power Consumption (read only) 29 Required Cooling Power to 50 55 60 65 70 Keep Temperature at ( ° C): 20 19 Cooling Power (W) 18 17 16 15 14 13 12 5 10 15 20 25 Bandwidth (GB/s) IISWC’17

  30. Cooling Power Consumption (read only) 30 Required Cooling Power to 50 55 60 65 70 Keep Temperature at ( ° C): 20 19 Cooling Power (W) A Bandwidth increment of 15 GB/s 18 17 16 About 1.5 W increment in cooling power 15 14 13 12 5 10 15 20 25 Bandwidth (GB/s) IISWC’17

  31. Closed-Page Policy 31 Payload Size: 128B 112B 96B 80B 64B 48B 32B 16B 25 Bandwidth (GB/s) 20 15 10 5 0 linear random linear random 16 vaults 1 vault Access Pattern IISWC’17

  32. Closed-Page Policy 32 Payload Size: 128B 112B 96B 80B 64B 48B 32B 16B 25 Bandwidth (GB/s) Applications benefit from bank-level parallelism 20 not from spatial locality 15 10 5 0 linear random linear random 16 vaults 1 vault Access Pattern IISWC’17

  33. Achieving High Bandwidth 33 } Promote bank-level parallelism } Remap data to avoid internal organization bottlenecks } Concatenate requests to use bandwidth effectively IISWC’17

  34. Latency Deconstruction PCIe Driver Pico API Software Host Pico EX700 PCIe Switch PCIe 3.0 x8 AC-510 IISWC’17 Add. Gen. FPGA Add. Gen. Add. Gen. Add. Gen. Ports (9x) Monitoring Monitoring Monitoring Arbitration Arbitration Arbitration Arbitration Rd. Tag Wr. Req. Rd. Tag Pool FIFO Wr. Req. Rd. Tag Wr. Req. Rd. Tag Wr. Req. Pool FIFO Pool FIFO Pool FIFO Data Gen. Data Gen. Data Gen. Data Gen. Monitoring HMC Controller Transceiver Transceiver HMC 34

  35. Latency Deconstruction PCIe Driver Pico API Software Host Pico EX700 PCIe Switch PCIe 3.0 x8 AC-510 IISWC’17 Add. Gen. FPGA Add. Gen. Add. Gen. Add. Gen. Ports (9x) Monitoring Monitoring Monitoring Arbitration Arbitration Arbitration Arbitration Rd. Tag Wr. Req. Rd. Tag Pool FIFO Wr. Req. Rd. Tag Wr. Req. Rd. Tag Wr. Req. Pool FIFO Pool FIFO Pool FIFO Data Gen. Data Gen. Data Gen. Data Gen. Monitoring HMC Controller Transceiver Transceiver HMC 35

  36. Latency Deconstruction Summary 36 TX Path: 287 ns RX Path: 260 ns 547 ns Total: Conversion to flits & buffering 10 cycles Round-robin arbitration among ports 2-9 cycles Add packet fields & flow control 10 cycles Serialization 10 cycles Transmission (128B) 15 cycles Freq.: 187.5 MHz Cycle: 5.3 ns IISWC’17

  37. Low-Load Latency 37 1.40 Size 16B Size 32B Latency (us) 1.20 1.00 0.80 0.60 0.40 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2.00 Size 64B Size 128B Max 1.80 Latency (us) 1.60 Avg. 1.40 Min 1.20 1.00 0.80 0.60 0.40 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Number of Read Requests Number of Read Requests IISWC’17

  38. Low-Load Latency 38 1.40 Size 16B Size 32B Latency (us) 1.20 1.00 0.80 0.60 0.40 Larger request size 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2.00 Size 64B Size 128B Max 1.80 Faster latency increment Latency (us) 1.60 Avg. 1.40 Min 1.20 1.00 0.80 0.60 0.40 2 4 6 8 10 12 14 16 18 20 22 24 26 28 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Number of Read Requests Number of Read Requests IISWC’17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend