Demystifying the Characteristics of 3D-Stacked Memories: A Case - - PowerPoint PPT Presentation
Demystifying the Characteristics of 3D-Stacked Memories: A Case - - PowerPoint PPT Presentation
Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC) , BaharAsgari, Burhan Ahmad Mudassar, SaibalMukhopadhyay, SudhakarYalamanchili, and HyesoonKim IISWC17 Talk Memory Evolution 2
Memory Evolution
2 IISWC’17
3D-Stacking Technology
3
Provides opportunities & novel features 3D-DRAMs:
} Provide higher bandwidth and density } Enable lower power consumption } Motivate processing-in-memory
HMC is an example of such memories.
IISWC’17
New Considerations
4
New internal organization New thermal behavior New latency and bandwidth hierarchy New packet-switched interface
Operational Bound
Bandwidth Temperature Bandwidth Power Bandwidth Latency
Device Cooling
IISWC’17
Contributions
5
- We evaluate a real system with HMC 1.1 to:
- Study new memory organization
- Present bandwidth, power, and
temperature relationships
- Investigate required cooling power
- Explore contributing factors to latency
- To realize the full-system impact of
3D-stacked memories and HMC in particular.
HMC FPGA
AC510
IISWC’17
Hybrid Memory Cube (HMC)
6
Logic Layer Vault Controller DRAM Layer
HMC 1.1 (Gen2): 4GB size
IISWC’17
TSV
P a r t i t i
- n
Vault
Bank Bank
Hybrid Memory Cube (HMC)
7
Logic Layer Vault Controller DRAM Layer
HMC 1.1 (Gen2): 4GB size
IISWC’17
TSV
P a r t i t i
- n
Vault
Bank Bank
16 Banks/Vault Total Number of Banks = 256 Size of Each Bank = 16 MB
HMC Communication I
8
- Follows a serialized packet-switched protocol
- Partitioned into 16-byte flit
- Each transfer incurs 1 flit of overhead
IISWC’17
HMC Communication I
9
- Follows a serialized packet-switched protocol
- Partitioned into 16-byte flit
- Each transfer incurs 1 flit of overhead
IISWC’17
HMC Communication II
10
- Two/Four full duplex external links:
- Width of 8 or 16 lanes
- Configurable speeds of 10, 12.5, and 15 Gbps
IISWC’17
Our evaluated system 2 external links – 8 lanes each
IISWC’17
Experimental Setup I
11
- Pico SC6 Mini
- EX700 Backplane
- AC510 Module
- 4GB HMC 1.1
Pico SC6 Mini 45 cm 90 cm 135 cm
W
Power Measurement 15W Fan 45°
EX700
AC-510
+ _
DC Power Supply: Fan Speed Control Pico SC6 Mini
EX700
AC-510
EX700
HMC FPGA
AC510
IISWC’17
Experimental Setup I
12
- Pico SC6 Mini
- EX700 Backplane
- AC510 Module
- 4GB HMC 1.1
Pico SC6 Mini 45 cm 90 cm 135 cm
W
Power Measurement 15W Fan 45°
EX700
AC-510
+ _
DC Power Supply: Fan Speed Control Pico SC6 Mini
EX700
AC-510
EX700
HMC FPGA
AC510
Experimental Setup II
13
- FPGA frequency: 187.5 MHz
- Modified GUPS (giga updates per second) benchmark
- Apply different masks to addresses
HMC
IISWC’17
Host
Pico API
Software
EX700
PCIe Switch
AC-510
PCIe
3.0 x8
FPGA
Pico PCIe Driver
Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
Ports (9x)
Monitoring
Access Patterns
14
Access Patterns
IISWC’17
Accessing Less Banks
Access Patterns
15
Access Patterns
IISWC’17
Accessing Less Banks
Access Patterns
16
Access Patterns
IISWC’17
Accessing Less Banks
Access Patterns
17
Access Patterns
IISWC’17
Accessing Less Banks
Bandwidth
18
5 10 15 20 25 30 ro wo rw Bandwidth (GB/s) Access Pattern Type of Accesses:
IISWC’17
(dependent)
Bandwidth
19
5 10 15 20 25 30 ro wo rw Bandwidth (GB/s) Access Pattern Type of Accesses:
IISWC’17
(dependent)
Accessing 4 banks saturates 1 vault bandwidth. External bandwidth is saturated at 4 vaults.
Thermal/Power Experiments
20
Pico SC6 Mini 45 cm 90 cm 135 cm
W
Power Measurement 15W Fan 45°
EX700
AC-510
+ _
DC Power Supply: Fan Speed Control
IISWC’17
Temperature (read only)
21
5 10 15 20 25 30 30 40 50 60 70 80
Cfg4 Cfg3 Cfg2 Cfg1
Temperature (°C) Access Pattern Bandwidth (GB/s) Thermal Configurations:
IISWC’17
Temperature (read only)
22
5 10 15 20 25 30 30 40 50 60 70 80
Cfg4 Cfg3 Cfg2 Cfg1
Temperature (°C) Access Pattern Bandwidth (GB/s) Thermal Configurations:
IISWC’17
Access patterns affect temperature.
48 50 52 54 56 58 60 5 10 15 20 25 30
ro wo rw
Type of Accesses:
Temperature & Bandwidth
23
Bandwidth (GB/s) Temperature (°C)
IISWC’17
48 50 52 54 56 58 60 5 10 15 20 25 30
ro wo rw
Type of Accesses:
Temperature & Bandwidth
24
Bandwidth (GB/s) Temperature (°C)
IISWC’17
A Bandwidth increment of 15 GB/s About 4°C increment in temperature
48 50 52 54 56 58 60 5 10 15 20 25 30
ro wo rw
Type of Accesses:
Temperature & Bandwidth
25
Bandwidth (GB/s) Temperature (°C)
IISWC’17
Greater slope for writes Writes are more sensitivity to temperature
Device Power Consumption (read only)
26
5 10 15 20 25 30 4 6 8 10 12 14 16 18 Average Power (W) Access Pattern
Cfg4 Cfg3 Cfg2 Cfg1
Bandwidth (GB/s) Thermal Configurations:
IISWC’17
Device Power & Bandwidth
27
Bandwidth (GB/s) Power (W) 6 8 10 12 14 5 10 15 20 25 30
ro wo rw
Type of Accesses:
IISWC’17
Device Power & Bandwidth
28
Bandwidth (GB/s) Power (W) 6 8 10 12 14 5 10 15 20 25 30
ro wo rw
Type of Accesses:
IISWC’17
A Bandwidth increment of 15 GB/s About 2 W increment in device power
Cooling Power Consumption (read only)
29
12 13 14 15 16 17 18 19 20 5 10 15 20 25 Bandwidth (GB/s)
50 55 60 65 70
Cooling Power (W) Required Cooling Power to Keep Temperature at (°C):
IISWC’17
Cooling Power Consumption (read only)
30
12 13 14 15 16 17 18 19 20 5 10 15 20 25 Bandwidth (GB/s)
50 55 60 65 70
Cooling Power (W) Required Cooling Power to Keep Temperature at (°C):
IISWC’17
A Bandwidth increment of 15 GB/s About 1.5 W increment in cooling power
Closed-Page Policy
31
5 10 15 20 25 linear random linear random 16 vaults 1 vault
128B 112B 96B 80B 64B 48B 32B 16B
Payload Size: Bandwidth (GB/s) Access Pattern
IISWC’17
Closed-Page Policy
32
5 10 15 20 25 linear random linear random 16 vaults 1 vault
128B 112B 96B 80B 64B 48B 32B 16B
Payload Size: Bandwidth (GB/s) Access Pattern
IISWC’17
Applications benefit from bank-level parallelism not from spatial locality
Achieving High Bandwidth
33 IISWC’17
} Promote bank-level parallelism } Remap data to avoid internal organization
bottlenecks
} Concatenate requests to use bandwidth
effectively
Host
Pico API
Software
EX700
PCIe Switch
AC-510
PCIe
3.0 x8
FPGA
Pico PCIe Driver
Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
Ports (9x)
Monitoring
Latency Deconstruction
34
HMC
IISWC’17
Host
Pico API
Software
EX700
PCIe Switch
AC-510
PCIe
3.0 x8
FPGA
Pico PCIe Driver
Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
Ports (9x)
Monitoring
Latency Deconstruction
35
HMC
IISWC’17
Latency Deconstruction Summary
36
Conversion to flits & buffering 10 cycles Round-robin arbitration among ports 2-9 cycles Add packet fields & flow control 10 cycles Serialization 10 cycles Transmission (128B) 15 cycles
Freq.: 187.5 MHz Cycle: 5.3 ns
TX Path: RX Path: 260 ns Total: 547 ns 287 ns
IISWC’17
2 4 6 8 10 12 14 16 18 20 22 24 26 28
0.40 0.60 0.80 1.00 1.20 1.40
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us)
Low-Load Latency
37
0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us) Number of Read Requests
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min
IISWC’17
2 4 6 8 10 12 14 16 18 20 22 24 26 28
0.40 0.60 0.80 1.00 1.20 1.40
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us)
Low-Load Latency
38
0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us) Number of Read Requests
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min
IISWC’17
Larger request size Faster latency increment
2 4 6 8 10 12 14 16 18 20 22 24 26 28
0.40 0.60 0.80 1.00 1.20 1.40
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us)
Low-Load Latency
39
0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us) Number of Read Requests
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min
IISWC’17
Average latency increases because of maximum latency increments
2 4 6 8 10 12 14 16 18 20 22 24 26 28
0.40 0.60 0.80 1.00 1.20 1.40
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us)
Low-Load Latency
40
0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Latency (us) Number of Read Requests
2 4 6 8 10 12 14 16 18 20 22 24 26 28
Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min
IISWC’17
125 ns is spent in the HMC
High-Load Latency
41
5 10 15 20 25 5 10 15 20 25 30
BW 128B BW 64B BW 32B Latency 128B Latency 64B Latency 32B
Latency (us) Access Pattern Bandwidth (GB/s)
IISWC’17
Latency-Bandwidth
42
2 4 6 8 10 5 10 15
size 16B size 32B size 64B size 128B
Highest Request Rate Lowest Request Rate
2 4 6 8 10 12 14 2 4 6 8
Bandwidth (GB/s) Bandwidth (GB/s) Latency (us)
4-banks banks 2-banks banks
IISWC’17
Latency-Bandwidth
43
2 4 6 8 10 5 10 15
size 16B size 32B size 64B size 128B
Highest Request Rate Lowest Request Rate
2 4 6 8 10 12 14 2 4 6 8
Bandwidth (GB/s) Bandwidth (GB/s) Latency (us)
4-banks banks 2-banks banks
IISWC’17
Each layer/bank has a queue. Limiting factor can be the queue size.
Conclusions
44
} Mixing read and write requests and using large
request sizes lead to effective use of bi-directional bandwidth.
} Distributing accesses prevents internal bottlenecks
and exploits bank-level parallelism.
} Controlling the request rate to avoid high latency. } Employing fault-tolerant mechanisms and using
proper cooling solutions enables temperature- sensitive operations to reach a higher bandwidth.
} Reducing latency overhead of the infrastructure will
greatly benefit latency.
IISWC’17
45
Backup Slides
IISWC’17
HMC Memory Addressing
46
- Closed-page policy Page Size = 256 B
- Low-order-interleaving address mapping policy
- 34-bit address field:
4K OS Page
Bank ID Quadrant ID Vault ID in a Quadrant Block Address
4 7 9 11 15 32 33
… Ignored
IISWC’17
Experimental Setup III
47
Full-scale GUPS Small-scale GUPS Stream GUPS
Addresses Random Configurable Mask Random Configurable Mask Defined by User Request Rate Maximum Configurable Minimum Experiment Bandwidth Power Temperature High-Load Latency Latency-Bandwidth Integrity Check Low-Load Latency
HMC
Host
Pico API
Software
EX700
PCIe Switch
AC-510
PCIe
3.0 x8
FPGA (GUPS)
Pico PCIe Driver
Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
Ports (9x)
Monitoring
IISWC’17
Thermal Configurations
48
Pico SC6 Mini 45 cm 90 cm 135 cm
W
Power Measurement 15W Fan 45°
EX700
AC-510
+ _
DC Power Supply: Fan Speed Control
IISWC’17
Cooling Power
49
Pico SC6 Mini 45 cm 90 cm 135 cm
W
Power Measurement 15W Fan 45°
EX700
AC-510
+ _
DC Power Supply: Fan Speed Control
IISWC’17
Configuration Cooling Power cfg1 19.32 W cfg2 15.90 W cfg3 13.90 W cfg4 10.78 W
HMC Communication II
50
- Two/Four full duplex external links:
- Width of 16 or 8 lanes
- Configurable speeds of 10, 12.5, and 15 Gbps
IISWC’17
Address Mapping
51
5 10 15 20 25
24-31 10-17 7-14 3-10 2-9 1-8 0-7
ro rw wo Bit Locations Forced to Zero Bandwidth (GB/s)
1 bank Bank ID Quadrant ID Vault ID in a Quadrant Block Address
4 7 9 11 15 32 33
… Ignored 8 vaults 1 vault 2 vaults
IISWC’17
Bandwidth II
52
50 100 150 200 250 300 350 5 10 15 20 25 128B 64B 32B MRPS 128B MRPS 64B MRPS 32B Bandwidth (GB/s) Access Pattern #Req. (M) / Second
IISWC’17
Latency-Bandwidth II
53
5 10 15 5 10 15 20
Latency (us) Bandwidth (GB/s)
16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank Size 32B
IISWC’17
5 10 15 5 10 15 20
Latency (us) Bandwidth (GB/s)
5 10 15 20
Bandwidth (GB/s) Size 128B
5 10 15 20 5 10 15 5 10 15 20
Latency (us)
1 bank 2 banks 4 banks 8 banks 1 vault 2 vaults 4 vaults 8 vaults 16 vaults
Latency-Bandwidth III
54
Size 16B Size 32B Size 64B
IISWC’17