Demystifying the Characteristics of 3D-Stacked Memories: A Case - - PowerPoint PPT Presentation

demystifying the characteristics of 3d stacked memories a
SMART_READER_LITE
LIVE PREVIEW

Demystifying the Characteristics of 3D-Stacked Memories: A Case - - PowerPoint PPT Presentation

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC) , BaharAsgari, Burhan Ahmad Mudassar, SaibalMukhopadhyay, SudhakarYalamanchili, and HyesoonKim IISWC17 Talk Memory Evolution 2


slide-1
SLIDE 1

Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC)

, BaharAsgari, Burhan Ahmad Mudassar, SaibalMukhopadhyay, SudhakarYalamanchili, and HyesoonKim

IISWC’17 Talk

slide-2
SLIDE 2

Memory Evolution

2 IISWC’17

slide-3
SLIDE 3

3D-Stacking Technology

3

Provides opportunities & novel features 3D-DRAMs:

} Provide higher bandwidth and density } Enable lower power consumption } Motivate processing-in-memory

HMC is an example of such memories.

IISWC’17

slide-4
SLIDE 4

New Considerations

4

New internal organization New thermal behavior New latency and bandwidth hierarchy New packet-switched interface

Operational Bound

Bandwidth Temperature Bandwidth Power Bandwidth Latency

Device Cooling

IISWC’17

slide-5
SLIDE 5

Contributions

5

  • We evaluate a real system with HMC 1.1 to:
  • Study new memory organization
  • Present bandwidth, power, and

temperature relationships

  • Investigate required cooling power
  • Explore contributing factors to latency
  • To realize the full-system impact of

3D-stacked memories and HMC in particular.

HMC FPGA

AC510

IISWC’17

slide-6
SLIDE 6

Hybrid Memory Cube (HMC)

6

Logic Layer Vault Controller DRAM Layer

HMC 1.1 (Gen2): 4GB size

IISWC’17

TSV

P a r t i t i

  • n

Vault

Bank Bank

slide-7
SLIDE 7

Hybrid Memory Cube (HMC)

7

Logic Layer Vault Controller DRAM Layer

HMC 1.1 (Gen2): 4GB size

IISWC’17

TSV

P a r t i t i

  • n

Vault

Bank Bank

16 Banks/Vault Total Number of Banks = 256 Size of Each Bank = 16 MB

slide-8
SLIDE 8

HMC Communication I

8

  • Follows a serialized packet-switched protocol
  • Partitioned into 16-byte flit
  • Each transfer incurs 1 flit of overhead

IISWC’17

slide-9
SLIDE 9

HMC Communication I

9

  • Follows a serialized packet-switched protocol
  • Partitioned into 16-byte flit
  • Each transfer incurs 1 flit of overhead

IISWC’17

slide-10
SLIDE 10

HMC Communication II

10

  • Two/Four full duplex external links:
  • Width of 8 or 16 lanes
  • Configurable speeds of 10, 12.5, and 15 Gbps

IISWC’17

Our evaluated system 2 external links – 8 lanes each

slide-11
SLIDE 11

IISWC’17

Experimental Setup I

11

  • Pico SC6 Mini
  • EX700 Backplane
  • AC510 Module
  • 4GB HMC 1.1

Pico SC6 Mini 45 cm 90 cm 135 cm

W

Power Measurement 15W Fan 45°

EX700

AC-510

+ _

DC Power Supply: Fan Speed Control Pico SC6 Mini

EX700

AC-510

EX700

HMC FPGA

AC510

slide-12
SLIDE 12

IISWC’17

Experimental Setup I

12

  • Pico SC6 Mini
  • EX700 Backplane
  • AC510 Module
  • 4GB HMC 1.1

Pico SC6 Mini 45 cm 90 cm 135 cm

W

Power Measurement 15W Fan 45°

EX700

AC-510

+ _

DC Power Supply: Fan Speed Control Pico SC6 Mini

EX700

AC-510

EX700

HMC FPGA

AC510

slide-13
SLIDE 13

Experimental Setup II

13

  • FPGA frequency: 187.5 MHz
  • Modified GUPS (giga updates per second) benchmark
  • Apply different masks to addresses

HMC

IISWC’17

Host

Pico API

Software

EX700

PCIe Switch

AC-510

PCIe

3.0 x8

FPGA

Pico PCIe Driver

Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.
  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

Ports (9x)

Monitoring

slide-14
SLIDE 14

Access Patterns

14

Access Patterns

IISWC’17

Accessing Less Banks

slide-15
SLIDE 15

Access Patterns

15

Access Patterns

IISWC’17

Accessing Less Banks

slide-16
SLIDE 16

Access Patterns

16

Access Patterns

IISWC’17

Accessing Less Banks

slide-17
SLIDE 17

Access Patterns

17

Access Patterns

IISWC’17

Accessing Less Banks

slide-18
SLIDE 18

Bandwidth

18

5 10 15 20 25 30 ro wo rw Bandwidth (GB/s) Access Pattern Type of Accesses:

IISWC’17

(dependent)

slide-19
SLIDE 19

Bandwidth

19

5 10 15 20 25 30 ro wo rw Bandwidth (GB/s) Access Pattern Type of Accesses:

IISWC’17

(dependent)

Accessing 4 banks saturates 1 vault bandwidth. External bandwidth is saturated at 4 vaults.

slide-20
SLIDE 20

Thermal/Power Experiments

20

Pico SC6 Mini 45 cm 90 cm 135 cm

W

Power Measurement 15W Fan 45°

EX700

AC-510

+ _

DC Power Supply: Fan Speed Control

IISWC’17

slide-21
SLIDE 21

Temperature (read only)

21

5 10 15 20 25 30 30 40 50 60 70 80

Cfg4 Cfg3 Cfg2 Cfg1

Temperature (°C) Access Pattern Bandwidth (GB/s) Thermal Configurations:

IISWC’17

slide-22
SLIDE 22

Temperature (read only)

22

5 10 15 20 25 30 30 40 50 60 70 80

Cfg4 Cfg3 Cfg2 Cfg1

Temperature (°C) Access Pattern Bandwidth (GB/s) Thermal Configurations:

IISWC’17

Access patterns affect temperature.

slide-23
SLIDE 23

48 50 52 54 56 58 60 5 10 15 20 25 30

ro wo rw

Type of Accesses:

Temperature & Bandwidth

23

Bandwidth (GB/s) Temperature (°C)

IISWC’17

slide-24
SLIDE 24

48 50 52 54 56 58 60 5 10 15 20 25 30

ro wo rw

Type of Accesses:

Temperature & Bandwidth

24

Bandwidth (GB/s) Temperature (°C)

IISWC’17

A Bandwidth increment of 15 GB/s About 4°C increment in temperature

slide-25
SLIDE 25

48 50 52 54 56 58 60 5 10 15 20 25 30

ro wo rw

Type of Accesses:

Temperature & Bandwidth

25

Bandwidth (GB/s) Temperature (°C)

IISWC’17

Greater slope for writes Writes are more sensitivity to temperature

slide-26
SLIDE 26

Device Power Consumption (read only)

26

5 10 15 20 25 30 4 6 8 10 12 14 16 18 Average Power (W) Access Pattern

Cfg4 Cfg3 Cfg2 Cfg1

Bandwidth (GB/s) Thermal Configurations:

IISWC’17

slide-27
SLIDE 27

Device Power & Bandwidth

27

Bandwidth (GB/s) Power (W) 6 8 10 12 14 5 10 15 20 25 30

ro wo rw

Type of Accesses:

IISWC’17

slide-28
SLIDE 28

Device Power & Bandwidth

28

Bandwidth (GB/s) Power (W) 6 8 10 12 14 5 10 15 20 25 30

ro wo rw

Type of Accesses:

IISWC’17

A Bandwidth increment of 15 GB/s About 2 W increment in device power

slide-29
SLIDE 29

Cooling Power Consumption (read only)

29

12 13 14 15 16 17 18 19 20 5 10 15 20 25 Bandwidth (GB/s)

50 55 60 65 70

Cooling Power (W) Required Cooling Power to Keep Temperature at (°C):

IISWC’17

slide-30
SLIDE 30

Cooling Power Consumption (read only)

30

12 13 14 15 16 17 18 19 20 5 10 15 20 25 Bandwidth (GB/s)

50 55 60 65 70

Cooling Power (W) Required Cooling Power to Keep Temperature at (°C):

IISWC’17

A Bandwidth increment of 15 GB/s About 1.5 W increment in cooling power

slide-31
SLIDE 31

Closed-Page Policy

31

5 10 15 20 25 linear random linear random 16 vaults 1 vault

128B 112B 96B 80B 64B 48B 32B 16B

Payload Size: Bandwidth (GB/s) Access Pattern

IISWC’17

slide-32
SLIDE 32

Closed-Page Policy

32

5 10 15 20 25 linear random linear random 16 vaults 1 vault

128B 112B 96B 80B 64B 48B 32B 16B

Payload Size: Bandwidth (GB/s) Access Pattern

IISWC’17

Applications benefit from bank-level parallelism not from spatial locality

slide-33
SLIDE 33

Achieving High Bandwidth

33 IISWC’17

} Promote bank-level parallelism } Remap data to avoid internal organization

bottlenecks

} Concatenate requests to use bandwidth

effectively

slide-34
SLIDE 34

Host

Pico API

Software

EX700

PCIe Switch

AC-510

PCIe

3.0 x8

FPGA

Pico PCIe Driver

Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.
  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

Ports (9x)

Monitoring

Latency Deconstruction

34

HMC

IISWC’17

slide-35
SLIDE 35

Host

Pico API

Software

EX700

PCIe Switch

AC-510

PCIe

3.0 x8

FPGA

Pico PCIe Driver

Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.
  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

Ports (9x)

Monitoring

Latency Deconstruction

35

HMC

IISWC’17

slide-36
SLIDE 36

Latency Deconstruction Summary

36

Conversion to flits & buffering 10 cycles Round-robin arbitration among ports 2-9 cycles Add packet fields & flow control 10 cycles Serialization 10 cycles Transmission (128B) 15 cycles

Freq.: 187.5 MHz Cycle: 5.3 ns

TX Path: RX Path: 260 ns Total: 547 ns 287 ns

IISWC’17

slide-37
SLIDE 37

2 4 6 8 10 12 14 16 18 20 22 24 26 28

0.40 0.60 0.80 1.00 1.20 1.40

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us)

Low-Load Latency

37

0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us) Number of Read Requests

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min

IISWC’17

slide-38
SLIDE 38

2 4 6 8 10 12 14 16 18 20 22 24 26 28

0.40 0.60 0.80 1.00 1.20 1.40

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us)

Low-Load Latency

38

0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us) Number of Read Requests

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min

IISWC’17

Larger request size Faster latency increment

slide-39
SLIDE 39

2 4 6 8 10 12 14 16 18 20 22 24 26 28

0.40 0.60 0.80 1.00 1.20 1.40

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us)

Low-Load Latency

39

0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us) Number of Read Requests

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min

IISWC’17

Average latency increases because of maximum latency increments

slide-40
SLIDE 40

2 4 6 8 10 12 14 16 18 20 22 24 26 28

0.40 0.60 0.80 1.00 1.20 1.40

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us)

Low-Load Latency

40

0.40 0.60 0.80 1.00 1.20 1.40 1.60 1.80 2.00

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Latency (us) Number of Read Requests

2 4 6 8 10 12 14 16 18 20 22 24 26 28

Number of Read Requests Size 16B Size 32B Size 64B Size 128B Max Avg. Min

IISWC’17

125 ns is spent in the HMC

slide-41
SLIDE 41

High-Load Latency

41

5 10 15 20 25 5 10 15 20 25 30

BW 128B BW 64B BW 32B Latency 128B Latency 64B Latency 32B

Latency (us) Access Pattern Bandwidth (GB/s)

IISWC’17

slide-42
SLIDE 42

Latency-Bandwidth

42

2 4 6 8 10 5 10 15

size 16B size 32B size 64B size 128B

Highest Request Rate Lowest Request Rate

2 4 6 8 10 12 14 2 4 6 8

Bandwidth (GB/s) Bandwidth (GB/s) Latency (us)

4-banks banks 2-banks banks

IISWC’17

slide-43
SLIDE 43

Latency-Bandwidth

43

2 4 6 8 10 5 10 15

size 16B size 32B size 64B size 128B

Highest Request Rate Lowest Request Rate

2 4 6 8 10 12 14 2 4 6 8

Bandwidth (GB/s) Bandwidth (GB/s) Latency (us)

4-banks banks 2-banks banks

IISWC’17

Each layer/bank has a queue. Limiting factor can be the queue size.

slide-44
SLIDE 44

Conclusions

44

} Mixing read and write requests and using large

request sizes lead to effective use of bi-directional bandwidth.

} Distributing accesses prevents internal bottlenecks

and exploits bank-level parallelism.

} Controlling the request rate to avoid high latency. } Employing fault-tolerant mechanisms and using

proper cooling solutions enables temperature- sensitive operations to reach a higher bandwidth.

} Reducing latency overhead of the infrastructure will

greatly benefit latency.

IISWC’17

slide-45
SLIDE 45

45

Backup Slides

IISWC’17

slide-46
SLIDE 46

HMC Memory Addressing

46

  • Closed-page policy Page Size = 256 B
  • Low-order-interleaving address mapping policy
  • 34-bit address field:

4K OS Page

Bank ID Quadrant ID Vault ID in a Quadrant Block Address

4 7 9 11 15 32 33

… Ignored

IISWC’17

slide-47
SLIDE 47

Experimental Setup III

47

Full-scale GUPS Small-scale GUPS Stream GUPS

Addresses Random Configurable Mask Random Configurable Mask Defined by User Request Rate Maximum Configurable Minimum Experiment Bandwidth Power Temperature High-Load Latency Latency-Bandwidth Integrity Check Low-Load Latency

HMC

Host

Pico API

Software

EX700

PCIe Switch

AC-510

PCIe

3.0 x8

FPGA (GUPS)

Pico PCIe Driver

Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.
  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

Ports (9x)

Monitoring

IISWC’17

slide-48
SLIDE 48

Thermal Configurations

48

Pico SC6 Mini 45 cm 90 cm 135 cm

W

Power Measurement 15W Fan 45°

EX700

AC-510

+ _

DC Power Supply: Fan Speed Control

IISWC’17

slide-49
SLIDE 49

Cooling Power

49

Pico SC6 Mini 45 cm 90 cm 135 cm

W

Power Measurement 15W Fan 45°

EX700

AC-510

+ _

DC Power Supply: Fan Speed Control

IISWC’17

Configuration Cooling Power cfg1 19.32 W cfg2 15.90 W cfg3 13.90 W cfg4 10.78 W

slide-50
SLIDE 50

HMC Communication II

50

  • Two/Four full duplex external links:
  • Width of 16 or 8 lanes
  • Configurable speeds of 10, 12.5, and 15 Gbps

IISWC’17

slide-51
SLIDE 51

Address Mapping

51

5 10 15 20 25

24-31 10-17 7-14 3-10 2-9 1-8 0-7

ro rw wo Bit Locations Forced to Zero Bandwidth (GB/s)

1 bank Bank ID Quadrant ID Vault ID in a Quadrant Block Address

4 7 9 11 15 32 33

… Ignored 8 vaults 1 vault 2 vaults

IISWC’17

slide-52
SLIDE 52

Bandwidth II

52

50 100 150 200 250 300 350 5 10 15 20 25 128B 64B 32B MRPS 128B MRPS 64B MRPS 32B Bandwidth (GB/s) Access Pattern #Req. (M) / Second

IISWC’17

slide-53
SLIDE 53

Latency-Bandwidth II

53

5 10 15 5 10 15 20

Latency (us) Bandwidth (GB/s)

16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank Size 32B

IISWC’17

slide-54
SLIDE 54

5 10 15 5 10 15 20

Latency (us) Bandwidth (GB/s)

5 10 15 20

Bandwidth (GB/s) Size 128B

5 10 15 20 5 10 15 5 10 15 20

Latency (us)

1 bank 2 banks 4 banks 8 banks 1 vault 2 vaults 4 vaults 8 vaults 16 vaults

Latency-Bandwidth III

54

Size 16B Size 32B Size 64B

IISWC’17