Performance Implications of NoCs on 3D-Stacked Memories: Insights - - PowerPoint PPT Presentation

performance implications of nocs on 3d stacked memories
SMART_READER_LITE
LIVE PREVIEW

Performance Implications of NoCs on 3D-Stacked Memories: Insights - - PowerPoint PPT Presentation

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube (HMC) Ramyad Hadidi, BaharAsgari, Jeffrey Young, Burhan Ahmad Mudassar, KartikayGarg, TusharKrishna, and HyesoonKim Introduction to HMC 2 } Hybrid


slide-1
SLIDE 1

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube (HMC)

Ramyad Hadidi, BaharAsgari, Jeffrey Young, Burhan Ahmad Mudassar, KartikayGarg, TusharKrishna, and HyesoonKim

slide-2
SLIDE 2

2

ISPASS 2018

2

Introduction to HMC

} Hybrid Memory Cube (HMC) vs High-Bandwidth Memory (HBM)

} HMC: Serial, packet-based interface } HBM: Wide bus, standard DRAM protocol

} Found in high-end GPUs and Intel’s Knight’s Landing Illustration credits: AMD and Micron

slide-3
SLIDE 3

3

ISPASS 2018

3

Why is HMC Interesting?

} -Serialized, high-speed link addresses pin limitation issues with

DRAM and HBM

} -Abstracted packet interface provides opportunities for novel

memories and addressing opportunities

}

Can be used with DRAM, PCM, STT-RAM, NVM, etc.

} -Memory controller sits on top of a “routing” layer

}

Allows for more interesting connections between processors and memory elements

}

This study addresses the impacts of the network on chip (NOC) for architects/application developers

Illustration credits: Micron

slide-4
SLIDE 4

4

ISPASS 2018

4

This Study’s Contributions

AC-510 EX700 PCIe Board

Configs/

  • Mem. Trace

Driver Software NoC HMC Vault

PCIe 3.0 x16

Host FPGA Vault Vault

Logic Layer

We examine the NoC of the HMC using an FPGA-based prototype to answer the following: 1) How does the NoC behave under low- and high-load conditions? 2) Can we relate QoS concepts to 3D stacked memories? 3) How does the NoC affect latency within the HMC? 4) What potential bottlenecks are there and how can we avoid them?

slide-5
SLIDE 5

5

ISPASS 2018

Hybrid Memory Cube (HMC)

5

Logic Layer Vault Controller DRAM Layer

HMC 1.1 (Gen2): 4GB size TSV

P a r t i t i

  • n

Vault

Bank Bank

slide-6
SLIDE 6

6

ISPASS 2018

Hybrid Memory Cube (HMC)

6

Logic Layer Vault Controller DRAM Layer

HMC 1.1 (Gen2): 4GB size TSV

P a r t i t i

  • n

Vault

Bank Bank

16 Banks/Vault Total Number of Banks = 256 Size of Each Bank = 16 MB

slide-7
SLIDE 7

8

ISPASS 2018

HMC Memory Addressing

8

  • Closed-page policy Page Size = 256 B
  • Low-order-interleaving address mapping policy
  • 34-bit address field:

4K OS Page

Bank ID Quadrant ID Vault ID in a Quadrant Block Address

4 7 9 11 15 32 33

… Ignored

slide-8
SLIDE 8

9

ISPASS 2018

HMC Communication I

9

  • Follows a serialized packet-switched protocol
  • Partitioned into 16-byte flit
  • Each transfer incurs 1 flit of overhead
slide-9
SLIDE 9

10

ISPASS 2018

HMC Communication II

10

Flow Control Request/Response

slide-10
SLIDE 10

11

ISPASS 2018

11

Our HMC Test Infrastructure

AC-510 EX700 PCIe Board

Configs/

  • Mem. Trace

Driver Software NoC HMC Vault

PCIe 3.0 x16

Host FPGA Vault Vault

Logic Layer

  • Micron’s AC-510 module contains a Xilinx Kintex FPGA and

HMC 1.1 4 GB part

  • 2 half-width links for a total of 60 GB/s of bandwidth
  • Host SW communicates over PCIe to FPGA-based queues
slide-11
SLIDE 11

12

ISPASS 2018

12

Methodology (GUPS)

Host

Pico API

GUPS Software

EX700

PCIe Switch

PCIe 3.0 x16

AC-510

FPGA (GUPS Firmware)

Pico PCIe Driver

HMC

... 2x 15Gbps 8x links

AXI-4

Vault Vault Vault NoC Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.
  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

Ports (9x)

Monitoring PCIe 3.0 x8

slide-12
SLIDE 12

13

ISPASS 2018

13

Methodology (multi-port stream)

Host

Pico API

Multi-Port Stream Software

EX700

PCIe Switch

PCIe 3.0 x16

AC-510

FPGA (Multi-Port Stream Firmware)

Pico PCIe Driver

HMC

... 2x 15Gbps 8x links

AXI-4

Vault Vault Vault NoC Transceiver Transceiver HMC Controller

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Add. Gen.

Monitoring

  • Wr. Req.

FIFO

  • Rd. Tag

Pool Data Gen. Arbitration

  • Rd. Tag Pool

Ports (9x)

Monitoring PCIe 3.0 x8 Memory Traces Command FIFO

  • Wr. Data. FIFO
  • Rd. Data. FIFO
  • Rd. Addr. FIFO

Glue Logic

slide-13
SLIDE 13

14

ISPASS 2018

14

Experiments

[1] High-Contention Latency Analysis (GUPS design) [2] Low-Contention Latency Analysis (Multi-port stream) [3] Quality of Service Analysis (Multi-port) [4] High-Contention Latency Histograms Per Vault (Multi- port) [5] Requested and Response Bandwidth Analysis (GUPS)

slide-14
SLIDE 14

15

ISPASS 2018

15

[1] Read-only Latency vs. Bandwidth

1 bank

Latency (μs) Bandwidth (GB/s)

5 10 15 20 25 30 2 4 6 8 10 12 14 16 18 20 22 24 Size 16B Size 32B Size 64B Size 128B

2 vaults 4 vaults 8 vaults 16 vaults 4 banks 8 banks 1 vault 2 banks

slide-15
SLIDE 15

16

ISPASS 2018

16

[2] Average Latency vs Requests

0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 5 10 15 20 25 30 35 40 45 50 55

Latency (μs) Number of Read Requests

16B 32B 64B 128B 1

slide-16
SLIDE 16

17

ISPASS 2018

17

[2] Average Latency vs Requests II

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 50 100 150 200 250 300

Latency (μs) Number of Read Requests

16B 32B 64B 128B 1 Linear Increment

slide-17
SLIDE 17

18

ISPASS 2018

18

[3] QoS for 4 Vaults

1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16B 32B 64B 128B

Maximum Latency (μs)

Vault Number 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Maximum Latency (μs)

Vault Number

slide-18
SLIDE 18

19

ISPASS 2018

19

[4] Latency vs. Request Size

1617 1624 1631 1639 1646 1653 1661 1668 1675

Latency (ns)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

0.05 0.1 0.15 0.2

1 9 3 1 1 9 5 7 1 9 8 2 2 8 2 3 3 2 5 8 2 8 4 2 1 9 2 1 3 5

Latency (ns)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

0.05 0.1 0.15 0.2 0.25

2573 2641 2708 2776 2844 2911 2979 3046 3114

Latency (ns)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

3894 3945 3996 4046 4097 4148 4198 4249 4300

Latency (ns)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

0.05 0.1 0.15 0.2 0.25 0.3

16B 32B 64B 128B

slide-19
SLIDE 19

20

ISPASS 2018

20

[4] Latency vs. Request Size

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

1617 1624 1631 1639 1646 1653 1661 1668 1675

Latency (ns)

0.05 0.1 0.15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

1931 1957 1982 2008 2033 2058 2084 2109 2135

Latency (ns)

0.05 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

2573 2641 2708 2776 2844 2911 2979 3046 3114

Latency (ns)

0.05 0.1 0.15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Vault Number

3894 3945 3996 4046 4097 4148 4198 4249 4300

Latency (ns)

0.05 0.1

16B 32B 64B 128B

slide-20
SLIDE 20

21

ISPASS 2018

21

[4] Average Latency – 4 Vaults

Average Latency (μs) Request Size Latency Standard Deviation (!) (ns)

40 80 120 160 200 1 2 3 4 5 16B 32B 64B 128B

Average Standard Deviation

slide-21
SLIDE 21

22

ISPASS 2018

22

[5] GUPS – Bandwidth vs. Active Ports

2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9

Bandwidth (GB/s)

#Active Ports ( Request Bandwidth) 16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9 #Active Ports ( Request Bandwidth)

(a) 16B (b) 32B

slide-22
SLIDE 22

23

ISPASS 2018

23

[5] GUPS – Bandwidth vs. Active Ports II

24

  • 16 vaults

8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 24

  • 2

4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9

Bandwidth (GB/s)

#Active Ports ( Request Bandwidth)

(c) 64B

2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9 #Active Ports ( Request Bandwidth)

(d) 128B

slide-23
SLIDE 23

24

ISPASS 2018

24

[6] GUPS – Outstanding Requests

100 200 300 400 500 600 16 32 64 128 Average

Number of Outstanding Requests Request size (Byte)

2 banks 4 banks

slide-24
SLIDE 24

25

ISPASS 2018

25

Takeaways

} Large and small requests allow tuning for bandwidth- or latency-

  • ptimized applications better than DRAM

} Vault- and bank-level parallelism are key to achieving higher BW

Vault latencies are more correlated with access patterns and traffic than with physical vault location

} Queuing delays will continue to be a concern with NOCs in the

HMC

} Address via host-side queuing/scheduling or by distributing accesses

across vaults (data structures or compiler passes)

} The HMC’s NoC complicates QoS due to variability

} However, trade-offs in packet size and ”private” vaults can improve QoS

slide-25
SLIDE 25

26

ISPASS 2018

26

Questions?

Thanks to Micron for helping to support our HMC testbed!