Performance Implications of NoCs on 3D-Stacked Memories: Insights - - PowerPoint PPT Presentation
Performance Implications of NoCs on 3D-Stacked Memories: Insights - - PowerPoint PPT Presentation
Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube (HMC) Ramyad Hadidi, BaharAsgari, Jeffrey Young, Burhan Ahmad Mudassar, KartikayGarg, TusharKrishna, and HyesoonKim Introduction to HMC 2 } Hybrid
2
ISPASS 2018
2
Introduction to HMC
} Hybrid Memory Cube (HMC) vs High-Bandwidth Memory (HBM)
} HMC: Serial, packet-based interface } HBM: Wide bus, standard DRAM protocol
} Found in high-end GPUs and Intel’s Knight’s Landing Illustration credits: AMD and Micron
3
ISPASS 2018
3
Why is HMC Interesting?
} -Serialized, high-speed link addresses pin limitation issues with
DRAM and HBM
} -Abstracted packet interface provides opportunities for novel
memories and addressing opportunities
}
Can be used with DRAM, PCM, STT-RAM, NVM, etc.
} -Memory controller sits on top of a “routing” layer
}
Allows for more interesting connections between processors and memory elements
}
This study addresses the impacts of the network on chip (NOC) for architects/application developers
Illustration credits: Micron
4
ISPASS 2018
4
This Study’s Contributions
AC-510 EX700 PCIe Board
Configs/
- Mem. Trace
Driver Software NoC HMC Vault
PCIe 3.0 x16
Host FPGA Vault Vault
Logic Layer
We examine the NoC of the HMC using an FPGA-based prototype to answer the following: 1) How does the NoC behave under low- and high-load conditions? 2) Can we relate QoS concepts to 3D stacked memories? 3) How does the NoC affect latency within the HMC? 4) What potential bottlenecks are there and how can we avoid them?
5
ISPASS 2018
Hybrid Memory Cube (HMC)
5
Logic Layer Vault Controller DRAM Layer
HMC 1.1 (Gen2): 4GB size TSV
P a r t i t i
- n
Vault
Bank Bank
6
ISPASS 2018
Hybrid Memory Cube (HMC)
6
Logic Layer Vault Controller DRAM Layer
HMC 1.1 (Gen2): 4GB size TSV
P a r t i t i
- n
Vault
Bank Bank
16 Banks/Vault Total Number of Banks = 256 Size of Each Bank = 16 MB
8
ISPASS 2018
HMC Memory Addressing
8
- Closed-page policy Page Size = 256 B
- Low-order-interleaving address mapping policy
- 34-bit address field:
4K OS Page
Bank ID Quadrant ID Vault ID in a Quadrant Block Address
4 7 9 11 15 32 33
… Ignored
9
ISPASS 2018
HMC Communication I
9
- Follows a serialized packet-switched protocol
- Partitioned into 16-byte flit
- Each transfer incurs 1 flit of overhead
10
ISPASS 2018
HMC Communication II
10
Flow Control Request/Response
11
ISPASS 2018
11
Our HMC Test Infrastructure
AC-510 EX700 PCIe Board
Configs/
- Mem. Trace
Driver Software NoC HMC Vault
PCIe 3.0 x16
Host FPGA Vault Vault
Logic Layer
- Micron’s AC-510 module contains a Xilinx Kintex FPGA and
HMC 1.1 4 GB part
- 2 half-width links for a total of 60 GB/s of bandwidth
- Host SW communicates over PCIe to FPGA-based queues
12
ISPASS 2018
12
Methodology (GUPS)
Host
Pico API
GUPS Software
EX700
PCIe Switch
PCIe 3.0 x16
AC-510
FPGA (GUPS Firmware)
Pico PCIe Driver
HMC
... 2x 15Gbps 8x links
AXI-4
Vault Vault Vault NoC Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
Ports (9x)
Monitoring PCIe 3.0 x8
13
ISPASS 2018
13
Methodology (multi-port stream)
Host
Pico API
Multi-Port Stream Software
EX700
PCIe Switch
PCIe 3.0 x16
AC-510
FPGA (Multi-Port Stream Firmware)
Pico PCIe Driver
HMC
... 2x 15Gbps 8x links
AXI-4
Vault Vault Vault NoC Transceiver Transceiver HMC Controller
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Add. Gen.
Monitoring
- Wr. Req.
FIFO
- Rd. Tag
Pool Data Gen. Arbitration
- Rd. Tag Pool
Ports (9x)
Monitoring PCIe 3.0 x8 Memory Traces Command FIFO
- Wr. Data. FIFO
- Rd. Data. FIFO
- Rd. Addr. FIFO
Glue Logic
14
ISPASS 2018
14
Experiments
[1] High-Contention Latency Analysis (GUPS design) [2] Low-Contention Latency Analysis (Multi-port stream) [3] Quality of Service Analysis (Multi-port) [4] High-Contention Latency Histograms Per Vault (Multi- port) [5] Requested and Response Bandwidth Analysis (GUPS)
15
ISPASS 2018
15
[1] Read-only Latency vs. Bandwidth
1 bank
Latency (μs) Bandwidth (GB/s)
5 10 15 20 25 30 2 4 6 8 10 12 14 16 18 20 22 24 Size 16B Size 32B Size 64B Size 128B
2 vaults 4 vaults 8 vaults 16 vaults 4 banks 8 banks 1 vault 2 banks
16
ISPASS 2018
16
[2] Average Latency vs Requests
0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 5 10 15 20 25 30 35 40 45 50 55
Latency (μs) Number of Read Requests
16B 32B 64B 128B 1
17
ISPASS 2018
17
[2] Average Latency vs Requests II
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 50 100 150 200 250 300
Latency (μs) Number of Read Requests
16B 32B 64B 128B 1 Linear Increment
18
ISPASS 2018
18
[3] QoS for 4 Vaults
1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
16B 32B 64B 128B
Maximum Latency (μs)
Vault Number 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Maximum Latency (μs)
Vault Number
19
ISPASS 2018
19
[4] Latency vs. Request Size
1617 1624 1631 1639 1646 1653 1661 1668 1675
Latency (ns)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
0.05 0.1 0.15 0.2
1 9 3 1 1 9 5 7 1 9 8 2 2 8 2 3 3 2 5 8 2 8 4 2 1 9 2 1 3 5
Latency (ns)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
0.05 0.1 0.15 0.2 0.25
2573 2641 2708 2776 2844 2911 2979 3046 3114
Latency (ns)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
3894 3945 3996 4046 4097 4148 4198 4249 4300
Latency (ns)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
0.05 0.1 0.15 0.2 0.25 0.3
16B 32B 64B 128B
20
ISPASS 2018
20
[4] Latency vs. Request Size
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
1617 1624 1631 1639 1646 1653 1661 1668 1675
Latency (ns)
0.05 0.1 0.15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
1931 1957 1982 2008 2033 2058 2084 2109 2135
Latency (ns)
0.05 0.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
2573 2641 2708 2776 2844 2911 2979 3046 3114
Latency (ns)
0.05 0.1 0.15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Vault Number
3894 3945 3996 4046 4097 4148 4198 4249 4300
Latency (ns)
0.05 0.1
16B 32B 64B 128B
21
ISPASS 2018
21
[4] Average Latency – 4 Vaults
Average Latency (μs) Request Size Latency Standard Deviation (!) (ns)
40 80 120 160 200 1 2 3 4 5 16B 32B 64B 128B
Average Standard Deviation
22
ISPASS 2018
22
[5] GUPS – Bandwidth vs. Active Ports
2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9
Bandwidth (GB/s)
#Active Ports ( Request Bandwidth) 16 vaults 8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9 #Active Ports ( Request Bandwidth)
(a) 16B (b) 32B
23
ISPASS 2018
23
[5] GUPS – Bandwidth vs. Active Ports II
24
- 16 vaults
8 vaults 4 vaults 2 vaults 1 vault 8 banks 4 banks 2 banks 1 bank 24
- 2
4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9
Bandwidth (GB/s)
#Active Ports ( Request Bandwidth)
(c) 64B
2 4 6 8 10 12 14 16 18 20 22 24 1 2 3 4 5 6 7 8 9 #Active Ports ( Request Bandwidth)
(d) 128B
24
ISPASS 2018
24
[6] GUPS – Outstanding Requests
100 200 300 400 500 600 16 32 64 128 Average
Number of Outstanding Requests Request size (Byte)
2 banks 4 banks
25
ISPASS 2018
25
Takeaways
} Large and small requests allow tuning for bandwidth- or latency-
- ptimized applications better than DRAM
} Vault- and bank-level parallelism are key to achieving higher BW
Vault latencies are more correlated with access patterns and traffic than with physical vault location
} Queuing delays will continue to be a concern with NOCs in the
HMC
} Address via host-side queuing/scheduling or by distributing accesses
across vaults (data structures or compiler passes)
} The HMC’s NoC complicates QoS due to variability
} However, trade-offs in packet size and ”private” vaults can improve QoS
26
ISPASS 2018
26