Processing Architecture with Memory Channel Network Mohammad Alian 1 - - PowerPoint PPT Presentation

processing architecture with memory
SMART_READER_LITE
LIVE PREVIEW

Processing Architecture with Memory Channel Network Mohammad Alian 1 - - PowerPoint PPT Presentation

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network Mohammad Alian 1 , Seung Won Min 1 , Hadi Asgharimoghaddam 1 , Ashutosh Dhar 1 , Dong Kai Wang 1 , Thomas Roewer 2 , Adam McPadden 2 , Oliver O'Halloran 2 ,


slide-1
SLIDE 1

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network

Mohammad Alian1, Seung Won Min1, Hadi Asgharimoghaddam1, Ashutosh Dhar1, Dong Kai Wang1, Thomas Roewer2, Adam McPadden2, Oliver O'Halloran2, Deming Chen1, Jinjun Xiong2, Daehoon Kim1, Wen-mei Hwu1, and Nam Sung Kim1,3

1University of Illinois Urbana-Champaign 2IBM Research and Systems 3Samsung Electronics

slide-2
SLIDE 2

Executive Summary ry

  • Processing In Memory (PIM), Near Memory Processing (NMP), …

✓EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, …

  • Question: why haven’t they been commercialized yet?

✓Demand changes in application code and/or memory subsystem of host processor

2

IRAM’97 ISCA’15 SmartMemories’00

Qaud Network Processing Tile or DRAM Block Qaud 48MB Memory 48MB Memory Vector Unit CPU+ 3M $

slide-3
SLIDE 3

Executive Summary ry

  • Processing In Memory (PIM), Near Memory Processing (NMP), …

✓EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, …

  • Question: why haven’t they been commercialized yet?

✓Demand changes in application code and/or memory subsystem of host processor

  • Solution: memory module based NMP + Memory Channel Network (MCN)

✓Recognize NMP memory modules as distributed computing nodes over Ethernet  no change in application code or memory subsystem of host processors ✓Seamlessly integrate NMP w/ distributed computing frameworks for better scalability

3

slide-4
SLIDE 4

Executive Summary ry

  • Processing In Memory (PIM), Near Memory Processing (NMP), …

✓EXECUBE’94, IRAM’97, ActivePages’98, FlexRAM’99, DIVA’99, SmartMemories’00, …

  • Question: why haven’t they been commercialized yet?

✓Demand changes in application code and/or memory subsystem of host processor

  • Solution: memory module based NMP + Memory Channel Network (MCN)

✓Recognize NMP memory modules as distributed computing nodes over Ethernet  no change in application code or memory subsystem of host processors ✓Seamlessly integrate NMP w/ distributed computing frameworks for better scalability

  • Feasibility & Performance:

✓Demonstrate the feasibility w/ an IBM POWER8 + experimental memory module ✓Improve the performance and processing bandwidth by 43% and 4×, respectively

4

slide-5
SLIDE 5

Overview of f MCN-based NMP

5

DRAM DRAM DRAM DRAM

MCN PROC

OS

MCN node regular node global channel host

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-0

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-1

local channels

CPU

DRAM DRAM DRAM DRAM

MCN PROC

OS

  • Buffered DIMM w/ a low-power but powerful AP* in a buffer device

MCN DIMM

*Application Processor Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-6
SLIDE 6

Overview of f MCN-based NMP

6

DRAM DRAM DRAM DRAM

MCN PROC

OS

MCN node regular node global channel host

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-0

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-1

local channels

CPU

DRAM DRAM DRAM DRAM

MCN PROC

OS

  • Buffered DIMM w/ a low-power but powerful AP* in a buffer device

✓An MCN processor runs its own lightweight OS including the minimum network stack

MCN DIMM

*Application Processor Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-7
SLIDE 7

Overview of f MCN-based NMP

7

DRAM DRAM DRAM DRAM

MCN PROC

OS

MCN node regular node global channel host

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-0

DDR4 DIMM DDR4 DIMM MCN DIMM MCN DIMM

MC-1

local channels

CPU

DRAM DRAM DRAM DRAM

MCN PROC

OS

  • Buffered DIMM w/ a low-power but powerful AP* in a buffer device
  • Special driver faking memory channels as Ethernet connections

MCN node MCN distributed computing MCN DIMM

*Application Processor Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-8
SLIDE 8

Higher Processing BW* w/ Commodity DRAM

  • Conventional memory system

✓More DIMMs  larger capacity but the same bandwidth

8

Memory Controller DDR4 DIMM DRAM DRAM Data Buffer DDR4 DIMM DRAM DRAM Data Buffer Global/Shared Channel

*bandwidth

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-9
SLIDE 9

Higher Processing BW* w/ Commodity DRAM

  • Conventional memory system w/ near memory processing DIMMs

✓an MCN processor w/ local DRAM devices through private channels  scaling aggregate processing memory bandwidth w/ # of MCN DIMMs

9

Memory Controller DDR4 DIMM DRAM DRAM Data Buffer DDR4 DIMM DRAM DRAM Data Buffer Memory Controller DDR4 DIMM DRAM DRAM MCN DIMM DRAM DRAM MCN PROC Global/Shared Channel Global/Shared Channel Local/Private Channels

*bandwidth

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver MCN PROC

slide-10
SLIDE 10

MCN DIM IMM Architecture

10

IBM Centaur DIMM Buffer Device 80 DDR DRAM chips + buffer chip w/ a tall form factor

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-11
SLIDE 11

MCN DIM IMM Architecture

11

MCN Processor

core core 1 core 2 core 3 LLC / interconnect

MC global DDR channel

dual-port SRAM

DDR interface

MCN PROC control TX RX IRQ

local DRAM

IBM Centaur DIMM Buffer Device 80 DDR DRAM chips + buffer chip w/ a tall form factor Near-Memory Processor

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

~20W TDP & ~10mm×10mm Snapdragon AP w/ 4 A57 ARM cores + 2MB LLC, GPU, 2 MCs, etc. @ ~5W & ~8×8mm2 (1.8W & ~2×2mm2)

slide-12
SLIDE 12

core core 1 core 2 core 3 LLC / interconnect

MC global DDR channel

dual-port SRAM

DDR interface

MCN PROC control TX RX IRQ

local DRAM

MCN DIM IMM Architecture: In Interface Logic

12

Serving as a fake network Interface card (NIC)

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

MCN Interface

slide-13
SLIDE 13

core core 1 core 2 core 3 LLC / interconnect

MC global DDR channel

dual-port SRAM

DDR interface

MCN PROC control TX RX IRQ

local DRAM

MCN DIM IMM Architecture: In Interface Logic

13

Serving as a fake network Interface card (NIC)

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

MCN Interface MCN buffer layout

Mapped to a range of physical memory space directly accessed by MC like normal DRAM Rx circular buffer

reserved Rx-poll 48KB Tx-head 64 Bytes Tx-tail Tx-poll 4 8 12 reserved

...

63

Tx circular buffer

Rx-head Rx-tail 48KB

slide-14
SLIDE 14

core core 1 core 2 core 3 LLC / interconnect

MC global DDR channel

dual-port SRAM

DDR interface

MCN PROC control TX RX IRQ

local DRAM

MCN DIM IMM Architecture: In Interface Logic

14

Serving as a fake network Interface card (NIC)

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

MCN Interface MCN buffer layout

Mapped to a range of physical memory space directly accessed by MC like normal DRAM Rx circular buffer

reserved Rx-poll 48KB Tx-head 64 Bytes Tx-tail Tx-poll 4 8 12 reserved

...

63

Tx circular buffer

Rx-head Rx-tail 48KB

slide-15
SLIDE 15

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Driver

15

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-16
SLIDE 16

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

16

  • Host  MCN

Header|Data/2 Header|Data/2 Header | Data

IP: X.X.X.X

  • 1. A packet is passed from the host

network stack and the packet goes to the corresponding MCN DIMM or NIC

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-17
SLIDE 17

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

17

  • Host  MCN

Header|Data/2 Header|Data/2 Header | Data

MAC: AA.AA.AA.AA.AA.AA

  • 2. If the packet needs to be sent to an

MCN DIMM, the forwarding engine checks the MAC of the packet stored in main memory

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-18
SLIDE 18

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

18

  • Host  MCN

MAC: AA.AA.AA.AA.AA.AA

Header|Data/2 Header|Data/2 Header | Data

  • 3. If the MAC matches w/ that of an MCN

DIMM, the packet is copied to the SRAM buffer of the MCN DIMM

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-19
SLIDE 19

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

19

  • Host  MCN

Header | Data

  • 4. This data copy triggers IRQ from the

MCN DIMM and MCN processor knows there is an arrived packet

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver Header | Data

slide-20
SLIDE 20

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

20

  • MCN  Host

Header | Data Header | Data

  • 1. MCN DIMM writes a packet to its SRAM

buffer and set the corresponding TX- poll bit

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-21
SLIDE 21

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

21

  • MCN  Host

Header | Data Header | Data

  • 2. Polling agent from the host recognize

the TX-poll bit is set and read the packet from the SRAM buffer

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-22
SLIDE 22

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

22

  • MCN  Host

Header | Data

MAC: HH.HH.HH.HH.HH.HH

Header | Data

  • 3. Forwarding engine checks the MAC

address and determine the destination

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-23
SLIDE 23

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

23

  • MCN  Host

Header | Data Header | Data

  • 4. If this is for the host, it copies the

packet to the host skb which in the host main memory

*network socket buffer Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-24
SLIDE 24

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

24

  • MCN  Host

Header|Data/2 Header|Data/2 Header | Data

  • 4. If this is for the host, it copies the

packet to the host skb which in the host main memory

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver *network socket buffer

slide-25
SLIDE 25

DDR memory

kernel space hardware user space memory channel

MCN driver

MCN access regular access

linux network stack

application

polling agent

memcpy

MCN SRAM forwarding engine

TX RX

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM MCN DIMM

MC-1 CPU

SRAM SRAM

NIC Driver

MCN Packet Routing

25

  • MCN  Host

Header|Data/2 Header|Data/2 Header | Data

  • 5. Finally the packet is passed to the host

network stack (e.g., TCP/IP)

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-26
SLIDE 26

Optimizations

  • Adopt optimizations from conventional Ethernet interfaces

✓Offload (or remove) packet fragmentation (e.g. TCP Segmentation Offload (TSO))

26

IP

Data Data Data Data

Ethernet (NIC)

Pkt Pkt Pkt Pkt

Data w/o TSO

Pkt Pkt Pkt Pkt

TCP Wire

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-27
SLIDE 27

Optimizations

  • Adopt optimizations from conventional Ethernet interfaces

✓Offload (or remove) packet fragmentation (e.g. TCP Segmentation Offload (TSO))

27

IP

Data Data Data Data

Ethernet (NIC)

Pkt Pkt Pkt Pkt

Data w/o TSO w/ TSO

Pkt Pkt Pkt Pkt

Data Data

Pkt Pkt Pkt Pkt Pkt Pkt Pkt Pkt

TCP Wire

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-28
SLIDE 28

Optimizations

  • Adopt optimizations from conventional Ethernet interfaces

✓Offload (or remove) packet fragmentation (e.g. TCP Segmentation Offload (TSO))

28

IP

Data Data Data Data

Ethernet (NIC)

Pkt Pkt Pkt Pkt

Data w/o TSO w/ TSO

Pkt Pkt Pkt Pkt

Data Data

Pkt Pkt Pkt Pkt Pkt Pkt Pkt Pkt

TCP Wire

MCN TSO Data Data

Reducing the overhead of sending many small packets

Pkt

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-29
SLIDE 29

Optimizations

  • Adopt optimizations from conventional Ethernet interfaces

✓Offload (or remove) packet fragmentation (e.g. TCP Segmentation Offload (TSO))

  • MCN-side DMA

✓Baseline MCN processor manually copies data b/w SRAM and DRAM ✓SRAM-to-DRAM DMA eliminates CPU memcpy overhead similar to NIC-to-DRAM DMA in conventional systems

29

CPU SRAM DRAM CPU SRAM DRAM

DMA Ctrl

MCN Baseline MCN w/ DMA

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-30
SLIDE 30

Proof of f Concept HW/SW Demonstration

30

ConTutto top view Contutto plugged in IBM POWER8 server Stratix V FPGA

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-31
SLIDE 31

Proof of f Concept HW/SW Demonstration

31

MPI application running through MCN

host

DDR4 DIMM MCN DIMM

MC-0

DDR4 DIMM

MC-1 CPU

SRAM

DDR4 DIMM

POWER8 (Host) ConTutto (MCN DIMM)

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

MPI case

slide-32
SLIDE 32

Evaluation Methodology

  • Simulation  dist-gem5 (ISPASS’17)
  • Network performance evaluation  iperf and ping
  • Application performance evaluation  Coral, Bigdatabench and NPB

32

MCN System Configuration CPU ARMv8 Quad Core running @ 2.45GHz Caches L1I: 32KB, L1D: 32KB, L2: 1MB Memory DDR4-3200 OS Ubuntu 14.04 Host System Configuration CPU ARMv8 Octa Core running @ 3.4GHz Caches L1I: 32KB, L1D: 32KB, L2: 256KB, L3: 8MB Memory DDR4-3200 NIC 10GbE/1𝜈𝑡 link latency OS Ubuntu 14.04

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-33
SLIDE 33

Evaluation – Network Bandwidth (iPerf)

33

Normalized Bandwidth

1 2 3 4 5 10GbE mcn baseline TSO mcn dma mcn baseline TSO mcn dma host-mcn mcn-mcn

1.30x 1.08x

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-34
SLIDE 34

Evaluation – Network Bandwidth (iPerf)

34

Normalized Bandwidth

1 2 3 4 5 10GbE mcn baseline TSO mcn dma mcn baseline TSO mcn dma host-mcn mcn-mcn

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-35
SLIDE 35

Evaluation – Network Bandwidth (iPerf)

35

Normalized Bandwidth

1 2 3 4 5 10GbE mcn baseline TSO mcn dma mcn baseline TSO mcn dma host-mcn mcn-mcn

3.50x

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-36
SLIDE 36

Evaluation – Network Bandwidth (iPerf)

36

Normalized Bandwidth

1 2 3 4 5 10GbE mcn baseline TSO mcn dma mcn baseline TSO mcn dma host-mcn mcn-mcn

4.56x

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-37
SLIDE 37

0.0 0.5 1.0 1.5 2.0 16 128 512 4K 8K Normalized Latency Packet Size (Bytes) 10GbE mcn baseline TSO mcn dma

Evaluation – Network Latency (P (Ping)

37

Host ↔ MCN MCN ↔ MCN

19.7% 0.0 0.5 1.0 1.5 2.0 16 128 512 4K 8K Packet Size (Bytes) 10GbE mcn baseline TSO mcn dma Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-38
SLIDE 38

Evaluation – Aggregate Processing Bandwidth

38

1 2 3 4 5 6 7 8 9

  • Norm. Aggregate Memory BW

2 4 6 8 MCN DIMMs 1.76x 2.54x 3.29x 3.93x Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

slide-39
SLIDE 39

Scale-up versus MCN: Application Performance

39

#MCN DIMMs per Channel (Number of Cores  8 for scale-up setup) 45.3% 42.9% 27.2% 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 is ep cg mg ft bt sp lu avg Normalized Execution Time scale-up MCN 45.3% 42.9% 27.2% Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver The same number of cores between scale-up server and a server w/ MCN-enabled near-memory processing modules.

slide-40
SLIDE 40

Conclusion

  • MCN is an innovative near-memory processing concept

✓No change in host hardware, OS and user applications ✓Seamless integration w/ traditional distributed computing and better scalability ✓Feasibility proven w/ a commercial hardware system

  • MCN can provide:

✓4.6× higher network bandwidth ✓5.3× lower network latency ✓3.9× higher processing bandwidth ✓45% higher performance

than a conventional system

40

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

Server Node Server Node Network Data Data MCN DIMM MCN DIMM Data Data MCN DIMM MCN DIMM Data Data

Mem Channel Mem Channel

slide-41
SLIDE 41

Application-Transparent Near-Memory Processing Architecture with Memory Channel Network

Mohammad Alian1, Seung Won Min1, Hadi Asgharimoghaddam1, Ashutosh Dhar1, Dong Kai Wang1, Thomas Roewer2, Adam McPadden2, Oliver O'Halloran2, Deming Chen1, Jinjun Xiong2, Daehoon Kim1, Wen-mei Hwu1, and Nam Sung Kim1,3

1University of Illinois Urbana-Champaign 2IBM Research and Systems 3Samsung Electronics

slide-42
SLIDE 42

42

Optimizations Evaluation Conclusion Overview Proof of Concept Architecture Driver

Any Questions?