An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, - - PowerPoint PPT Presentation

an initial characterization of the emu chick
SMART_READER_LITE
LIVE PREVIEW

An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, - - PowerPoint PPT Presentation

An Initial Characterization of the Emu Chick Eric Hein, Tom Conte, (ECE) Jeff Young, (CS) Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, Jason Riedy (CSE) 5/21/2018 . . . . Migratory Memory-side Processing Main innovation:


slide-1
SLIDE 1

. . . .

An Initial Characterization

  • f the Emu Chick

Eric Hein, Tom Conte, (ECE) Jeff Young, (CS) Srinivas Eswar, Jiajia Li, Patrick Lavin, Richard Vuduc, Jason Riedy (CSE)

5/21/2018

slide-2
SLIDE 2

. . . .

2

5/21/2018

Migratory Memory-side Processing

  • Main innovation: Thread contexts migrate to the data
  • Threads always read from local memory
  • Migration is hardware-controlled, triggered on a remote

read

  • CRNCH Rogue’s Gallery
  • Early access to Emu Chick hardware prototype
slide-3
SLIDE 3

. . . .

3

5/21/2018

Outline

  • Emu Architecture Description
  • Data Allocation and Thread Spawning
  • Benchmark Results
  • STREAM
  • Sparse Matrix Vector Multiply (SpMV)
  • Pointer Chasing
  • Simulator Validation
  • Conclusion
slide-4
SLIDE 4

. . . .

4

5/21/2018

Emu Architecture

slide-5
SLIDE 5

. . . .

5

5/21/2018

Fine-grained Memory Accesses

  • Narrow-channel DRAM (NCDRAM)
  • 8-bit bus allows access at 8-byte granularity without waste
  • Many narrow channels instead of few wide channels
  • Remote Writes
  • Write to remote nodelet without migrating
  • Proceed directly to the memory front-end, bypassing the GC
  • Remote Atomics
  • Performed in Memory Front-End (MFE), near memory
slide-6
SLIDE 6

. . . .

6

5/21/2018

Emu Cilk

  • cilk_spawn: create a child thread to execute a function in parallel
  • Creates an actual thread, not just a continuation stack frame
  • No work-stealing across nodelets
  • cilk_sync: wait for all child threads to complete
  • Threads die instead of waiting, last thread to arrive continues
  • “Remote Spawn”: Create thread on a remote nodelet
  • Determines location of cactus stack frame
slide-7
SLIDE 7

7

Emu Nodelet Emu Node Card (8 nodelets) Emu Chick (8 nodes) Emu1 Rack (256 nodes) Current Future Current Future Current Future # of cores 1 core 4 cores 8 cores 32 cores 64 cores 256 cores 8192 cores # of threads 64 256 512 2048 4096 16384 > 2 million Memory capacity 2 GiB 8 GiB 16 GiB 64 GiB 128 GiB 512 GiB 16 TiB # of 8-bit DDR4 channels 1 channel 1 channel 8 channels 8 channels 64 channels 64 channels 2048 Memory bandwidth 120 MB/s 2.5 GB/s 1.2 GB/s 20 GB/s 8 GB/s 160 GB/s 5.12 TB/s

Images and data from www.emutechnology.com

slide-8
SLIDE 8

. . . .

8

5/21/2018

STREAM: Thread spawning

https://www.cilkplus.org/tutorial-cilk-plus-keywords#cilk_for

Recursive Spawn Serial Spawn

slide-9
SLIDE 9

. . . .

9

5/21/2018

Emu Memory Layouts

slide-10
SLIDE 10

10

Nodelet 2 Nodelet 1 Nodelet 0 Nodelet 3

Serial Spawn Recursive Remote Spawn Serial Remote Spawn

slide-11
SLIDE 11

. . . .

11

5/21/2018

STREAM: Emu hardware results (single-node)

~140 MB/s per nodelet ~1.2 GB/s per node (8 nodelets)

  • Surprisingly, serial spawn performance matches recursive spawn
  • Remote spawn is necessary to saturate global bandwidth
slide-12
SLIDE 12

. . . .

12

5/21/2018

STREAM: Emu hardware results (multi-node)

slide-13
SLIDE 13

. . . .

13

5/21/2018

Proxy applications

  • Pointer chasing -> streaming graphs
  • Designed to mimic the traversal of an edge list in a streaming

graph data structure

  • Using results to tune streaming graph engine for Emu
  • SpMV -> sparse tensor analysis
  • Exploring data layout options with a sparse matrix
  • Will transfer knowledge to design of sparse tensor library
slide-14
SLIDE 14

. . . .

14

5/21/2018

Sparse matrix-vector multiply (SpMV)

Data layout is very important on Emu. We experimented with three layouts for the sparse matrix. In each case, the vector X was replicated onto all nodelets.

slide-15
SLIDE 15

. . . .

15

5/21/2018

SpMV results: Emu simulator vs Haswell Xeon

  • “2D” layout outperforms “1D” layout
  • Spurious thread migrations are limiting performance on Emu
slide-16
SLIDE 16

. . . .

16

5/21/2018

Pointer Chasing Microbenchmark

  • Designed to mimic access pattern of dynamically allocated data structures

(i.e. streaming graphs)

  • Data-dependent loads: Memory-level parallelism is severely limited since

each thread must wait for one pointer dereference to complete before accessing the next pointer

  • Fine-grained accesses: Spatial locality is restricted since all accesses are at

a 16B granularity. This is smaller than a 64B cache line on x86 platforms, and much smaller than a typical DRAM page size (around 8KB).

  • Random access pattern: Since each block of memory is read exactly once

in random order, caching and prefetching are mostly ineffective.

slide-17
SLIDE 17

. . . .

17

5/21/2018

Pointer chasing: Initialization

  • 1. Create a linked list of elements

“Ordered” Access pattern is sequential and predictable Plenty of spatial locality available

slide-18
SLIDE 18

. . . .

18

5/21/2018

Pointer chasing: Intra-block shuffle

  • 2. Randomize traversal order of elements within each block

“Intra-block shuffle” Creates small contiguous blocks of memory that are accessed in random

  • rder.

Overall access pattern is still sequential.

slide-19
SLIDE 19

. . . .

19

5/21/2018

Pointer chasing: Block shuffle

  • 3. Randomize traversal order of each block

“Block shuffle” Overall access pattern is now random. Small chunks of sequential locality still available.

slide-20
SLIDE 20

. . . .

20

5/21/2018

Pointer Chasing: Sandy Bridge Xeon Results

slide-21
SLIDE 21

. . . .

21

5/21/2018

Pointer Chasing: Emu Hardware Results (single-node)

slide-22
SLIDE 22

. . . .

22

5/21/2018

Pointer Chasing: Bandwidth Utilization

slide-23
SLIDE 23

. . . .

23

5/21/2018

Simulator Validation: STREAM

When configured to match the current hardware specifications, the simulator results match closely for local stream and global stream.

slide-24
SLIDE 24

. . . .

24

5/21/2018

Simulator Validation: Migrations

  • Pointer chasing performs 2x better in the simulator
  • Simulator is over-estimating migration throughput
  • Updated simulator matches more closely
slide-25
SLIDE 25

. . . .

25

5/21/2018

Conclusions

  • First independent evaluation of Emu Chick prototype
  • STREAM bandwidth is low, but scales well
  • Memory layout and thread management decisions are critical to achieving

scalability in SpMV

  • Pointer Chasing maintains 80% memory bandwidth utilization in a worst-

case pointer chasing scenario

  • Future work: Applying these lessons to streaming graphs analytics and

sparse tensor processing