Processing in Storage Class Memory Joel Nider Craig Mustard - - PowerPoint PPT Presentation

processing in storage class memory
SMART_READER_LITE
LIVE PREVIEW

Processing in Storage Class Memory Joel Nider Craig Mustard - - PowerPoint PPT Presentation

Processing in Storage Class Memory Joel Nider Craig Mustard Andrada Zoltan Alexandra Fedorova Embedding Processors in SCM CPU Non-volatile RAM Storage Latency Is Decreasing Scaling Compute with Storage Storage Arrays Persistent Smart


slide-1
SLIDE 1

Processing in Storage Class Memory

Joel Nider Craig Mustard Andrada Zoltan Alexandra Fedorova

slide-2
SLIDE 2

Embedding Processors in SCM

CPU Non-volatile RAM

slide-3
SLIDE 3

Storage Latency Is Decreasing

slide-4
SLIDE 4

Scaling Compute with Storage

CPU + registers Smart Caches PIM in RAM

SCM

Smart Disks / SSD Storage Arrays Volatile Persistent

Latency

slide-5
SLIDE 5

Scaling Compute with Storage

CPU + registers Smart Caches PIM in RAM

SCM

Smart Disks / SSD Storage Arrays Volatile Persistent

Latency

slide-6
SLIDE 6

Benefits of PIM on SCM

CPU

Memory bus

DPU SCM DRAM

slide-7
SLIDE 7

Benefits of PIM on SCM

CPU

Memory bus

slide-8
SLIDE 8

Benefits of PIM on SCM

CPU

Memory bus

slide-9
SLIDE 9

Benefits of PIM on SCM

CPU

DPU SCM

Memory bus

slide-10
SLIDE 10

Benefits of PIM on SCM

CPU

Memory bus

DPU Count: SCM Capacity:

64 4 GB

Ratio: 1:64 MB

Core Density

slide-11
SLIDE 11

Benefits of PIM on SCM

CPU

Memory bus

DPU Count: SCM Capacity:

128 8 GB

Ratio: 1:64 MB

slide-12
SLIDE 12

Benefits of PIM on SCM

CPU

Memory bus

DPU Count: SCM Capacity:

256 16 GB

Ratio: 1:64 MB

slide-13
SLIDE 13

Benefits of PIM on SCM

CPU

Memory bus

DPU Count: SCM Capacity:

512 32 GB

Ratio: 1:64 MB

slide-14
SLIDE 14

Benefits of PIM on SCM

CPU

Memory bus

slide-15
SLIDE 15

PIM Design Points

Inter-PIM Communication Core Density Instruction Set Address Translation

slide-16
SLIDE 16

UPMEM Architecture and Limitations

DPU DRAM

slide-17
SLIDE 17

UPMEM Architecture and Limitations

DPU DRAM DDR Interface Control SRAM External Bus

slide-18
SLIDE 18

Interleaved Multithreading

slide-19
SLIDE 19

UPMEM Architecture and Limitations

ABCDEFGHIJKLMNOPQRSTUV

Memory bus Input data DPU 0 DPU 1 DPU 2

slide-20
SLIDE 20

UPMEM Architecture and Limitations

IJKLMNOPQRSTUVWXYZabcd

Memory bus Input data

A B C D E F G H

DPU 0

A

DPU 1

B

DPU 2

C

slide-21
SLIDE 21

UPMEM Architecture and Limitations

QRSTUVWXYZabcdefghijkl

Memory bus Input data

A I B J C K D L E M F N G O H P

DPU 0

AI

DPU 1

BJ

DPU 2

CK

slide-22
SLIDE 22

Raw Performance: Throughput

64KB SRAM

9 ranks x 64 DPUS = 576 DPUs 576 DPUs x 64MB = 36GB DRAM

36 GB in 0.16 s = 252 GB/s

Top speed of DDR4-2400 channel: 19GB/s

16 threads @ 2KB per transfer

64 MB DRAM

DPU

slide-23
SLIDE 23

Use Case: Compression

File Size DPUs

spamfile 84 MB 172 mozilla 50 MB 105 nci 30 MB 64 dickens 10 MB 35 sao 7 MB 21 xml 5 MB 15 world192 1 MB 4 plrabn12 0.5 MB 2 terror2 0.1 MB 1

slide-24
SLIDE 24

Wishlist

Concurrent Memory Access Data Triggered Functions Mix Of Memory Types Tuning For Performance

slide-25
SLIDE 25

Future Directions

Hyperdimensional Computing Regular Expression Search ?

slide-26
SLIDE 26

Thank you for watching

Joel Nider joel@ece.ubc.ca Craig Mustard craigm@ece.ubc.ca Andrada Zoltan zoltandrada@gmail.com Alexandra Fedorova sasha@ece.ubc.ca