Near-Data Processing for Differentiable Machine Learning Models - - PowerPoint PPT Presentation

near data processing for differentiable machine learning
SMART_READER_LITE
LIVE PREVIEW

Near-Data Processing for Differentiable Machine Learning Models - - PowerPoint PPT Presentation

Near-Data Processing for Differentiable Machine Learning Models Hyeokjun Choe 1 , Seil Lee 1 , Hyunha Nam 1 , Seongsik Park 1 , Seijoon Kim 1 , Eui-Young Chung 2 and Sungroh Yoon 1 , 3 1 Electrical and Computer Engineering, Seoul National


slide-1
SLIDE 1

Near-Data Processing for Differentiable Machine Learning Models

Hyeokjun Choe1, Seil Lee1, Hyunha Nam1, Seongsik Park1, Seijoon Kim1, Eui-Young Chung2 and Sungroh Yoon1,3∗

1Electrical and Computer Engineering, Seoul National University 2Electrical and Electronic Engineering, Yonsei University 3Neurology and Neurological Sciences, Stanford University ∗Correspondence: sryoon@snu.ac.kr

Homepage: http://dsl.snu.ac.kr

May 19th, 2017

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 1 / 33

slide-2
SLIDE 2

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 2 / 33

slide-3
SLIDE 3

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 3 / 33

slide-4
SLIDE 4

Machine Learning’s Success

Big data Powerful parallel processors ⇒ Sophisticated models

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 4 / 33

slide-5
SLIDE 5

Issues on Conventional Memory Hierachy

Data movement in memory hierarchy

Computational efficiency ⇓ Power consumption ⇑

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 5 / 33

slide-6
SLIDE 6

Near-data Processing (NDP)

Memory or storage with intelligence (i.e., computing power) Process the data stored in memory or storage Reduce the data movements, CPU offloading

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 6 / 33

slide-7
SLIDE 7

ISP-ML

ISP-ML: a full-fledged ISP-supporting SSD platform Easy to implement machine learning algorithm in C/C++ For validation, three SGD algorithms were implemented and experimented with ISP-ML

CPU

Main Memory

OS

SSD controller

Channel Controller Host I/F Embedded Processor(CPU) Cache Controller DRAM SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash

SSD

ISP HW ISP HW ISP HW ISP SW

SSD controller

Channel Controller Host I/F ARM Processor DRAM Cache Controller SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash ISP HW ISP HW ISP HW ISP SW

SSD

User Application

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 7 / 33

slide-8
SLIDE 8

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 8 / 33

slide-9
SLIDE 9

Machine Learning as an Optimization Problem

Machine learning categories

Supervised learning, unsupervised learning, reinforcement learning

The main purpose of supervised machine learning

Find the optimal θ that minimizes F(D; θ)

F(D, θ) = L(D, θ) + r(θ) (1)

Input layer Output layer

D : input data θ : model parameters L : loss function r : regularization term F : objective function

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 9 / 33

slide-10
SLIDE 10

Gradient Descent

θt+1 = θt − η∇F(D, θt) (2) = θt − η

  • i

∇F(Di, θt) (3)

η : learning rate t : iteration index i : data sample index

1st-order iterative optimization algorithm

Use all samples per iteration

Stochastic gradient descent (SGD)

Use only one sample per iteration.

Minibatch stochastic gradient descent

Between gradient descent and SGD Use multiple samples per iteration

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 10 / 33 h t t p s : / / s e b a s t i a n r a s c h k a . c

  • m

/ f a q / d

  • c

s / c l

  • s

e d

  • f
  • r

m

  • v

s

  • g

d . h t m l

slide-11
SLIDE 11

Parallel and Distributed SGD

Synchornous SGD

Parameter server aggregates ∇θslave synchronously.

Downpour SGD

Workers communicate with parameter server asynchronously.

Elastic Average SGD (EASGD)

Each worker has own parameters Workers transfer (θslave − θmaster), not ∇θslave

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 11 / 33

slide-12
SLIDE 12

Fundamentals of Solid-State Drives (SSDs)

SSD Controller

Embedded processor for FTL

HDD emulation Wear Leveling, Garbage collection, etc.

Cache controller Channel controller

DRAM

Cache and Buffer 512MB - 2GB

NAND flash arrays

Simultaneously accessible

Host interface logic

SATA, PCIe

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 12 / 33

slide-13
SLIDE 13

Previous Work on Near-Data Processing:PIM

Perform computation inside the main memory 3D stacked memory (e.g. HMC) is used for PIM recently

Implement processing unit in Logic Layer

Applications: sorting, string matching, CNN, matrix multiplication etc.

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 13 / 33

slide-14
SLIDE 14

Previous Work on Near-Data Processing:ISP

Perform computation inside the storage ISP with embedded processor

Pros: easy to implement, flexible Cons: no parallelism

ISP with dedicated hardware logic

Pros: channel parallelism, hardware acceleration Cons: hard to implement and change

Applications: DB query (scan, join), linear regression, k-means, string match etc.

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 14 / 33

slide-15
SLIDE 15

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 15 / 33

slide-16
SLIDE 16

ISP-ML: ISP Platform for Machine Learning on SSDs

ISP-supporting SSD simulator

Implemented in SystemC on the Synopsys Platform Architect

Software/Hardware co-simulation Easily executes various machine learning algorithms in C/C++

Transaction level simulator

For reasonable simulation speed

ISP components

ISP SW, ISP HW

CPU

Main Memory

OS

SSD controller

Channel Controller Host I/F Embedded Processor(CPU) Cache Controller DRAM SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash

SSD

ISP HW ISP HW ISP HW ISP SW

SSD controller Channel Controller Host I/F ARM Processor DRAM Cache Controller SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash

ISP HW ISP HW ISP HW ISP SW

SSD

User Application Host I/F Cache Controller NAND Flash Channel Controller clk/rst Embedded Processor SRAM DRAM Hyeokjun Choe et al. MSST 2017 May 19th, 2017 16 / 33

slide-17
SLIDE 17

ISP-ML: ISP Platform for Machine Learning on SSDs

We implemented two types of ISP hardware components.

Channel controller: perform primitive operations on the stored data. Cache controller: collect the results from each of the channel controller.

Master-slave architecture They communicate with each other.

CPU

Main Memory

OS

SSD controller

Channel Controller Host I/F Embedded Processor(CPU) Cache Controller DRAM SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash

SSD

ISP HW ISP HW ISP HW ISP SW

SSD controller Channel Controller Host I/F ARM Processor DRAM Cache Controller SRAM Channel Controller NAND Flash NAND Flash NAND Flash NAND Flash

ISP HW ISP HW ISP HW ISP SW

SSD

User Application

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 17 / 33

slide-18
SLIDE 18

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-19
SLIDE 19

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-20
SLIDE 20

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-21
SLIDE 21

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-22
SLIDE 22

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-23
SLIDE 23

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-24
SLIDE 24

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-25
SLIDE 25

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-26
SLIDE 26

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-27
SLIDE 27

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-28
SLIDE 28

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-29
SLIDE 29

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-30
SLIDE 30

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-31
SLIDE 31

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-32
SLIDE 32

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-33
SLIDE 33

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-34
SLIDE 34

Parallel SGD Implementation on ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 18 / 33

slide-35
SLIDE 35

Methodology for IHP-ISP Performance Comparison

Ideal Ways to Fairly Compare ISP and IHP

1 Implementing ISP-ML in a real semiconductor chip

High chip manufacturing costs

2 Simulating IHP in the ISP-ML framework.

High simulation time to simulate IHP

3 Implementing both ISP and IHP using FPGAs.

Require another significant development efforts.

⇒ Hard to fairly compare the performances of ISP and IHP ⇒ We propose a practical comparison methodology

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 19 / 33

slide-36
SLIDE 36

Methodology for IHP-ISP Performance Comparison

Host

Storage ISP-ML

(baseline)

IO Trace ISP-ML

(ISP implemented)

ISP Cmd

(b) (a)

Simulator Real System

Measure observed IHP execution time(Ttotal) Measure baseline simulation time with IO trace(TIOsim) Measure data IO time(TIO) Extract IO trace while executing application In Host In SSD (Sim)

Observed IHP execution time = Ttotal = TnonIO + TIO. Expected IHP simulation time = TnonIO + TIOsim = Ttotal - TIO + TIOsim.

TIO : Data IO latency time of the storage TnonIO : Non-data IO time TIOsim : Data IO time of the baseline SSD in ISP-ML

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 20 / 33

slide-37
SLIDE 37

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 21 / 33

slide-38
SLIDE 38

Setup and Implementation

Host specifications CPU 8-core Intel(R) Core i7-3770K (3.50GHz) Main memory DDR3 32GB RAM Storage Samsung SSD 840 Pro OS Ubuntu 14.04 LTS ISP-ML specifications Embedded processor ARM 926EJ-S (400MHz) FTL DFTL Page size 8KB tprog / tread / tblockerase 300us / 75us / 5ms FPU 0.5 instruction/cycle(pipelined) Dataset x10 amplified MNIST(handwritten digits)

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 22 / 33

slide-39
SLIDE 39

Performance Comparison:ISP-Based Optimization

EASGD showed best performance in this experiment.

x2.96 against synchornous SGD on average. x1.41 against Downpour SGD on average.

For 4,8 Ch, synchronous SGD was slower than Downpour SGD For 16 Ch, synchronous SGD was faster than Downpour SGD

0.82 0.86 0.90 0.94 Time(sec)

(a) 4-Channel

Test accuracy 2 4 6 8 10 12 0.82 0.86 0.90 0.94 Time(sec)

(c) 16-Channel

Test accuracy 2 4 6 8 10 12 0.82 0.86 0.90 0.94 Time(sec)

(b) 8-Channel

Test accuracy 2 4 6 8 10 12 Synchronous SGD Downpour SGD EASGD Synchronous SGD Downpour SGD EASGD Synchronous SGD Downpour SGD EASGD

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 23 / 33

slide-40
SLIDE 40

Performance Comparison:IHP versus ISP

Compared IHP in memory shortage situation with ISP

In large-scale machine learning, the computing systems used may suffer from memory shortages. Assumption: The host had already loaded all the data to main memory for IHP.

ISP-based EASGD with 16 channels obtained the best performance in our experiments.

0.80 0.84 0.88 0.92 4 8 12 16 20 IHP(2GB-memory) IHP(16GB-memory) ISP(EASGD, 4CH) IHP(4GB-memory) IHP(32GB-memory) ISP(EASGD, 8CH) IHP(8GB-memory) ISP(EASGD, 16CH) Time(sec) Test accuracy Hyeokjun Choe et al. MSST 2017 May 19th, 2017 24 / 33

slide-41
SLIDE 41

Channel Parallelism

The speed-up tends to be proportional to the number of channels. Because the communication overhead in ISP is negligible.

In distributed computing systems, communication bottleneck commonly occurs.

0.82 0.86 0.90 0.94 Time(sec) (b) Downpour SGD Test accuracy 2 4 6 8 10 12 4-Channel 8-Channel 16-Channel 0.82 0.86 0.90 0.94 Time(sec) (c) EASGD Test accuracy 2 4 6 8 10 12 4-Channel 8-Channel 16-Channel 0.82 0.86 0.90 0.94 Time(sec) (a) Synchronous SGD Test accuracy 2 4 6 8 10 12 4-Channel 8-Channel 16-Channel 1 2 4 4 8 16 Synchronous SGD Downpour SGD EASGD Channel Speed up (d) Speed up

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 25 / 33

slide-42
SLIDE 42

Effects of Communication Period in Async. SGD

Downpour SGD

High speed for a low communication period [τ=1; 4] Unstable for a high communication period [τ=16; 64]

EASGD

Communication period ⇑, convergence speed ⇓ In contrast to the distributed computing system Because of the low communication overhead

0.50 0.60 0.70 0.80 0.90 2 4 6 8 10 (a) Downpour SGD Time(sec) Test accuracy 0.86 0.88 0.90 0.92 2 4 6 8 10 (b) EASGD Time(sec) Test accuracy τ = 1 τ = 4 τ = 16 τ = 64 τ = 1 τ = 4 τ = 16 τ = 64 Hyeokjun Choe et al. MSST 2017 May 19th, 2017 26 / 33

slide-43
SLIDE 43

Experimental Results Summary

1 EASGD shows the best performance in our ISP-ML environment. 2 ISP is more efficient than IHP while host suffers from insufficient

main memory.

ISP may be useful in large scale machine learning.

3 The speed-up by parallelizing is proportional to the number of

channels.

Because of the ultra fast on-chip communication.

4 The performance of EASGD decreases while the communication

period increases unlike conventional distributed system.

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 27 / 33

slide-44
SLIDE 44

Outline

1 Introduction 2 Background 3 Proposed Methodology 4 Experimental Results 5 Discussion and Conclusion

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 28 / 33

slide-45
SLIDE 45

Parallelism in ISP

ISP can provide various advantages for data processing involved in machine learning.

E.g. ultra-fast on-chip communication

⇒ Increase energy efficiency, security, and reliability High degree of parallelism could be achieved.

By increasing the number of channels inside an SSD.

Exploiting a hierarchy of parallelism

Distributed systems + ISP-based SSDs

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 29 / 33

slide-46
SLIDE 46

Opportunities for Future Research

1 Implementing deep neural networks in ISP-ML framework 2 Implementing adaptive optimization algorithms

E.g. Adagrad and Adadelta

3 Pre-computing metadata during data writes 4 Implementing data shuffling functionality 5 Investigate the effect of NAND flash design on performance Hyeokjun Choe et al. MSST 2017 May 19th, 2017 30 / 33

slide-47
SLIDE 47

Conclusion

Create full-fledged ISP-supporting SSD simulator supporting ML Implement and compare multiple versions of parallel SGD Propose fair comparison methodology between IHP and ISP Intrigue future research opportunities in terms of exploiting the channel parallelism

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 31 / 33

slide-48
SLIDE 48

Acknowledgments

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 32 / 33

slide-49
SLIDE 49

Q/A

Hyeokjun Choe et al. MSST 2017 May 19th, 2017 33 / 33