A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei - - PowerPoint PPT Presentation

a powerful flexible and intui5ve deep learning framework
SMART_READER_LITE
LIVE PREVIEW

A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei - - PowerPoint PPT Presentation

@ NVIDIA GTC, April 6 th , 2016 A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview http://chainer.org/ l Chainer is a Python-based deep learning framework


slide-1
SLIDE 1

A Powerful, Flexible, and Intui5ve Deep Learning Framework

@ NVIDIA GTC, April 6th, 2016

Shohei Hido Chief Research Officer Preferred Networks, Inc.

slide-2
SLIDE 2

Overview

l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run

http://chainer.org/

l Why do users sOll need another framework? l How different and effecOve Chainer is?

2

slide-3
SLIDE 3

Preferred Networks (PFN) A startup that applies deep learning to industrial IoT

l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT

Deep learning Industrial IoT

3

Manufacturing Automotive Healthcare

slide-4
SLIDE 4

Partnering with world-leading companies using Chainer

l R&D collaboraOon on industrial problems with real-world data

̶ Specific requirements, modified algorithms, many trials and errors, etc ̶ Different from making general-purpose recogniOon system

4

Toyota FANUC Panasonic NTT Cisco NVIDIA

slide-5
SLIDE 5

Two types of background behind DL frameworks

  • 1. Scalability-oriented

l Use-cases in mind

̶ Image/speech recogniOon system ̶ Fast DL as a service in cloud

l Problem type

̶ A few general applicaOons ̶ 10+ million training samples ̶ 10+ nodes cluster w/ fast network

l Possible boZleneck

̶ Tuning of well-known algorithms ̶ Distributed computaOon for

model/data-parallel training

  • 2. Flexibility-oriented

l Use-cases in mind

̶ Algorithm research ̶ R&D projects for new products

l Problem type

̶ Various specific applicaOons ̶ 10+ k training samples ̶ 1 node with mulOple GPUs

l Possible boZleneck

̶ Trial-and-error in prototyping ̶ Debugging, profiling & refactoring ̶ (wait Ome during compilaOon)

slide-6
SLIDE 6

Designed for efficient research & development

l Flexible: new kinds of complex models for various applicaOons l IntuiOve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mulO-GPUs

6

Scalability-oriented Flexibility-oriented

slide-7
SLIDE 7

Agenda

l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons

7

slide-8
SLIDE 8

Neural network and computation

x1

xN

・・

h1

hH

・・・・

kM

k1

yM

y1

Forward computation Backward computation (backpropagation) ・・ ・・ Input Hidden units Output Text Image Sensor Object:
 Tulip Anomaly score:
 0.35 Category:
 Sports ・・ ・・ ・・

8

slide-9
SLIDE 9

Chainer focuses on network representation/training

l Design choices for deep learning frameworks

̶ How to build neural networks? ̶ How to train neural networks? ̶ Which text format/language for modeling? ̶ Which language for compuOng? ̶ Run with GPU? ̶ Run on mulOple GPUs? ̶ Run on mulOple compute nodes?

9

slide-10
SLIDE 10

Building and training neural networks: Computational graph construction is the key

1.

Construct a computaOonal graph

̶ Based on network definiOon given by users ̶ Chains of funcOons and operaOons on input variables

2.

Compute loss and gradients

̶ Forward computaOon to calculate loss for a minibatch ̶ BackpropagaOon gives gradients to all of parameters

3.

OpOmize model

̶ Update each parameter with the gradient ̶ Repeat unOl convergence

Step 1. is the most important and there are many approaches

10

slide-11
SLIDE 11

Building blocks

l These funcOonaliOes are very similar between frameworks l But the structure, abstracOon level, and interface are different l It comes to the design of domain-specific language for NN

Array data structure (vector/matrix/tensor) Operations & functions Network (computational graph) Optimizer (SGD/AdaGrad/Adam)

11

slide-12
SLIDE 12

Types of domain-specific language for neural networks

l Text DSL

̶ Ex. Caffe (prototxt) ̶ Ex. CNTK (NDL)

l Symbolic program

̶ OperaOons

  • n symbols

̶ Ex. Theano ̶ Ex. TensorFlow

l ImperaOve program

̶ Direct computaOons

  • n raw data arrays

̶ Ex. Torch.nn ̶ Ex. Chainer

# Symbolic definiOon A = Variable(‘A’) B = Variable(‘B’) C = B * A D = C + Constant(1) # Compile f = compile(D) d = f(A=np.ones(10), B=np.ones(10) * 2) # ImperaOve declaraOon a = np.ones(10) b = np.ones(10) * 2 c = b * a d = c + 1 %% DefiniOon in text f: { “A”: “Variable”, “B”: “Variable”, “C”: [“B”, “*”, “A”], “ret”: [“C”, “+”, 1] } # Compile f = compile(“f.txt”) d = f(A=np.ones(10), B=np.ones(10) * 2)

12

  • Ex. MXNet
slide-13
SLIDE 13

Comparison of DSL type

DSL type Pros. Cons. Text DSL

  • Human-readable definiOon
  • Non-programmer can easily

edit the network

  • Users must study the format
  • Format might have to be

extended for new algorithms Internal DSL Symbolic

  • StaOc analysis at compile
  • OpOmizaOon before training
  • Easy to parallelize
  • Users must study special syntax
  • May need more efforts to

implement new algorithms ImperaOve

  • Less efforts to learn syntax
  • Easy debugging and profiling
  • Suitable for new algorithms

with complex logic

  • Hard to opOmize in advance
  • Less efficient in memory

allocaOon and parallelizaOon

Chainer is at the extreme end of imperaOve program for high flexibility

13

slide-14
SLIDE 14

Agenda

l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons

14

slide-15
SLIDE 15

Chainer as an open-source project

l hZps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork l 3,708 commits l AcOve development & release for last 10 months

̶ v1.0.0 (June 2015) to v1.7.2 (March 2016)

15 Original developer Seiya Tokui

slide-16
SLIDE 16

CuPy

  • Chainer software stack

CPU NVIDIA GPU CUDA cuDNN BLAS NumPy

Chainer

l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU

16

slide-17
SLIDE 17

Run

  • Define

Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not)

l Define: build a computaOonal graph based on definiOon l Run: update the model (parameters) using training dataset

Network definiOon ComputaOonal graph Gradient funcOon Parameters ComputaOonal graph Gradient funcOon Parameters Training data Update Loss & gradient Auto differenOaOon

17

slide-18
SLIDE 18

Define-by-Run

Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly

l No graph is constructed before training l Instead, the graph is built at each forward computaOon l ComputaOonal graph can be modified dynamically

for each iteraOon/sample or depending on some condiOons

Model definiOon ComputaOonal graph Gradient funcOon Parameters Training data Update Dynamic change CondiOons

18

slide-19
SLIDE 19

Define-by-Run example: MLP for MNIST

l Only transformaOons between units are set before training l ConnecOon is given as forward computaOon

l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) Linear l2 Linear l1 x y h1 W

bias

0 5 9 W

bias

ReLU def forward(x): h1 = ReLU(l1(x)) return l2(h1)

19

slide-20
SLIDE 20

Define-by-Run: An interpreted language for neural network

l Idea

̶ Forward computaOon actually goes through computaOonal graph ̶ By remembering the history, the actual graph can be obtained

l Advantage

̶ Flexibility for new algorithms with complex components

u Ex. recurrent, recursive, aZenOon, memory, adversarial, etc

̶ IntuiOve coding with highly imperaOve network definiOon

u Ex. stochasOc network of which graph changes for each iteraOon

l Current drawbacks

̶ Graph is generated every Ome also for fixed networks ̶ No opOmizaOon even for staOc part of graphs

u JIT-like analysis and subgraph cache might be useful

20

slide-21
SLIDE 21

Basic components (1/2): Variable and Function

l Variable

̶ Variable wraps arrays (.data) ̶ It remembers parent funcOon

(.creator)

̶ It will be assigned gradient (.grad) ̶ It keeps track of not only data

but also computaOons

l FuncOon

̶ TransformaOon between Variable ̶ Stateless ̶ e.g. sigmoid, tanh, ReLU,

maxpooling, dropout

Function x y Variable x y h1 0 5 9

21

slide-22
SLIDE 22

Chain (MLP2)

Basic components (2/2): Link and Chain

l Link = funcOon with state

̶ Parameters are also Variable

and gradients will be assigned

̶ e.g. Linear (fully-connected), LSTM

ConvoluOon2d, word-embedding

l Chain = network

̶ Chain has a set of child Link ̶ Forward computaOon is defined

in . __call__()

̶ e.g. MLP2, AlexNet, GoogLeNet,

RNNLM, seq2seq,

Link (Linear) y=f(W*x+b) x y W

b

Linear l2 Linear l1

  • y

h1 W

bias

  • W

bias

ReLU

22

slide-23
SLIDE 23

Backpropagation through computational graph

l Consider an objecOve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaOon, and

simultaneously builds the following computaOonal graph

l The gradient of L can be computed with respect to

any variables by backpropagaOon

l Then the opOmizer updates the value of parameters

*

x W

+

b

f

L

is Variable is FuncOon

23

slide-24
SLIDE 24

Code sample (1/4): Multi-layer perceptron

class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy # Model and optimizer setup model = Classifier(MLP2())

  • ptimizer = optimizers.SGD()
  • ptimizer.setup(model)

# training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward()

  • ptimizer.update()

Chain (MLP2)

Linear l2 Linear l1

  • y

h1 W

bias

  • W

bias

ReLU

24

slide-25
SLIDE 25

Code sample (2/4): Convolutional neural network

class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y

* ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf

conv2d conv2d conv2d conv2d conv2d linear linear

25

linear

slide-26
SLIDE 26

Code sample (3/4): Recurrent neural network

class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h

x_1 h y_1 x_2 h y_2 x_3 h y_3

  • x_4

h y_4

  • BPTT length = 3

Input word Output Recurrent state

# Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward()

  • ptimizer.update()

26

slide-27
SLIDE 27

Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l A variant of Residual Net that skips connecOons stochasOcally

̶ Outperformed the original Residual Net (ImageNet 2015 winner, MSR) ̶ StochasOc skip:

Taken from http://arxiv.org/abs/1603.09382v2

  • G. Huang et al.

# Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h h = F.relu(c) return h

w/ survival probability: 27

slide-28
SLIDE 28

Miscellaneous

  • l Other features

̶ Install with pip in one line: ̶ MulO-GPU support by explicitly selecOng the ID to use ̶ Pre-trained Caffe model import from Model Zoo ̶ Model serializaOon & save & load : HDF5 or NumPy npz

l Future direcOon (not only for Chainer)

̶ JIT-like opOmizaOon during Define-by-Run ̶ Memory consumpOon reducOon (GPU memory is sOll small) ̶ Handling variable-length inputs without minibatch ̶ Maximizing performance on mulO-node & mulO-GPU environment

$ pip install chainer

28

slide-29
SLIDE 29

Agenda

l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons

29

slide-30
SLIDE 30

CuPy: (partially-)NumPy-compatible GPU library

  • l MoOvaOon: NumPy + CUDA = CuPy

̶ NumPy is the standard library in Python for numerical computaOon ̶ CUDA is the standard APIs for using GPU for high-performance ̶ Unfortunately, NumPy does NOT work with CUDA

l CuPy supports:

̶ Fast computaOon using NVIDIA’s cuBLAS and cuDNN ̶ Array indexing, slicing, transpose, and reshape ̶ Most of operaOons/funcOons in NumPy

u Chainer v1.7.2 already supports more than 170 funcOons

̶ User-defined funcOons and kernels ̶ all dtypes, broadcasOng, memory pool, etc

30

slide-31
SLIDE 31

How to use CuPy

l Usage of CuPy: just replace NumPy with CuPy l Conversion between numpy.ndarray and cupy.ndarray l Ex. CPU/GPU-agnosOc logsumexp funcOon

def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum) import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray

31

slide-32
SLIDE 32

CuPy implementation: Optimized for performance & NumPy-compatibility

l Use Cython for cupy.core & cupy.cuda l Dynamic code generaOon & compile

̶ CUDA code is generated for specific tensor dimension & data type ̶ On-the-fly compile by nvcc and binary cache (faster awer 1st use)

CUDA libraries (cuBLAS, cuRAND, cuDNN) ndarray ufunc, elementwise, reduc5on CUDA Python wrapper cupy.cuda cupy.core Tensor opera5ons & func5ons cupy

32

slide-33
SLIDE 33

CuPy performance on linear algebra: 5 to 25 times faster than NumPy

  • def test(xp):

a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1)

msec speed up NumPy 2,929 1.0 CuPy 585 5.0 CuPy + Memory Pool 123 23.8

Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970

33

slide-34
SLIDE 34

Use CuPy for GPU-based computation

l Support three paZerns as wrappers

̶ ElementwiseKernel: for element-wise computaOon ̶ ReducOonKernel: for reduce operaOon along axis ̶ ufunc: universal funcOon as in Numpy

l Ex. definiOon of an element-wise funcOon l Usage (automaOc broadcast and type check are supported)

squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name squared_diff(cupy.arange(10), 10)

34

slide-35
SLIDE 35

Agenda

l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons

35

slide-36
SLIDE 36

Public benchmark results (CNN): Chainer shows comparable performance

l Forward computaOon is almost the same with TensorFlow l Training with backward computaOon is slower, but it can be

  • ffset by no compilaOon Ome while debugging/tuning

200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve)

Forward computation (msec) Backward computation (msec)

Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Cafge

36

slide-37
SLIDE 37

Chainer can benefit from latest CUDA libraries:

  • Ex. Winograd algorithm in cuDNN v5

l Conv3x3 is common in CNNs & now computed with Winograd l State-of-the-art CNN models (e.g., GoogLeNet, VGG-A)

can be accelerated up to 2.0x at test Ome (forward only)

100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5

Forward computation (msec) Backward computation (msec)

Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0

37

slide-38
SLIDE 38

Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l hZps://github.com/maZya/chainer-gogh

  • Content

image (cat) Style image New artistic image + =

Main code (45 lines) 38

slide-39
SLIDE 39

l Many collaboraOons are on-going w/ Chainer-based

computer vision, deep reinforcement learning, etc…

l Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015

̶ 8 hours training to reach expert-level, commercializaOon by 2016 end

Chainer in industry: Used in demonstrations & being commercialized

  • http://tinyurl.com/pfn-irex15

http://tinyurl.com/pfn-ces16 39

slide-40
SLIDE 40

Summary

l Chainer is a Python-based deep learning framework

with dynamic network construcOon scheme and CuPy

l It is designed for efficient research and prototyping while

keeping comparable performance thanks to NVIDIA GPU

l Official web: hZp://chainer.org/ l Github: hZps://github.com/pfnet/chainer

Your contribuOons will be appreciated & we are hiring!

40