A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei - - PowerPoint PPT Presentation
A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei - - PowerPoint PPT Presentation
@ NVIDIA GTC, April 6 th , 2016 A Powerful, Flexible, and Intui5ve Deep Learning Framework Shohei Hido Chief Research Officer Preferred Networks, Inc. Overview http://chainer.org/ l Chainer is a Python-based deep learning framework
Overview
l Chainer is a Python-based deep learning framework l Chainer v1.0 was released as an open source on June 2015 l It DOESN’T rely on Theano, unlike other Python frameworks l Chainer uses a unique scheme named Define-by-Run
http://chainer.org/
l Why do users sOll need another framework? l How different and effecOve Chainer is?
2
Preferred Networks (PFN) A startup that applies deep learning to industrial IoT
l Founded: March 2014 l Headquarter: Tokyo, Japan l U.S. Subsidiary: San Mateo, California l Company size: 35 engineers & researchers l Investors: Toyota, FANUC, NTT
Deep learning Industrial IoT
3
Manufacturing Automotive Healthcare
Partnering with world-leading companies using Chainer
l R&D collaboraOon on industrial problems with real-world data
̶ Specific requirements, modified algorithms, many trials and errors, etc ̶ Different from making general-purpose recogniOon system
4
Toyota FANUC Panasonic NTT Cisco NVIDIA
Two types of background behind DL frameworks
- 1. Scalability-oriented
l Use-cases in mind
̶ Image/speech recogniOon system ̶ Fast DL as a service in cloud
l Problem type
̶ A few general applicaOons ̶ 10+ million training samples ̶ 10+ nodes cluster w/ fast network
l Possible boZleneck
̶ Tuning of well-known algorithms ̶ Distributed computaOon for
model/data-parallel training
- 2. Flexibility-oriented
l Use-cases in mind
̶ Algorithm research ̶ R&D projects for new products
l Problem type
̶ Various specific applicaOons ̶ 10+ k training samples ̶ 1 node with mulOple GPUs
l Possible boZleneck
̶ Trial-and-error in prototyping ̶ Debugging, profiling & refactoring ̶ (wait Ome during compilaOon)
Designed for efficient research & development
l Flexible: new kinds of complex models for various applicaOons l IntuiOve: rapid prototyping and efficient trial-and-error l Powerful: comparable performance for 1 node & mulO-GPUs
6
Scalability-oriented Flexibility-oriented
Agenda
l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons
7
Neural network and computation
x1
xN
・・
h1
hH
・・・・
kM
k1
yM
y1
Forward computation Backward computation (backpropagation) ・・ ・・ Input Hidden units Output Text Image Sensor Object: Tulip Anomaly score: 0.35 Category: Sports ・・ ・・ ・・
8
Chainer focuses on network representation/training
l Design choices for deep learning frameworks
̶ How to build neural networks? ̶ How to train neural networks? ̶ Which text format/language for modeling? ̶ Which language for compuOng? ̶ Run with GPU? ̶ Run on mulOple GPUs? ̶ Run on mulOple compute nodes?
9
Building and training neural networks: Computational graph construction is the key
1.
Construct a computaOonal graph
̶ Based on network definiOon given by users ̶ Chains of funcOons and operaOons on input variables
2.
Compute loss and gradients
̶ Forward computaOon to calculate loss for a minibatch ̶ BackpropagaOon gives gradients to all of parameters
3.
OpOmize model
̶ Update each parameter with the gradient ̶ Repeat unOl convergence
Step 1. is the most important and there are many approaches
10
Building blocks
l These funcOonaliOes are very similar between frameworks l But the structure, abstracOon level, and interface are different l It comes to the design of domain-specific language for NN
Array data structure (vector/matrix/tensor) Operations & functions Network (computational graph) Optimizer (SGD/AdaGrad/Adam)
11
Types of domain-specific language for neural networks
l Text DSL
̶ Ex. Caffe (prototxt) ̶ Ex. CNTK (NDL)
l Symbolic program
̶ OperaOons
- n symbols
̶ Ex. Theano ̶ Ex. TensorFlow
l ImperaOve program
̶ Direct computaOons
- n raw data arrays
̶ Ex. Torch.nn ̶ Ex. Chainer
# Symbolic definiOon A = Variable(‘A’) B = Variable(‘B’) C = B * A D = C + Constant(1) # Compile f = compile(D) d = f(A=np.ones(10), B=np.ones(10) * 2) # ImperaOve declaraOon a = np.ones(10) b = np.ones(10) * 2 c = b * a d = c + 1 %% DefiniOon in text f: { “A”: “Variable”, “B”: “Variable”, “C”: [“B”, “*”, “A”], “ret”: [“C”, “+”, 1] } # Compile f = compile(“f.txt”) d = f(A=np.ones(10), B=np.ones(10) * 2)
12
- Ex. MXNet
Comparison of DSL type
DSL type Pros. Cons. Text DSL
- Human-readable definiOon
- Non-programmer can easily
edit the network
- Users must study the format
- Format might have to be
extended for new algorithms Internal DSL Symbolic
- StaOc analysis at compile
- OpOmizaOon before training
- Easy to parallelize
- Users must study special syntax
- May need more efforts to
implement new algorithms ImperaOve
- Less efforts to learn syntax
- Easy debugging and profiling
- Suitable for new algorithms
with complex logic
- Hard to opOmize in advance
- Less efficient in memory
allocaOon and parallelizaOon
Chainer is at the extreme end of imperaOve program for high flexibility
13
Agenda
l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons
14
Chainer as an open-source project
l hZps://github.com/pfnet/chainer l 50 contributors l 1,277 stars & 255 fork l 3,708 commits l AcOve development & release for last 10 months
̶ v1.0.0 (June 2015) to v1.7.2 (March 2016)
15 Original developer Seiya Tokui
CuPy
- Chainer software stack
CPU NVIDIA GPU CUDA cuDNN BLAS NumPy
Chainer
l Chainer is built on top of NumPy and CUDA l CuPy is also introduced as an equivalent of NumPy on GPU
16
Run
- Define
Graph build scheme (1/2) - Define-and-Run: Most of frameworks use this scheme (Chainer does not)
l Define: build a computaOonal graph based on definiOon l Run: update the model (parameters) using training dataset
Network definiOon ComputaOonal graph Gradient funcOon Parameters ComputaOonal graph Gradient funcOon Parameters Training data Update Loss & gradient Auto differenOaOon
17
Define-by-Run
Graph build scheme (2/2) - Define-by-Run: Computational graph construction on the fly
l No graph is constructed before training l Instead, the graph is built at each forward computaOon l ComputaOonal graph can be modified dynamically
for each iteraOon/sample or depending on some condiOons
Model definiOon ComputaOonal graph Gradient funcOon Parameters Training data Update Dynamic change CondiOons
18
Define-by-Run example: MLP for MNIST
l Only transformaOons between units are set before training l ConnecOon is given as forward computaOon
l1 = Linear(784, n_units) l2 = Linear(n_units, 10)) Linear l2 Linear l1 x y h1 W
bias
0 5 9 W
bias
ReLU def forward(x): h1 = ReLU(l1(x)) return l2(h1)
19
Define-by-Run: An interpreted language for neural network
l Idea
̶ Forward computaOon actually goes through computaOonal graph ̶ By remembering the history, the actual graph can be obtained
l Advantage
̶ Flexibility for new algorithms with complex components
u Ex. recurrent, recursive, aZenOon, memory, adversarial, etc
̶ IntuiOve coding with highly imperaOve network definiOon
u Ex. stochasOc network of which graph changes for each iteraOon
l Current drawbacks
̶ Graph is generated every Ome also for fixed networks ̶ No opOmizaOon even for staOc part of graphs
u JIT-like analysis and subgraph cache might be useful
20
Basic components (1/2): Variable and Function
l Variable
̶ Variable wraps arrays (.data) ̶ It remembers parent funcOon
(.creator)
̶ It will be assigned gradient (.grad) ̶ It keeps track of not only data
but also computaOons
l FuncOon
̶ TransformaOon between Variable ̶ Stateless ̶ e.g. sigmoid, tanh, ReLU,
maxpooling, dropout
Function x y Variable x y h1 0 5 9
21
Chain (MLP2)
Basic components (2/2): Link and Chain
l Link = funcOon with state
̶ Parameters are also Variable
and gradients will be assigned
̶ e.g. Linear (fully-connected), LSTM
ConvoluOon2d, word-embedding
l Chain = network
̶ Chain has a set of child Link ̶ Forward computaOon is defined
in . __call__()
̶ e.g. MLP2, AlexNet, GoogLeNet,
RNNLM, seq2seq,
Link (Linear) y=f(W*x+b) x y W
b
Linear l2 Linear l1
- y
h1 W
bias
- W
bias
ReLU
22
Backpropagation through computational graph
l Consider an objecOve (Link.Linear): L = f(x * w + b) l This computes the value of L in forward computaOon, and
simultaneously builds the following computaOonal graph
l The gradient of L can be computed with respect to
any variables by backpropagaOon
l Then the opOmizer updates the value of parameters
*
x W
+
b
f
L
is Variable is FuncOon
23
Code sample (1/4): Multi-layer perceptron
class MLP2(Chain): def __init__(self): super(MLP2, self).__init__( l1=L.Linear(784, 100), l2=L.Linear(100, 10), ) def __call__(self, x): h1 = F.relu(self.l1(x)) y = self.l2(h1) return y class Classifier(Chain): def __init__(self, predictor): super(Classifier, self). __init__(predictor=predictor) def __call__(self, x, t): y = self.predictor(x) self.accuracy = F.accuracy(y, t) self.loss = F.softmax_cross_entropy(y, t) return self.loss, self.accuracy # Model and optimizer setup model = Classifier(MLP2())
- ptimizer = optimizers.SGD()
- ptimizer.setup(model)
# training loop with minibatch for i in range(0, datasize, batchsize): x = Variable(x_tr[i:i+batchsize]) t = Variable(y_tr[i:i+batchsize]) model.zerograds() loss, acc = model(x, t) loss.backward()
- ptimizer.update()
Chain (MLP2)
Linear l2 Linear l1
- y
h1 W
bias
- W
bias
ReLU
24
Code sample (2/4): Convolutional neural network
class AlexNet(Chain): def __init__(self): super(AlexNet, self).__init__( conv1=L.Convolution2D(3, 96, 11, stride=4), conv2=L.Convolution2D(96, 256, 5, pad=2), conv3=L.Convolution2D(256, 384, 3, pad=1), conv4=L.Convolution2D(384, 384, 3, pad=1), conv5=L.Convolution2D(384, 256, 3, pad=1), fc6=L.Linear(9216, 4096), fc7=L.Linear(4096, 4096), fc8=L.Linear(4096, 1000), ) def __call__(self, x, t): h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv1(x))), 3, stride=2) h = F.max_pooling_2d(F.relu( F.local_response_normalization(self.conv2(h))), 3, stride=2) h = F.relu(self.conv3(h)) h = F.relu(self.conv4(h)) h = F.max_pooling_2d(F.relu(self.conv5(h)), 3, stride=2) h = F.dropout(F.relu(self.fc6(h)), train=self.train) h = F.dropout(F.relu(self.fc7(h)), train=self.train) y = self.fc8(h) return y
* ImageNet Classification with Deep Convolutional Neural Networks http://www.image-net.org/challenges/LSVRC/2012/supervision.pdf
conv2d conv2d conv2d conv2d conv2d linear linear
25
linear
Code sample (3/4): Recurrent neural network
class SimpleRNN(Chain): def __init__(self, n_vocab, n_units): super(SimpleRNN, self).__init__( embed=L.EmbedID(n_vocab, n_units) x2h=L.Linear(n_units, n_units), h2h=L.Linear(n_units, n_units), h2y=L.Linear(n_units, n_vocab),) self.h = None def __call__(self, x): y, h_new = self.fwd_one_step(x, self.h) self.h = h_new return y def fwd_one_step(self, x, h): x = F.tanh(self.embed(x)) if h is None: h = F.tanh(self.x2h(x)) else: h = F.tanh(self.x2h(x) + self.h2h(h)) y = F.softmax(self.h2y(h)) return y, h
x_1 h y_1 x_2 h y_2 x_3 h y_3
- x_4
h y_4
- BPTT length = 3
Input word Output Recurrent state
# Truncated BPTT (length=3) for i in range(0, datasize, batchsize): ... accum_loss += model(x, t) if i % bptt_length == 0: model.zerograds() accum_loss.backward() accum_loss.unchain_backward()
- ptimizer.update()
26
Code sample (4/4): Deep Networks with Stochastic Depth A paper published on arXiv, March 30, 2016 l A variant of Residual Net that skips connecOons stochasOcally
̶ Outperformed the original Residual Net (ImageNet 2015 winner, MSR) ̶ StochasOc skip:
Taken from http://arxiv.org/abs/1603.09382v2
- G. Huang et al.
# Mock code in Chainer class StochasticResNet(Chain): def __init__(self, prob, size, …): super(StochasticResNet, size, …).__init__( ## Define f[i] as same for Residual Net ) self.p = prob # Survival probabilities def __call__(self, h): for i in range(self.size): b = numpy.random.binomial(1, self.p[i]) c = self.f[i](h) + h if b == 1 else h h = F.relu(c) return h
w/ survival probability: 27
Miscellaneous
- l Other features
̶ Install with pip in one line: ̶ MulO-GPU support by explicitly selecOng the ID to use ̶ Pre-trained Caffe model import from Model Zoo ̶ Model serializaOon & save & load : HDF5 or NumPy npz
l Future direcOon (not only for Chainer)
̶ JIT-like opOmizaOon during Define-by-Run ̶ Memory consumpOon reducOon (GPU memory is sOll small) ̶ Handling variable-length inputs without minibatch ̶ Maximizing performance on mulO-node & mulO-GPU environment
$ pip install chainer
28
Agenda
l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons
29
CuPy: (partially-)NumPy-compatible GPU library
- l MoOvaOon: NumPy + CUDA = CuPy
̶ NumPy is the standard library in Python for numerical computaOon ̶ CUDA is the standard APIs for using GPU for high-performance ̶ Unfortunately, NumPy does NOT work with CUDA
l CuPy supports:
̶ Fast computaOon using NVIDIA’s cuBLAS and cuDNN ̶ Array indexing, slicing, transpose, and reshape ̶ Most of operaOons/funcOons in NumPy
u Chainer v1.7.2 already supports more than 170 funcOons
̶ User-defined funcOons and kernels ̶ all dtypes, broadcasOng, memory pool, etc
30
How to use CuPy
l Usage of CuPy: just replace NumPy with CuPy l Conversion between numpy.ndarray and cupy.ndarray l Ex. CPU/GPU-agnosOc logsumexp funcOon
def logsumexp(x, axis=None): xp = cuda.get_array_module(x) #Get CuPy or NumPy x_max = x.max(axis) exp_sum = xp.exp(x - x_max).sum(axis) return x_max + xp.log(exp_sum) import numpy, cupy enable_cupy = True xp = cupy if enable_cupy else numpy w_c = cupy.asarray(numpy.ones(10)) # cupy.ndarray w_n = cupy.asnumpy(cupy.ones(10)) # numpy.ndarray
31
CuPy implementation: Optimized for performance & NumPy-compatibility
l Use Cython for cupy.core & cupy.cuda l Dynamic code generaOon & compile
̶ CUDA code is generated for specific tensor dimension & data type ̶ On-the-fly compile by nvcc and binary cache (faster awer 1st use)
CUDA libraries (cuBLAS, cuRAND, cuDNN) ndarray ufunc, elementwise, reduc5on CUDA Python wrapper cupy.cuda cupy.core Tensor opera5ons & func5ons cupy
32
CuPy performance on linear algebra: 5 to 25 times faster than NumPy
- def test(xp):
a = xp.arange(1000000).reshape(1000, -1) return a.T * 2 test(numpy) t1 = datetime.datetime.now() for i in range(1000): test(numpy) t2 = datetime.datetime.now() print(t2 -t1) test(cupy) t1 = datetime.datetime.now() for i in range(1000): test(cupy) t2 = datetime.datetime.now() print(t2 -t1)
msec speed up NumPy 2,929 1.0 CuPy 585 5.0 CuPy + Memory Pool 123 23.8
Intel Core i7-4790 @3.60GHz,32GB, GeForce GTX 970
33
Use CuPy for GPU-based computation
l Support three paZerns as wrappers
̶ ElementwiseKernel: for element-wise computaOon ̶ ReducOonKernel: for reduce operaOon along axis ̶ ufunc: universal funcOon as in Numpy
l Ex. definiOon of an element-wise funcOon l Usage (automaOc broadcast and type check are supported)
squared_diff = cupy.ElementwiseKernel( ‘float32 x, float32 y’, # Input ‘float32 z’, # Output ‘z = (x - y) * (x - y)’, # Operation ‘squared_diff’) # Name squared_diff(cupy.arange(10), 10)
34
Agenda
l Deep learning framework basics l IntroducOon to Chainer l CuPy: NumPy-compaOble GPU library l Performance and applicaOons
35
Public benchmark results (CNN): Chainer shows comparable performance
l Forward computaOon is almost the same with TensorFlow l Training with backward computaOon is slower, but it can be
- ffset by no compilaOon Ome while debugging/tuning
200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve) 200 400 600 800 1000 1200 AlexNet GoogLeNet VGG-A OverFeat Torch TensorFlow Chainer Caffe (naCve)
Forward computation (msec) Backward computation (msec)
Taken from https://github.com/soumith/convnet-benchmarks, using cuDNN except Cafge
36
Chainer can benefit from latest CUDA libraries:
- Ex. Winograd algorithm in cuDNN v5
l Conv3x3 is common in CNNs & now computed with Winograd l State-of-the-art CNN models (e.g., GoogLeNet, VGG-A)
can be accelerated up to 2.0x at test Ome (forward only)
100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5 100 200 300 400 500 600 AlexNet GoogLeNet VGG-A OverFeat cuDNN v4 cuDNN v5
Forward computation (msec) Backward computation (msec)
Independently measured by a modified version of soumith/convnet-benchmarks cuDNN v5 can be used in Chainer v1.8.0
37
Algorithm implementation in Chainer: A Neural Algorithm of Artistic Style (Gatys et al., 2015) l hZps://github.com/maZya/chainer-gogh
- Content
image (cat) Style image New artistic image + =
Main code (45 lines) 38
l Many collaboraOons are on-going w/ Chainer-based
computer vision, deep reinforcement learning, etc…
l Ex. 1 Chainer-controlled toy cars in Toyota booth at CES 2016 l Ex. 2 Highly accurate FANUC’s bin-picking robot at IREX 2015
̶ 8 hours training to reach expert-level, commercializaOon by 2016 end
Chainer in industry: Used in demonstrations & being commercialized
- http://tinyurl.com/pfn-irex15