Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, - - PowerPoint PPT Presentation

mocha jl
SMART_READER_LITE
LIVE PREVIEW

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, - - PowerPoint PPT Presentation

Mocha.jl Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT Deep Learning Learning with multi-layer (3~30) neural networks, on a huge training set. State-of-the-art on many AI tasks Computer Vision : Image Classification,


slide-1
SLIDE 1

Mocha.jl

Deep Learning in Julia Chiyuan Zhang (@pluskid) CSAIL, MIT

slide-2
SLIDE 2

Deep Learning

  • Learning with multi-layer (3~30) neural networks,
  • n a huge training set.
  • State-of-the-art on many AI tasks
  • Computer Vision: Image Classification, Object

Detection, Semantic Segmentation, etc.

  • Speech Recognition & Natural Language

Processing: Acoustic Modeling, Language Modeling, Word / Sentence embedding

slide-3
SLIDE 3 input Conv 7x7+2(S) MaxPool 3x3+2(S) LocalRespNorm Conv 1x1+1(V) Conv 3x3+1(S) LocalRespNorm MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) AveragePool 5x5+3(V) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) MaxPool 3x3+2(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) Conv 1x1+1(S) MaxPool 3x3+1(S) DepthConcat Conv 3x3+1(S) Conv 5x5+1(S) Conv 1x1+1(S) AveragePool 7x7+1(V) FC Conv 1x1+1(S) FC FC SoftmaxActivation softmax0 Conv 1x1+1(S) FC FC SoftmaxActivation softmax1 SoftmaxActivation softmax2

GoogLeNet (Inception)

Winner of ILSVRC 2014, 27 layers, ~7 million parameters

slide-4
SLIDE 4

ISVRC on Imagenet

Deep learning dominated since 2012; surpassing “human performance” since 2015.

7.5 15 22.5 30 Top-5 error 2010 2011 2012 2013 2014 Human ArXiv 2015

slide-5
SLIDE 5

Deep Learning in Speech Recognition

image source: Li Deng and Dong Yu. Deep Learning – Methods and Applications.

slide-6
SLIDE 6

Deep Neural Networks

  • A network that consists
  • f computation units

(layers, or nodes) connected via a specific architecture.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS 2012.

slide-7
SLIDE 7

Deep Learning Made Easy

A deep learning toolkit provides common layers, easy ways to define network architecture, and transparent interface to high performance computation backends (BLAS, GPUs, etc.)

  • C++: Caffe (widely used on academia), dmlc/cxxnet, cuda-

convnet, etc.

  • Python: Theano (auto-differentiation) and wrappers,

NervanaSystems/neon, etc.

  • Lua: Torch7 (facebook); Matlab: MatConvNet (VGG)
  • Julia: pluskid/Mocha.jl, dfdx/Boltzmann.jl
slide-8
SLIDE 8

Why Mocha.jl?

  • Written in Julia and for Julia: easily making use of data

pre/post processing and visualization tools from Julia.

  • Minimum dependency: Julia backend ready to run, easy

for fast prototyping.

  • Multiple backends: easily switching to CUDA + cuDNN

based backend for highly efficient deep nets training.

  • Correctness: all computation layers are unit-tested.
  • Modular architecture: layers, activation functions, network

topology, etc. Easily extendable.

slide-9
SLIDE 9

Mocha.jl

  • Deep learning framework for (and written in) Julia;

inspired by Caffe; focusing on easy prototyping, customization, and efficiency (switchable computation backends) > Pkg.add(“Mocha”) > Pkg.test(“Mocha”) > Pkg.checkout(“Mocha”)

  • r for latest dev version:
slide-10
SLIDE 10

IJulia Example

IJulia Image classification example with pre-trained Imagenet model.

slide-11
SLIDE 11

Mini-tutor: CNN on MNIST

  • MNIST: handwritten digits
  • Data preparation:
  • Image data in Mocha is represented as 4D tensor:

width-by-height-by-channels-by-batch

  • MNIST: 28-by-28-by-1-by-64
  • Mocha supports ND-tensor for general data
  • HDF5 file: general format for tensor data, also

supported by numpy, Matlab, etc.

slide-12
SLIDE 12

Mini-tutor: CNN on MNIST

  • Data layer


data_layer = AsyncHDF5DataLayer(name="train-data", source="data/ train.txt", batch_size=64, shuffle=true)

  • data/train.txt lists the HDF5 files for training set
  • 64 images is provided for each mini-batch
  • the data is shuffled to improve convergence
  • async data layer use Julia’s @async to pre-read

data while waiting for computation on CPU / GPU

slide-13
SLIDE 13

Convolution layer

conv_layer = ConvolutionLayer(name="conv1", n_filter=20, kernel=(5,5), bottoms=[:data], tops=[:conv])

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

slide-14
SLIDE 14

Pooling Layer

pool_layer = PoolingLayer(name="pool1", kernel=(2,2), stride=(2,2), bottoms=[:conv], tops=[:pool])

  • Pooling layer operate on the output of convolution layer
  • By default, MAX pooling is performed; can switch to

MEAN pooling by specifying pooling=Pooling.Mean()

slide-15
SLIDE 15

Blobs & Net Architecture

  • Network architecture is

determined by connecting tops (output) blobs to bottoms (input) blobs with matching blob names.

  • Layers are automatically

sorted and connected as a directed acyclic graph (DAG).

slide-16
SLIDE 16

Rest of the layers

conv2_layer = ConvolutionLayer(name="conv2", n_filter=50, kernel=(5,5), bottoms=[:pool], tops=[:conv2]) pool2_layer = PoolingLayer(name="pool2", kernel=(2,2), stride=(2,2), bottoms=[:conv2], tops=[:pool2]) fc1_layer = InnerProductLayer(name="ip1", output_dim=500, neuron=Neurons.ReLU(), bottoms=[:pool2], tops=[:ip1]) fc2_layer = InnerProductLayer(name="ip2", output_dim=10, bottoms=[:ip1], tops=[:ip2]) loss_layer = SoftmaxLossLayer(name="loss", bottoms=[:ip2, :label])

slide-17
SLIDE 17

SGD Solver

params = SolverParameters(max_iter=10000, regu_coef=0.0005, mom_policy=MomPolicy.Fixed(0.9), lr_policy=LRPolicy.Inv(0.01, 0.0001, 0.75), load_from=exp_dir) solver = SGD(params)

slide-18
SLIDE 18

Coffee Breaks…

… for the solver

setup_coffee_lounge(solver, save_into="$exp_dir/ statistics.jld", every_n_iter=1000) # report training progress every 100 iterations add_coffee_break(solver, TrainingSummary(), every_n_iter=100) # save snapshots every 5000 iterations add_coffee_break(solver, Snapshot(exp_dir), every_n_iter=5000)

slide-19
SLIDE 19

Solver Statistics

Solver statistics will be automatically saved if coffee lounge is set up. Snapshots save the training progress periodically, can continue training from the last snapshot after interruption.

slide-20
SLIDE 20

Demo: GPU vs CPU

backend = use_gpu ? GPUBackend() : CPUBackend()

slide-21
SLIDE 21

Parameter Sharing

  • When a layer has trainable parameters (e.g.

convolution, inner-product layers), those parameters will be registered under the layer name, and shared by layers with the same name

  • Use cases
  • Validation network during training
  • Pre-training, fine-tuning
  • Advanced architectures, time-delayed nodes
slide-22
SLIDE 22

Parameter Sharing

slide-23
SLIDE 23

3rd most star-ed Julia package Contributions are very welcome!