Neural networks across space & time Dave Snowdon @davesnowdon - - PowerPoint PPT Presentation

neural networks across space time
SMART_READER_LITE
LIVE PREVIEW

Neural networks across space & time Dave Snowdon @davesnowdon - - PowerPoint PPT Presentation

Neural networks across space & time Dave Snowdon @davesnowdon https://www.linkedin.com/in/davesnowdon/ About me Java & javascript by day Python & clojure by night Amateur social roboticist Been learning about deep


slide-1
SLIDE 1

Neural networks across space & time

Dave Snowdon @davesnowdon https://www.linkedin.com/in/davesnowdon/

slide-2
SLIDE 2

About me

  • Java & javascript by day
  • Python & clojure by night
  • Amateur social roboticist
  • Been learning about deep learning for 18 months
slide-3
SLIDE 3

Agenda

  • Why neural networks
  • How do neural networks work
  • Convolutional neural networks
  • Recurrent neural networks
slide-4
SLIDE 4

Why neural networks?

slide-5
SLIDE 5

Why care about deep learning?

  • Impressive results in a wide range of domains
  • image classification, text descriptions of images, language translation,

speech generation, speech recognition…

  • Predictable execution (inference) time
  • Amenable to hardware acceleration
  • Automatic feature extraction
slide-6
SLIDE 6

What are features?

10 PRINT “Hello QCon London” 20 GOTO 10 Average statement length Number of statements Number of variables Cyclomatic complexity

slide-7
SLIDE 7

Feature extraction

Traditional machine learning process

Pre- process

Extract features Model Results

Data Pre- process

Model Results

Deep learning process

Data

slide-8
SLIDE 8

Neural network downsides

  • Need to define the model and it’s training parameters
  • Large models can take days or weeks to train
  • May need a lot of data. > 10K examples
slide-9
SLIDE 9

How neural networks work

slide-10
SLIDE 10

NOT YOUR NEURAL NETWORK

Deep learning != your brain

slide-11
SLIDE 11

input 0 input 1 input N weight 0 weight 1 weight N bias (weight for fixed input) S u m x0 x1 xN w0 w1 wN b S u m

  • utput

F( )

Neuron model

slide-12
SLIDE 12

0.5 1

  • 0.5

0.1

  • 0.5

4 0.8 S u m F( )

Neuron model

  • 1.65
slide-13
SLIDE 13

0.5 1

  • 0.5

0.1

  • 0.5

4 0.8 S u m

  • 1.65

F( )

Neuron model

  • 1.65

identity

slide-14
SLIDE 14

0.5 1

  • 0.5

0.1

  • 0.5

4 0.8 S u m 0.1611 F( )

Neuron model

  • 1.65

sigmoid

slide-15
SLIDE 15

0.5 1

  • 0.5

0.1

  • 0.5

4 0.8 S u m

  • 0.9289

F( )

Neuron model

  • 1.65

tanh

slide-16
SLIDE 16

0.5 1

  • 0.5

0.1

  • 0.5

4 0.8 S u m F( )

Neuron model

  • 1.65

ReLU

slide-17
SLIDE 17

Neural networks are not graphs

slide-18
SLIDE 18

Neural networks are like onions

(they have layers and can make you cry)

Input layer Output layer Hidden layer

slide-19
SLIDE 19

Why layers?

Layer 1 Layer 2 x x x x x

slide-20
SLIDE 20

Neural networks are like onions

(they have layers and can make you cry)

Input layer Output layer Hidden layer

  • utput = f(W2 . f(W1 . Input + B1) + B2)

W1 W2

W11 W12 W13 W21 W22 W23 W31 W32 W33 W41 W42 W43 ⎧ ⎨ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎫ ⎬ ⎪ ⎪ ⎪ ⎭ ⎪ ⎪ ⎪ W W W W W W W W 11 12 13 14 21 22 23 24

slide-21
SLIDE 21

Going deeper

Input layer Output layer Hidden layer Hidden layer

slide-22
SLIDE 22

What do the layers do?

Successive layers model higher level features

slide-23
SLIDE 23

What input can a network accept?

  • Anything you like as long as it’s a tensor
  • Tensor = general multi-dimensional numeric quantity
  • scalar = tensor of 0 dimensions (AKA rank 0)
  • vector = 1 dimensional tensor (rank 1)
  • matrix = 2 dimensional tensor (rank 2)
  • tensor = N dimensional tensor (rank > 2)
slide-24
SLIDE 24

Images

Can represent image as tensor of rank 3

Source: https://www.slideshare.net/BertonEarnshaw/a-brief-survey-of-tensors

slide-25
SLIDE 25

One-hot encoding : input

FAVOURITE PROGRAMMING LANGUAGE JAVA CLOJURE PYTHON JAVASCRIPT BARRY

1

BRUCE

1

RUSSEL

1

“enums”

slide-26
SLIDE 26

One-hot encoding: output

JAVA CLOJURE PYTHON JAVASCRIPT BARRY

0.6 0.1 0.1 0.2

BRUCE

0.15 0.75 0.05 0.05

RUSSEL

0.34 0.05 0.6 0.01

Also useful for output Probability distribution

slide-27
SLIDE 27

Back propagation

w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ w11 w12 w21 w22 w31 w32 w41 w42 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

Training example

Cost Function

Expected

  • utput

Input example

Error

(also known as cost or loss)

w11 w12 w21 w22 w31 w32 w41 w42 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟

slide-28
SLIDE 28

More on back propagation

slide-29
SLIDE 29

Frameworks

slide-30
SLIDE 30

Summary so far

  • Neural networks are NOT like your brain
  • Networks are arranged as layers
  • Forward pass compute output of network
  • Backward pass compute gradients & adjust weights
  • Frameworks take care of the math for you
  • but still good to understand what’s going on
slide-31
SLIDE 31

A request from marketing

slide-32
SLIDE 32

Images that mention VMware

slide-33
SLIDE 33

First we need a dataset

slide-34
SLIDE 34

Highlight the parts for training

slide-35
SLIDE 35

Creating the dataset

  • Grab images from google image search
  • PyImageSearch “How to create a deep learning dataset using Google

Images”

  • Use dlib imglab tool to draw bounding boxes around logos / not Logos
  • https://github.com/davisking/dlib/tree/master/tools/imglab
  • Wrote python script to read imglab XML and produce cropped images using

OpenCV

slide-36
SLIDE 36

Sliding windows

slide-37
SLIDE 37

Multiple scales

slide-38
SLIDE 38

How it all adds up

  • 5501 images total
  • 883 VMware
  • 4318 not VMware
  • Scaled to 75x22x3 -> 4950 inputs
  • Easily 4,950,000 weights in first layer alone
  • Maybe we need another neural network architecture
slide-39
SLIDE 39

Convolutional Neural Networks

slide-40
SLIDE 40

Convolution

slide-41
SLIDE 41

Convolution example(s)

1 1 1

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

8

  • 1
  • 1
  • 1
  • 1

1

  • 1

1

  • 1

1

  • 1
slide-42
SLIDE 42

Convolutional layer

slide-43
SLIDE 43

Max Pooling layer

slide-44
SLIDE 44

Convolutional network

slide-45
SLIDE 45

DL4J model structure

Input Convolution Pooling Convolution Pooling Fully connected Softmax

slide-46
SLIDE 46

Define the model

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .seed(seed) .cacheMode(CacheMode.DEVICE) .updater(Updater.ADAM) .iterations(iterations) .gradientNormalization(GradientNormalization.RenormalizeL2PerLayer) // normalize to prevent vanishing or exploding gradients .optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT) .l1(1e-4) .regularization(true) .l2(5 * 1e-4) .list() .layer(0, new ConvolutionLayer.Builder(new int[]{4, 4}, new int[]{1, 1}, new int[]{0, 0}).name("cnn1").convolutionMode(ConvolutionMode.Same) .nIn(3).nOut(64).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU)//.learningRateDecayPolicy(LearningRatePolicy.Step) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(1, new ConvolutionLayer.Builder(new int[]{4, 4}, new int[]{1, 1}, new int[]{0, 0}).name("cnn2").convolutionMode(ConvolutionMode.Same) .nOut(64).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(2, new SubsamplingLayer.Builder(PoolingType.MAX, new int[]{2, 2}).name("maxpool2").build()) .layer(3, new ConvolutionLayer.Builder(new int[]{4, 4}, new int[]{1, 1}, new int[]{0, 0}).name("cnn3").convolutionMode(ConvolutionMode.Same) .nOut(96).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(4, new ConvolutionLayer.Builder(new int[]{4, 4}, new int[]{1, 1}, new int[]{0, 0}).name("cnn4").convolutionMode(ConvolutionMode.Same) .nOut(96).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(5, new ConvolutionLayer.Builder(new int[]{3, 3}, new int[]{1, 1}, new int[]{0, 0}).name("cnn5").convolutionMode(ConvolutionMode.Same) .nOut(128).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(6, new ConvolutionLayer.Builder(new int[]{3, 3}, new int[]{1, 1}, new int[]{0, 0}).name("cnn6").convolutionMode(ConvolutionMode.Same) .nOut(128).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(7, new ConvolutionLayer.Builder(new int[]{2, 2}, new int[]{1, 1}, new int[]{0, 0}).name("cnn7").convolutionMode(ConvolutionMode.Same) .nOut(256).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(8, new ConvolutionLayer.Builder(new int[]{2, 2}, new int[]{1, 1}, new int[]{0, 0}).name("cnn8").convolutionMode(ConvolutionMode.Same) .nOut(256).weightInit(WeightInit.XAVIER_UNIFORM).activation(Activation.RELU) .learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(9, new SubsamplingLayer.Builder(PoolingType.MAX, new int[]{2, 2}).name("maxpool8").build()) .layer(10, new DenseLayer.Builder().name("ffn1").nOut(1024).learningRate(1e-3).biasInit(1e-3).biasLearningRate(1e-3 * 2).build()) .layer(11, new DropoutLayer.Builder().name("dropout1").dropOut(0.2).build()) .layer(12, new DenseLayer.Builder().name("ffn2").nOut(1024).learningRate(1e-2).biasInit(1e-2).biasLearningRate(1e-2 * 2).build()) .layer(13, new DropoutLayer.Builder().name("dropout2").dropOut(0.2).build()) .layer(14, new OutputLayer.Builder(LossFunctions.LossFunction.Not-VMwareLOGLIKELIHOOD) .name("output") .nOut(numLabels) .activation(Activation.SOFTMAX) .build()) .backprop(true) .pretrain(false) .setInputType(InputType.convolutional(height, width, channels)) .build(); MultiLayerNetwork model = new MultiLayerNetwork(conf); model.init();

slide-47
SLIDE 47
slide-48
SLIDE 48

Results

  • 10 epochs, 16 minutes to train on NVIDIA GTX 1080
  • Inference time: ~20ms
  • Precision 0.9661
  • Recall 0.8829

PREDICTED ACTUAL

VMWARE NOT- VMWARE VMWARE 140 42 NOT- VMWARE 3 855

slide-49
SLIDE 49

Results

VMware not-VMware

slide-50
SLIDE 50

More efficient object detection

  • You Only Look Once (YOLO)
  • Single Shot Multibox Detector (SSD)
  • Faster R-CNN

“Building a Production Grade Object Detection System with SKIL and YOLO” https://blog.skymind.ai/building-a-production-grade-object-detection-system- with-skil-and-yolo/

slide-51
SLIDE 51

Summary so far

Convolutional networks

  • Used mostly for image processing
  • Convolution layer applies learnt filter to inputs
  • Pooling layer reduces size of inputs
  • Fewer weights (parameters) to train compared to fully connected networks
slide-52
SLIDE 52

But, are they saying nice things about us?

slide-53
SLIDE 53

Variable length input

VMware workstation rocks!

I’ve used vSphere for a number

  • f

years …

slide-54
SLIDE 54

Feed forward networks don’t remember state

rocks! workstation VMware

slide-55
SLIDE 55

Recurrent neural networks

slide-56
SLIDE 56

Recurrent network

rocks! workstation VMware

slide-57
SLIDE 57

Vanishing/exploding gradients

10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

x 1.1 x 0.9

slide-58
SLIDE 58

Long Short-Term Memory (LSTM) cell

slide-59
SLIDE 59

Stanford large movie review dataset

Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, … This is the biggest Flop of 2008. I don know what Director has is his mind of creating such a big disaster. The songs have been added without situations, the story have been stretched to fill the 3 hrs gap …

http://ai.stanford.edu/~amaas/data/sentiment/

slide-60
SLIDE 60

Words -> vectors

1 …

cat

0.67 5.4 0.2 0.46 15 69 23 …

1 hot, 3 million values word2vec, 300 values

slide-61
SLIDE 61

DL4J model

LSTM

slide-62
SLIDE 62

The code

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder() .seed(seed) .updater(Updater.ADAM) //To configure: .updater(Adam.builder().beta1(0.9).beta2(0.999).build()) .regularization(true).l2(1e-5) .weightInit(WeightInit.XAVIER) .gradientNormalization(GradientNormalization.ClipElementWiseAbsoluteValue).gradientNormalizationThreshol d(1.0) .learningRate(2e-2) .trainingWorkspaceMode(WorkspaceMode.SEPARATE).inferenceWorkspaceMode(WorkspaceMode.SEPARAT E) .list() .layer(0, new GravesLSTM.Builder().nIn(vectorSize).nOut(256) .activation(Activation.TANH).build()) .layer(1, new RnnOutputLayer.Builder().activation(Activation.SOFTMAX) .lossFunction(LossFunctions.LossFunction.MCXENT).nIn(256).nOut(2).build()) .pretrain(false).backprop(true).build(); MultiLayerNetwork net = new MultiLayerNetwork(conf); net.init();

slide-63
SLIDE 63
slide-64
SLIDE 64

Results

0.976

VMware Horizon rocks!

0.976

I don't like it at all. It does not work the way I think it should

0.0317

It crashes. It's buggy. Don't waste your time on this

0.976

This software is market leading and will change the world

slide-65
SLIDE 65

RNN Summary

  • Use recurrent neural networks (RNNs) for time series or sequential data
  • RNNs can consume & generate sequences
  • RNNs back propagate through time as well as layers
  • very deep networks so increase in training time
  • Use LSTM (or GRU) layers
slide-66
SLIDE 66

Summary

  • Each layer learns features composed of features from previous layer
  • Convolutional neural networks (CNN) well suited for images
  • Recurrent Neural Networks (RNN) used for time series data
  • Can have networks combining both convolutional & recurrent layers
slide-67
SLIDE 67

Further info #1

slide-68
SLIDE 68

Further info #2

  • Andrew Ng Coursera : https://www.coursera.org/specializations/deep-

learning

  • Udacity : https://www.udacity.com/course/deep-learning--ud730
  • CS231n Winter 2016 lecture videos
  • Andrej Karpathy’s blog : http://karpathy.github.io/
  • Andrew Trask’s blog : https://iamtrask.github.io/
slide-69
SLIDE 69

Backup / bonus slides

slide-70
SLIDE 70

Back propagation

w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ w11 w12 w21 w22 w31 w32 w41 w42 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟

Training example

Cost Function

Expected

  • utput

Input example

Error

(also known as cost or loss)

w11 w12 w21 w22 w31 w32 w41 w42 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ w11 w12 w13 w14 w21 w22 w23 w24 w31 w32 w33 w34 ⎛ ⎝ ⎜ ⎜ ⎜ ⎜ ⎞ ⎠ ⎟ ⎟ ⎟ ⎟

slide-71
SLIDE 71

Gradient descent

Update delta = -gradient * error * learning rate Cost Model parameter (weight)

slide-72
SLIDE 72

The Chain Rule

x0 x1 x2 w0 w1 w2

  • utput

Want error gradient with respect to each weight

δerror δactivation δactivation δweight δerror δweight = δerror δactivation×δactivation δweight

slide-73
SLIDE 73

Time for some java

slide-74
SLIDE 74

Backpropgation

slide-75
SLIDE 75

Sigmoid & derivative

private INDArray sigmoid(INDArray input) { return Nd4j.ones(input.shape()).div(exp(input.neg()).add(1)); } private INDArray sigmoidDerivative(INDArray input) {

slide-76
SLIDE 76

Inputs & Weights

final double[][] inputsArray = { {0, 0, 1}, {0, 1, 1}, {1, 0, 1}, {1, 1, 1}

slide-77
SLIDE 77

Forward pass

for (int i = 0; i < numIterations; ++i) { // forward pass INDArray layer1 = sigmoid(x.mmul(weights1)); INDArray layer2 = sigmoid(layer1.mmul(weights2));

slide-78
SLIDE 78

Backward pass

// backward pass INDArray delta2 = layer2Error.mul(sigmoidDerivative(layer2)); INDArray layer1Error = delta2.mmul(weights2.transpose());

slide-79
SLIDE 79

Backward pass

weights2 = weights2.add( layer1.transpose() // chain rule .mmul(delta2) // error value .mul(learningRate)); // update scale factor