Practical Deep Learning micha.codes / fastforwardlabs.com 1 / 70 - - PowerPoint PPT Presentation

practical deep learning
SMART_READER_LITE
LIVE PREVIEW

Practical Deep Learning micha.codes / fastforwardlabs.com 1 / 70 - - PowerPoint PPT Presentation

Practical Deep Learning micha.codes / fastforwardlabs.com 1 / 70 deep learning can seem mysterious 2 / 70 let's nd a way to just build a function 3 / 70 Feed Forward Layer # X.shape == (512,) # output.shape == (4,) # weights.shape ==


slide-1
SLIDE 1

Practical Deep Learning

micha.codes / fastforwardlabs.com

1 / 70

slide-2
SLIDE 2

deep learning can seem mysterious

2 / 70

slide-3
SLIDE 3

let's nd a way to just build a function

3 / 70

slide-4
SLIDE 4

Feed Forward Layer

# X.shape == (512,) # output.shape == (4,) # weights.shape == (512, 4) == 2048 # biases.shape == (4,) def feed_forward(activation, X, weights, biases): return activation(X @ weights + biases)

IE: f (X) = σ (X × W + b) 4 / 70

slide-5
SLIDE 5

What's so special?

5 / 70

slide-6
SLIDE 6

Composable

# Just like a Logistic Regression result = feed_forward( softmax, X,

  • uter_weights,
  • uter_biases

)

6 / 70

slide-7
SLIDE 7

Composable

# Just like a Logistic Regression with learned features? result = feed_forward( softmax, feed_forward( tanh, X, inner_weights, inner_biases )

  • uter_weights,
  • uter_biases

)

7 / 70

slide-8
SLIDE 8

nonlinear

8 / 70

slide-9
SLIDE 9

UNIVERSAL APPROXIMATION THEOREM

neural networks can approximate arbitrary functions

9 / 70

slide-10
SLIDE 10

dierentiable ➡ SGD

Iteratively learn the values for the weights and biases given training data

10 / 70

slide-11
SLIDE 11

11 / 70

slide-12
SLIDE 12

12 / 70

slide-13
SLIDE 13

13 / 70

slide-14
SLIDE 14

Convolutional Layer

import numpy as np from scipy.signal import convolve # X.shape == (800, 600, 3) # filters.shape == (8, 8, 3, 16) # biases.shape == (3, 16) # output.shape < (792, 592, 16) def convnet(activation, X, filters, biases): return activation( np.stack([convolve(X, f) for f in filter]) + biases )

IE: f (X) = σ (X ∗ f + b) 14 / 70

slide-15
SLIDE 15

15 / 70

slide-16
SLIDE 16

Recurrent Layer

# X_sequence.shape == (None, 512) # output.shape == (None, 4) # W.shape == (512, 4) # U.shape == (4, 4) # biases.shape == (4,) def RNN(activation, X_sequence, W, U, biases, activation):

  • utput = None

for X in X_sequence:

  • utput = activation(x @ W + output @ U + biases)

yield output

IE: (

) = σ ( × W + ( ) × U + b) ft Xt Xt ft−1 Xt−1

16 / 70

slide-17
SLIDE 17

GRU Layer

def GRU(activation_in, activation_out, X_sequence, W, U, biases):

  • utput = None

for X in X_sequence: z = activation_in(W[0] @ X + U[0] @ output + biases[0]) r = activation_in(W[1] @ X + U[1] @ output + biases[1])

  • = activation_out(W[2] @ x +

U[2] @ (r @ output) + biases[2])

  • utput = z * output + (1 - z) * o

yield output

17 / 70

slide-18
SLIDE 18

What about theano/tensorow/mxnet?

18 / 70

slide-19
SLIDE 19

what happens here?

import numpy as np a = np.random.random(100) - 0.5 a[a < 0] = 10

19 / 70

slide-20
SLIDE 20

condition body

while variable

name: b constant value: 0 compare

  • p: ≠

branch

statement sequence

return variable

name: a

20 / 70

slide-21
SLIDE 21

library widely used auto-diff gpu/cpu mobile frontend models multi-gpu speed library widely used auto-diff gpu/cpu mobile frontend models multi-gpu speed numpy ✔ ✖ ✖ ✖ ✖ ✖ ✖ slow theano ✔ ✔ ✔ ✖ ✖ ✖ ✖ fast mx-net ✖ ✔ ✔ ✔ ✔ ✔ ✔ fast tensorow ✔ ✔ ✔ ✖ ✔ ✔ ➖ slow 21 / 70

slide-22
SLIDE 22

Which should I use?

22 / 70

slide-23
SLIDE 23

keras makes Deep Learning simple

(http://keras.io/)

23 / 70

slide-24
SLIDE 24

$ cat ~/.keras/keras.json { "image_dim_ordering": "th", "epsilon": 1e-07, "floatx": "float32", "backend": "theano" }

  • r

$ cat ~/.keras/keras.json { "image_dim_ordering": "tf", "epsilon": 1e-07, "floatx": "float32", "backend": "tensorflow" }

24 / 70

slide-25
SLIDE 25

(coming soon... hopefully)

$ cat ~/.keras/keras.json { "image_dim_ordering": "mx", "epsilon": 1e-07, "floatx": "float32", "backend": "mxnet" }

25 / 70

slide-26
SLIDE 26

simple!

from keras.models import Sequential from keras.layers.core import Dense # Same as our Logistic Regression above with: # weights_outer.shape = (512, 4) # biases_outer.shape = (4,) model_lr = Sequential() model_lr.add(Dense(4, activation='softmax', input_shape=[512])) model_lr.compile('sgd', 'categorical_crossentropy') model_lr.fit(X, y)

26 / 70

slide-27
SLIDE 27

extendible!

from keras.models import Sequential from keras.layers.core import Dense # Same as our "deep" Logistic Regression model = Sequential() model.add(Dense(128, activation='tanh', input_shape=[512])) model.add(Dense(4, activation='softmax')) model.compile('sgd', 'categorical_crossentropy') model.fit(X, y)

27 / 70

slide-28
SLIDE 28

model_lr.summary() # __________________________________________________________________________ # Layer (type) Output Shape Param # Connected to # ========================================================================== # dense_1 (Dense) (None, 4) 2052 dense_input_1[0][0] # ========================================================================== # Total params: 2,052 # Trainable params: 2,052 # Non-trainable params: 0 # __________________________________________________________________________ model.summary() # ___________________________________________________________________ # Layer (type) Output Shape Param # Connected to # =================================================================== # dense_2 (Dense) (None, 128) 65664 dense_input_2[0][0] # ___________________________________________________________________ # dense_3 (Dense) (None, 4) 516 dense_2[0][0] # =================================================================== # Total params: 66,180 # Trainable params: 66,180 # Non-trainable params: 0 # ___________________________________________________________________

28 / 70

slide-29
SLIDE 29

let's build something

29 / 70

slide-30
SLIDE 30

30 / 70

slide-31
SLIDE 31

31 / 70

slide-32
SLIDE 32

32 / 70

slide-33
SLIDE 33

fastforwardlabs.com/luhn/

33 / 70

slide-34
SLIDE 34

34 / 70

slide-35
SLIDE 35

35 / 70

slide-36
SLIDE 36

36 / 70

slide-37
SLIDE 37

37 / 70

slide-38
SLIDE 38

38 / 70

slide-39
SLIDE 39

39 / 70

slide-40
SLIDE 40

40 / 70

slide-41
SLIDE 41

41 / 70

slide-42
SLIDE 42

42 / 70

slide-43
SLIDE 43

43 / 70

slide-44
SLIDE 44

def skipgram(words): for i in range(1, len(words)-1): yield words[i], (words[i-1], words[i+1])

44 / 70

slide-45
SLIDE 45

45 / 70

slide-46
SLIDE 46

46 / 70

slide-47
SLIDE 47

from keras.models import Model from keras.layers import (Input, Embedding, Merge, Lambda, Activation) vector_size=300 word_index = Input(shape=1) word_point = Input(shape=1) syn0 = Embedding(len(vocab), vector_size)(word_index) syn1= Embedding(len(vocab), vector_size)(word_point) merge = Merge([syn0, syn1], mode='mul') merge_sum = Lambda(lambda x: x.sum(axis=-1))(merge) context = Activation('sigmoid')(merge_sum) model = Model(input=[word, context], output=output) model.compile(loss='binary_crossentropy', optimizer='adam')

47 / 70

slide-48
SLIDE 48

Feed Forward vs Recurrent Network

48 / 70

slide-49
SLIDE 49

49 / 70

slide-50
SLIDE 50
  • 1. Find articles summaries heavy on quotes (http://thebrowser.com/)
  • 2. Score every sentence in the articles based on their "closeness" to

a quote

  • 3. Use skip-thoughts to encode every sentence in the article
  • 4. Train and RNN to predict these scores given the sentence vector
  • 5. Evaluate trained model on new things!

RNN Summarization Sketch for Articles

50 / 70

slide-51
SLIDE 51

keras makes RNN's simple

(http://keras.io/)

51 / 70

slide-52
SLIDE 52

Example: proprocess

from skipthoughts import skipthoughts from .utils import load_data (articles, scores), (articles_test, scores_test) = load_data() articles_vectors = skipthoughts.encode(articles) articles_vectors_test = skipthoughts.encode(articles_test)

52 / 70

slide-53
SLIDE 53

Example: model def and training

from keras.models import Model from keras.layers.recurrent import LSTM from keras.layers.core import Dense from keras.layers.wrappers import TimeDistributed model = Model() model.add(LSTM(512, input_shape=(None, 4800), dropout_W=0.3, dropout_U=0.3)) model.add(TimeDistributed(Dense(1))) model.compile(loss='mean_absolute_error', optimizer='rmsprop') model.fit(articles_vectors, scores, validation_split=0.10) loss, acc = model.evaluate(articles_vectors_test, scores_test) print('Test loss / test accuracy = {:.4f} / {:.4f}' .format(loss, acc)) model.save("models/new_model.h5")

53 / 70

slide-54
SLIDE 54

LSTM LSTM LSTM LSTM LSTM LSTM Dense Dense Dense Dense Dense Dense article sent1 sent2 sent3 sent4 sent5 sent6 skip thought skip thought skip thought skip thought skip thought skip thought

keras

(6,4800) list(text) (6,512) (6,1)

size of data

preprocess

54 / 70

slide-55
SLIDE 55

Example Model: evaluation

from keras.models import load_model from flask import Flask, request import nltk app = Flask(__name__) model = load_model("models/new_model.h5") @app.route('/api/evaluate', methods=['POST']) def evaluate(): article = request.data sentences = nltk.sent_tokenize(article) sentence_vectors = skipthoughts.encode(sentences) return model.predict(sentence_vectors)

55 / 70

slide-56
SLIDE 56

56 / 70

slide-57
SLIDE 57

57 / 70

slide-58
SLIDE 58

Thoughts of doing this method

Scoring function used is SUPER important Hope you have a GPU Hyper-parameters for all! Structure of model can change where it's applicable SGD means random initialization... may need to t multiple times 58 / 70

slide-59
SLIDE 59

59 / 70

slide-60
SLIDE 60

REGULARIZE!

dropout: only parts of the NN participate in every round l1/l2: add penalty term for large weights batchnormalization: unit mean/std for each batch of data 60 / 70

slide-61
SLIDE 61

VALIDATE AND TEST!

lots of parameters == potential for overtting

61 / 70

slide-62
SLIDE 62

deploy?

62 / 70

slide-63
SLIDE 63

63 / 70

slide-64
SLIDE 64

internet front-end api load balancer gpu gpu gpu gpu backend api backend api backend api backend api gpu backend api

p2.xlarge p2.8xlarge

model store

64 / 70

slide-65
SLIDE 65

careful: data drift

set monitoring on the distribution of results 65 / 70

slide-66
SLIDE 66

66 / 70

slide-67
SLIDE 67

67 / 70

slide-68
SLIDE 68

68 / 70

slide-69
SLIDE 69

Text Generation Audio Generation Event Detection Intent Identication Decision Making

Emerging Applications

69 / 70

slide-70
SLIDE 70

70 / 70