RNNs for Timeseries Analysis www.data4sci.com - - PowerPoint PPT Presentation

rnns for timeseries analysis
SMART_READER_LITE
LIVE PREVIEW

RNNs for Timeseries Analysis www.data4sci.com - - PowerPoint PPT Presentation

Bruno Gonalves RNNs for Timeseries Analysis www.data4sci.com github.com/DataForScience/RNN Disclaimer The views and opinions expressed in this tutorial are those of the authors and do not necessarily reflect the official policy or


slide-1
SLIDE 1

RNNs for Timeseries Analysis

Bruno Gonçalves


www.data4sci.com github.com/DataForScience/RNN

slide-2
SLIDE 2

www.data4sci.com @bgoncalves

The views and opinions expressed in this tutorial are those of the authors and do not necessarily reflect the

  • fficial policy or position of my employer. The

examples provided with this tutorial were chosen for their didactic value and are not mean to be representative of my day to day work.

Disclaimer

slide-3
SLIDE 3

www.data4sci.com @bgoncalves

References

slide-4
SLIDE 4

www.data4sci.com @bgoncalves

How the Brain “Works” (Cartoon version)

slide-5
SLIDE 5

www.data4sci.com @bgoncalves

How the Brain “Works” (Cartoon version)

  • Each neuron receives input from other neurons
  • 1011 neurons, each with with 104 weights
  • Weights can be positive or negative
  • Weights adapt during the learning process
  • “neurons that fire together wire together” (Hebb)
  • Different areas perform different functions using same structure

(Modularity)

slide-6
SLIDE 6

www.data4sci.com @bgoncalves

How the Brain “Works” (Cartoon version)

Inputs Output f(Inputs)

slide-7
SLIDE 7

www.data4sci.com @bgoncalves

Optimization Problem

  • (Machine) Learning can be thought of as an optimization

problem.

  • Optimization Problems have 3 distinct pieces:
  • The constraints
  • The function to optimize
  • The optimization algorithm.

Neural Network Prediction Error Gradient Descent

slide-8
SLIDE 8

www.data4sci.com @bgoncalves

Artificial Neuron

x1 x2 x3 xN w

1 j

w2j w3j wNj zj wT x aj φ (z) w0j 1

Inputs Weights Output Activation function Bias

slide-9
SLIDE 9

www.data4sci.com @bgoncalves

Activation Function - Sigmoid

φ (z) = 1 1 + e−z

  • Non-Linear function
  • Differentiable
  • non-decreasing
  • Compute new sets of features
  • Each layer builds up a more abstract


representation of the data

  • Perhaps the most common

http://github.com/bmtgoncalves/Neural-Networks

slide-10
SLIDE 10

www.data4sci.com @bgoncalves

Activation Function - tanh

φ (z) = ez − e−z ez + e−z

  • Non-Linear function
  • Differentiable
  • non-decreasing
  • Compute new sets of features
  • Each layer builds up a more abstract


representation of the data

http://github.com/bmtgoncalves/Neural-Networks

slide-11
SLIDE 11

www.data4sci.com @bgoncalves

Forward Propagation

  • The output of a perceptron is determined by a sequence of steps:
  • obtain the inputs
  • multiply the inputs by the respective weights
  • calculate output using the activation function
  • To create a multi-layer perceptron, you can simply use the output of
  • ne layer as the input to the next one.


x1 x2 x3 xN w1j w2j w3j wNj aj w0j 1

φ

  • wT x
  • 1

w0k

w1k

w2k

w

3 k

wNk

ak

φ

  • wT a
  • a1

a2

aN

slide-12
SLIDE 12

www.data4sci.com @bgoncalves

Backward Propagation of Errors (BackProp)

  • BackProp operates in two phases:
  • Forward propagate the inputs and calculate the deltas
  • Update the weights
  • The error at the output is a weighted average difference between

predicted output and the observed one.

  • For inner layers there is no “real output”!
slide-13
SLIDE 13

www.data4sci.com @bgoncalves

The Cross Entropy is complementary to sigmoid activation in the output layer and improves its stability.

Loss Functions

  • For learning to occur, we must quantify how far off we are from the

desired output. There are two common ways of doing this:

  • Quadratic error function:
  • Cross Entropy

E = 1 N X

n

|yn − an|2 J = − 1 N X

n

h yT

n log an + (1 − yn)T log (1 − an)

i

slide-14
SLIDE 14

www.data4sci.com @bgoncalves

Gradient Descent

  • Find the gradient for each training

batch

  • Take a step downhill along the

direction of the gradient 
 


  • where is the step size.
  • Repeat until “convergence”.

H θmn ← θmn − α ∂H ∂θmn − ∂H ∂θmn α

slide-15
SLIDE 15

www.data4sci.com @bgoncalves

slide-16
SLIDE 16

www.data4sci.com @bgoncalves

Feed Forward Networks

ht Output ht = f (xt) xt Input

slide-17
SLIDE 17

www.data4sci.com @bgoncalves

Feed Forward Networks

ht xt

Information
 Flow Input Output

ht = f (xt)

slide-18
SLIDE 18

www.data4sci.com @bgoncalves

ht = f (xt, ht−1)

Recurrent Neural Network (RNN)

ht Output ht Output

Previous Output Information
 Flow

ht−1 ht = f (xt) xt Input

slide-19
SLIDE 19

www.data4sci.com @bgoncalves

Recurrent Neural Network (RNN)

xt ht ht−1 ht xt+1 ht+1 ht+1 xt−1 ht−1 ht−2

  • Each output depends (implicitly) on all previous outputs.
  • Input sequences generate output sequences (seq2seq)
slide-20
SLIDE 20

www.data4sci.com @bgoncalves

Recurrent Neural Network (RNN)

ht ht ht−1 xt tanh ht = tanh (Wht−1 + Uxt)

Concatenate both inputs.

slide-21
SLIDE 21

www.data4sci.com @bgoncalves

Timeseries

  • Temporal sequence of data points
  • Consecutive points are strongly correlated
  • Common in statistics, signal processing, econometrics,

mathematical finance, earthquake prediction, etc

  • Numeric (real or discrete) or symbolic data
  • Supervised Learning problem:

Xt = f (Xt−1, ⋯, Xt−n)

slide-22
SLIDE 22

github.com/DataForScience/RNN

slide-23
SLIDE 23

www.data4sci.com @bgoncalves

Long-Short Term Memory (LSTM)

xt ht ct−1 ct xt+1 ht+1 ct+1 xt−1 ht−1 ct−2

  • What if we want to keep explicit information about previous states

(memory)?

  • How much information is kept, can be controlled through gates.

ht−2 ht−1 ht ht+1

  • LSTMs were first introduced in 1997 by Hochreiter and Schmidhuber
slide-24
SLIDE 24

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

tanh

slide-25
SLIDE 25

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

Forget gate:
 How much of the previous state should be kept?

tanh

slide-26
SLIDE 26

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

Input gate:
 How much of the previous

  • utput

should be remembered?

tanh

slide-27
SLIDE 27

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

Output gate:
 How much of the previous

  • utput

should contribute? All gates use the same inputs and activation functions, but different weights

tanh

slide-28
SLIDE 28

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

Output gate:
 How much of the previous

  • utput

should contribute?

tanh

slide-29
SLIDE 29

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

State:
 Update the current state

tanh

slide-30
SLIDE 30

www.data4sci.com @bgoncalves

σ σ

f g

× σ ×

i

  • Long-Short Term Memory (LSTM)

ht ht ht−1 xt ct−1 ct g = tanh (Wght−1 + Ugxt) ct = (ct−1 ⊗ f) + (g ⊗ i) ht = tanh (ct) ⊗ o + × × +

1− Element wise addition Element wise multiplication 1 minus the input

tanh i = σ (Wiht−1 + Uixt) f = σ (Wfht−1 + Uf xt)

  • = σ (Woht−1 + Uoxt)

Output:
 Combine all available information.

tanh

slide-31
SLIDE 31

www.data4sci.com @bgoncalves

Using LSTMs

σ

inputs

W1 LSTM W2 Neuron

inputs

W1 LSTM

inputs

W1 LSTM

inputs

W1 LSTM

inputs

W1 LSTM

inputs

W1 LSTM

inputs

W1 LSTM

inputs

W1 LSTM S e q u e n c e L e n g t h #features #LSTM cells

slide-32
SLIDE 32

github.com/DataForScience/RNN

slide-33
SLIDE 33

www.data4sci.com @bgoncalves

Applications

  • Language Modeling and Prediction
  • Speech Recognition
  • Machine Translation
  • Part-of-Speech Tagging
  • Sentiment Analysis
  • Summarization
  • Time series forecasting
slide-34
SLIDE 34

www.data4sci.com @bgoncalves

Neural Networks?

ht ht ht−1 xt ct−1 ct

slide-35
SLIDE 35

www.data4sci.com @bgoncalves

Or legos?

https://keras.io

slide-36
SLIDE 36

www.data4sci.com @bgoncalves

Keras

  • Open Source neural network library written in Python
  • TensorFlow, Microsoft Cognitive Toolkit or Theano backends
  • Enables fast experimentation
  • Created and maintained by François Chollet, a Google engineer.
  • Implements Layers, Objective/Loss functions, Activation

functions, Optimizers, etc…

https://keras.io

slide-37
SLIDE 37

www.data4sci.com @bgoncalves

Keras

  • keras.models.Sequential(layers=None, name=None)- is the
  • workhorse. You use it to build a model layer by layer. Returns the
  • bject that we will use to build the model
  • keras.layers
  • Dense(units, activation=None, use_bias=True) - None

means linear activation. Other options are, ’tanh’, ’sigmoid’, ’softmax’, ’relu’, etc.

  • Dropout(rate, seed=None)
  • Activation(activation) - Same as the activation option to Dense,

can also be used to pass TensorFlow or Theano operations directly.

  • SimpleRNN(units, input_shape, activation='tanh',

use_bias=True, dropout=0.0, return_sequences=False)

  • GRU(units, input_shape, activation='tanh', use_bias=True,

dropout=0.0, return_sequences=False)

https://keras.io

slide-38
SLIDE 38

www.data4sci.com @bgoncalves

Keras

  • model = Sequential()
  • model.add(layer) - Add a layer to the top of the model
  • model.compile(optimizer, loss) - We have to compile the model

before we can use it

  • optimizer - ‘adam’, ‘sgd’, ‘rmsprop’, etc…
  • loss - ‘mean_squared_error’, ‘categorical_crossentropy’,

‘kullback_leibler_divergence’, etc…

  • model.fit(x=None, y=None, batch_size=None, epochs=1,

verbose=1, validation_split=0.0, validation_data=None, shuffle=True)

  • model.predict(x, batch_size=32, verbose=0) - fit/predict interface

https://keras.io

slide-39
SLIDE 39

www.data4sci.com @bgoncalves

Gated Recurrent Unit (GRU)

  • Introduced in 2014 by K. Cho
  • Meant to solve the Vanishing Gradient Problem
  • Can be considered as a simplification of LSTMs
  • Similar performance to LSTM in some applications, better

performance for smaller datasets.

slide-40
SLIDE 40

www.data4sci.com @bgoncalves

σ tanh × σ ×

r c z

Gated Recurrent Unit (GRU)

ht ht ht−1 xt + × z = σ (Wzht−1 + Uzxt) r = σ (Wrht−1 + Urxt) c = tanh (Wc (ht−1 ⊗ r) + Ucxt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

1−

× +

1− Element wise addition Element wise multiplication 1 minus the input

slide-41
SLIDE 41

www.data4sci.com @bgoncalves

σ tanh × σ ×

r c z

Gated Recurrent Unit (GRU)

ht ht ht−1 xt + × z = σ (Wzht−1 + Uzxt) r = σ (Wrht−1 + Urxt) c = tanh (Wc (ht−1 ⊗ r) + Ucxt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

1−

× +

1− Element wise addition Element wise multiplication 1 minus the input Update gate:
 How much of the previous state should be kept?

slide-42
SLIDE 42

www.data4sci.com @bgoncalves

σ tanh × σ ×

r c z

Gated Recurrent Unit (GRU)

ht ht ht−1 xt + × z = σ (Wzht−1 + Uzxt) r = σ (Wrht−1 + Urxt) c = tanh (Wc (ht−1 ⊗ r) + Ucxt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

1−

× +

1− Element wise addition Element wise multiplication 1 minus the input Reset gate:
 How much of the previous

  • utput should

be removed?

slide-43
SLIDE 43

www.data4sci.com @bgoncalves

σ tanh × σ ×

r c z

Gated Recurrent Unit (GRU)

ht ht ht−1 xt + × z = σ (Wzht−1 + Uzxt) r = σ (Wrht−1 + Urxt) c = tanh (Wc (ht−1 ⊗ r) + Ucxt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

1−

× +

1− Element wise addition Element wise multiplication 1 minus the input Current memory:
 What information do we remember right now?

slide-44
SLIDE 44

www.data4sci.com @bgoncalves

σ tanh × σ ×

r c z

Gated Recurrent Unit (GRU)

ht ht ht−1 xt + × z = σ (Wzht−1 + Uzxt) r = σ (Wrht−1 + Urxt) c = tanh (Wc (ht−1 ⊗ r) + Ucxt) ht = (z ⊗ c) + ((1 − z) ⊗ ht−1)

1−

× +

1− Element wise addition Element wise multiplication 1 minus the input Output:
 Combine all available information.

slide-45
SLIDE 45

www.data4sci.com @bgoncalves

Webinars

@bgoncalves www.data4sci.com Deep Learning from Scratch

  • Apr 19, 2019 8am-12pm (PST)

Natural Language Processing (NLP) from Scratch

  • May 6, 2019 5am-9am (PST)


Data Visualization with matplotlib and seaborn

  • Jun 4, 2019 8am-12pm (PST)