Neural Networks Janos Borst July 23, 2019 University of Leipzig - - - PowerPoint PPT Presentation

neural networks
SMART_READER_LITE
LIVE PREVIEW

Neural Networks Janos Borst July 23, 2019 University of Leipzig - - - PowerPoint PPT Presentation

Neural Networks Janos Borst July 23, 2019 University of Leipzig - NLP Group Machine Learning as a Way of Modeling Data Perceptron - Pt. 1 Data Modelling Setting Data set Two features and two classes 1. No: blue 2. Yes: red


slide-1
SLIDE 1

Neural Networks

Janos Borst July 23, 2019

University of Leipzig - NLP Group

slide-2
SLIDE 2

Machine Learning as a Way of Modeling Data

slide-3
SLIDE 3

Perceptron - Pt. 1

slide-4
SLIDE 4

Data Modelling

f1 f2

Setting

  • Data set
  • Two features and two

classes

  • 1. No:

blue

  • 2. Yes:

red

We want to fjnd a description of the data, such that we can classify unseen examples

1

slide-5
SLIDE 5

Data Modelling

f1 f2

Setting

  • Data set
  • Two features and two

classes

  • 1. No:

blue

  • 2. Yes:

red

We want to fjnd a description of the data, such that we can classify unseen examples

1

slide-6
SLIDE 6

Data Modelling

f1 f2

Setting

  • Data set
  • Two features and two

classes

  • 1. No:

blue

  • 2. Yes:

red

We want to fjnd a description of the data, such that we can classify unseen examples

1

slide-7
SLIDE 7

Data Modelling

f1 f2

Decision Tree:

  • 1. Which feature and

threshold make a good split?

  • 2. After fjnding the best

splits:

  • 3. Take f1 check if its larger

than some x

3.1 No: blue 3.2 Yes: red

2

slide-8
SLIDE 8

Data Modelling

f1 f2

k- Nearest Neighbour:

  • 1. Just save all the examples.
  • 2. Check for the labels of all

the nearest Neighbors

  • 3. The new Node has

probably the same class

3

slide-9
SLIDE 9

Perceptron - Pt. 1

f1 f2

Perceptron:

  • What if we just weight and

combine the features?

  • Imagine a two-feature

space

  • features : f1 and f2

4

slide-10
SLIDE 10

Neural Network Perspective

  • Feature Pairs (f1, f2)
  • Associated label l1
  • Weights: w1, w2 and b

n f1 f2 w1 f1 w2 f2 b

  • Prediction:

y sgn n f1 f2

5

slide-11
SLIDE 11

Neural Network Perspective

  • Feature Pairs (f1, f2)
  • Associated label l1
  • Weights: w1, w2 and b

n f1 f2 w1 f1 w2 f2 b

  • Prediction:

y sgn n f1 f2

5

slide-12
SLIDE 12

Neural Network Perspective

  • Feature Pairs (f1, f2)
  • Associated label l1
  • Weights: w1, w2 and b

n(f1, f2) = w1 · f1 + w2 · f2 + b

  • Prediction:

y = sgn(n(f1, f2))

5

slide-13
SLIDE 13

Neural Network Perspective

Example I:

  • Take weights w1 = 2, w2 = 2 and b = 0
  • We say red is 1 and blue is -1
  • Take (0,2)
  • n 0 2

2 0 2 2 4

  • Prediction: y

sgn 4 1 red

  • Take

1 0

  • n 0 2

2 1 2 0 2

  • Prediction: y

sgn 2 1 blue

6

slide-14
SLIDE 14

Neural Network Perspective

Example I:

  • Take weights w1 = 2, w2 = 2 and b = 0
  • We say red is 1 and blue is -1
  • Take (0,2)
  • n(0, 2) = 2 · 0 + 2 · 2 + 0 = 4
  • Prediction: y

sgn 4 1 red

  • Take

1 0

  • n 0 2

2 1 2 0 2

  • Prediction: y

sgn 2 1 blue

6

slide-15
SLIDE 15

Neural Network Perspective

Example I:

  • Take weights w1 = 2, w2 = 2 and b = 0
  • We say red is 1 and blue is -1
  • Take (0,2)
  • n(0, 2) = 2 · 0 + 2 · 2 + 0 = 4
  • Prediction: y = sgn(4) = 1

red

  • Take

1 0

  • n 0 2

2 1 2 0 2

  • Prediction: y

sgn 2 1 blue

6

slide-16
SLIDE 16

Neural Network Perspective

Example I:

  • Take weights w1 = 2, w2 = 2 and b = 0
  • We say red is 1 and blue is -1
  • Take (0,2)
  • n(0, 2) = 2 · 0 + 2 · 2 + 0 = 4
  • Prediction: y = sgn(4) = 1

red

  • Take (−1, 0)
  • n(0, 2) = 2 · −1 + 2 · 0 + 0 = −2
  • Prediction: y

sgn 2 1 blue

6

slide-17
SLIDE 17

Neural Network Perspective

Example I:

  • Take weights w1 = 2, w2 = 2 and b = 0
  • We say red is 1 and blue is -1
  • Take (0,2)
  • n(0, 2) = 2 · 0 + 2 · 2 + 0 = 4
  • Prediction: y = sgn(4) = 1

red

  • Take (−1, 0)
  • n(0, 2) = 2 · −1 + 2 · 0 + 0 = −2
  • Prediction: y = sgn(−2) = −1

blue

6

slide-18
SLIDE 18

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-19
SLIDE 19

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-20
SLIDE 20

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-21
SLIDE 21

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-22
SLIDE 22

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-23
SLIDE 23

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-24
SLIDE 24

The Perceptron Algorithm - Informal

How do we fjnd the ”right” weights?

  • 1. Initialize the weights randomly
  • 2. Take an example from the data set
  • 3. Predict a class based on the current function
  • 4. Prediction is
  • correct: then go back to 2.
  • false: adjust weights towards the misclassifjed point and

go back to 2.

Do this until all the examples are classifjed correctly

7

slide-25
SLIDE 25

The Perceptron Algorithm

Data: Features and Labels Function neuron(f1,f2): return w1 · f1 + w2 · f2 + b for f1,f2,label in Data do

  • utput ← neuron(f1,f2)

prediction ← sgn(output) if prediction = 1 and label = −1 then w1 ← w1 + f1 w1 ← w2 + f2 b ← b + 1 end if prediction = −1 and label = 1 then w1 ← w1 − f1 w1 ← w2 − f2 b ← b − 1 end end

8

slide-26
SLIDE 26

The Perceptron

The Perceptron1:

  • The fjrst neural network-like machine learning algorithm
  • A detailed description of the Perceptron here

1The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain, F. Rosenblatt (1958)

9

slide-27
SLIDE 27

showcase

The Perceptron

10

slide-28
SLIDE 28

Schematic

Perceptron: f1 f2

∑ α

w1 w2 y Input | Dendrite Summation|Soma Activation| Potential Forwarding the Signal|Axon Neurona:

aWiki Commons

Detailed Comparison

11

slide-29
SLIDE 29

Neural Networks

slide-30
SLIDE 30

Neural Networks

Generalizing this idea

Input Output Layer

12

slide-31
SLIDE 31

Neural Networks

Generalizing this idea

Input Output Layer

12

slide-32
SLIDE 32

Neural Networks

Generalizing this idea

Input Output Layer

12

slide-33
SLIDE 33

Neural Networks

Generalizing this idea

Input Output Layer

12

slide-34
SLIDE 34

Dense Layer

Dense This type of layer is called Dense Layer or Densely Connected Layer or Fully Connected Layer For input x the output o cann be calculated by:

  • W x

b with weight matrix W and bias vector b the trainable parameters.

13

slide-35
SLIDE 35

Dense Layer

Dense This type of layer is called Dense Layer or Densely Connected Layer or Fully Connected Layer For input x the output o cann be calculated by: ⃗

  • = W ·⃗

x + ⃗ b , with weight matrix W and bias vector b the trainable parameters.

13

slide-36
SLIDE 36

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-37
SLIDE 37

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-38
SLIDE 38

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-39
SLIDE 39

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-40
SLIDE 40

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-41
SLIDE 41

Layers

Layers - The building Blocks of Neural Networks

  • An arrangement of neurons (trainable parameters)
  • Mathematical transformation of the input
  • Determine how the information fmows through
  • Contain a method to update the parameters
  • The abstraction allows to stack layers on top of each other

14

slide-42
SLIDE 42

The Picture So Far

Input Layer Layer Output Activation

15

slide-43
SLIDE 43

The Picture So Far

Input Layer Layer Output Activation Layers

  • We can transform the input by using a layer
  • We can stack layers
  • Layers other than input/ouput are called Hidden Layers
  • The arrangement of the layers is called an

Architecture

15

slide-44
SLIDE 44

The Picture So Far

Input Layer Layer Output Activation Activation Functions

  • Non-linear functions
  • Applied to the output of a layer
  • They make neural networks powerful
  • Correspond to the ”fjring of neurons”

15

slide-45
SLIDE 45

Activation functions

What makes neural networks so powerful?

  • Non-Linearity
  • Scaling the network
  • Various Activations
  • A short Guide
  • or this
  • We will learn and use mainly the softmax activation

16

slide-46
SLIDE 46

The Big Picture

Input Layer Layer Output Activation Metric

17

slide-47
SLIDE 47

The Big Picture

Input Layer Layer Output Activation Metric

17

slide-48
SLIDE 48

The Big Picture

Input Layer Layer Output Activation Metric Metrics

  • Measure the quality of the Prediction on a Data Sample
  • Describes the desired performance

17

slide-49
SLIDE 49

Metrics

Accuracy - A very common and easy metric

  • The ratio of correct predictions to the number of test

examples

  • For set S of examples:

accuracy s S: prediction(s) is correct S

  • This is what we want to be high !

Unfortunately: This is not difgerentiable (Remember: The training will rely on the derivation.)

18

slide-50
SLIDE 50

Metrics

Accuracy - A very common and easy metric

  • The ratio of correct predictions to the number of test

examples

  • For set S of examples:

accuracy s S: prediction(s) is correct S

  • This is what we want to be high !

Unfortunately: This is not difgerentiable (Remember: The training will rely on the derivation.)

18

slide-51
SLIDE 51

Metrics

Accuracy - A very common and easy metric

  • The ratio of correct predictions to the number of test

examples

  • For set S of examples:

accuracy = |{s ∈ S: prediction(s) is correct}| |S|

  • This is what we want to be high !

Unfortunately: This is not difgerentiable (Remember: The training will rely on the derivation.)

18

slide-52
SLIDE 52

Metrics

Accuracy - A very common and easy metric

  • The ratio of correct predictions to the number of test

examples

  • For set S of examples:

accuracy = |{s ∈ S: prediction(s) is correct}| |S|

  • This is what we want to be high !

Unfortunately: This is not difgerentiable (Remember: The training will rely on the derivation.)

18

slide-53
SLIDE 53

Metrics

Accuracy - A very common and easy metric

  • The ratio of correct predictions to the number of test

examples

  • For set S of examples:

accuracy = |{s ∈ S: prediction(s) is correct}| |S|

  • This is what we want to be high !

Unfortunately: This is not difgerentiable (Remember: The training will rely on the derivation.)

18

slide-54
SLIDE 54

The Picture So Far

Input Layer Layer Activation Output Metric Loss

19

slide-55
SLIDE 55

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Loss functions measure the quality of the prediction
  • Difgerentiable !
  • A Proxy for the metric
  • Also: cost function or error function

19

slide-56
SLIDE 56

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Loss functions measure the quality of the prediction
  • Difgerentiable !
  • A Proxy for the metric
  • Also: cost function or error function

19

slide-57
SLIDE 57

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Loss functions measure the quality of the prediction
  • Difgerentiable !
  • A Proxy for the metric
  • Also: cost function or error function

19

slide-58
SLIDE 58

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Loss functions measure the quality of the prediction
  • Difgerentiable !
  • A Proxy for the metric
  • Also: cost function or error function

19

slide-59
SLIDE 59

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Dependent on the task we want to train
  • We will learn the corresponding loss functions by example

19

slide-60
SLIDE 60

The Picture So Far

Input Layer Layer Activation Output Metric Loss Loss Functions

  • Dependent on the task we want to train
  • We will learn the corresponding loss functions by example

19

slide-61
SLIDE 61

Loss - Intuition

Measures ”Deviation from ideal prediction” Suppose an image classifjer predicts: human cat dog 0.48 0.01 0.51

  • Results in the correct decision dog
  • The ideal prediction would be p(dog)=1, p(cat)=0,

p(human)=1

20

slide-62
SLIDE 62

Loss - Intuition

Measures ”Deviation from ideal prediction” Suppose an image classifjer predicts: human cat dog 0.48 0.01 0.51

  • Results in the correct decision dog
  • The ideal prediction would be p(dog)=1, p(cat)=0,

p(human)=1

20

slide-63
SLIDE 63

Loss - Intuition

Measures ”Deviation from ideal prediction” Suppose an image classifjer predicts: human cat dog 0.48 0.01 0.51

  • Results in the correct decision dog
  • The ideal prediction would be p(dog)=1, p(cat)=0,

p(human)=1

20

slide-64
SLIDE 64

Loss - Intuition

Measures ”Deviation from ideal prediction” Suppose an image classifjer predicts: human cat dog 0.48 0.01 0.51

  • Results in the correct decision dog
  • The ideal prediction would be p(dog)=1, p(cat)=0,

p(human)=1

20

slide-65
SLIDE 65

Loss - Intuition

  • prediction p = (0.48, 0.01, 0.51)
  • truth t = (0, 0, 1)

For example distance: l = ∥t − p∥ = √ (0 − 0.48)2 + (0 − 0.1)2 + (1 − 0.51)2 = 0.68 high loss, because prediction is uncertain

21

slide-66
SLIDE 66

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance

22

slide-67
SLIDE 67

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance Performance

  • Metrics and Losses to train the net
  • How do we measure the real performance of the model?
  • Train on set of examples (training set)
  • Evaluating on unseen data (test set)

22

slide-68
SLIDE 68

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance Performance

  • Metrics and Losses to train the net
  • How do we measure the real performance of the model?
  • Train on set of examples (training set)
  • Evaluating on unseen data (test set)

22

slide-69
SLIDE 69

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance Data

  • Before there is input there is data
  • How do we represent language data for input and output?
  • Next chapter

22

slide-70
SLIDE 70

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance Data

  • Before there is input there is data
  • How do we represent language data for input and output?
  • Next chapter

22

slide-71
SLIDE 71

The Picture So Far

Data Input Layer Layer Activation Output Metric Loss Performance Data

  • Before there is input there is data
  • How do we represent language data for input and output?
  • Next chapter

22

slide-72
SLIDE 72

Outlook

  • Very coarse-grained view on the structure of Neural

Networks

  • Learning by examples
  • With every example we will:
  • learn new layers
  • learn new activations
  • learn new loss functions
  • And directly use them in Keras

23

slide-73
SLIDE 73

Outlook

  • Very coarse-grained view on the structure of Neural

Networks

  • Learning by examples
  • With every example we will:
  • learn new layers
  • learn new activations
  • learn new loss functions
  • And directly use them in Keras

23

slide-74
SLIDE 74

Outlook

  • Very coarse-grained view on the structure of Neural

Networks

  • Learning by examples
  • With every example we will:
  • learn new layers
  • learn new activations
  • learn new loss functions
  • And directly use them in Keras

23

slide-75
SLIDE 75

Training a neural network

slide-76
SLIDE 76

Training

The Process of Finding the best Parameters by looking at the Data. How do we update the weights?

The Backpropagation Algorithm 2

2Learning representations by back-propagating errors. David E. Rumelhart,

Geofgrey E. Hinton, Ronald J. Williams. (1988)

24

slide-77
SLIDE 77

Training

The Process of Finding the best Parameters by looking at the Data. How do we update the weights?

The Backpropagation Algorithm 2

2Learning representations by back-propagating errors. David E. Rumelhart,

Geofgrey E. Hinton, Ronald J. Williams. (1988)

24

slide-78
SLIDE 78

Terminology

  • batch: A small subset drawn from the data
  • batch size
  • example: One element of the data
  • epoch: Iteration over all available examples (in batches)

25

slide-79
SLIDE 79

Terminology

  • batch: A small subset drawn from the data
  • batch size
  • example: One element of the data
  • epoch: Iteration over all available examples (in batches)

25

slide-80
SLIDE 80

Terminology

  • batch: A small subset drawn from the data
  • batch size
  • example: One element of the data
  • epoch: Iteration over all available examples (in batches)

25

slide-81
SLIDE 81

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-82
SLIDE 82

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-83
SLIDE 83

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-84
SLIDE 84

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-85
SLIDE 85

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-86
SLIDE 86

Epoch

The idea of an epoch (similar to the Perceptron Algorithm):

  • 1. Pick a few examples of data at random (batch)
  • 2. Calculate the output of the net
  • 3. Loss: Calculate the Loss/Error of the output
  • 4. Determine the gradients (derivatives)
  • 5. Update the Weights accordingly
  • 6. Do that until every example has been seen once

26

slide-87
SLIDE 87

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss

27

slide-88
SLIDE 88

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Input example

27

slide-89
SLIDE 89

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Transform to network input

27

slide-90
SLIDE 90

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Calculate a dense transformation

27

slide-91
SLIDE 91

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Calculate the output of the network

27

slide-92
SLIDE 92

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Calculate the loss function

27

slide-93
SLIDE 93

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Update the weights, s.t. the probability for Cat decreases.

27

slide-94
SLIDE 94

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Update the weights, s.t. the probability for Cat decreases.

27

slide-95
SLIDE 95

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Update the weights, s.t. the probability for Dog increases.

27

slide-96
SLIDE 96

Backpropagation - An visual Intuition

Update the Weights accordingly? 0.9 cat 0.1 dog loss Update the weights, s.t. the probability for Dog increases.

27

slide-97
SLIDE 97

Backpropagation - Reading

A neural network is trained by a combination of Gradient Descent and Backpropagation

  • A good video for intuition
  • Very mathematical

28

slide-98
SLIDE 98

Backpropagation - A mathematical Intuition

The neural network is a parametrized function, e.g.: pred(i) = α(W · i + b) , with parameteres W and b and a loss function : loss pred i truth How does a specifjc weight wij infmuence the error made on the example? loss wij loss wij

29

slide-99
SLIDE 99

Backpropagation - A mathematical Intuition

The neural network is a parametrized function, e.g.: pred(i) = α(W · i + b) , with parameteres W and b and a loss function : loss(pred(i), truth) How does a specifjc weight wij infmuence the error made on the example? loss wij loss wij

29

slide-100
SLIDE 100

Backpropagation - A mathematical Intuition

The neural network is a parametrized function, e.g.: pred(i) = α(W · i + b) , with parameteres W and b and a loss function : loss(pred(i), truth) How does a specifjc weight wij infmuence the error made on the example? loss wij loss wij

29

slide-101
SLIDE 101

Backpropagation - A mathematical Intuition

The neural network is a parametrized function, e.g.: pred(i) = α(W · i + b) , with parameteres W and b and a loss function : loss(pred(i), truth) How does a specifjc weight wij infmuence the error made on the example? ∂loss ∂wij loss wij

29

slide-102
SLIDE 102

Backpropagation - A mathematical Intuition

The neural network is a parametrized function, e.g.: pred(i) = α(W · i + b) , with parameteres W and b and a loss function : loss(pred(i), truth) How does a specifjc weight wij infmuence the error made on the example? ∂loss ∂wij = ∂loss ∂α ∂α ∂wij

29

slide-103
SLIDE 103

Backpropagation - A mathematical Intuition

This is done vectorized, by calculating the gradient:       

∂loss ∂w11 ∂loss ∂w12

. . .

∂loss ∂w21 ∂loss ∂w22

. . .

∂loss ∂w31

... . . .        ... and then used to update the weights by adding the negativ the gradient, remember: The Gradient points in the direction of steepest ascend. This is called Gradient Descent

30

slide-104
SLIDE 104

Backpropagation - A mathematical Intuition

This is done vectorized, by calculating the gradient:       

∂loss ∂w11 ∂loss ∂w12

. . .

∂loss ∂w21 ∂loss ∂w22

. . .

∂loss ∂w31

... . . .        ... and then used to update the weights by adding the negativ the gradient, remember: The Gradient points in the direction of steepest ascend. This is called Gradient Descent

30

slide-105
SLIDE 105

Stochastic Gradient Descent

Gradient Descent:

  • 1. Calculate the loss for every example
  • 2. Determine the gradient
  • 3. Update weights in the direction of the negative gradient
  • 4. Iterate

Stochastic Gradient Descent (SGD):

  • 1. Calculate the loss for random subsample of the data set
  • 2. Determine the gradient
  • 3. Update weights in the direction of the negative gradient
  • 4. Iterate

31

slide-106
SLIDE 106

Stochastic Gradient Descent

Gradient Descent:

  • 1. Calculate the loss for every example
  • 2. Determine the gradient
  • 3. Update weights in the direction of the negative gradient
  • 4. Iterate

Stochastic Gradient Descent (SGD):

  • 1. Calculate the loss for random subsample of the data set
  • 2. Determine the gradient
  • 3. Update weights in the direction of the negative gradient
  • 4. Iterate

31

slide-107
SLIDE 107

Optimizer

Stochastic Gradient Descent (SGD) Wt+1 = Wt − η grad [ loss(i; Wt) ] , with η the learning rate and batch i. Optimizers:

  • In neural networks this is called an optimizer
  • SGD is the most basic neural network optimizer
  • There are a lot of advancements in optimizers
  • A very comprehensive article
  • We use the (more or less) state-of-the-art default: Adam

32

slide-108
SLIDE 108

Optimizer

Stochastic Gradient Descent (SGD) Wt+1 = Wt − η grad [ loss(i; Wt) ] , with η the learning rate and batch i. Optimizers:

  • In neural networks this is called an optimizer
  • SGD is the most basic neural network optimizer
  • There are a lot of advancements in optimizers
  • A very comprehensive article
  • We use the (more or less) state-of-the-art default: Adam

32

slide-109
SLIDE 109

Optimizer

Stochastic Gradient Descent (SGD) Wt+1 = Wt − η grad [ loss(i; Wt) ] , with η the learning rate and batch i. Optimizers:

  • In neural networks this is called an optimizer
  • SGD is the most basic neural network optimizer
  • There are a lot of advancements in optimizers
  • A very comprehensive article
  • We use the (more or less) state-of-the-art default: Adam

32

slide-110
SLIDE 110

Optimizer

Stochastic Gradient Descent (SGD) Wt+1 = Wt − η grad [ loss(i; Wt) ] , with η the learning rate and batch i. Optimizers:

  • In neural networks this is called an optimizer
  • SGD is the most basic neural network optimizer
  • There are a lot of advancements in optimizers
  • A very comprehensive article
  • We use the (more or less) state-of-the-art default: Adam

32

slide-111
SLIDE 111

Optimizer

Stochastic Gradient Descent (SGD) Wt+1 = Wt − η grad [ loss(i; Wt) ] , with η the learning rate and batch i. Optimizers:

  • In neural networks this is called an optimizer
  • SGD is the most basic neural network optimizer
  • There are a lot of advancements in optimizers
  • A very comprehensive article
  • We use the (more or less) state-of-the-art default: Adam

32

slide-112
SLIDE 112

Summary

  • To train a neural network, we chose a loss function and an
  • ptimizer
  • We train by:
  • iterating over the data
  • Calulating the loss
  • let the optimizer update the weights, s.t. the loss decreases
  • We stop training when the loss does not get lower

anymore

33

slide-113
SLIDE 113

Summary

  • To train a neural network, we chose a loss function and an
  • ptimizer
  • We train by:
  • iterating over the data
  • Calulating the loss
  • let the optimizer update the weights, s.t. the loss decreases
  • We stop training when the loss does not get lower

anymore

33

slide-114
SLIDE 114

Summary

  • To train a neural network, we chose a loss function and an
  • ptimizer
  • We train by:
  • iterating over the data
  • Calulating the loss
  • let the optimizer update the weights, s.t. the loss decreases
  • We stop training when the loss does not get lower

anymore

33

slide-115
SLIDE 115

Summary

  • To train a neural network, we chose a loss function and an
  • ptimizer
  • We train by:
  • iterating over the data
  • Calulating the loss
  • let the optimizer update the weights, s.t. the loss decreases
  • We stop training when the loss does not get lower

anymore

33

slide-116
SLIDE 116

showcase

Programming MNIST

34

slide-117
SLIDE 117

Caveats and More Things to Worry About

slide-118
SLIDE 118

Generalization

How well does the model perform on data not seen during training? Training means Fitting the training set - low loss What about the loss on unseen data (test set)? We hope by minimizing the error on training data the model can generalize to unseen data

35

slide-119
SLIDE 119

Generalization

How well does the model perform on data not seen during training? Training means Fitting the training set - low loss What about the loss on unseen data (test set)? We hope by minimizing the error on training data the model can generalize to unseen data

35

slide-120
SLIDE 120

Generalization

How well does the model perform on data not seen during training? Training means Fitting the training set - low loss | What about the loss on unseen data (test set)? We hope by minimizing the error on training data the model can generalize to unseen data

35

slide-121
SLIDE 121

Generalization

How well does the model perform on data not seen during training? Training means Fitting the training set - low loss | What about the loss on unseen data (test set)? | We hope by minimizing the error on training data the model can generalize to unseen data

35

slide-122
SLIDE 122

Generalization

When good performance on training data indicates good performance on test data we say: the model generalizes well. The bad cases? Underfjtting: Fails to catch the underlying trend Overfjtting: Memorizes the training data and noise Both related to bad generalization!

36

slide-123
SLIDE 123

Generalization

When good performance on training data indicates good performance on test data we say: the model generalizes well. The bad cases? Underfjtting: Fails to catch the underlying trend Overfjtting: Memorizes the training data and noise Both related to bad generalization!

36

slide-124
SLIDE 124

Generalization

When good performance on training data indicates good performance on test data we say: the model generalizes well. The bad cases? Underfjtting: Fails to catch the underlying trend Overfjtting: Memorizes the training data and noise Both related to bad generalization!

36

slide-125
SLIDE 125

Over-, Underfjtting, Variance, Bias

Regression Categorization Data Underfjtted Just right Overfjtted 37

slide-126
SLIDE 126

Over-, Underfjtting, Variance, Bias

Regression Categorization Data Underfjtted Just right Overfjtted 37

slide-127
SLIDE 127

Over-, Underfjtting, Variance, Bias

Regression Categorization Data Underfjtted Just right Overfjtted 37

slide-128
SLIDE 128

Over-, Underfjtting, Variance, Bias

Regression Categorization Data Underfjtted Just right Overfjtted 37

slide-129
SLIDE 129

Over-, Underfjtting, Variance, Bias

Regression Categorization Data Underfjtted Just right Overfjtted 37

slide-130
SLIDE 130

Overfjtting

Oberserving the loss on training and test data during the training to detect overfjtting

Overfjtting

training test training

38

slide-131
SLIDE 131

Overfjtting

Oberserving the loss on training and test data during the training to detect overfjtting

time loss Overfjtting

training test training

38

slide-132
SLIDE 132

Overfjtting

Oberserving the loss on training and test data during the training to detect overfjtting

time loss Overfjtting

training test training

38

slide-133
SLIDE 133

Overfjtting

Oberserving the loss on training and test data during the training to detect overfjtting

time loss Overfjtting

training test training

38

slide-134
SLIDE 134

Underfjtting

Difgerent ways to manifest

  • loss on training set does not decrease
  • loss on training set does increases
  • loss on training set behaves randomly

39

slide-135
SLIDE 135

Validation Split

Validation Split

  • Take a small portion of the data
  • do not use for training
  • track the loss

Three sets: training set: Use to update the weights validation set: track the generalization test set: evaluate performance on held-out data

40

slide-136
SLIDE 136

Validation Split

Validation Split

  • Take a small portion of the data
  • do not use for training
  • track the loss

Three sets: training set: Use to update the weights validation set: track the generalization test set: evaluate performance on held-out data

40

slide-137
SLIDE 137

Hyperparamters

  • learning rate
  • batch size
  • epochs
  • number of layers
  • number of parameters in layers
  • . . .

Lots of little knobs to tune...

41

slide-138
SLIDE 138

Hyperparamters

  • learning rate
  • batch size
  • epochs
  • number of layers
  • number of parameters in layers
  • . . .

Lots of little knobs to tune...

41

slide-139
SLIDE 139

Runtime Performance

The time it takes to train. Dependent on task, data, and deep It takes longer:

  • the deeper the network
  • the larger the data set
  • the higher the complexity of the task

Often a GPU can help.

42

slide-140
SLIDE 140

Runtime Performance

The time it takes to train. Dependent on task, data, and deep It takes longer:

  • the deeper the network
  • the larger the data set
  • the higher the complexity of the task

Often a GPU can help.

42

slide-141
SLIDE 141

Runtime Performance

The time it takes to train. Dependent on task, data, and deep It takes longer:

  • the deeper the network
  • the larger the data set
  • the higher the complexity of the task

Often a GPU can help.

42

slide-142
SLIDE 142

Important

Main Caveats:

  • data, the more the better
  • Generalization
  • lots of hyperparameters
  • runtime

43

slide-143
SLIDE 143

Important

Main Caveats:

  • data, the more the better
  • Generalization
  • lots of hyperparameters
  • runtime

43

slide-144
SLIDE 144

Important

Main Caveats:

  • data, the more the better
  • Generalization
  • lots of hyperparameters
  • runtime

43

slide-145
SLIDE 145

Important

Main Caveats:

  • data, the more the better
  • Generalization
  • lots of hyperparameters
  • runtime

43