Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

deconstructing data science
SMART_READER_LITE
LIVE PREVIEW

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 16: Neural networks Mar 16, 2017 https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-deep-learning-and-how-is-it-useful Neural network libraries The


slide-1
SLIDE 1

Deconstructing Data Science

David Bamman, UC Berkeley
 
 Info 290
 Lecture 16: Neural networks Mar 16, 2017

slide-2
SLIDE 2

https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-deep-learning-and-how-is-it-useful

slide-3
SLIDE 3

Neural network libraries

slide-4
SLIDE 4

The perceptron, again

ˆ yi =

  • 1

if F

i xiβi ≥ 0

−1

  • therwise

x β 1

  • 0.5

1

  • 1.7

0.3

not bad movie

slide-5
SLIDE 5

The perceptron, again

x1 β1 y x2 x3 β2 β3

x β 1

  • 0.5

1

  • 1.7

0.3

not bad movie

ˆ yi =

  • 1

if F

i xiβi ≥ 0

−1

  • therwise
slide-6
SLIDE 6

Neural networks

  • Two core ideas:
  • Non-linear activation functions
  • Multiple layers
slide-7
SLIDE 7

x1 h1 x2 x3 h2 y

Input Output “Hidden” Layer W V

V1 V2 W1,1 W1,2 W2,1 W2,2 W3,1 W3,2

slide-8
SLIDE 8

x1 h1 x2 x3 h2 y

W V

x 1 1

not bad movie

W

  • 0.5

1.3 0.4 0.08 1.7 3.1 V 4.1

  • 0.9

y

  • 1
slide-9
SLIDE 9

x1 h1 x2 x3 h2 y

W V hj = f F

  • i=1

xiWi,j

  • the hidden nodes are

completely determined by the input and weights

slide-10
SLIDE 10

x1 h1 x2 x3 h2 y

W V h1 = f F

  • i=1

xiWi,1

slide-11
SLIDE 11

Activation functions

σ(z) = 1 1 + exp(−z)

0.00 0.25 0.50 0.75 1.00

  • 10
  • 5

5 10

x y

slide-12
SLIDE 12

Activation functions

tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z)

  • 1.0
  • 0.5

0.0 0.5 1.0

  • 10
  • 5

5 10

x y

slide-13
SLIDE 13

Activation functions

rectifier(z) = max(0, z)

0.0 2.5 5.0 7.5 10.0

  • 10
  • 5

5 10

x y

slide-14
SLIDE 14

W V h2 = σ F

  • i=1

xiWi,2

  • h1 = σ

F

  • i=1

xiWi,1

  • ˆ

y = V1h1 + V2h2

x1 h1 x2 x3 h2 y

slide-15
SLIDE 15

W V ˆ y = V1

  • σ

F

  • i=1

xiWi,1

  • h1

+V2

  • σ

F

  • i=1

xiWi,2

  • h2

we can express y as a function only of the input x and the weights W and V x1 h1 x2 x3 h2 y

slide-16
SLIDE 16

ˆ y = V1

  • σ

F

  • i=1

xiWi,1

  • h1

+V2

  • σ

F

  • i=1

xiWi,2

  • h2

This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.

slide-17
SLIDE 17
  • 100
  • 75
  • 50
  • 25
  • 10
  • 5

5 10

x

  • x^2

17

We can get to maximum value of this function by following the gradient

x .1(-2x) 8.00

  • 1.60

6.40

  • 1.28

5.12

  • 1.02

4.10

  • 0.82

3.28

  • 0.66

2.62

  • 0.52

2.10

  • 0.42

1.68

  • 0.34

1.34

  • 0.27

1.07

  • 0.21

0.86

  • 0.17

0.69

  • 0.14

x + α(-2x)

[α = 0.1]

d dx − x2 = −2x

slide-18
SLIDE 18

x1 h1 x2 x3 h2 y

Neural network structures

Output one real value

1

slide-19
SLIDE 19

x1 h1 x2 x3 h2 y

Neural network structures

Multiclass: output 3 values, only

  • ne = 1 in training data

y y

1

slide-20
SLIDE 20

x1 h1 x2 x3 h2 y

Neural network structures

  • utput 3 values, several = 1 in

training data y y

1 1

slide-21
SLIDE 21

Regularization

  • Increasing the number of parameters = increasing

the possibility for overfitting to training data

slide-22
SLIDE 22

Regularization

  • L2 regularization: penalize W and V for being too

large

  • Dropout: when training on a <x,y> pair, randomly

remove some node and weights.

  • Early stopping: Stop backpropagation before the

training error is too small.

slide-23
SLIDE 23

Deeper networks

x1 h1 x2 x3 h2 y

W1 V

x3 h2 h2 h2

W2

slide-24
SLIDE 24

http://neuralnetworksanddeeplearning.com/chap1.html

slide-25
SLIDE 25

Higher order features learned for image recognition
 Lee et al. 2009 (ICML)

slide-26
SLIDE 26

Autoencoder

x1 h1 x2 x3 h2 x1 x2 x3

  • Unsupervised neural network, where y = x
  • Learns a low-dimensional representation of x
slide-27
SLIDE 27

Feedforward networks

x1 h1 x2 x3 h2 y

slide-28
SLIDE 28

Recurrent networks

x h y

input label hidden layer

slide-29
SLIDE 29

Interpretability

x1 β1 y x2 x3 β2 β3

x β 1

  • 0.5

1

  • 1.7

0.3

not bad movie

P(y = 1 | x, β) = exp(xβ) 1 + exp(xβ)

With a single-layer linear model (logistic/linear regression, perceptron) there’s an immediate relationship between x and y apparent in β

slide-30
SLIDE 30

Interpretability

x1 h1 x2 x3 h2 y Non-linear activation functions induce dependencies between the inputs.