Deconstructing Data Science
David Bamman, UC Berkeley Info 290 Lecture 16: Neural networks Mar 16, 2017
Deconstructing Data Science David Bamman, UC Berkeley Info 290 - - PowerPoint PPT Presentation
Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 16: Neural networks Mar 16, 2017 https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-deep-learning-and-how-is-it-useful Neural network libraries The
David Bamman, UC Berkeley Info 290 Lecture 16: Neural networks Mar 16, 2017
https://www.forbes.com/sites/kevinmurnane/2016/04/01/what-is-deep-learning-and-how-is-it-useful
ˆ yi =
if F
i xiβi ≥ 0
−1
x β 1
1
0.3
not bad movie
x1 β1 y x2 x3 β2 β3
x β 1
1
0.3
not bad movie
ˆ yi =
if F
i xiβi ≥ 0
−1
x1 h1 x2 x3 h2 y
Input Output “Hidden” Layer W V
V1 V2 W1,1 W1,2 W2,1 W2,2 W3,1 W3,2
x1 h1 x2 x3 h2 y
W V
x 1 1
not bad movie
W
1.3 0.4 0.08 1.7 3.1 V 4.1
y
x1 h1 x2 x3 h2 y
W V hj = f F
xiWi,j
completely determined by the input and weights
x1 h1 x2 x3 h2 y
W V h1 = f F
xiWi,1
0.00 0.25 0.50 0.75 1.00
5 10
x y
tanh(z) = exp(z) − exp(−z) exp(z) + exp(−z)
0.0 0.5 1.0
5 10
x y
rectifier(z) = max(0, z)
0.0 2.5 5.0 7.5 10.0
5 10
x y
W V h2 = σ F
xiWi,2
F
xiWi,1
y = V1h1 + V2h2
x1 h1 x2 x3 h2 y
W V ˆ y = V1
F
xiWi,1
+V2
F
xiWi,2
we can express y as a function only of the input x and the weights W and V x1 h1 x2 x3 h2 y
ˆ y = V1
F
xiWi,1
+V2
F
xiWi,2
This is hairy, but differentiable Backpropagation: Given training samples of <x,y> pairs, we can use stochastic gradient descent to find the values of W and V that minimize the loss.
5 10
x
17
We can get to maximum value of this function by following the gradient
x .1(-2x) 8.00
6.40
5.12
4.10
3.28
2.62
2.10
1.68
1.34
1.07
0.86
0.69
x + α(-2x)
[α = 0.1]
d dx − x2 = −2x
x1 h1 x2 x3 h2 y
Output one real value
1
x1 h1 x2 x3 h2 y
Multiclass: output 3 values, only
y y
1
x1 h1 x2 x3 h2 y
training data y y
1 1
the possibility for overfitting to training data
large
remove some node and weights.
training error is too small.
x1 h1 x2 x3 h2 y
W1 V
x3 h2 h2 h2
W2
http://neuralnetworksanddeeplearning.com/chap1.html
Higher order features learned for image recognition Lee et al. 2009 (ICML)
x1 h1 x2 x3 h2 x1 x2 x3
x1 h1 x2 x3 h2 y
x h y
input label hidden layer
x1 β1 y x2 x3 β2 β3
x β 1
1
0.3
not bad movie
P(y = 1 | x, β) = exp(xβ) 1 + exp(xβ)
With a single-layer linear model (logistic/linear regression, perceptron) there’s an immediate relationship between x and y apparent in β
x1 h1 x2 x3 h2 y Non-linear activation functions induce dependencies between the inputs.