Course setup 9 ec course examination based on computer exercises - - PowerPoint PPT Presentation

course setup
SMART_READER_LITE
LIVE PREVIEW

Course setup 9 ec course examination based on computer exercises - - PowerPoint PPT Presentation

Course setup 9 ec course examination based on computer exercises weekly exercises discussed in tutorial class All course materials (slides, exercises) and schedule via http://www.snn.ru. nl/bertk/machinelearning/ Bert Kappen ML


slide-1
SLIDE 1

Course setup

  • 9 ec course
  • examination based on computer exercises
  • weekly exercises discussed in tutorial class
  • All course materials (slides, exercises) and schedule via http://www.snn.ru.

nl/˜bertk/machinelearning/

Bert Kappen ML 1

slide-2
SLIDE 2

Handout Perceptrons

The Perceptron

Relevant in history of pattern recognition and neural networks.

  • Perceptron learning rule + convergence, Rosenblatt (1962)
  • Perceptron critique (Minsky and Papert, 1969) → ”Dark ages of neural net-

works”

  • Revival in the 80’s: Backpropagation and Hopfield model. Statistical physics

entered.

  • 1995. Bayesian methods take over. Start of modern machine learning. NN out
  • f fashion.
  • 2006 Deep learning, big data.

Bert Kappen ML 2

slide-3
SLIDE 3

Handout Perceptrons

The Perceptron

y(x) = sign(wTφ(x))

where sign(a) =

+1, a ≥ 0 −1, a < 0.

and φ(x) is a feature vector (e.g. hard wired neural network).

Bert Kappen ML 3

slide-4
SLIDE 4

Handout Perceptrons

The Perceptron

Ignore φ, ie. consider inputs xµ and outputs tµ = ±1 Define wT x = n

j=1 w jxj + w0. Then, the learning condition becomes

sign(wT xµ) = tµ, µ = 1, . . . , P We have sign(wT xµtµ) = 1

  • r

wTzµ > 0

with zµ

j = xµ jtµ.

Bert Kappen ML 4

slide-5
SLIDE 5

Handout Perceptrons

Linear separation

Classification depends on sign of wT x. Thus, decision boundary is hyper plane:

0 = wT x =

n

  • j=1

wjxj + w0

Perceptron can solve linearly separable problems. AND problem is linearly separable. XOR problem and linearly dependent inputs not linearly separable.

Bert Kappen ML 5

slide-6
SLIDE 6

Handout Perceptrons

Perceptron learning rule

Learning succesful when

wTzµ > 0,

all patterns Learning rule is ’Hebbian’:

wnew

j

= wold

j

+ ∆w j ∆wj = ηΘ(−wTzµ)xµ

jtµ = ηΘ(−wTzµ)zµ j

η is the learning rate.

Bert Kappen ML 6

slide-7
SLIDE 7

Handout Perceptrons

Depending on the data, there may be many or few solutions to the learning problem (or non at all) The quality of the solution is determined by the worst pattern. Since the solution does not depend on the size of w:

D(w) = 1 |w| min

µ wTzµ

Acceptable solutions have D(w) > 0. The best solution is given by Dmax = maxw D(w).

Bert Kappen ML 7

slide-8
SLIDE 8

Handout Perceptrons

Dmax > 0 iff the problem is linearly separable.

Bert Kappen ML 8

slide-9
SLIDE 9

Handout Perceptrons

Convergence of Perceptron rule

Assume that the problem is linearly separable, so that there is a solution w∗ with

D(w∗) > 0.

At each iteration, w is updated only if w·zµ < 0. Let Mµ denote the number of times pattern µ has been used to update w. Thus,

w = η

  • µ

Mµzµ

Consider the quanty

−1 < w · w∗ |w||w∗| < 1

We will show that

w · w∗ |w||w∗| ≥ O( √ M),

with M =

µ Mµ the total number of iterations.

Therefore, M can not grow indefinitely. Thus, the perceptron learning rule con- verges in a finite number of steps when the problem is linearly separable.

Bert Kappen ML 9

slide-10
SLIDE 10

Handout Perceptrons

Proof:

w · w∗ = η

  • µ

Mµzµ · w∗ ≥ ηM min

µ zµ · w∗

= ηMD(w∗)|w∗| ∆|w|2 = |w + ηzµ|2 − |w|2 = 2ηw · zµ + η2|zµ|2 ≤ η2|zµ|2 = η2N |w| ≤ η √ NM

Thus,

1 ≥ w · w∗ |w||w∗| ≥ √ MD(w∗) √ N

Number of weight updates:

M ≤ N D2(w∗)

Bert Kappen ML 10

slide-11
SLIDE 11

Handout Perceptrons

Capacity of the Perceptron

Consider P patterns in N dimensions in general position:

  • no subset of size less than N is linearly dependent.
  • general position is necessary for linear separability

Question: What is the probability that a problem of P samples in N dimensions is linearly separable?

Bert Kappen ML 11

slide-12
SLIDE 12

Handout Perceptrons

Define C(P, N) the number of linearly separable colorings on P points in N dimen- sions, with separability plane through the origin. Then (Cover 1966):

C(P, N) = 2

N−1

  • i=0

P − 1 i

  • When P ≤ N small, then C(P, N) = 2 P−1

i=0

P − 1 i

  • = 2(1 + 1)P−1 = 2P

When P = 2N, then 50 % is linearly separable: C(P, N) = 2 N−1

i=0

2N − 1 i

  • =

2N−1

i=0

2N − 1 i

  • = 22N−1 = 2P−1

Bert Kappen ML 12

slide-13
SLIDE 13

Handout Perceptrons

Proof by induction. Add one point X. The set C(P, N) consists of

  • colorings with separator through X (A)
  • rest (B)

Thus,

C(P + 1, N) = 2A + B = C(P, N) + A = C(P, N) + C(P, N − 1)

Yields

C(P, N) = 2

N−1

  • i=0

P − 1 i

  • Bert Kappen

ML 13

slide-14
SLIDE 14

5.2

Network training

Regression: tn continue valued, h2(x) = x and one usually minimizes the squared error (one output)

E(w) = 1 2

N

  • n=1

(y(xn, w) − tn)2 = − log

N

  • n=1

N(tn|y(xn, w), β−1) + . . .

Classification: tn = 0, 1 , h2(x) = σ(x), y(xn, w) is probability to belong to class 1.

E(w) = −

N

  • n=1

{tn log y(xn, w) + (1 − tn) log(1 − y(xn, w))} = − log

N

  • n=1

y(xn, w)tn(1 − y(xn, w))1−tn

Bert Kappen ML 14

slide-15
SLIDE 15

5.2

Network training

More than two classes: consider network with K outputs. tnk = 1 if xn belongs to class k and zero otherwise. yk(xn, w) is the network output

E(w) = −

N

  • n=1

K

  • k=1

tnk log pk(xn, w) pk(x, w) = exp(yk(x, w)) K

k′=1 exp(yk′(x, w))

Bert Kappen ML 15

slide-16
SLIDE 16

5.2

Parameter optimization

w1 w2 E(w) wA wB wC ∇E

E is minimal when ∇E(w) = 0, but not vice versa!

As a consequence, gradient based methods find a local minimum, not necessary the global minimum.

Bert Kappen ML 16

slide-17
SLIDE 17

5.2

Gradient descent optimization

The simplest procedure to optimize E is to start with a random w and iterate

wτ+1 = wτ − η∇E(wτ)

This is called batch learning, where all training data are included in the computation

  • f ∇E.

Does this algorithm converge? Yes, if ǫ is ”sufficiently small” and E bounded from below. Proof: Denote ∆w = −η∇E.

E(w + ∆w) ≈ E(w) + (∆w)T∇E = E(w) − η

  • i

∂E ∂wi 2 ≤ E(w)

In each gradient descent step the value of E is lowered. Since E bounded from below, the procedure must converge asymptotically.

Bert Kappen ML 17

slide-18
SLIDE 18

Handouts Ch. Perceptrons

Convergence of gradient descent in a quadratic well

E(w) = 1 2

  • i

λiw2

i

∆wi = −η ∂E ∂wi = −ηλiwi wnew

i

= wold

i

+ ∆wi = (1 − ηλi)wi

Convergence when |1 − ηλi| < 1. Oscillations when 1 − ηλi < 0. Optimal learning parameter depends on curvature of each dimension.

Bert Kappen ML 18

slide-19
SLIDE 19

Handouts Ch. Perceptrons

Learning with momentum

One solution is adding momentum term:

∆wt+1 = −η∇E(wt) + α∆wt = −η∇E(wt) + α (−η∇E(wt−1) + α (−η∇E(wt−2) + . . .)) = −η

t

  • k=0

αk∇E(wt−k)

Consider two extremes: No oscillations all derivative are equal:

∆wt+1 ≈ −η∇E

t

  • k=0

αk = − η 1 − α ∂E ∂w

results in acceleration

Bert Kappen ML 19

slide-20
SLIDE 20

Handouts Ch. Perceptrons

Oscillations all derivatives are equal but have opposite sign:

∆w(t + 1) ≈ −η∇E

t

  • k=0

(−α)k = − η 1 + α ∂E ∂w

results in decceleration

Bert Kappen ML 20

slide-21
SLIDE 21

Newtons method

One can also use Hessian information for optimization. As an example, consider a quadratic approximation to E around w0:

E(w) = E(w0) + bT(w − w0) + 1 2(w − w0)H(w − w0) bi = ∂E(w0) ∂wi Hi j = ∂2E(w0) ∂wi∂w j ∇E(w) = b + H(w − w0)

We can solve ∇E(w) = 0 and obtain

w = w0 − H−1∇E(w0)

This is called Newtons method. Quadratic approximation is exact when E is quadratic, so convergence in one step. Quasi-Newton: Consider only diagonal of H.

Bert Kappen ML 21

slide-22
SLIDE 22

Line search

Another solution is line optimisation:

w1 = w0 + λ0d0, d0 = −∇E(w0) λ0 > 0 is found by a one dimensional optimisation 0 = ∂ ∂λ0 E(w0 + λ0d0) = d0 · ∇E(w1) = d0 · d1

Therefore, subsequent search directions are orthogonal.

Bert Kappen ML 22

slide-23
SLIDE 23

Conjugate gradient descent

We choose as new direction a combination of the gradient and the old direction

d′

1 = −∇E(w1) + βd0

Line optimisation w2 = w1 + λ1d′

1 yields λ1 > 0 such that d′ 1 · ∇E(w2) = 0.

The direction d′

1 is found by demanding that ∇E(w2) ≈ 0 also in the ’old’ direction

d0: 0 = d0 · ∇E(w2) ≈ d0 · (∇E(w1) + λ1H(w1)d′

1)

  • r

d0H(w1)d′

1 = 0

The subsequent search directions d0, d′

1 are said to be conjugate.

Bert Kappen ML 23

slide-24
SLIDE 24

Polak-Ribiere rule

The conjugate directions can be computed without computing the Hessian matrix, for instance using the Polak-Ribiere rule:1

β = (∇E(w1) − ∇E(w0)) · ∇E(w1) ∇E(w0)2

For quadratic problems, it can be proven that this rule keeps the last n directions all mutually conjugate [Press et al., 1996]

dT

i Hdj = 0

i, j = 1, . . . , n

1 We need 0 = dT

0 H(w1)d′

  • 1. We use ∇E(w0) ≈ ∇E(w1) + (w0 − w1)T H(w1) = ∇E(w1) − λ0dT

0 H(w1) and d′ 1 =

−∇E(w1) + βd0. Then 0 = λ0dT

0 H(w1)d′ 1 = ∇E(w1) − ∇E(w0) · −∇E(w1) + βd0

= − ∇E(w1) − ∇E(w0) · ∇E(w1) + β∇E(w0)2

where in the last step we used that d0 · ∇E(w1) = 0.

Bert Kappen ML 24

slide-25
SLIDE 25

Stochastic gradient descent

One can also consider on-line learing, where only one or a subset of training pat- terns is considered for computing ∇E.

E(w) =

  • n

En(w) wt+1 = wt − αt∇En(wτ)

May be efficient for large data sets. This results in a stochastic dynamics in w that can help to escape local minima.

Bert Kappen ML 25

slide-26
SLIDE 26

Robbins Monro

Consider the problem to find x such that

M(x) = a, M(x) = N(x, ξ) =

  • dξp(ξ)Ni(x, ξ)

x, a, M, N are vectors. Ni(x, ξ) some non-linear function, p(ξ is a probability distri-

bution and ai a constant. Method of stochastic approximation originally due to Robbins and Monro 1951:

  • Initialize x0 random
  • For t = 0, . . ., Choose ξt ∼ p(ξ); Update xt+1 = xt + αt(a − N(xt, ξt))

If Mi(x) = ∇iE and E is convex and x∗ the unique solution, then one can prove that

xt − x∗2 → 0, provided that

  • t=1

αt = ∞

  • t=1

α2

t < ∞

For instance αt = 1/t.

Bert Kappen ML 26

slide-27
SLIDE 27

Stochastic gradient descent

Denote training error

E(w) = 1 P

  • µ

Eµ(w)

we wish to find solution of

∇E(w) = 1 P

  • µ

∇Eµ(w) = 0

This is an instance of the Robbins-Monro problem with ξ = µ = 1, . . . , P and

p(µ) = 1 P ai = 0 N(w, µ) = Eµ(w)

The SGD method is

  • Choose random a pattern µ ∈ [1, . . . , P]
  • Update wt+1 = wt − ηt∇Eµ(w)

Bert Kappen ML 27

slide-28
SLIDE 28

Extensions of SGD and comparisons see [Sohl-Dickstein et al., 2013].

Bert Kappen ML 28

slide-29
SLIDE 29

5.1

Feed-forward Network functions

We extend the previous regression model with fixed basis functions

y(x, w) = f         

M

  • j=1

wjφ j(x)         

to a model where φ j is adaptive:

φ j(x) = h(

D

  • i=0

w(1)

ji xi)

Bert Kappen ML 29

slide-30
SLIDE 30

5.1

Feed-forward Network functions

In the case of K outputs

yk(x, w) = h2         

M

  • j=1

w(2)

k j h1

       

D

  • i=0

w(1)

ji xi

                 h2(x) is σ(x) or x depending on the problem. h1(x) is σ(x) or tanh(x).

x0 x1 xD z0 z1 zM y1 yK w(1)

MD

w(2)

KM

w(2)

10

hidden units inputs

  • utputs

x1 x2 z1 z3 z2 y1 y2 inputs

  • utputs

Left) Two layer architecture. Right) general feed-forward network with skip-layer connections.

If h1, h2 linear, the model is linear. If M < D, K it computes principle components (Bishop section 12.4.2).

Bert Kappen ML 30

slide-31
SLIDE 31

5.1

Feed-forward Network functions

Two layer NN with 3 ’tanh’ hidden units and linear output can approximate many functions. x ∈

[−1, 1], 50 equally spaced points. From left to right: f(x) = x2, sin(x), |x|, Θ(x). Dashed lines are

  • utputs of the 3 hidden units.

−2 −1 1 2 −2 −1 1 2 3

Two layer NN with two inputs and 2 ’tanh’ hid- den units and sigmoid output for classification. Dashed lines are hidden unit activities.

Feed-forward neural networks have good approximation properties.

Bert Kappen ML 31

slide-32
SLIDE 32

5.1.1

Weight space symmetries

For any solutions of the weights, there are many equivalent solutions due to sym- metry:

  • for any hidden unit j with tanh activation function, change w ji → −w ji and

wk j → −wk j: 2M solutions

  • rename the hidden unit labels: M! solutions

Thus a total of M!2M equivalent solutions, not only for tanh activation functions.

Bert Kappen ML 32

slide-33
SLIDE 33

5.3.1

Error backpropagation

Error is sum of error per pattern

E(w) =

  • n

En(w) En(w) = 1 2y(xn, w) − tn2 yk(x, w) = h2         wk0 +

M

  • j=1

wk jh1        w j0 +

D

  • i=1

w jixi                  = h2(ak) ak = wk0 +

M

  • j=1

wk jh1(aj) =

M

  • j=0

wk jh1(aj) h1(a0) = 1 aj = w j0 +

D

  • i=1

w jixi =

D

  • i=0

w jixi x0 = 1 i labels inputs, j labels hiddens, k labels outputs.

Bert Kappen ML 33

slide-34
SLIDE 34

5.3.1

Error backpropagation

We do each pattern separately, so we consider En

yk(xn, w) = h2(an

k) = h2

        

M

  • j=0

wk jh1(an

j)

         = h2         

M

  • j=0

wk jh1        

D

  • i=0

w jixn

i

                 ∂En ∂wk j = (yn

k − tn k) ∂yn k

∂wk j = (yn

k − tn k)h′ 2(an k) ∂an k

∂wk j = (yn

k − tn k)h′ 2(an k)h1(an j)

= δn

kh1(an j)

δn

k = (yn k − tn k)h′ 2(an k)

∂En ∂w ji =

K

  • k=1

(yn

k − tn k) ∂yn k

∂w ji =

K

  • k=1

(yn

k − tn k)h′ 2(an k) ∂an k

∂wji =

K

  • k=1

δn

kwk jh′ 1(an j)

∂an

j

∂w ji =

K

  • k=1

δn

kwk jh′ 1(an j)xn i = δn jxn i

δn

j

= h′

1(an j) K

  • k=1

δn

kwk j

Bert Kappen ML 34

slide-35
SLIDE 35

5.3.1

Error backpropagation

zi zj δj δk δ1 wji wkj

The back propagation extends to arbitrary layers:

  • 1. zn

i = xn i forward propagation all activations zn j = h1(an j) and zn k = h2(an k), etc.

  • 2. Compute the δn

k for the output units, and back-propagate the δ to obtain δn j each

hidden unit j

  • 3. ∂En/∂wk j = δn

kzn j and ∂En/∂w ji = δn jzn i

  • 4. for batch mode, ∂E/∂w ji =

n ∂En/∂w ji

E is a function of O(|w|) variables. In general, the computation of E requires O(|w|)

  • perations. The computation of ∇E would thus require O(|w|2) operations.

The backpropagation method allows to compute ∇E efficiently, in O(|w|) operations.

Bert Kappen ML 35

slide-36
SLIDE 36

5.5

Regularization

M = 1 1 −1 1 M = 3 1 −1 1 M = 10 1 −1 1

Complexity of neural network solution is controlled by number of hidden units

2 4 6 8 10 60 80 100 120 140 160

sum squared test error for different number of hidden units and different weight initializations. Error is also affected by local minima.

Bert Kappen ML 36

slide-37
SLIDE 37

Part of the cause of local minima is the saturation of the sigmoid functions

tanh( wi jxj). When wi j becomes large, any change in its value hardly affects

the output, implying ∇i jE = 0. One can partly prevent this from happening by

  • chosing tanh instead of σ transfer functions and scaling of inputs and outputs

with mean zero and standard deviation one

  • proper initialisation of wi j with mean zero and standard deviation of order 1/ √n1,

with n1 the number of inputs to neuron i.

  • add regularizer such as

i w2 i to cost keeps weights small

  • dropouts, other transfer functions, adding noise, ....

Bert Kappen ML 37

slide-38
SLIDE 38

MLPs are universal approximators

Consider 2n binary patterns in n dimensions and two classes:

xµ → cµ = ±1, xµ

i = ±1

Use 2n hidden units, labeled j = 0, . . . , 2n − 1, i labels input. Set

w ji = b

if ith digit in binary repr. of j is 1

w ji = −b

else

j

binary

wj1 w j2

00

  • b
  • b

1 01

  • b

b 2 10 b

  • b

3 11 b b

x1 x2

  • i w0ixi

w1ixi w2ixi w3ixi

  • 1
  • 1

2b

  • 2b
  • 1

1 2b

  • 2b

1

  • 1
  • 2b

2b 1 1

  • 2b

2b

Bert Kappen ML 38

slide-39
SLIDE 39

MLPs are universal approximators

Use threshold of (n − 1)b at each hidden unit. zj = Θ[

i w jixi − (n − 1)b)]. The

remaining problem has p = 2n patterns in 2n dimensions and is linearly separable. Define c = sign[3

j=0 w jz j].

x1 x2 z0 z1 z2 z3 c

  • 1
  • 1

1 sign[w0]

  • 1

1 1 sign[w1] 1

  • 1

1 sign[w2] 1 1 1 sign[w3]

The combination of linear summation and non-linear functions can create many different functions.

  • The MLP with a single hidden layer can map any continuous function [Cybenko, 1989,

Hornik et al., 1989]

Bert Kappen ML 39

slide-40
SLIDE 40

Convolution networks

A special architecture for images using 4 ideas:

local connections: Connect not fully the bipartite graph between two layers. weight sharing: use the same parameters to different neurons that detect same feature (feature layer) (max) pooling: down sample each feature map (not adaptive) many layers: repeat many times. Fukushima 1982, LeCunn 1990

Bert Kappen ML 40

slide-41
SLIDE 41

Imagenet 2012 competition

Imagenet is an annual computer vision competition. Classify images in 1000 classes. 1.2 million training images, 50,000 validation images, and 150,000 testing images.

Bert Kappen ML 41

slide-42
SLIDE 42

Interpreting images

Vinyals et al. 2014

Bert Kappen ML 42

slide-43
SLIDE 43

Deep learning papers

Fukushima Neocognitron 1982:

https://www.sciencedirect.com/science/article/pii/0031320382900243

LeCunn Convolutional networks 1990:

http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation- pdf

Krizhevsky Imagenet 2012:

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional- pdf

Vinyals Interpreting images 2014:http://arxiv.org/abs/1502.03044 LeCunn, Bengio, Hinton. Deep learning 2015:

https://www.nature.com/articles/nature14539.pdf

Bert Kappen ML 43