Propagating Error Backward A Back-Propagation Example /* Propagate - - PowerPoint PPT Presentation

propagating error backward
SMART_READER_LITE
LIVE PREVIEW

Propagating Error Backward A Back-Propagation Example /* Propagate - - PowerPoint PPT Presentation

Learning in Neural Networks w 1,3 w 3,5 1 3 5 w w 1,4 3,6 w w 2,3 4,5 2 4 6 w w 2,4 4,6 Class #21: Back-Propagation; } A neural network can learn a classification function by Tuning Hyper-Parameters adjusting its weights to


slide-1
SLIDE 1

1

Class #21: Back-Propagation; Tuning Hyper-Parameters

Machine Learning (COMP 135): M. Allen, 06 Apr. 20

1

Learning in Neural Networks

} A neural network can learn a classification function by

adjusting its weights to compute different responses

} This process is another version of gradient descent: the

algorithm moves through a complex space of partial solutions, always seeking to minimize overall error

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 2

w3,5

3,6

w

4,5

w

4,6

w 5 6 w1,3

1,4

w

2,3

w

2,4

w 1 2 3 4

2

function BACK-PROP-LEARNING(examples, network) returns a neural network inputs: examples, a set of examples, each with input vector x and output vector y network, a multilayer network with L layers, weights wi,j, activation function g local variables: ∆, a vector of errors, indexed by network node repeat for each weight wi,j in network do wi,j ← a small random number for each example (x, y) in examples do /* Propagate the inputs forward to compute the outputs */ for each node i in the input layer do ai ← xi for ℓ = 2 to L do for each node j in layer ℓ do inj ← P

i wi,j ai

aj ← g(inj) /* Propagate deltas backward from output layer to input layer */ for each node j in the output layer do ∆[j] ← g′(inj) × (yj − aj) for ℓ = L − 1 to 1 do for each node i in layer ℓ do ∆[i] ← g′(ini) P

j wi,j ∆[j]

/* Update every weight in network using deltas */ for each weight wi,j in network do wi,j ← wi,j + α × ai × ∆[j] until some stopping criterion is satisfied return network

Back-Propagation (Hinton, et al.)

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 3

Initial random weights Loop over all training examples, generating the

  • utput, and then updating

weights based on error Stop when weights converge

  • r error is minimized

Source: Russel & Norvig, AI: A Modern Approach (Prentice Hal, 2010)

3

for each example x y in do /* Propagate the inputs forward to compute the outputs */ for each node i in the input layer do ai ← xi for ℓ = 2 to L do for each node j in layer ℓ do inj ← P

i wi,j ai

aj ← g(inj) /* Propagate deltas backward from output layer to input laye

Propagating Output Values Forward

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 4

At first (“top”) layer, each neuron input is set to the corresponding feature value Go down layer-by-layer, calculating weighted input sums for each neuron, and computing

  • utput function g

4

slide-2
SLIDE 2

2

/* Propagate deltas backward from output layer to input layer */ for each node j in the output layer do ∆[j] ← g′(inj) × (yj − aj) for ℓ = L − 1 to 1 do for each node i in layer ℓ do ∆[i] ← g′(ini) P

j wi,j ∆[j]

/* Update every weight in network using deltas */ for each weight wi,j in network do wi,j ← wi,j + α × ai × ∆[j] until some stopping criterion is satisfied

Propagating Error Backward

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 5

At output (“bottom”) layer, each delta-value is set to the error on that neuron, multiplied by the derivative of function g Go bottom-up and set delta to derivative value multiplied by sum of deltas at the next layer down (weighting each such value appropriately) After all the delta values are computed, update weights on every node in the network

5

A Back-Propagation Example

} Consider the following

simple network, with:

1.

T wo inputs

2.

A single hidden layer, consisting of one neuron

3.

T wo output neurons

4.

Initial weights as shown

} Suppose we have the

following data-point:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 6

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

I1 I2

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1 1

0.1 0.1

6

A Back-Propagation Example

} For this data, we start by

computing the output of H

} We have the weighted linear sum: } And, assuming the logistic

activation function, we get output:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 7

X

inH

= 0.1 + (0.1 × 0.5) + (0.1 × 0.4) = 0.19

<latexit sha1_base64="1QPsF535z/j1dg9D/2yKjswG8Zg=">ACKHicZVBdT8IwFO38RPya+uhLI9FANGRDjPpgQuSFR0zkI2Fk6boOGrp1aTsSQvhB+mPUN+XVX2IHe0Fu0vTk9Nzbe4XMyqVZc2Njc2t7Z3d3F5+/+Dw6Ng8OW1LnghMWpgzLroekoTRiLQUVYx0Y0FQ6DHS8Ub19L0zJkJSHr2qSUz6IRpENKAYKU25Zt2RSehOaeQ2ZvDqCVplG17DYno5ioZEauautEZVS9Bx8kv9o2sWrLK1KLgO7AwUQFZN1/xwfI6TkEQKMyRlz7Zi1Z8ioShmZJZ3EklihEdoQKYLizN4qSkfBlzoEym4YFd0EVcLSyvdvUQFD3tLk4UifByTJAwqDhM04A+FQrNtEAYUH1/xAPkUBY6cxWJomEf8GjtOgfb0rG3CtH4YVva8OwP5vdx20K2X7tlx5qRZqz1kUOXAOLkAR2OAe1EADNELYPAGPsEPmBvxpfxbcyX0g0j6zkDK2X8/gGxU575</latexit>

aH = 1 1 + e−0.19 ≈ 0.547

<latexit sha1_base64="YWPhxb0duxzHhHpD5p1d+ab0tE=">ACEnicZVDNSwJBHJ21L7OvrY5dhiQIymVXC4sIpC4eDdIENRlnZ3VwdmeZnZVkWeh/qH+mbuU16hwE/SuNqxfzwcDjze/rvY7PaCBN81tLSwuLa+kVzNr6xubW/r2Ti3gocCkijnjot5BAWHUI1VJSN1XxDkdhi56/Svx/93AyICyr1bOfRJy0VdjzoUI6mktp5D7agcw0vYdATCkRVHFjyC5D7KmYZ1HsewiXxf8AdoGqcnxbaeNQ0zAZwn1pRkSxe/X48/BVFp659Nm+PQJZ7EDAVBwzJ92YqQkBQzEmeaYUB8hPuoS6LETAwPlGRDhwv1PAkTdabO4zI5fqa7EUrnrBVRzw8l8fBkjBMyKDkc+4Y2FQRLNlQEYUHVfoh7SHmWKp2ZSJkxD6Gg3GktrqVdbmq7l5da8KwPpvd57U8oZVMPI3KokrMEa7IF9cAgsUAQlUAYVUAUYPIEX8A5G2rP2qr1po0lpSpv27IZaB9/VD2fPQ=</latexit>

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

aH = 0.547 0.5 0.4

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1 1

0.1 0.1

7

A Back-Propagation Example

} Next, we compute the output of

each of the two output neurons

} Since each has identical weights,

initial outputs will be the same:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 8

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

aH = 0.547

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1 1

0.1 X

inO

= 0.1 + (0.1 × 0.547) = 0.1547 aO = 1 1 + e−0.1547 ≈ 0.539

<latexit sha1_base64="E8stiyCyrzSMsDX3V9smHqHOBro=">ACSXicZVFNSwMxEM3W7/pV9eglWBRFLbtVqR6EohdvKlgV3Lqk2dk2mN0sSbYoy/5Be/+C71pT2a3vVQHQh4z701mXjoxZ0rb9odVmpicmp6ZnSvPLywuLVdWVm+VSCSFhVcyPsOUcBZBC3NIf7WAIJOxzuOk/nef2uD1IxEd3olxjaIelGLGCUaJPyKr6rktBLWeRdZnjrFNs1B+/i7fxyNQtBmczRYWMHFyWDsOuWiZcO2W4gCU2dLM1F8JjuDzlZhl0Sx1I85+qDE69StWt2Efg/cEagikZx5VXeXV/QJIRIU06UenDsWLdTIjWjHLKymyiICX0iXUgLDzK8aVI+DoQ0J9K4yI7xIqGLncfUD4kOjtm/TjRENFhmyDhWAuc24V9JoFq/mIAoZKZ9zHtEbO1NqaOdZIJB38P9/Of8M2svCsMvxfWzbzGAOfvuv/Bb3mHNTq14fV5tnIilm0jbQNnJQAzXRBbpCLUTRK/pEP2hgvVlf1rc1GFJL1kizhsaiNPEL2guqJw=</latexit>

a01 = 0.539 a02 = 0.539 0.1 0.5 0.4

8

slide-3
SLIDE 3

3 A Back-Propagation Example

} Given the output vector y = (1,0),

we can compute the error terms for the two output neurons:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 9

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

aH = 0.547

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1 1

0.1 a01 = 0.539 a02 = 0.539 eO1 = 1 − 0.539 = 0.461 eO2 = 0 − 0.539 = −0.539

<latexit sha1_base64="jRB6WjsrbpIjNlhL4FKeBWByUvc=">ACIHicZVDLTsJAFJ3iC/FVdelmItG4ENIWfMWQEN24ExN5JECaYTqFCdNOM52SkIaP0Z/RlcpSv8ZpYSFyk8mcnHOfpxcwGkrD+NYyK6tr6xvZzdzW9s7unr5/0Ah5JDCpY864aPVQSBj1SV1SyUgrEAR5PUaveF9ojdHRISU+89yHJCuh/o+dSlGUlG2fkvs+NGcwNMKNGEBGsWL0g2sqL98acJOJ5fIViobf+RCmw9bxSNOAyMOcgD+ZRs/WPjsNx5BFfYobCsG0agezGSEiKGZnkOlFIAoSHqE/i9LQJPFGUA10u1PMlTNmFPJ/L9JSF6nYk3etuTP0gksTHszZuxKDkMHEBOlQLNlYAYQFVfMhHiCBsFReLXQSESPORwlBjtqV9bnKn/gWpfZYD5/9xl0LCKZqloPZXz1bu5FVlwBI7BGTDBFaiCB1ADdYDBC3gDX2CqvWrv2qc2naVmtHnNIVgI7ecXeCGahg=</latexit>

e01 = 0.461 e02 = –0.539

0.1 0.5 0.4

9

A Back-Propagation Example

} The derivative of the activation

function at each output neuron is:

} And so we have our 𝛦 values:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 10

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

g0(inO) = aO × (1 − aO) = 0.539 × 0.461 ≈ 0.2485

<latexit sha1_base64="vPGWUSFb/dmLh+m10UMdJuSlwSg=">ACLnicZVDLTgIxFO34Fl+oSzeNRIOJTmZGVFyYEN3oCkxESRgyKZ0CDZ26XSMhPBN+iXuNHGhbP0MC4wL9CZNzj30XtOUzIa8f5sGZm5+YXFpeWMyura+sb2c2t+1gkCpMqFkyoWhPFhFOqpqRmpSERQ1GXlodq9G9YdHomIq+J3uSdKIUJvTFsVIGyrI3rR9qaI85UH5AO5fQBSUoa9pRGKYd4/QiPX9jCk49snx+W/JsQunLvSRlEo8mcwrFE+CbM6xnXHA/8BNQ6kUQmyr34ocBIRrjFDcVx3HakbfaQ0xYwMn4SE4lwF7VJfyx0APcMFcKWUOZxDcfsVB8Xeixsarqe6Fax0adcJpwPFnTShjUAo48gSFVBGvWMwBhRc3/EHeQlgb56Y2qYSR8BA+juwOza2sLUx/J/LMvcYA96/c/+Des91j27st5EqXqRVLYAfsgjxwRkogWtQAVWAwTN4A19gaL1Y79anNZy0zljpzDaYCuv7B2cuorE=</latexit>

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1 1

a01 = 0.539 a02 = 0.539

e01 = 0.461 e02 = –0.539 ∆[O1] = g0(inO) × e01 = 0.2485 × 0.461 ≈ 0.115 ∆[O2] = g0(inO) × e02 = 0.2485 × −0.539 ≈ −0.134

<latexit sha1_base64="Y3fnZMKgKzsc1v16fcdpLaDv9s=">ACinicfVFNaxsxENVu0yZ10sZpj72ImJQU2kXa2PmkENoceksKdRLwmkXWjh0RrS0s6HB+Ofk7/Tef1PZcQpuSgYEb2bePI2eBk6rChn7HcXPlp6/WF52Vhde/V6vbnx5rytZfQlVZbfzkQFWhloIsKNVw6D6IcaLgYXH+d9i9uwFfKmh946BfipFRQyUFhlLevMtOQKPonfI+f+ZjLny21l8tMPNENVQkUhHzM+oVnWCH2WpO39zkOLJe1dTjPhnLc/Q8Z5Z8p7kEyfkz/L/mJZ2dg7+aIeU7bZo3Wyxhs6CPAZ+DFpnHWd78lRVW1iUYlFpUVY8zh/2x8KikhkjqytwQl6LEYxnJk7oVigVdGh9OAbprLrAMxZnpi1M92oc7vfHyrgawch7mWGtKVo69ZsWyoNEfRuAkF6F+6m8El5IDL+yoORrDcVHejP9yiLsqkc28K/KNOwbDOD/PvcxOE+DW0n6vd06/jK3YoW8I5tkm3CyR47JN3JGukRGqxGPDqOjeC1O4P46J4aR/OZt2Qh4pM/5ta41w=</latexit>

𝛦[O1] = 0.115 𝛦[O2] = – 0.134

aH = 0.547 0.1 0.5 0.4 0.1

10

A Back-Propagation Example

} Similarly, we can compute the

derivative of the activation function and the 𝛦 value for the hidden- layer neuron, H:

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 11

(x, y) = ((0.5, 0.4), (1, 0))

<latexit sha1_base64="f6x0ZdNxyc/Sc0b4u/EyA3uIls=">ACnicZVDLSsNAFJ34rPUVdelmaCmkGEJSFd0IRTcuK9gHtKVMJpN26CQTJpNiCf0CXfojutNuXbv3zh9bGIPDBzO3Me5x40YjaVtz7SNza3tnd3cXn7/4PDoWD85bcQ8EZjUMWdctFwUE0ZDUpdUMtKBEGBy0jTHT7M/5sjImLKw2c5jkg3QP2Q+hQjqaSeXjI6AZID109fJiYcl+EdNAzbujahbV2VTWg4pl0u9/SibdkLwHXirEixWuhcvM+q41pP/+14HCcBCSVmKI7bjh3JboqEpJiRSb6TxCRCeIj6JF0cMYElJXnQ50K9UMKFmqkLuVyYznS3E+nfdlMaRokIV6O8RMGJYfze6FHBcGSjRVBWFC1H+IBEghLlUpmkgY8Uw4mkfpKa+sz1X9IKgovyoA5/+56RsZxLq/KkrgHS+TAOSgAzjgBlTBI6iBOsDgFXyAbzDV3rRP7UubLks3tFXPGchA+/kDo/qYyg=</latexit>

H 1

0.1 0.1 0.1 bias

O2 O1

0.1 0.1

1

0.1

1

a01 = 0.539 a02 = 0.539 aH = 0.547

𝛦[O1] = 0.115 𝛦[O2] = – 0.134

∆[H] = g0(inH) × ( X

O wH,O × ∆[O])

= 0.248 × (0.1 × 0.115 + 0.1 × −0.134) = 0.248 × −0.0019 ≈ 0.0005

<latexit sha1_base64="cDLr7imTjifnxPYC+cvJXWfn3U=">ACk3icbVHbhMxEPUutxJuocATLyMiUBAl2k1TUBIVZuHvKAUibSV4tXK8TqpVa9t2bOFKsov8Sm8zc46RYRykiWjs/MGY/PTKySHpPkVxTfuHnr9p2Nu4179x8fNR8vHnkTeW4GHGjDuZMC+U1GKEpU4sU6wcqLE8eTsYJk/PhfOS6O/4oUVWclmWk4lZxiovPmD9oVCNh5k8OoTzKh1ZVvqfPAaKMpSeGhTX5VUGyVLiT4fwrd8PtgaLq7ytX6YBQVthB5Jp9vb/aNOukVDjDdgTfwF/U24O3ef5UhlSTpe6DMWme+w/Ka7OTN1gqEgOsgrUGL1HGYN3/SwvCqFBq5Yt6P08RiNmcOJVdi0aCVF5bxMzYT85WbC3gZqAKmxoWjEVbsWp02uHJvT2ucLqbzaW2FQrNL9tMKwVoYGk8FNIJjuoiAMadDO8DP2WOcQzrWevkKiWKLThf7rQIs6qZCfWnZTfMGwxI/3udXDUDb52ul96rb392oN8py8IG2SkndkjwzIRkRHj2NPkQHUT9+Fn+M9+P+ZWkc1ZonZC3iz78BaFy8LQ=</latexit>

g0(inH) = aH × (1 − aH) = 0.547 × 0.453 ≈ 0.248

<latexit sha1_base64="5a0UsfifXcA53dnOoik9/J/ye+E=">ACLXicZVDLTgIxFO3gC/GFunTSDSY6GSGR2BjQnTDEhN5JAyZlE7Bhk7bdDpEQvgl/RNXujBRtv6G5eECuUmTc89J7TlYxG2nE+rcTG5tb2TnI3tbd/cHiUPj5pRCJWmNSxYEK1uigijHJS1Qz0pKoLDLSLM7uJ/Vm0OiIir4ox5J0glRn9MexUgbyk9X+5UYZyv3oFL28h8qvQ0zQkEcy6N2jGel7KFBy7WCj9lRy7UMxD0mpxLPJcoWyn84tjMPuA7cJciAZdT89JsXCByHhGvMUBS1XUfqzhgpTEjk5QXR0QiPEB9Mp7rnMALQwWwJ5R5XM5u9LHhZ7rWplux7pX7owpl7EmHC/W9GIGtYAzS2BAFcGajQxAWFHzP8RPSCGsjXErm1TMSHANhzO3A3Mr6wvT/xTmzL3GAPe/3HXQyNlu3s49FDKVu6UVSXAGzkEWuKAEKqAKaqAOMHgB7+AbTK1X68P6sqaL1oS1nDkFK2H9/AK53qJd</latexit>

𝛦[H] = 0.0005

0.5 0.4 0.1

11

A Back-Propagation Example

} Finally, all weights in the network

can be updated, using the 𝛦 values (here, we will assume that the learning rate 𝛽 = 1):

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 12

H 1

0.10005 0.10025 0.1002

bias

O2 O1

0.1629 0.0267

1

0.215

1

a01 = 0.539 a02 = 0.539 aH = 0.547

𝛦[O1] = 0.115 𝛦[O2] = – 0.134 𝛦[H] = 0.0005

0.5 0.4

– 0.034

wi,j = wi,j + α × ai × ∆[j]

<latexit sha1_base64="mXEtna7VOTPU0JO0Fc6RuZaQlU=">ACG3icZVDLSgMxFM3Ut/VRdekmWAVBKTMV1I1Q1IVLBdsKnTKkmds2NTMZMncqpfRP9Df8ALvT4s6FH6Jr04eL6oHA4eTc16lFUsRo259WamZ2bn5hcWk5vbK6tp7Z2CzFKtEcilxJpe9qLAYpQiQAl3kQYW1CSUa/cXw/9yG3QsVHiLnQiqAWuEoi4QyN5meMHrysOWz16Rn/ZAXWZjJqMuigCiCnzxC91L0Eiq7SqXiZr5+wR6H/iTEi2sPv13G+nv6+9zLvrK54ECKXLI4rjh1htcs0Ci6ht+wmMUSM37MGdEdX9eiekXxaV9q8EOlInfKFCkdXTFVXEqyfVrsijBKEkI/b1BNJUdFhANQXGjKjiGMa2HmU95kmnE0MU10okE/5C2h9n6ZlfZUMbfDPJmXxOA8/fc/6SUzlHufyNSeKcjLFItskO2ScOSEFckWuSZFw8kheyBsZWE9W3q1BmNryprUbJEpWB8/wcikQ=</latexit>

Exercise for the reader: Now that the weights are updated, re-compute the

  • utputs of the network,

along with the error values for each output neuron. What do we see?

12

slide-4
SLIDE 4

4 Hyperparameters for Neural Networks

} Multi-layer (deep) neural networks involve a number of

different possible design choices, each of which can affect classifier accuracy:

} Number of hidden layers } Size of each hidden layer } Activation function employed } Regularization term (controls over-fitting)

} This is not unique to neural networks

} Logistic regression: regularization (C parameter in sklearn), class

weights, etc.

} SVM: kernel type, kernel parameters (like polynomial degree), error

penalty (C again), etc.

} Question is often how we can tune these model control

parameters effectively to find best combinations

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 13

13

Heldout Cross-Validation

} We can use k-fold cross-validation techniques to estimate the

real effectiveness of various parameter settings:

1.

Divide labeled data into k folds, each of size 1/k

2.

Repeat k times:

a.

Hold aside one of the folds; train on the remaining (k – 1); test on the heldout data

b.

Record classification error for both training and heldout data

3.

Average over the k trials

} This can give us a more robust estimate of real effectiveness } It can also allow us to better detect over-fitting: when average

heldout error is significantly worse than average training error, model has grown too complex or otherwise problematic

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 14

14

Modifying Model Parameters

} Using heldout validation techniques, we can begin to

explore various parts of the hyperparameter-space

} In each case, we try to maximize average performance on the

heldout validation data

} For example: number of layers in a neural network can

be explored iteratively, starting with one layer, and increasing one at a time (up to some reasonable) limit until over-fitting is detected

} Similarly, we can explore a range of layer sizes, starting

with hidden layers of size equal to the number of input features, and increasing in some logarithmic manner until

  • ver-fitting occurs, or some practical limits reach

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 15

15

Using Grid Search for Tuning

} One basic technique is to list out the different values of

each parameter that we want to test, and systematically try different combinations of those values

} For P distinct tuning parameters, defines a P-dimensional space

(or “grid”), that we can explore, one combination at a time

} In many cases, since building, training, and testing the

models for each combination all take some time, we may find that there are far too many such combinations to try

} One possibility: many such models can be explored in parallel,

allowing large numbers of combinations to be compared at the same time, given sufficient resources

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 16

16

slide-5
SLIDE 5

5 Costs of Grid Search

} When we have large numbers of combinations of possible

parameters, we may decide to limit the range of some of the parts of our “grid” for feasibility

} For example, we might try:

1.

# Hidden layers: 1, 2, …, 10

2.

Layer size: N, 2N, 5N, 10N, 20N (N: # input features)

3.

Activation: Sigmoid, ReLU, tanh

4.

Regularization (alpha): 10-5, 10-3, 10-1, 101, 103

} Produces (10 x 5 x 3 x 5) = 750 different models

} If we are doing 10-fold validation, need to run 7,500 total tests } Still only a small fragment of the possible parameter-space

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 17

17

Random Search

} Instead of limiting our grid even further, or trying to spend

even more time on more combinations, we might try to randomize the process

} Instead of limiting values, we choose randomly from any of a

(larger) range of values:

1.

# Hidden layers: [1, 20]

2.

Layer size: [8, 1024]

3.

Activation: [Sigmoid, ReLU, tanh]

4.

Regularization (alpha): [10-7,107]

} For each of these, we assign a probability distribution over its

values (uniform or otherwise)

} We may presume these distributions are independent of one another

} For T tests, we sample each of the ranges for one possible

value, giving us T different combinations of those values

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 18

18

Performance of Random Search

}

This technique can sometimes out-perform grid search

}

When using a grid, it is sometimes possible that we just miss some intermediate, and important, value completely

}

The random approach can often hit upon the better combinations with the same (or far less) testing involved

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 19 Grid Layout Random Layout

Unimportant parameter Important parameter Unimportant parameter Important parameter

e 1: Grid and random search of nine trials for optimizing a function f(x,y) = g(x) + h(y) ≈ g(x) with low effective dimensionality. Above each square g(x) is shown in green, and left of each square h(y) is shown in yellow. With grid search, nine trials only test g(x) in three distinct places. With random search, all nine trials explore distinct values of

  • g. This failure of grid search is the rule rather than the exception in high dimensional

hyper-parameter optimization. From: J. Bergstra & Y. Bengio, “Random search for hyper- parameter optimization,” Journal of Machine Learning Research 13 (2012).

19

Performance of Random Search

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 20 From: J. Bergstra & Y. Bengio, “Random search for hyper- parameter optimization,” Journal of Machine Learning Research 13 (2012). 1 2 4 8 16 32

experiment size (# trials)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

accuracy mnist rotated

Performance for grid search over 100 different neural network parameter combinations Statistically significant improvement for as few as 8 randomly chosen combination models

20

slide-6
SLIDE 6

6 This Week & Upcoming

} T

  • pics: Neural Networks, Reinforcement Learning

} Homework 04: due Monday, 13 April, 5:00 PM } Project 02: due Monday, 27 April, 5:00 PM

} Sentiment analysis in review text } Uses two different models of textual data

} Office Hours:

} Hours and Zoom links can be found on Piazza and Canvas

Monday, 6 Apr. 2020 Machine Learning (COMP 135) 21

21