Machine Learning Tricks
Philipp Koehn 13 October 2020
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp - - PowerPoint PPT Presentation
Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 Machine Learning 1 Myth of machine learning given: real world examples automatically build model
Philipp Koehn 13 October 2020
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
1
– given: real world examples – automatically build model – make predictions
– do not worry about specific properties of problem – deep learning automatically discovers the feature
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
2
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
3
Why are you telling us all this madness?
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
4
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
5
λ error(λ)
Too high learning rate may lead to too drastic parameter updates → overshooting the optimum
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
6
λ error(λ)
Bad initialization may require many updates to escape a plateau
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
7
λ error(λ) local optimum global optimum
Local optima trap training
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
8
– start with larger value (big changes at beginning) – reduce over time (minor adjustments to refine model)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
9
λ error(λ)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
10
x y Derivative of sigmoid Near zero for large positive and negative values
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
11
x y Derivative of ReLU Flat and for large interval: Gradient is 0 ”Dead cells” elements in output that are always 0, no matter the input
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
12
λ error(λ) local optimum global optimum
– highly dimensional space – complex interaction between individual parameter changes – ”bumpy”
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
13
RNN RNN RNN RNN RNN RNN RNN
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
14
Under-Fitting Good Fit Over-Fitting
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
15
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
16
independent and identically distributed training examples
– avoid undue structure in the training data – avoid undue structure in initial weight setting
– Fit properties of training data – Otherwise, model should be as random as possible (i.e., has maximum entropy)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
17
– different types of corpora ∗ European Parliament Proceedings ∗ collection of movie subtitles – temporal structure in each corpus – similar sentences next too each other (e.g., same story / debate)
→ stretch of hard examples following easy examples: prematurely stopped ⇒ randomly shuffle the training data (maybe each epoch)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
18
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
19
⇒ Output values in range [0.269;0.731]
√n, 1 √n
√ 6 √nj + nj+1 , √ 6 √nj + nj+1
– nj+1 size of next layer
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
20
(word prediction probabilities of over 99%)
– decoding: sensible alternatives get low scores, bad for beam search – training: overfitting is more likely
– in classification tasks, we predict a label – jargon term for any output → here, we smooth the word predictions
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
21
– prediction layer produces numbers for each word – converted into probabilities using the softmax p(yi) = exp si
p(yi) = exp si/T
(i.e., less probability is given to most likely choice)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
22
– truth gives some probability mass to other words (say, 10% of it) – uniformly distributed over all words – relative to unigram word probabilities (relative counts of each word in the target side of the training data)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
23
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
24
(low number, e.g., µ = 0.001)
– starting with larger updates – refining weights with smaller updates – adjust for other reasons
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
25
– accumulate weight updates at each time step t – some decay rate for sum (e.g., 0.9) – combine momentum term mt−1 with weight update value ∆wt mt = 0.9mt−1 + ∆wt wt = wt−1 − µ mt
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
26
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
27
→ different learning rate for each parameter
– record gradients for each parameter – accumulate their square values over time – use this sum to reduce learning rate
– gradient gt = dEt
dw of error E with respect to weight w
– divide the learning rate µ by accumulated sum ∆wt = µ t
τ=1 g2 τ
gt
→ reduction of the learning rate of the weight parameter
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
28
change
mt = β1mt−1 + (1 − β1)gt
vt = β2vt−1 + (1 − β2)g2
t Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
29
ˆ mt = mt 1 − βt
1
, ˆ vt = vt 1 − βt
2
limt→∞ 1 1 − βt → 1
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
30
– learning rate µ – momentum ˆ mt – accumulated change ˆ vt
∆wt = µ √ˆ vt + ǫ ˆ mt
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
31
(converges slowly)
(quicker convergence, but last training disproportionately higher impact)
– compute all their gradients for individual word predictions errors – use sum over each batch to update parameters → better parallelization on GPUs
– batch processing may take different amount of time – asynchronous training: apply updates when they arrive – mismatch between original weights and updates may not matter much
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
32
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
33
results close to this optimum on unseen test data.
getting stuck in local optima.
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
34
– 100s of millions of parameters – 100s of millions of training examples (individual word predictions)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
35
– adjust training objective – add cost for any non-zero parameter – typically done with L2 norm
– derivative of L2 norm is value of parameter – if not signal from training: reduce value of parameter – alsp called weight decay
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
36
– learn simple concepts first – learn more complex material later
– only short sentences – create artificial data by extracting smaller segments (similar to phrase pair extraction in statistical machine translation) – Later epochs: all training data
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
37
– some properties of task have been learned – discovery of other properties would take it too far out of its comfort zone.
– model learned the language model aspects – but cannot figure out role of input sentence
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
38
– For each batch, different random set of nodes is removed – Their values are set to 0 and their weights are not updated – 10%, 20% or even 50% of all the nodes
– robustness: redundant nodes play similar nodes – ensemble learning: different subnetworks are different models
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
39
⇒ Limit total value of gradients for a layer to threshold (τ)
L2(g) =
g2
j
g′
i = gi ×
τ max(τ, L2(g))
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
40
⇒ Normalize node values
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
41
sl = W hl−1 hl = sigmoid(hl)
µl = 1 H
H
sl
i
σl =
H
H
(sl
i − µl)2 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
42
ˆ sl = 1 σl(sl − µl)
ˆ sl = g σl(sl − µl) + b
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
43
⇒ Error propagation has to travel farther
– shortcuts – residual connections – skip connections
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
44
y = f(x)
y = f(x) + x
y′ = f ′(x) + 1
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
45
y = t(x) f(x) + (1 − t(x)) x
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
46
FF
Add FF Add FF Gate
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
47
memoryt = gateinput × inputt + gateforget × memoryt−1
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
48
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
49
– predict one word at a time – compare against correct word – proceed training with correct word
– predict entire sequence – measure translation with sentence-level metric (e.g., BLEU)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
50
– generator proposes a translation – discriminator distinguishes between generator’s translation and human translation – generator tries to fool discriminator
– traditional neural machine translation model – generates full sentence translations t for each input sentence
– is trained to classify (x, y) as correct example – is trained to classify (x, t) as generated example
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
51
– generator with additional objective to fool discriminator – discriminator to do well on detecting generator’s output as such
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020
52
– chess playing: quality of move only revealed at end of game – walk through maze to avoid monsters and find gold
(here: generator, traditional neural machine translation model)
(here: ability to fool discriminator)
(here: Monte Carlo decoding)
Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020