Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp - PowerPoint PPT Presentation

Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Machine Learning 1 • Myth of machine learning – given: real world examples – automatically build model – make predictions • Promise of deep learning – do not worry about specific properties of problem – deep learning automatically discovers the feature • Reality: bag of tricks Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Today’s Agenda 2 • No new translation model • Discussion of failures in machine learning • Various tricks to address them Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Fair Warning 3 • At some point, you will think: Why are you telling us all this madness? • Because pretty much all of it is commonly used Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

4 failures in machine learning Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Failures in Machine Learning 5 error( λ ) λ Too high learning rate may lead to too drastic parameter updates → overshooting the optimum Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Failures in Machine Learning 6 error( λ ) λ Bad initialization may require many updates to escape a plateau Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Failures in Machine Learning 7 error( λ ) local optimum λ global optimum Local optima trap training Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Learning Rate 8 • Gradient computation gives direction of change • Scaled by learning rate • Weight updates • Simplest form: fixed value • Annealing – start with larger value (big changes at beginning) – reduce over time (minor adjustments to refine model) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Initialization of Weights 9 • Initialize weights to random values • But: range of possible values matters error( λ ) λ Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Sigmoid Activation Function 10 y x Derivative of sigmoid Near zero for large positive and negative values Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Rectified Linear Unit 11 y x Derivative of ReLU Flat and for large interval: Gradient is 0 ”Dead cells” elements in output that are always 0, no matter the input Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Local Optima 12 • Cartoon depiction error( λ ) local optimum λ global optimum • Reality – highly dimensional space – complex interaction between individual parameter changes – ”bumpy” Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Vanishing and Exploding Gradients 13 RNN RNN RNN RNN RNN RNN RNN • Repeated multiplication with same values • If gradients are too low → 0 • If gradients are too big → ∞ Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Overfitting and Underfitting 14 Under-Fitting Good Fit Over-Fitting • Complexity of the problem has too match the capacity of the model • Capacity ≃ number of trainable parameters Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

15 ensuring randomness Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Ensuring Randomness 16 • Typical theoretical assumption independent and identically distributed training examples • Approximate this ideal – avoid undue structure in the training data – avoid undue structure in initial weight setting • ML approach: Maximum entropy training – Fit properties of training data – Otherwise, model should be as random as possible (i.e., has maximum entropy) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Shuffling the Training Data 17 • Typical training data in machine translation – different types of corpora ∗ European Parliament Proceedings ∗ collection of movie subtitles – temporal structure in each corpus – similar sentences next too each other (e.g., same story / debate) • Online updating: last examples matter more • Convergence criterion: no improvement recently → stretch of hard examples following easy examples: prematurely stopped ⇒ randomly shuffle the training data (maybe each epoch) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Weight Initialization 18 • Initialize weights to random values • Values are chosen from a uniform distribution • Ideal weights lead to node values in transition area for activation function Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

For Example: Sigmoid 19 • Input values in range [ − 1; 1] ⇒ Output values in range [0.269;0.731] • Magic formula ( n size of the previous layer) − 1 √ n, 1 � � √ n • Magic formula for hidden layers √ √ 6 6 � � , − √ n j + n j +1 √ n j + n j +1 – n j is the size of the previous layer – n j +1 size of next layer Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Problem: Overconfident Models 20 • Predictions of the neural machine translation models are surprisingly confident • Often almost all the probability mass is assigned to a single word (word prediction probabilities of over 99%) • Problem for decoding and training – decoding: sensible alternatives get low scores, bad for beam search – training: overfitting is more likely • Solution: label smoothing • Jargon notice – in classification tasks, we predict a label – jargon term for any output → here, we smooth the word predictions Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Label Smoothing during Decoding 21 • Common strategy to combat peaked distributions: smooth them • Recall – prediction layer produces numbers for each word – converted into probabilities using the softmax exp s i p ( y i ) = � j exp s j • Softmax calculation can be smoothed with so-called temperature T exp s i /T p ( y i ) = � j exp s j /T • Higher temperature → distribution smoother (i.e., less probability is given to most likely choice) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Label Smoothing during Training 22 • Root of problem: training • Training object: assign all probability mass to single correct word • Label smoothing – truth gives some probability mass to other words (say, 10% of it) – uniformly distributed over all words – relative to unigram word probabilities (relative counts of each word in the target side of the training data) Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

23 adjusting the learning rate Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Adjusting the Learning Rate 24 • Gradient descent training: weight update follows the gradient downhill • Actual gradients have fairly large values, scale with a learning rate (low number, e.g., µ = 0 . 001 ) • Change the learning rate over time – starting with larger updates – refining weights with smaller updates – adjust for other reasons • Learning rate schedule Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Momentum Term 25 • Consider case where weight value far from optimum • Most training examples push the weight value in the same direction • Small updates take long to accumulate • Solution: momentum term m t – accumulate weight updates at each time step t – some decay rate for sum (e.g., 0.9) – combine momentum term m t − 1 with weight update value ∆ w t m t = 0 . 9 m t − 1 + ∆ w t w t = w t − 1 − µ m t Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Adapting Learning Rate per Parameter 26 • Common strategy: reduce the learning rate µ over time • Initially parameters are far away from optimum → change a lot • Later nuanced refinements needed → change little • Now: different learning rate for each parameter Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Adagrad 27 • Different parameters at different stages of training → different learning rate for each parameter • Adagrad – record gradients for each parameter – accumulate their square values over time – use this sum to reduce learning rate • Update formula – gradient g t = dE t dw of error E with respect to weight w – divide the learning rate µ by accumulated sum µ ∆ w t = g t �� t τ =1 g 2 τ • Big changes in the parameter value (corresponding to big gradients g t ) → reduction of the learning rate of the weight parameter Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Adam: Elements 28 • Combine idea of momentum term and reduce parameter update by accumulated change • Momentum term idea (e.g., β 1 = 0 . 9 ) m t = β 1 m t − 1 + (1 − β 1 ) g t • Accumulated gradients (decay with β 2 = 0 . 999 ) v t = β 2 v t − 1 + (1 − β 2 ) g 2 t Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020

Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp - PowerPoint PPT Presentation

Machine Learning Tricks Philipp Koehn 13 October 2020 Philipp Koehn Machine Translation: Machine Learning Tricks 13 October 2020 Machine Learning 1 Myth of machine learning given: real world examples automatically build model

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

Image Processing Tricks in Image Processing Tricks in OpenGL OpenGL Simon Green Simon Green

The Agile PMP: Teaching an Old Dog New Tricks The Agile PMP: Teaching an Old Dog New Tricks

Clipping http://www.ugrad.cs.ubc.ca/~cs314/Vjan2013 Reading for Clipping FCG Sec 8.1.3-8.1.6

IN INTER ERFACE ENV NVIRONME NMENT NT + PROC OCES ESS FOR OR INTE TERACTI CTION ON

Welcome to CMSC 434 Dilbert http://dilbert.com/strip/2001-12-12 1 Douglas Adams

User interface technology CS 347 Michael Bernstein Announcements Articulating Contributions in

Autonomous Intelligent Robotics Instructor: Shiqi Zhang

T979 QUARTIC Timing detectors for

A Suite of Hard ACL2 Theorems Arising in Refinement-Based Processor Verification Panagiotis

Project Duration Digital IC Project and Verification Dec 1st April 1st ~14 weeks Project