Review: Supervised Learning CS 6355: Structured Prediction 1

Previous lecture • A broad overview of structured prediction • The different aspects of the area – Basically the syllabus of the class • Questions? 2

Supervised learning, Binary classification 1. Supervised learning: The general setting 2. Linear classifiers 3. The Perceptron algorithm 4. Learning as optimization 5. Support vector machines 6. Logistic Regression 3

Where are we? 1. Supervised learning: The general setting 2. Linear classifiers 3. The Perceptron algorithm 4. Learning as optimization 5. Support vector machines 6. Logistic Regression 4

Supervised learning: General setting Given: Training examples of the form < 𝐲, 𝑔 𝐲 > • – The function 𝑔 is an unknown function The input 𝐲 is represented in a feature space • – Typically 𝐲 ∈ {0,1} ) or 𝐲 ∈ ℜ ) For a training example 𝐲 , the value of 𝑔 𝐲 is called its label • Goal: Find a good approximation for 𝑔 • Different kinds of problems • – Binary classification: 𝑔 𝐲 ∈ {−1, 1} – Multiclass classification : 𝑔 𝐲 ∈ {1, 2, ⋯ , 𝑙} – Regression: 𝑔 𝐲 ∈ ℜ 5

Nature of applications • There is no human expert – Eg: Identify DNA binding sites • Humans can perform a task, but can’t describe how they do it – Eg: Object detection in images • The desired function is hard to obtain in closed form – Eg: Stock market 6

� Linear Classifiers • Input is a n dimensional vector x • Output is a label y 2 {-1, 1} For now • Linear threshold units classify an example 𝒚 using the classification rule sgn 𝑐 + 𝒙 7 𝒚 = sgn(𝑐 + ∑ 𝑥 < 𝑦 < ) < • 𝑐 + 𝒙 7 𝒚 ≥ 0 ) Predict y = 1 • 𝑐 + 𝒙 7 𝒚 < 0 ) Predict y = -1 8

The geometry of a linear classifier sgn(b +w 1 x 1 + w 2 x 2 ) b +w 1 x 1 + w 2 x 2 =0 + ++ + + + + + [w 1 w 2 ] x 1 - - - - - - - - In n dimensions, - - - - - - a linear classifier - - represents a hyperplane - - that separates the space into two half-spaces x 2 9

XOR is not linearly separable No line can be drawn to separate the two classes - - - - - - - - + ++ + - - - + - - - - + + - + - - x 1 - - - - - - - - + ++ - - - + - - - + - - + + + - - x 2 10

Not all functions are linearly separable Even these functions can be made linear These points are not separable in 1-dimension by a line What is a one-dimensional line, by the way? The trick: Change the representation 11

Not all functions are linearly separable Even these functions can be made linear The trick: Use feature conjunctions Transform points: Represent each point x in 2 dimensions by (x, x 2 ) Now the data is linearly separable in this space! 12

Linear classifiers are an expressive hypothesis class • Many functions are linear – Conjunctions, disjunctions – At least m-of-n functions • Often a good guess for a hypothesis space – If we know a good feature representation We will see later in the • Some functions are not linear class that many – The XOR function structured predictors – Non-trivial Boolean functions are linear functions too 13

The Perceptron algorithm • Rosenblatt 1958 • The goal is to find a separating hyperplane – For separable data, guaranteed to find one • An online algorithm – Processes one example at a time • Several variants exist 15

The algorithm Given a training set D = {( x ,y)}, x 2 < n , y 2 {-1,1} 1. Initialize w = 0 2 < n 2. For epoch = 1 … T: 1. Shuffle the data 2. For each training example ( x , y) in D: 1. Predict y’ = sgn( w T x ) 2. If y ≠ y’, update w Ã w + y x 3. Return w Prediction: sgn( w T x ) 16

The algorithm Given a training set D = {( x ,y)}, x 2 < n , y 2 {-1,1} 1. Initialize w = 0 2 < n T is a hyperparameter to the algorithm 2. For epoch = 1 … T: 1. Shuffle the data 2. For each training example ( x , y) in D: 1. Predict y’ = sgn( w T x ) 2. If y ≠ y’, update w Ã w + y x Update only on an error. 3. Return w Perceptron is an mistake- driven algorithm. Prediction: sgn( w T x ) 17

Convergence theorem If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge after a finite number of updates. – [Novikoff 1962] 18

Beyond the separable case • The good news – Perceptron makes no assumption about data distribution – Even adversarial – After a fixed number of mistakes, you are done. Don’t even need to see any more data • The bad news: Real world is not linearly separable – Can’t expect to never make mistakes again – What can we do: more features, try to be linearly separable if you can 19

Variants of the algorithm • The original version: Return the final weight vector • Averaged perceptron – Returns the average weight vector from the entire training time (i.e longer surviving weight vectors get more say) – Widely used – A practical approximation of the Voted Perceptron 20

Where are we? 1. Supervised learning: The general setting 2. Linear classifiers 3. The Perceptron algorithm 4. Learning as optimization 1. The general idea 2. Stochastic gradient descent 3. Loss functions 5. Support vector machines 6. Logistic Regression 21

Learning as loss minimization • Collect some annotated data. More is generally better • Pick a hypothesis class (also called model) – Eg: linear classifiers, deep neural networks – Also, decide on how to impose a preference over hypotheses • Choose a loss function – Eg: negative log-likelihood, hinge loss – Decide on how to penalize incorrect decisions • Minimize the expected loss – Eg: Set derivative to zero and solve on paper, typically a more complex algorithm 22

Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknown • Instead, minimize empirical loss on the training set 23

Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? Overfitting! We need something that biases the learner towards simpler hypotheses • Achieved using a regularizer, which penalizes complex hypotheses • Capacity control for better generalization 26

� � Regularized loss minimization D∈E regularizer(w) + C N • Learning: min O ∑ 𝑀(ℎ 𝑦 < , 𝑧 < ) < N • With L2 regularization: min T 𝑥 7 𝑥 + 𝐷 ∑ 𝑀(𝐺 𝑦 < , 𝑥 , 𝑧 < ) < S 27

� � Regularized loss minimization D∈E regularizer(w) + C N • Learning: min O ∑ 𝑀(ℎ 𝑦 < , 𝑧 < ) < N • With L2 regularization: min T 𝑥 7 𝑥 + 𝐷 ∑ 𝑀(𝐺 𝑦 < , 𝑥 , 𝑧 < ) < S • What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data 28

� How do we train in such a regime? • Suppose we have a predictor F that maps inputs x to a score F(x, w) that is thresholded to get a label – Here w are the parameters that define the function – Say F is a differentiable function • How do we use a labeled training set to learn the weights i.e. solve this minimization problem? min S W 𝑀(𝐺 𝑦 < , 𝑥 , 𝑧 < ) < • We could compute the gradient of F and decend the gradient to minimize the loss 29

� How do we train in such a regime? • Suppose we have a predictor F that maps inputs x to a score F(x, w) that is thresholded to get a label – Here w are the parameters that define the function – Say F is a differentiable function • How do we use a labeled training set to learn the weights i.e. solve this minimization problem? min S W 𝑀(𝐺 𝑦 < , 𝑥 , 𝑧 < ) < • We could compute the gradient of the loss and descend along that direction to minimize 30

Review: Supervised Learning CS 6355: Structured Prediction 1 - PowerPoint PPT Presentation

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad overview of structured prediction The different aspects of the area Basically the syllabus of the class Questions? 2 Supervised learning,

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

NOTICE OF PROPOSED RULEMAKING . Accelerating Wireless Broadband Deployment by Removing Barriers

January 25, 2018 ACL CL Disc scret etiona ionary ry Gran ant t Pr Prod oduc uct t Disc

Review: Supervised Learning CS 6355: Structured Prediction 1 - PowerPoint PPT Presentation

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad overview of structured prediction The different aspects of the area Basically the syllabus of the class Questions? 2 Supervised learning,

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

Stacking for supervised learning Stacking for supervised learning Niall Rooney, NIKEL,

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Learning frameworks Self-supervised learning: (Auto)encoder networks Supervised learning Network

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require

Web Mining and Recommender Systems Supervised learning Regression Learning Goals Introduce

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Self-Supervised Feature Learning by Learning to Spot Artifacts Wonbin Kim Self-Supervised

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science &amp; Engineering

Training Neural Nets COMPSCI 527 Computer Vision COMPSCI 527 Computer Vision Training

Risk-parameter estimation in volatility models Christian Francq Jean-Michel Zakoan CREST and

Some Special Cases Recall: we can classify models: constant mean = linear

2017/11/02 Army Environmental Program Division Cultural Resource Legislative/Policy Mandate

GCE N(A) &amp; N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome

NOTICE OF PROPOSED RULEMAKING . Accelerating Wireless Broadband Deployment by Removing Barriers

January 25, 2018 ACL CL Disc scret etiona ionary ry Gran ant t Pr Prod oduc uct t Disc

Fundamental Belief: Universal Approximation Theorems Ju Sun Computer Science & Engineering

GCE N(A) & N(T) EXAMINATIONS 19 DEC 2019 THURS, 2 PM TODAYS PROGRAMME Items By Welcome