Lecture 2: Model-based classification Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 2: Model-based classification Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 28th March 2019

Reprise: Statistical Learning (I) 𝑙 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 𝐲 𝑈 𝜸 (GLM)) 2. linear regression (viewpoint: generalized linear models 𝑧 𝑗 𝑚 𝐲 𝑗𝑚 ∈𝑂 𝑙 (𝐲) ∑ 1/25 Regression 1. k-nearest neighbour regression 𝑔(𝐲) = 𝔽 𝑞(𝑧|𝐲) [𝑧] ˆ loss ▶ Theoretically best regression function for squared error ▶ Approximate (1) or make model-assumptions (2) 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 1

Reprise: Statistical Learning (II) Classification sensible model assumptions instead? 2. Instead of approximating 𝑞(𝑗|𝐲) from data, can we make 𝐲 𝑚 ∈𝑂 𝑙 (𝐲) ∑ 𝑙 1. k-nearest neighbour classification 𝑞(𝑗|𝐲) 1≤𝑗≤𝐿 𝑑(𝐲) = arg max ̂ possible classes 2/25 ▶ Theoretically best classification rule for 0-1 loss and 𝐿 ▶ Approximate (1) or make model-assumptions (2) 𝑞(𝑗|𝐲) ≈ 1 1 (𝑗 𝑚 = 𝑗)

Amendment: kNN methods There are two choices to make when implementing a kNN method 1. The metric to determine a neighbourhood 2. The number of neighbours, i.e. 𝑙 The choice of metric changes the underlying local model of the method while 𝑙 is a tuning parameter. 3/25 ▶ e.g. Euclidean/ ℓ 2 norm, Manhattan/ ℓ 1 norm, max norm, …

Model-based classification

Classification as regression 0 1 Note that 𝐲 𝑈 𝜸 = otherwise 1 2 1 𝐲 𝑈 𝜸 ≤ 4/25 model approximation for Bayes’ rule as well 𝑞(0|𝐲) = 1 − 𝑞(1|𝐲) ≈ 1 − 𝐲 𝑈 𝜸 , we indirectly specified a 𝑞(1|𝐲) = 𝔽 𝑞(𝑗|𝐲) [𝑗] ≈ 𝐲 𝑈 𝜸 Note that 𝑗 has a discrete distribution. 𝔽 𝑞(𝑗|𝐲) [𝑗] = 0 ⋅ 𝑞(0|𝐲) + 1 ⋅ 𝑞(1|𝐲) = 𝑞(1|𝐲) ▶ Consider a two-class problem, with 𝑗 𝑚 = 0 or 𝑗 𝑚 = 1 ▶ Instead of 0-1 loss, use square error loss, i.e. ▶ Linear regression model assumption ▶ Since we are approximating 𝑞(1|𝐲) and 𝑑(𝐲) = { 2 defines the decision boundary

0-1 regression The solid black lines show the decision boundary . 5/25 4.5 ● 1.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4.0 ● ● ● ● ● ● ● ● ● ● Sepal Width ● ● ● 3.5 ● ● ● ● ● ● Coding ● ● ● ● ● ● ● ● ● ● ● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2.0 ● 5 6 7 5 6 7 Sepal Length Sepal Length Species setosa versicolor ● ●

0-1 regressions and outliers 6/25 5 ● ● Case ● ● ● ● ● ● ● ● ● ● ● ● ● ● x 2 ● ● No outlier ● ● ● ● ● ● ● ● ● 0 ● ● ● ● With Outlier ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● ● 0 5 10 x 1

Dummy encoding for categorical variables In regression, when a predictor 𝑦 is categorical , i.e. takes one Example: 𝑦 = 1 → 𝑨 = (1, 0, 0) 𝑦 = 2 → 𝑨 = (0, 1, 0) 𝑦 = 3 → 𝑨 = (0, 0, 1) Idea Turn a classification problem into a regression problem by as vectors in dummy encoding. 7/25 of 𝐿 values, it is common to use a dummy encoding . representing the class outcomes 𝑗 𝑚 in the training data (𝑗 𝑚 , 𝐲 𝑚 )

Multiple classes 𝑑(𝑦) = arg max for 𝑗 ≠ 𝑘 𝐲 𝑈 𝜸 (𝑗) 1≤𝑗≤𝐿 𝑞(𝑗|𝐲) ≈ arg max 1≤𝑗≤𝐿 8/25 𝑚 𝑨 (𝐿) ⋮ 𝑚 𝑨 (1) blackboard). If there are 𝐿 classes then ▶ This creates a sequence of 0-1 regressions (see ∶= 1 (𝑗 𝑚 = 1) → 𝑞(𝑨 (1) = 1|𝐲) ≈ 𝐲 𝑈 𝜸 (1) ∶= 1 (𝑗 𝑚 = 𝐿) → 𝑞(𝑨 (𝐿) = 1|𝐲) ≈ 𝐲 𝑈 𝜸 (𝐿) ▶ Note that 𝑞(𝑗|𝐲) = 𝑞(𝑨 (𝑗) = 1|𝐲) ≈ 𝐲 𝑈 𝜸 (𝑗) ▶ Classification rule Decision boundaries are defined by 𝑑(𝑦) = 𝐲 𝑈 𝛾 (𝑗) = 𝐲 𝑈 𝛾 (𝑘)

Multiple 0-1 regressions 9/25 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Coding ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Coding ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 5 10 Predictor Class ● 1 ● 2 ● 3

Problems with 0-1 regression Observations : 1. 𝐲 𝑈 𝜸 is unbounded but models a probability 𝑞(𝑗|𝐲) ∈ [0, 1] 2. Only values of 𝐲 𝑈 𝜸 around 0.5 (for binary classification) or close to the maximal value (for multiple classes) are really of interest. 3. Sensitive to points far away from the boundary (outliers) (adding polynomial predictors can sometimes help, but this is arbitrary and data dependent) Inspiration from GLM Can we transform 𝐲 𝑈 𝜸 such that the transformed values are in [0, 1] , are similar to the original values when close to 0.5 and insensitive outliers far away from the boundary? 10/25 4. Masking: Classes can get buried among other classes

Logistic function and Normal Distribution CDF Logistic (sigmoid) function √2𝜌 1 −∞ 𝑦 Standard Normal CDF 1 + exp (𝑦) exp (𝑦) 𝜏(𝑦) = 11/25 1.00 0.75 0.50 y 0.25 0.00 −4 −2 0 2 4 x Type Logistic Function Standard Normal CDF Φ(𝑦) = ∫ exp (−𝑨 2 2 ) d 𝑨

(occurs seldom in practice) Logistic and probit regression more predictors forces the intercept to −∞ and the corresponding predictor coefficient to +∞ . 12/25 ▶ We arrive at logistic regression when assuming 𝑞(1|𝐲) = 𝔽 𝑞(𝑗|𝐲) [𝑗] = 𝜏 −1 (𝐲 𝑈 𝜸) or probit regression when assuming 𝑞(1|𝐲) = 𝔽 𝑞(𝑗|𝐲) [𝑗] = Φ −1 (𝐲 𝑈 𝜸) ▶ Parameters can be estimated by iteratively reweighted least squares (Details in ESL Ch. 4.4.1) ▶ A warning: Problematic situation in two-class case ▶ Assume two classes can be separated perfectly in one or ▶ Logistic regression tries to fit a step-like function, which

Logistic regression and outliers 13/25 5 Case ● ● ● ● ● ● 0−1: no outlier ● ● ● ● ● x 2 ● ● ● ● ● ● ● ● ● ● 0−1: with outlier ● ● ● ● ● ● 0 ● ● ● ● ● ● Logistic: with outlier ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −5 ● 0 5 10 x 1

Multi-class logistic regression 𝑘 (𝐴) = classifier, … multinomial logistic regression, maximum entropy 𝐿−1 𝑓 𝐲 𝑈 (𝜸 (𝑚) −𝜸 (𝐿) ) 𝑞(𝑗|𝐲) = or 𝐿 ∑ 𝑓 𝐲 𝑈 𝜸 (𝑗) 𝑞(𝑗|𝐲) = 𝑓 (𝑨 𝑘 −𝑨 𝐿 ) 𝐿−1 𝝉 ∑ outcome leads again to a series of regression problems. 𝝉 𝑘 (𝐴) = ⇔ 𝑓 𝑨 𝑘 14/25 𝐿 ▶ In case of 𝐿 > 2 classes, using dummy encoding for the ▶ Requirement: Probabilities should be modelled, i.e. in 𝑞(𝑗|𝐲) ∈ [0, 1] for each class and ∑ 𝑗 𝑞(𝑗|𝐲) = 1 ▶ Softmax function: 𝝉 ∶ ℝ 𝐿 ↦ [0, 1] 𝐿 𝑚=1 𝑓 𝑨 𝑚 1 + ∑ 𝑚=1 𝑓 (𝑨 𝑚 −𝑨 𝐿 ) ▶ Model now: 𝑚=1 𝑓 𝐲 𝑈 𝜸 (𝑗) 1 + ∑ 𝑚=1 𝑓 𝐲 𝑈 (𝜸 (𝑚) −𝜸 (𝐿) ) ▶ This method has many names: softmax regression,

Lecture 2: Model-based classification Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 2: Model-based classification Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 28th March 2019 Reprise: Statistical Learning (I) (|) [] (GLM)) 2. linear

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

Classification Classification and Prediction Classification: predict categorical class labels

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Classification Classification TNM classification Survival time Survival time Tumour size,

ADEQ Lakes Classification ADEQ Lakes Classification ADEQ Lakes Classification Project Project

OVERVIEW U.S. National Vegetation Classification A Classification Partnership Don Faber-

Welcome to the Board of Visitors Virtual Meeting 9 June 2020 CLASSIFICATION CLASSIFICATION

Need for Classification Classification required To isolate traffic of interest

Generalized sample selection model Magorzata Wojty 1 , Giampiero Marra 2 1 Plymouth University,

Key Concepts Nicky Best and Alexina Mason Imperial College London BAYES 2013, May 21-23, Erasmus

ADVANCED ECONOMETRICS I Theory (3/3) Instructor: Joaquim J. S. Ramalho E.mail:

Introduction to the R Statistical Computing Environment John Fox McMaster University May 2013

Madonna & Sustainability Sustainable Purchasing Leadership Council

T HE SCHOOL DI ST RI CT OF PHI L ADE L PHI A PROCURE ME NT DE PART ME NT ST

This talk is for the Producer Bootcamp [213] at GDC 2013. The description is on this page:

Peer Monitoring, Ostracism and the Internalization of Social Norms with Rohan Dutta and Salvatore

Lecture 2: Model-based classification Felix Held, Mathematical - PowerPoint PPT Presentation

Lecture 2: Model-based classification Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 28th March 2019 Reprise: Statistical Learning (I) (|) [] (GLM)) 2. linear

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

Classification Classification and Prediction Classification: predict categorical class labels

Web Information Retrieval Lecture 14 Text classification Sec. 13.1 Text Classification

Classification Classification TNM classification Survival time Survival time Tumour size,

ADEQ Lakes Classification ADEQ Lakes Classification ADEQ Lakes Classification Project Project

OVERVIEW U.S. National Vegetation Classification A Classification Partnership Don Faber-

Welcome to the Board of Visitors Virtual Meeting 9 June 2020 CLASSIFICATION CLASSIFICATION

Need for Classification Classification required To isolate traffic of interest

Generalized sample selection model Magorzata Wojty 1 , Giampiero Marra 2 1 Plymouth University,

Key Concepts Nicky Best and Alexina Mason Imperial College London BAYES 2013, May 21-23, Erasmus

ADVANCED ECONOMETRICS I Theory (3/3) Instructor: Joaquim J. S. Ramalho E.mail:

Introduction to the R Statistical Computing Environment John Fox McMaster University May 2013

Madonna &amp; Sustainability Sustainable Purchasing Leadership Council

T HE SCHOOL DI ST RI CT OF PHI L ADE L PHI A PROCURE ME NT DE PART ME NT ST

This talk is for the Producer Bootcamp [213] at GDC 2013. The description is on this page:

Peer Monitoring, Ostracism and the Internalization of Social Norms with Rohan Dutta and Salvatore

Madonna & Sustainability Sustainable Purchasing Leadership Council