Lecture 16: Summary and outlook Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 24th May 2019

The big topics 1. Statistical Learning 2. Supervised learning 3. Unsupervised learning 4. Data representations and dimension reduction 5. Large scale methods 1/58 ▶ Classification ▶ Regression ▶ Clustering

The big data paradigms centroids, …), low-rank approximations (SVD, NMF), … approximations (randomized SVD), … extensions), subspace clustering, low-rank and classification (Lasso, ridge regression, shrunken mixture models, … quadratic discriminant analysis (LDA and QDA), Gaussian 2/58 ▶ Small to medium sized data ▶ “Good old stats” ▶ Typical methods: 𝑙 -nearest neighbour (kNN), linear and ▶ High-dimensional data ▶ big- 𝑞 paradigm ▶ Typical methods: Feature selection, penalized regression ▶ Curse of dimensionality ▶ Large scale data ▶ big- 𝑜 paradigm (sometimes in combination with big- 𝑞 ) ▶ Typical methods: Random forests (with its big- 𝑜

Statistical Learning

What is Statistical Learning? Learn a model from data by minimizing expected prediction (predictive modelling) observed data and predictions 3/58 error determined by a loss function . ▶ Model: Find a model that is suitable for the data ▶ Data: Data with known outcomes is needed ▶ Expected prediction error: Focus on quality of prediction ▶ Loss function: Quantifies the discrepancy between

Statistical Learning for Regression loss ˆ 𝑔(𝐲) = 𝔽 𝑞(𝑧|𝐲) [𝑧] 1. k-nearest neighbour regression 𝑙 ∑ 𝐲 𝑗𝑚 ∈𝑂 𝑙 (𝐲) 𝑧 𝑗 𝑚 2. linear regression (viewpoint: generalized linear models (GLM)) 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 𝐲 𝑈 𝜸 4/58 ▶ Theoretically best regression function for squared error ▶ Approximate (1) or make model-assumptions (2) 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 1

Statistical Learning for Classification 𝐲 𝑚 ∈𝑂 𝑙 (𝐲) 𝐿−1 1 𝑞(𝐿|𝐲) = and 𝐿−1 𝑓 𝐲 𝑈 𝜸 (𝑗) 𝑞(𝑗|𝐲) = 2. Multi-class logistic regression 5/58 ∑ 𝑙 1. k-nearest neighbour classification 𝑞(𝑗|𝐲) 1≤𝑗≤𝐿 𝑑(𝐲) = arg max ̂ possible classes ( Bayes rule ) ▶ Theoretically best classification rule for 0-1 loss and 𝐿 ▶ Approximate (1) or make model-assumptions (2) 1 (𝑗 𝑚 = 𝑗) 𝑞(𝑗|𝐲) ≈ 1 1 + ∑ 𝑚=1 𝑓 𝐲 𝑈 𝜸 (𝑚) 1 + ∑ 𝑚=1 𝑓 𝐲 𝑈 𝜸 (𝑚)

Empirical error rates (I) 𝑛 same distribution as 𝒰 , i.e. 𝑞(𝒰) . 𝐲 𝑚 ) for 1 ≤ 𝑚 ≤ 𝑛 are new samples from the 𝑧 𝑚 , ̃ 𝐲 𝑚 |𝒰)) 𝑔( ̃ 𝑚=1 ∑ 𝑛 𝒰 = {(𝑧 𝑚 , 𝐲 𝑚 ) ∶ 1 ≤ 𝑚 ≤ 𝑜} where 𝑔(𝐲 𝑚 |𝒰)) 𝑚=1 ∑ 𝑜 𝑜 6/58 ▶ Training error 𝑆 𝑢𝑠 = 1 𝑀(𝑧 𝑚 , ˆ ▶ Test error 𝑆 𝑢𝑓 = 1 𝑧 𝑚 , ˆ 𝑀( ̃ where ( ̃

Splitting up the data 1. Randomly split available data into 𝑑 equally large subsets, so-called folds . 7/58 ▶ Holdout method: If we have a lot of samples, randomly split available data into training set and test set ▶ 𝑑 -fold cross-validation: If we have few samples 2. By taking turns, use 𝑑 − 1 folds as the training set and the last fold as the test set

Approximations of expected prediction error 𝐲 𝑚 ) for 1 ≤ 𝑚 ≤ 𝑛 are the elements in the test set. (LOOCV) where ℱ −𝑘 )) 𝑔(𝐲 𝑚 |ℱ 𝑘 (𝑧 𝑚 ,𝐲 𝑚 )∈ℱ ∑ 𝑘=1 ∑ 𝑑 𝑜 𝑧 𝑚 , ̃ 𝑔( ̃ 𝑛 𝑛 ∑ 𝑚=1 8/58 𝐲 𝑚 |𝒰)) ▶ Use test error for hold-out method, i.e. 𝑆 𝑢𝑓 = 1 𝑧 𝑚 , ˆ 𝑀( ̃ where ( ̃ ▶ Use average test error for c-fold cross-validation, i.e. 𝑆 𝑑𝑤 = 1 𝑀(𝑧 𝑚 , ˆ 𝑘 is the 𝑘 -th fold and ℱ −𝑘 is all data except fold 𝑘 . Note: For 𝑑 = 1 this is called leave-one-out cross validation

Careful data splitting training sets need to be identically distributed Examples: intervals than others (e.g. high values more often than low values) 9/58 ▶ Note: For the approximations to be justifiable, test and ▶ Splitting has to be done randomly ▶ If data is unbalanced, then stratification is necessary. ▶ Class imbalance ▶ Continuous outcome is observed more often in some

Bias-Variance Tradeoff ] Variance Bias 2 Error Model complexity Overfit Underfit 𝑆 𝑔 averaged over 𝐲 𝑔(𝐲)]] + Bias-Variance Decomposition 10/58 2 𝑔(𝐲)]) + Irreducible Error 𝜏 2 = Total expected prediction error 𝑔(𝐲)) 2 ] = 𝑆 𝔽 𝑞(𝒰,𝐲,𝑧) [(𝑧 − ˆ Bias 2 averaged over 𝐲 𝔽 𝑞(𝐲) [(𝑔(𝐲) − 𝔽 𝑞(𝒰) [ ˆ 𝔽 𝑞(𝐲) [ Var 𝑞(𝒰) [ ˆ Variance of ˆ Irreducible Error

Classification

Overview 1. 𝑙 -nearest neighbours (Lecture 1) 2. 0-1 regression (Lecture 2) 3. Logistic regression (Lecture 2, both binary and 4. Nearest Centroids (Lecture 2) and shrunken centroids (Lecture 10) 5. Discriminant analysis (Lecture 2) Fisher’s LDA/reduced-rank LDA (Lecture 6), mixture DA (Lecture 8) 6. Classification and Regression trees (CART) (Lecture 4) 7. Random Forests (Lecture 5 & 15) 11/58 ▶ just an academic example - do not use in practice multi-class; Lecture 11 for sparse case) ▶ Many variants: linear (LDA), quadratic (QDA), diagonal/Naive Bayes, regularized (RDA; Lecture 5),

Multiple angles on the same problem 𝑞(𝑗) separately feature space and assign each a class 12/58 1. Bayes rule: Approximate 𝑞(𝑗|𝐲) and choose largest ▶ e.g. kNN or logistic regression 2. Model of the feature space: Assume models for 𝑞(𝐲|𝑗) and ▶ e.g. discriminant analysis 3. Partitioning methods: Create explicit partitions of the ▶ e.g. CART or Random Forests

Finding the parameters of DA leads to 𝝂 𝑗 ) 𝑈 𝑗 𝑚 =𝑗 1 ˆ 𝑦 𝑚 𝑗 𝑚 =𝑗 ∑ 𝑜 𝑗 ˆ 𝑚=1 ∑ 𝑜 ˆ with 13/58 subject to parameters arg max 𝝂,𝚻,𝝆 𝑜 ∏ 𝑚=1 𝑂(𝐲 𝑚 |𝝂 𝑗 𝑚 , 𝚻 𝑗 𝑚 )𝜌 𝑗 𝑚 𝐿 ∑ 𝑗=1 ▶ Notation: Write 𝑞(𝑗) = 𝜌 𝑗 and consider them as unknown ▶ Given data (𝑗 𝑚 , 𝐲 𝑚 ) the likelihood maximization problem is 𝜌 𝑗 = 1. ▶ Can be solved using a Lagrange multiplier (try it!) and 𝜌 𝑗 = 𝑜 𝑗 1 (𝑗 𝑚 = 𝑗) 𝑜 , 𝑜 𝑗 = 𝝂 𝑗 = 1 𝚻 𝑗 = 𝑜 𝑗 − 1 ∑ (𝑦 𝑚 − ˆ 𝝂 𝑗 )(𝑦 𝑚 − ˆ

Performing classification in DA 𝜀 𝑗 (𝐲) This is a quadratic function in 𝐲 . (+ 𝐷) 2(𝐲 − 𝝂 𝑗 ) 𝑈 𝚻 −1 𝜀 𝑗 (𝐲) = log 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 ) + log 𝜌 𝑗 Bayes’ rule implies the classification rule where 1≤𝑗≤𝐿 𝑑(𝐲) = arg max Note that since log is strictly increasing this is equivalent to 𝑂(𝐲|𝝂 𝑗 , 𝚻 𝑗 )𝜌 𝑗 1≤𝑗≤𝐿 𝑑(𝐲) = arg max 14/58 = log 𝜌 𝑗 − 1 𝑗 (𝐲 − 𝝂 𝑗 ) − 1 2 log |𝚻 𝑗 |

Different levels of complexity 𝑗=1 between features are assumed to have the same correlation structure 𝝂 𝑗 ) 𝑈 𝑗 𝑚 =𝑗 ∑ 𝑗=1 ∑ 𝐿 𝑜 − 𝐿 1 ˆ 𝚻 𝑗 ∑ estimate (QDA) dimension 𝐿 15/58 ˆ 𝚻 = ▶ This method is called Quadratic Discriminant Analysis ▶ Problem: Many parameters that grow quickly with ▶ 𝐿 − 1 for all 𝜌 𝑗 ▶ 𝑞 ⋅ 𝐿 for all 𝝂 𝑗 ▶ 𝑞(𝑞 + 1)/2 ⋅ 𝐿 for all 𝚻 𝑗 (most costly) ▶ Solution: Replace covariance matrices 𝚻 𝑗 by a pooled 𝑜 𝑗 − 1 𝑜 − 𝐿 = (𝑦 𝑚 − ˆ 𝝂 𝑗 )(𝑦 𝑚 − ˆ ▶ Simpler correlation and variance structure: All classes

Performing classification in the simplified case As before, consider 𝑑(𝐲) = arg max 1≤𝑗≤𝐿 𝜀 𝑗 (𝐲) where 2𝝂 𝑈 (+ 𝐷) This is a linear function in 𝐲 . The method is therefore called Linear Discriminant Analysis (LDA) . 16/58 𝜀 𝑗 (𝐲) = log 𝜌 𝑗 + 𝐲 𝑈 𝚻 −1 𝝂 𝑗 − 1 𝑗 𝚻 −1 𝝂 𝑗

Even more simplifications Other simplifications of the correlation structure are possible ( Diagonal QDA or Naive Bayes’ Classifier ) 17/58 ▶ Ignore all correlations between features but allow different variances, i.e. 𝚻 𝑗 = 𝚳 𝑗 for a diagonal matrix 𝚳 𝑗 ▶ Ignore all correlations and make feature variances equal, i.e. 𝚻 𝑗 = 𝚳 for a diagonal matrix 𝚳 ( Diagonal LDA ) ▶ Ignore correlations and variances, i.e. 𝚻 𝑗 = 𝜏 2 𝐉 𝑞×𝑞 ( Nearest Centroids adjusted for class frequencies 𝜌 𝑗 )

Classification and Regression Trees (CART) > values/classes in each region sequence of binary splits Partition from a 18/58 Partition > Partition Arbitrary Rectangular ▶ Complexity of partitioning: ▶ Classification and Regression Trees create a sequence of binary axis-parallel splits in order to reduce variability of 0 0 0 0 00 x2 >= 2.2 4 0 0 0 0 0 0 yes no 00 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1.00 .00 x1 >= 3.5 00 0 x 2 0 0 0 0 0 0 60% 2 1 1 0 0 0 1 0 1 1 0 0 1 1 0 1 1.00 .00 .00 1.00 0 11 1 0 0 0 1 11 0 20% 20% 0 2 4 6 x 1

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 24th May 2019 The big topics 1. Statistical Learning 2. Supervised learning 3. Unsupervised learning 4. Data representations and

Future Outlook: Experiment Future Outlook: Experiment Future Outlook: Experiment Future Outlook:

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

accessing Outlook via the Outlook Web App (OWA) This user guide will show you how to access

New Zealand Government Debt Market Outlook January 2019 Overview New Zealand Economic Outlook

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Results 2014 and Outlook 2015 24 March 2015 24 March 2015 / Results 2014 and Outlook 2015 / 1

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

QUEEENSLAND OUTLOOK Source: ABS, Deloitte Access Economics Business Outlook SUNSHINE COAST OUTLOOK

2016 Espa Outlook Latin America Latam Outlook / May 2016 Main Messages The global outlook

Strategic Solutions in Economics ILPA February 2012 International Outlook Brazilian

The Financial Outlook Financial Outlook Issues, ASU Forecasting Luncheon, Phoenix, Arizona,

Regional Economic Outlook October 2014 Outline Global Outlook MENAP and CCA: Regional Themes

Middle East and North Africa Regional Economic Outlook October 2014 Outline Global Outlook

New Zealand Government Debt Market Outlook March 2018 Overview New Zealand Economic Outlook

New Zealand Government Debt Market Outlook February 2018 Overview New Zealand Economic Outlook

LUZON SUPPLY-DEMAND OUTLOOK Revised Outlook (based on the data as of 27 October 2014) With

Introduction to Machine Learning Classification: Tasks Sonar Learning goals 0.20 Understand

Local Classification Methods for Heterogeneous Classes Julia Schiffner and Claus Weihs

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004

Constrained discriminative speaker verification specific to normalized i-vectors P.M. Bousquet,

Practical Considerations for ANOVA Applied Statistics and Experimental Design Chapter 5 Peter

The Assembly of Disk Galaxies: From Keck to JWST Susan Kassin Space Telescope Science Institute

EECS 4441 Human-Computer Interaction Topic #4: Empirical Research Methods for HCI I. Scott

Overview Overview Local invariant features (C. Schmid) Matching and recognition with

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 16: Summary and outlook Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 24th May 2019 The big topics 1. Statistical Learning 2. Supervised learning 3. Unsupervised learning 4. Data representations and

Future Outlook: Experiment Future Outlook: Experiment Future Outlook: Experiment Future Outlook:

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY &amp; OUTLOOK SUMMARY

accessing Outlook via the Outlook Web App (OWA) This user guide will show you how to access

New Zealand Government Debt Market Outlook January 2019 Overview New Zealand Economic Outlook

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

Results 2014 and Outlook 2015 24 March 2015 24 March 2015 / Results 2014 and Outlook 2015 / 1

Baldwin Space Summary October 25 1 Baldwin School Space Summary 2 Baldwin School Space Summary

QUEEENSLAND OUTLOOK Source: ABS, Deloitte Access Economics Business Outlook SUNSHINE COAST OUTLOOK

2016 Espa Outlook Latin America Latam Outlook / May 2016 Main Messages The global outlook

Strategic Solutions in Economics ILPA February 2012 International Outlook Brazilian

The Financial Outlook Financial Outlook Issues, ASU Forecasting Luncheon, Phoenix, Arizona,

Regional Economic Outlook October 2014 Outline Global Outlook MENAP and CCA: Regional Themes

Middle East and North Africa Regional Economic Outlook October 2014 Outline Global Outlook

New Zealand Government Debt Market Outlook March 2018 Overview New Zealand Economic Outlook

New Zealand Government Debt Market Outlook February 2018 Overview New Zealand Economic Outlook

LUZON SUPPLY-DEMAND OUTLOOK Revised Outlook (based on the data as of 27 October 2014) With

Introduction to Machine Learning Classification: Tasks Sonar Learning goals 0.20 Understand

Local Classification Methods for Heterogeneous Classes Julia Schiffner and Claus Weihs

Discriminative Feature Extraction and Dimension Reduction - PCA &amp; LDA Berlin Chen, 2004

Constrained discriminative speaker verification specific to normalized i-vectors P.M. Bousquet,

Practical Considerations for ANOVA Applied Statistics and Experimental Design Chapter 5 Peter

The Assembly of Disk Galaxies: From Keck to JWST Susan Kassin Space Telescope Science Institute

EECS 4441 Human-Computer Interaction Topic #4: Empirical Research Methods for HCI I. Scott

Overview Overview Local invariant features (C. Schmid) Matching and recognition with

Preliminary Results For year end 31st July 2019 6 November 2019 SUMMARY & OUTLOOK SUMMARY

Discriminative Feature Extraction and Dimension Reduction - PCA & LDA Berlin Chen, 2004