Applied Machine Learning Some important concepts Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Some important concepts Siamak Ravanbakhsh COMP 551 (fall 2020)

Admin Weekly quiz : practice quiz was released yesterday 24hrs to submit your answers correct answers are release afterward no extension possible your lowest score across all quizzes is ignored Mini-project 1 : working on it... instead a mini-project from last year is released to give you an idea Math tutorial : this Friday at noon

Learning objectives understanding the following concepts overfitting & generalization validation and cross-validation curse of dimensionality no free lunch inductive bias of a learning algorithm

Model selection many ML algorithms have hyper-parameters (e.g., K in K-nearest neighbors, max-depth of decision tree, etc) how should we select the best hyper-parameter? performance of KNN regression on California Housing Dataset example best model underfitting the model can more closely fit the overfitting to the trainig data trainig data and still get good test error bad performance on unseen data

Model selection what if unseen data is completely different from training data? no point in learning! assumption: training data points are samples from an unknown distribution independent identically distributed ( IID ) ( n ) ( n ) , y ∼ p ( x , y ) x unseen data comes from the same distribution. train unseen

Loss, cost and generalization 1 f : ↦ 3 f : x ↦ y assume we have a model for example R and we have a loss function that measures the error in our prediction ℓ : y , ^ → y ^ 2 ℓ( y , ) = ^ ( y − ) y y for example I ( y = ℓ( y , ) = ^ ^ ) y y we train our models to minimize the cost function : how to estimate this? 1 J = ℓ( y , f ( x )) ∑ x , y ∈ D train ∣ D ∣ train E ℓ( y , f ( x )) what we really care about is the generalization error: x , y ∼ p we can set aside part of the training data and use it to estimate generalization error

Validation set how to estimate this? E ℓ( y , f ( x )) what we really care about is the generalization error: x , y ∼ p we can set aside part of the training data and use it to estimate the generalization error training validation unseen (test) pick a hyper-parameter that gives us the best validation error at the very end, we report the error on test set validation and test error could be different because they use limited amount of data

Cross validation how to get a better estimate of generalization error? increase the size of the validation set? this reduces the training set training validation test Cross-validation helps us in getting better estimates + uncertainty measure divide the (training + validation) data into L parts use one part for validation and L-1 for training L = 5 test

Cross validation divide the (training + validation) data into L parts use one part for validation and L-1 for training run 1 train validation test run 2 train validation train run 3 train validation train run 4 train validation train run 5 validation train use the average validation error and its variance (uncertainty) to pick the best model report the test error for the final model this is called L-fold cross-validation in leave-one-out cross-validation L=N (only one instance is used for validation)

Cross validation example the plot of the mean and standard deviation in 10 fold cross-validation test error is plotted only to show its agreement with the validation error; in practice we don't look at the test set for hyper-parameter tunning a rule of thumb: pick the simplest model within one std of the model with lowest validation error COMP 551 | Fall 2020

Decision tree example decision tree for Iris dataset dataset (D=2) decision tree decision boundaries decision boundaries suggest overfitting confirmed using a validation set training accuracy ~ 85% validation accuracy ~ 70%

Decision tree: overfitting a decision tree can fit any Boolean function (binary classification with binary features) example: of decision tree representation of a boolean function (D=3) 2 2 D there are such functions, why? decision tree can perfectly fit our training data How to solve the problem of overfitting in large decision trees? idea 1. grow a small tree problem: substantial reduction in cost may happen after a few steps by stopping early we cannot know this example cost drops after the second node image credit: https://www.wikiwand.com/en/Binary_decision_diagram

Decision tree: overfitting & pruning idea 2. grow a large tree and then prune it idea 3. random forests (later!) greedily turn an internal node into a leaf node choice is based on the lowest increase in the cost repeat this until left with the root node pick the best among the above models using a validation set example cross-validation is used to pick after pruning before pruning the best size COMP 551 | Fall 2020

Evaluation metrics when evaluating a classifier it is useful to look at the confusion matrix it is a CxC table that shows how many sample of each class are classified as belonging to another class sample images from Cifar-10 dataset classifier's accuracy is the sum of diagonal divided by the sum-total of the matrix

Evaluation metrics for binary classification the elements of the confusion matrix are TP, TN, FP, FN some other evaluation metrics based on the confusion table type I vs type II error TP + TN Accuracy = P + N FP + FN Error rate = P + N TP Precision = RP TP Recall = P Precision × Recall F score = 2 Precision + Recall 1

Evaluation metrics threshold p ( y = 1∣ x ) if an ML algorithm produces class score or probability we can trade-off between type I & type II error 0 1 goal: evaluate class scores/probabilities (independent of choice of threshold) Receiver Operating Characteristic ROC curve TPR = TP/P ( recall , sensitivity) FPR = FP/N ( fallout , false alarm) Area Under the Curve ( AUC ) is sometimes used as a threshold independent measure of quality of the classifier COMP 551 | Fall 2020

Curse of dimensionality learning in high dimensions can be difficult: x ∈ [0, 3] D suppose our data is uniformly distributed in some range, say predict the label by counting labels in the same unit of the grid (similar to KNN) 3 D to have at least one example per unit, we need training examples for D=180 we need more training examples than the number of particles in the universe

Curse of dimensionality in high dimensions most points have similar distances! histogram of pairwise distance of 1000 random points as we increase dimension, distances become "similar"!

Curse of dimensionality Q. why are most distances similar? A. in high dimensions most of the volume is close to the corners! volum( ) lim = 0 D /2 2 r π D D →∞ volum( ) D Γ( D /2) (2 r ) D a "conceptual" visualization of the same idea D = 3 # corners and the mass in the corners grow quickly with D image: Zaki's book on Data Mining and Analysis

Real-word vs. randomly generated data how come ML methods work for image data (D=number of pixels)? pairwise distance for random data pairwise distance for D pixels of MNIST digits in fact KNN works well for image classification the statistics do not match that of random high-dimensional data!

Manifold hypothesis real-world data is often far from uniformly random manifold hypothesis: real data lies close to the surface of a manifold example example data dimension: D = 3 data dimension: D=number of pixels ^ ^ manifold dimension: = 2 D manifold dimension: = 2 D COMP 551 | Fall 2020

No free lunch consider the binary classification task: suppose this is our dataset there are binary functions that perfectly fit our dataset (why?) 4 2 = 16 ^ {0, 1} → 3 our learning algorithm can produce one of these as our classifier : {0, 1} f the same algorithm cannot perform well for all possible class of problems (f) no free lunch each ML algorithm is biased to perform well on some class of problems

Inductive bias learning algorithms make implicit assumptions learning or inductive bias e.g., we are often biased towards simplest explanations of our data Occam's razor between two models (explanations) we should prefer the simpler one example both of the following models perfectly fit the data ^ this one is simpler ( x ) = f x 2 ^ ( x ) = x ∧ f x 2 1 why does is make sense for learning algorithms to be biased? the world is not random there are regularities, and induction is possible (why do you think the sun will rise in the east tomorrow morning? what are some of the inductive biases in using K-NN? COMP 551 | Fall 2020

Summary curse of dimensionality : exponentially more data needed in higher dimensions the manifold hypothesis to the rescue! what we care about is the generalization of ML algorithms overfitting : good performance on the training set doesn't mean the same for the test set underfitting : we don't even have a good performance on the training set estimated using a validation set or better, we could use cross-validation no algorithm can perform well on all problems, or there ain't no such thing as a free lunch learning algorithms make assumptions about the data ( inductive biases ) strength and correctness of those assumptions about the data affects their performance 5 👖

Applied Machine Learning Some important concepts Siamak Ravanbakhsh - PowerPoint PPT Presentation

Applied Machine Learning Some important concepts Siamak Ravanbakhsh COMP 551 (fall 2020) Admin Weekly quiz : practice quiz was released yesterday 24hrs to submit your answers correct answers are release afterward no extension possible your

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Contact

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Applied Machine Learning Introduction 1 APPLIED MACHINE LEARNING Practicalities Slides and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

APPLIED MACHINE LEARNING Probability Density Functions Gaussian Mixture Models 1 APPLIED

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Multilayer Perceptron Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Convolutional Neural Networks Siamak

Applied Machine Learning Applied Machine Learning Perceptron and Support Vector Machines Siamak

Applied Machine Learning Applied Machine Learning Decision Trees Siamak Ravanbakhsh Siamak

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

Applied Machine Learning Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

High-dimensional statistics and probability Christophe Giraud 1 , Matthieu Lerasle 2 , 3 and

Recent Advances in Adaptive Sampling and Reconstruction for Monte

Applied Machine Learning Applied Machine Learning Some basic concepts Siamak Ravanbakhsh Siamak

Kernel-based Methods and Support Vector Machines Larry Holder CSE 6363 Machine Learning

How to Cope with the Curse of Dimensionality ? Henryk Wo zniakowski University of Warsaw and

Steganalysis in high dimensions: Fusing classifiers built on random subspaces Jan Kodovsk,

Introduction to Machine Learning Polynomial Regression Models Learning goals Understand how to