Ma c hine L e a rning with MAT L AB - - c la ssific a tion - - PDF document

ma c hine l e a rning with mat l ab c la ssific a tion
SMART_READER_LITE
LIVE PREVIEW

Ma c hine L e a rning with MAT L AB - - c la ssific a tion - - PDF document

7/27/2017 Ma c hine L e a rning with MAT L AB - - c la ssific a tion Stanley Liang, PhD York University Classific atio n the de finitio n In machine learning and statistics, classification is the problem of identifying to which of a


slide-1
SLIDE 1

7/27/2017 1

Ma c hine L e a rning with MAT L AB

  • - c la ssific a tion

Stanley Liang, PhD York University

Classific atio n the de finitio n

  • In machine learning and statistics,

classification is the problem of identifying to which of a set of categories (sub‐ populations) a new observation belongs,

  • n the basis of a training set of data

containing observations (or instances) whose category membership is known

  • Steps for classification

1. Data prepare – preprocessing, creating training / test set 2. Training 3. Cross Validation 4. Model deployment

slide-2
SLIDE 2

7/27/2017 2

Our data se t

  • Titanic disaster dataset
  • 891 rows
  • Binary classification
  • Features / predictors

– Class: cabin class – Sex: gender of the passenger – Age – Fare

  • Label / response

– Survived: 0‐ dead, 1‐survived

  • Iris dataset
  • 150 rows
  • Multi‐class (3) classification
  • Features / predictors

– Sepal Length – Sepal Width – Petal Length – Petal Width

  • Label / response

– Species ‐‐ string

Our data se t

  • Pima Indians Diabetes Data (NIDDK)
  • 768 rows
  • Binary classification – diabetes or not
  • Features / predictors ‐ 8

– preg: # of pregnant times – plas: plasma glucose concentration – pres: diastolic BP (mmHg) – skin: triceps skinfold thickness (mm) – test: 2‐Hour serum insulin (mu U/ml) – mass: body mass index – pedi: diabetes pedigree function (numeric) – age

  • Label / response: 1‐diabetes, 0‐no
  • Wholesale Customers
  • 440 rows
  • Binary / multiclass (2 categorical)
  • Continuous variables (6): the monetary units

(m.u.) spent on the products

– Fresh ‐ fresh products – Milk ‐ diary products – Grocery ‐ grocery products – Frozen ‐ frozen products – Detergents_Paper ‐detergents and paper products – Delicatessen ‐ delicatessen products

  • Categorical variables (2)

– Channel: 1‐Horeca, 2‐Retail – Region: 1 ‐Lisbon, 2‐Oporto, 3‐Other

slide-3
SLIDE 3

7/27/2017 3

T he wo rkflo w o f Classific atio n Optimizing a mo de l

  • Because of the prior knowledge you have about the data
  • r after looking at the classification results, you may want

to customize the classifier.

  • You can update and customize the model by setting

different options using the fitting functions.

  • Set the options by providing additional inputs for the
  • ption name and the option value.
  • model=fitc*(tbl,ʹresponseʹ,ʹoptionNameʹ,optionValue)

– ʹoptionNameʹ ‐‐ Name of the option, e.g., ʹCostʹ. – optionValue ‐‐ Value to be set to the option specified, e.g., [0 10; 2 0] ‐‐ change the Cost Matrix

slide-4
SLIDE 4

7/27/2017 4

k-Ne are st Ne ig hbo r Ove rvie w

  • Function – fitcknn
  • Performance

– Fit Time: fast – Prediction Time: fast, ∝(Data Size)^2 – Memory Overhead: Small – Common Properties:

– ʹNumNeighborsʹ – Number of neighbors used for classification. – ʹDistanceʹ – Metric used for calculating distances between neighbors. – ʹDistanceWeightʹ – Weighting given to different neighbors.

– Special Notes

– For normalizing the data, use the ʹStandardizeʹ option. –1 – The cosine distance metric works well for “wide” data (more predictors than observations) and data with many predictors.

De c isio n T re e s

  • Function – fitctree
  • Performance

– Fit Time ‐ ∝ Size of the data – Prediction Time – Fast – Memory Overhead – small

  • Common Properties

– ʹSplitCriterionʹ – Formula used to determine optimal splits at each level – ʹMinLeafSizeʹ – Minimum number of observations in each leaf node. – ʹMaxNumSplitsʹ – Maximum number of splits allowed in the decision tree.

  • Special Notes

– Trees are a good choice when there is a significant amount of missing data.

slide-5
SLIDE 5

7/27/2017 5

Naïve Baye s

  • k‐NN and decision trees do not make any

assumptions about the distribution of the underlying data.

  • If we assume that the data comes from a

certain underlying distribution, we can treat the data as a statistical sample. This can reduce the influence of the outliers on

  • ur model.
  • A naïve Bayes classifier assumes the

independence of the predictors within each class. This classifier is a good choice for relatively simple problems.

  • Function – fitcnb
  • Performance
  • Fit Time:

– Normal Dist. ‐ Fast; Kernel Dist. – Slow

  • Prediction Time:

– Normal Dist. ‐ Fast; Kernel Dist. – Slow

  • Memory Overhead:

– Normal Dist. ‐ Small; Kernel Dist. ‐ Moderate to large

  • Common Properties

– ʹDistributionʹ – Distribution used to calculate probabilities – ʹWidthʹ – Width of the smoothing window (when ʹDistributionʹ is set to ʹkernelʹ)

– ʹKernelʹ – Type of kernel to use (when ʹDistributionʹ is set to ʹkernelʹ). – Special Notes

– Naive Bayes is a good choice when there is a significant amount of missing data.

Disc riminant Analysis

  • Similar to naive Bayes, discriminant analysis works by

assuming that the observations in each prediction class can be modeled with a normal probability distribution.

  • There is no assumption of independence in each

predictor.

  • A multivariate normal distribution is fitted to each class.
  • Fit Time: Fast; ∝ size of the data
  • Prediction Time: Fast; ∝ size of the data
  • Memory Overhead: Linear DA ‐ Small; Quadratic DA ‐

Moderate to large; ∝ number of predictors

  • Common Properties

‐ ʹDiscrimTypeʹ ‐ Type of boundary used. ‐ ʹDeltaʹ ‐ Coefficient threshold for including predictors in a linear boundary. (Default 0.) ‐ ʹGammaʹ ‐ Regularization to use when estimating the covariance matrix for linear DA.

  • Linear discriminant analysis works well for “wide”

data (more predictors than observations).

  • Linear Discriminant Analysis

– The default classification assumes that the covariance for each response class is assumed to be the same. This results in linear boundaries between classes.

– DaModel = fitcdiscr(dataTrain,ʹresponseʹ);

  • Quadratic Discriminant Analysis

– Give up equal covariance assumption, a quadratic boundary will be drawn between classes

– daModel = fitcdiscr(dataTrain,ʹresponseʹ,ʹDiscrimTypeʹ,ʹquadra ticʹ);

slide-6
SLIDE 6

7/27/2017 6

Suppo rt Ve c to r Mac hine s

  • SVM will calculate the closes boundary that can

correctly separate different groups of data

  • Fit Time: Fast; ∝ square of the size of the data
  • Prediction Time: Very Fast; ∝ square of the size of the

data

  • Memory Overhead: Moderate
  • ʹKernelFunctionʹ – Variable transformation to apply.
  • ʹKernelScaleʹ – Scaling applied before the kernel

transformation.

  • ʹBoxConstraintʹ – Regularization parameter controlling

the misclassification penalty

  • SVMs use a distance based algorithm. For data is not

normalized, use the ʹStandardizeʹ option.

  • Linear SVMs work well for “wide” data (more

predictors than observations). Gaussian SVMs often work better on “tall” data (more observations than predictors).

  • Multiclass Support Vector Machines

– The underlying calculations for classification with support vector machines are binary by nature. You can perform multiclass SVM classification by creating an error‐ correcting output codes (ECOC) classifier. – First, Create a template for a binary classifier – Second, Create multiclass SVM classifier – Use the function fitecoc to create a multiclass SVM classifier.

Cro ss Validatio n

  • To compare model performance, we can calculate the loss for each method

and pick the method with minimum loss.

  • The loss is calculated on a specific test data. It is possible that a learning

algorithm performs well on that particular test data but does not generalize well to other data

  • The general idea of cross validation is to repeat the above process by

creating different training and test data, fit the model to each training data, and calculate the loss using the corresponding test data.

slide-7
SLIDE 7

7/27/2017 7

K e ywo rd – value pairs fo r c ro ss validatio n

  • mdl = fitcknn(data,ʹresponseVarNameʹ,ʹoptionNameʹ,ʹoptionValueʹ)
  • ‐‐ ʹCrossValʹ : ʹonʹ ‐‐ 10‐fold cross validation

‐‐ ʹHoldoutʹ : scalar from 0 to 1 ‐‐ Holdout with the given fraction reserved for validation. ‐‐ ʹKFoldʹ : k (scalar) ‐‐ k‐fold cross validation ‐‐ ʹLeaveoutʹ : ʹonʹ ‐‐ Leave‐one‐out cross validation

  • if you already have a partition created using the cvpartition function, you can

also provide that to the fitting function.

  • >> part = cvpartition(y,ʹKFoldʹ,k);

>> mdl = fitcknn(data,ʹresponseVarNameʹ,ʹCVPartitionʹ,part);

  • To evaluate a cross‐validated model, use the kfoldLoss function to compute the

loss

  • >> kfoldLoss(mdl)

Strate g ie s to re duc e pre dic to rs

  • High‐dimensional Data
  • Machine learning problems often involve high‐dimensional data

with hundreds or thousands of predictors, e.g. Facial recognition, Predicting weather

  • Learning algorithms are often computation intensive and

reducing the number of predictors can have significant benefits in calculation time and memory consumption.

  • Reducing the number of predictors results in simpler models

which can be generalized and are easier to interpret.

  • Two common ways:

– Feature transformation ‐‐ Transform the coordinate space of the observed variables. – Feature selection ‐‐ Choose a subset of the observed variables

slide-8
SLIDE 8

7/27/2017 8

F e ature T ransfo rmatio n

  • Principal Component Analysis (PCA) transforms an n‐dimensional

feature space into a new n‐dimensional space of orthogonal

  • components. The components are ordered by the variation explained in

the data.

  • PCA can therefore be used for dimensionality reduction by discarding

the components beyond a chosen threshold of explained variance.

  • In the following example, the input X has 11 columns but first 9

principal components explain more than 95% of variance.

F e ature Se le c tio n

  • The data often contains predictors which do not have any relationship with

the response. These predictors should not be included in a model. For example, the patient‐id in the heart health data does not have any relationship with the risk of heart disease.

  • In the decision tree model, one of the methods, predictorImportance, can be

used to identify the predictor variables that are important for creating an accurate model.

  • Sequential Feature Selection

– to incrementally add predictors to the model as long as there is reduction in the prediction error.

slide-9
SLIDE 9

7/27/2017 9

E nse mble L e arning

  • Classification trees are considered weak learners, meaning that they are

highly sensitive to the data used to train them. Thus, two slightly different sets of training data can produce two completely different trees and, consequently, different predictions.

  • However, this weakness can be harnessed as a strength by creating several

trees (or, following the analogous naming, a forest). New observations can then be applied to all the trees and the resulting predictions can be compared.

  • To improve the classifier, we can ensemble learning methods.