Ba sic Skill of Ma c hine L e a rning with MAT L AB Stanley - PDF document

7/26/2017 Ba sic Skill of Ma c hine L e a rning with MAT L AB Stanley Liang, PhD York University Basic Data Pre pro c e ssing • Importing data to MATLAB – Import external data MATLAB (matrix laboratory) is a – readtable() multi ‐ paradigm numerical computing – Using logical indexing environment and fourth ‐ generation – create a logical idx variable programming language. A proprietary – use the idx variable to get the subset programming language developed by – Creating categorical data MathWorks, MATLAB allows matrix – for nominal data manipulations, plotting of functions – creating dummy variable and data, implementation of – Grouping data algorithms, creation of user interfaces, – Merging data and interfacing with programs written in other languages, including C, C++, C#, Java, Fortran and Python. 1

7/26/2017 Visualize Data fo r F irst I mpre ssio n • Boxplot • Scatter No rmalizing Data • Many of the clustering methods use the distance between the observations as a similarity measure. Smaller distances indicate more similar observations. • In the patient dataset, the systolic pressure is likely to be higher than the diastolic pressure. However, diastolic pressure > 90 mmHg is hypertension, but systolic pressure >120 mmHg is still normal • Each statistic has different units and scales. When using the distance measure, statistics with wider scales will be given more importance 2

7/26/2017 Unsupe rvise d L e arning I de ntifying Gro ups by Visualizing the Data • A quick way to group data is to visualize the data and see if there are any obvious patterns and groups • To effectively visualize the data containing more than three predictors, we can use the dimensionality reduction techniques such as multidimensional scaling and principal component analysis (PCA). 3

7/26/2017 Multidime nsio nal Sc aling 1. Calculate pairwise distances – d = pdist(measurements, distance) – d: a distance or dissimilarity vector containing the distance between each pair of observations – measurements ‐‐ A numeric matrix containing the data. Each row is considered as an observation – distance ‐‐ An optional input that indicates the method of calculating the dissimilarity or distance. Commonly used methods are ʹ euclidean ʹ (default), ʹ cityblock ʹ , and ʹ correlation 2. Perform multidimensional scaling – [x, e] = cmdscale(d) – x ‐‐ The m ‐ by ‐ q matrix of the reconstructed coordinates in q ‐ dimensional space. q is the minimum number of dimensions needed to achieve the given pairwise distances. – e ‐‐ Eigenvalues of the matrix x*x‘ – d ‐‐ A dissimilarity or distance vector Princ ipal Co mpo ne nt Analysis • Principal component analysis (PCA) is a popular method for dimensionality reduction • The MATLAB provides the pca() function for PCA – [pcs ,scrs,~,~,pexp] = pca(measurements) – pcs ‐‐ A n ‐ by ‐ n matrix of principal components. – scrs ‐‐ A matrix containing the data transformed using the linear coordinate transformation matrix pcs – pexp ‐‐ A vector of the percentage of variance explained by each principal component – measurements ‐‐ A numeric matrix containing n columns corresponding to the the observed variables. Each row corresponds to an observation. 4

7/26/2017 k-me ans Cluste ring • By on the result of PCA and multidimensional scaling, we get the initial impression that the data can be grouped by 2 • Then we can use the k ‐ means clustering method to divide the observations into groups or clusters • MATLAB function for k ‐ means clustering – idx = kmeans(X,k) – idx ‐‐ Cluster indices, returned as a numeric column vector. – X ‐‐ Data, specified as a numeric matrix – k ‐‐ Number of clusters. • There are several options to tune the clustering, the default method is euclidean distance • Another way to get optimum clustering is to perform the analysis multiple times with different starting centroids and then choose the clustering scheme which minimizes the sum of distances between the centroids and the observations (sumd). Cluste ring by Gaussian Mixture Mo de l • Gaussian Mixture Models (GMM) clustering involves fitting several n ‐ dimensional normal distributions to the data 1. Fit Gaussian Mixture Model ‐‐ fitgmdist – it fits several multidimensional gaussian (normal) distributions – gmdl = fitgmdist(data,4) ‐‐ fit 4 distributions 2. Identify Clusters ‐ calculate each observation’s posterior probability for each component – grps = cluster(gmdl,data); – [grps,~,p] = cluster(gmdl,X); ‐‐ get the probability 3. Visualize the result 5

7/26/2017 I nte rpre ting the Cluste rs • Visualizing Observations in Clusters – With high ‐ dimensional data, it is difficult to visualize the groups as points in space – Use of the parallelcoords() function – parallelcoords(X, ʹ Group ʹ ,g) – x ‐‐ Data, specified as a numeric matrix. – ʹ Group ʹ ‐‐ Property Name. – g ‐‐ A vector containing the cluster identifiers. E valuating Cluste r Quality • When using clustering techniques such as k ‐ means and Gaussian mixture model, you have to specify the number of clusters. • However, for high dimensional data, it is difficult to determine the optimum number of clusters. • An observation’s silhouette value is a normalized (between ‐ 1 and +1) measure of how close that observation is to others in the same cluster, compared to the observations in different clusters. • Silhouette Plots – shows the silhouette value of each observation, grouped by cluster – Clustering schemes in which most of the observations have high silhouette value are desirable 6

7/26/2017 Auto mate Cluste r Quality E valuatio n • Instead of manually experimenting with silhouette plots with different number of clusters, you can automate the process with evalclusters function. • the evalclusters() function – creates m to n clusters by a defined method – computes the silhouette value for each clustering scheme 7

Ba sic Skill of Ma c hine L e a rning with MAT L AB Stanley - PDF document

7/26/2017 Ba sic Skill of Ma c hine L e a rning with MAT L AB Stanley Liang, PhD York University Basic Data Pre pro c e ssing Importing data to MATLAB Import external data MATLAB (matrix laboratory) is a readtable() multi

Presentation Presentation skill skill skill skill Presentation Presentation skill skill

Brite-Mat Sizes/Print Areas: Rectangular mouse mat 240x190mm, Circular Mouse Mat 200mm diameter,

INTERLAYER CONTROL OF SiC f /SiC COMPOSITE PREPARED BY SiC SLURRY INFILTRATION AND HOT PRESSING

Ma c hine L e a rning with MAT L AB - - c la ssific a tion Stanley Liang, PhD York

Ma c hine L e a rning with MAT L AB - - Re g re ssion Stanley Liang, PhD York University Re

Mac hine L e ar ning Intr oduc tion Stanley Liang, PhD York University What is Mac hine L e

Who le Bo dy L e a rning 1 THE C LO V ERLEA F SC HO O L Who le Bo dy L e a rning 2 De

chi hildren dren enj njoy learni rning, ng, to f o feel strong rong abo bout learn rning,

Advanc e d MAT L AB Stanley Liang, PhD York University MAT L AB Sc ript vs. MAT L AB F unc

CS233601: Discret e CS233601: Discret e CS233601: Discret e Mat hemat ics Mat hemat ics Mat

Reflections on Fusion Chamber Technology and SiC/SiC Applications Mohamed Abdou UCLA Presented

Thermally oxidized SiO 2 formation on 4H-SiC substrate by considering the 4H SiC substrate by

SiC/SiC Ceramic Matrix Composites (CMC) for jet engine applications state of the art Dr. Alex

Design of SiC MOSFET Gate Driver Circuit and Development of SiC MOSFET Based Buck Converter

following statements about Sea Isle City (SIC). The quality of life in SIC is improving. 60.0

L e a rning L ite ra c y, L e a rning to T e a c h L ite ra c y: Suppo rting AL L

CS480/680 Machine Learning Lecture 12: February 13 th , 2020 Expectation-Maximization Zahra

Personalized Recommendations for Music Genre Exploration Yu Liang, Martijn Willemsen HTI Group,

Compressive Classification (Machine Learning without learning) Vincent Schellekens Laurent

Histogram-based matching of GMM encoded features for online signature verification Vivek

The Crop Shop LLC Shelly Turner and Steve Nicklaus Errors and Omissions Coverage Professional

FORECAST ANNUAL RECONCILATION PAYMENT Q2 GAS YEAR 2018/19 Forecast Annual Reconciliation Payments

Nuclear states and spectra in holographic QCD Yoshinori Matsuo Osaka University Based on

Numerical shape optimization using CFD Praveen. C praveen@math.tifrbng.res.in Tata Institute of