Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar - PowerPoint PPT Presentation

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/

Agenda Agenda  Dimensionality Reduction  Feature Extraction  Feature Extraction Approaches  Linear Methods  Principal Component Analysis (PCA)  Linear Discriminant Analysis (LDA)  Multiple Discriminant Analysis (MDA)  PCA vs LDA  Linear Methods Drawbacks  Nonlinear Dimensionality Reduction  ISOMAP  Local Linear Embedding (LLE)  ISOMAP vs. LLE Sharif University of Technology, Computer Engineering Department, Machine Learning Course 2

Di Dimensionali mensionality R ty Reduc educti tion on  Feature Selection (discussed previous time)  Select the best subset from a given feature set  Feature Extraction (will be discussed today)  Create new features based on the original feature set  Transforms are usually involved Sharif University of Technology, Computer Engineering Department, Machine Learning Course 3

Why D Why Dim imensionali ensionality ty Reducti Reduction? on?  Most machine learning and data mining techniques may not be effective for high-dimensional data  Curse of Dimensionality  Query accuracy and efficiency degrade rapidly as the dimension increases  The intrinsic dimension may be small.  For example, the number of genes responsible for a certain type of disease may be small.  Visualization: projection of high-dimensional data onto 2D or 3D.  Data compression: efficient storage and retrieval.  Noise removal: positive effect on query accuracy. Adopted from slides of Arizona State University Sharif University of Technology, Computer Engineering Department, Machine Learning Course 4

Feature Extr Feature Extracti action on Feature X i Y i Extractor T T X x ,x , ,x Y f(X ) y ,y , ,y i i1 i2 id i i i1 i2 im m  d , usually  For example: x x T 1 2 X x x x x Y 1 2 3 4 x x 3 4 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 5

Feature Extr Feature Extracti action on Appr Approach oaches es  The best f(x) is most likely a non-linear function, but linear functions are easier to find though  Linear Approaches  Principal Component Analysis (PCA)  will be discussed  or Karhunen-Loeve Expansion (KLE)  Linear Discriminant Analysis (LDA)  will be discussed  Multiple Discriminant Analysis (MDA)  will be discussed  Independent Component Analysis (ICA)  Project Pursuit  Factor Analysis  Multidimensional Scaling (MDS) Sharif University of Technology, Computer Engineering Department, Machine Learning Course 6

Feature Extr Feature Extracti action on Appr Approach oaches es  Non-linear approach  Kernel PCA  ISOMAP  Locally Linear Embedding (LLE)  Neural Networks  Feed-Forward Neural Networks  High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors.  Ref: Hinton, G. E. and Salakhutdinov, R. R. (2006 ) “ Reducing the dimensionality of data with neural networks.” Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006.  Self-Organizing Map  A Clustering Approach to Dimensionality Reduction  Transform data to lower dimensional lattice Sharif University of Technology, Computer Engineering Department, Machine Learning Course 7

Feature Extr Feature Extracti action on Appr Approach oaches es  Another view  Unsupervised approaches  PCA  LLE  Self organized map  Supervised approaches  LDA  MDA Sharif University of Technology, Computer Engineering Department, Machine Learning Course 8

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Main idea:  seek most accurate data representation in a lower dimensional space  Example in 2-D  Project data to 1-D subspace (a line) which minimize the projection error  Notice that the good line to use for projection lies is in the direction of largest variance small projection errors, large projection error, good line to project to bad line to project to Sharif University of Technology, Computer Engineering Department, Machine Learning Course 9

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Preserves largest variances in the data  What is the direction of largest variance in data?  Hint: If x has multivariate Gaussian distribution N( μ , Σ ), direction of largest variance is given by eigenvector corresponding to the largest eigenvalue of Σ Sharif University of Technology, Computer Engineering Department, Machine Learning Course 10

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  We can derive following algorithm (will be discussed in next slides)  PCA algorithm:  X  input nxd data matrix (each row a d-dimensional sample)  X  subtract mean of X , from each row of X  The new data has zero mean (normalized data)  Σ  covariance matrix of X  Find eigenvectors and eigenvalues of Σ  C  the M eigenvectors with largest eigenvalues, each in a column (a dxM matrix) - value of eigenvalues gives importance of each component  Y (transformed data)  transform X using C (Y = X * C)  The number of new dimensional is M (M<<d)  Q: How much is the data energy loss? Sharif University of Technology, Computer Engineering Department, Machine Learning Course Sharif University of Technology, Computer Engineering Department, Machine Learning Course 11 11

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Illustration: First principal component * * * * * Second principal component * * * * * * * * Original axes * * * ** * * * * * * Sharif University of Technology, Computer Engineering Department, Machine Learning Course 12

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Example: Sharif University of Technology, Computer Engineering Department, Machine Learning Course 13

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A) Adopted from lectures of Duncan Fyfe Gillies Sharif University of Technology, Computer Engineering Department, Machine Learning Course 14

Pri Principal ncipal Com Compone ponent nt Anal Analysis ysis (PC (PCA) A)  Drawbacks  PCA was designed for accurate data representation, not for data classification  Preserves as much variance in data as possible  If directions of maximum variance is important for classification, will work (give an example?)  However the direction of maximum variance may be useless for classification Sharif University of Technology, Computer Engineering Department, Machine Learning Course 15

PCA PCA Deri Derivati vation on  Can be considered in many viewpoints:  Minimum Error of Projection [least squares error]  Maximum Information gain [maximum variance]  Or by Neural Nets  The result would be the same!  least squares error == maximum variance:  By using Pythagorean Theorem In the below figure Sharif University of Technology, Computer Engineering Department, Machine Learning Course 16

PCA PCA Deri Derivati vation on  We want to find the most accurate representation of d-dimensional data D={x 1 ,x 2 ,…,x n } in some subspace W which has dimension k < d  Let {e 1 ,e 2 ,…,e k } be the orthonormal basis for W. Any vector in W can be written as k e i i i 1 e i s are d-dimensional vectors in original space.  Thus x 1 will be represented by some vectors in W: k x e 1 1 i i i 1 2  Error of this representation is k error x e 1 1 i i i 1  Then, the total error is: 2 n k J x e j ji i j 1 i 1 2 n n k n k t 2 x 2 x e j j ji i ji j 1 j 1 i 1 j 1 i 1 2 n n k n k t 2 x 2 x e j ji j i ji j 1 j 1 i 1 j 1 i 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 17

PCA PCA Deri Derivati vation on  To minimize J, need to take partial derivatives and also enforce constraint that {e 1 ,e 2 ,…,e k } are orthogonal. 2 n n k n k t 2 J e ( ,..., e , ,..., ) x 2 x e 1 k 11 nk j ji j i ji j 1 j 1 i 1 j 1 i 1  First take partial derivatives with respect to α ml t J e ( ,..., e , ,..., ) 2 x e 2 1 k 11 nk m l ml ml  Thus the optimal value for α ml is t t 2 x e 2 0 x e m l ml ml m l  Plug the optimal value for α ml back into J 2 2 n n k n k t t t ( ,..., ) 2 ( ) ( ) J e e x x e x e x e 1 k j j i j i j i 1 1 1 1 1 j j i j i 2 2 n n k t x ( x e ) j j i j 1 j 1 i 1 2 n k n t t x e ( x x ) e j i j j i j 1 i 1 j 1 2 n k n t t x e Se ; S x x j i i j j j 1 i 1 j 1 Sharif University of Technology, Computer Engineering Department, Machine Learning Course 18

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar - PowerPoint PPT Presentation

Machine Learning Dimensionality Reduction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1/ Agenda Agenda Dimensionality Reduction Feature Extraction Feature

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Residuals and Goodness-of-fit tests for marked Gibbs point processes Fr ed eric Lavancier

Anderson-Darling Type Goodness-of-fit Statistic Based on a Multifold Integrated Empirical

Programs extracted from proofs: efficiency aspects Helmut Schwichtenberg Mathematisches

Classical and quantum Chaos plus RMT and some applications Arul Lakshminarayan Department of

From normal to anomalous deterministic diffusion Part 3: Anomalous diffusion Rainer Klages

ss 3 rs r t

A symbol approach in IgA matrix analysis (and in the design of efficient multigrid methods)

Author Department of Mechanical Engineering, Ben-Gurion University P .O.Box 653, Beer-Sheva