On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - PowerPoint PPT Presentation

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation Generalisation Error Error Matrix and the of Kernel PCA ( of Kernel PCA (Shawe Shawe- -Taylor, et al. 2005) Taylor, et al. 2005) Ameet Talwalkar Ameet Talwalkar 02/13/07 02/13/07

Outline Outline � Background Background � � Motivation Motivation � � PCA, MDS PCA, MDS � � ( (Isomap Isomap) ) � � Kernel PCA Kernel PCA � � Generalisation Generalisation Error of Kernel PCA Error of Kernel PCA �

Dimensional Reduction: Motivation Dimensional Reduction: Motivation Lossy Lossy � Computational efficiency Computational efficiency � � Visualization of data requires 2D or 3D representations Visualization of data requires 2D or 3D representations � � Curse of Dimensionality : Learning algorithms require Curse of Dimensionality : Learning algorithms require � “reasonably” good sampling “reasonably” good sampling Intractable learning A(x) problem Tractable learning A(x’) problem Dim Red x -> x’ Lossless – – “Manifold Learning” “Manifold Learning” Lossless � Assumes existence of “intrinsic dimension,” or a Assumes existence of “intrinsic dimension,” or a � reduced representation containing all independent reduced representation containing all independent variables variables

Linear Dimensional Reduction Linear Dimensional Reduction � Assumes input data is a linear function of the Assumes input data is a linear function of the � independent variables independent variables � Common Methods: Common Methods: � � Principal Component Analysis (PCA) Principal Component Analysis (PCA) � � Multidimensional Scaling (MDS) Multidimensional Scaling (MDS) �

PCA – – Big Picture Big Picture PCA � Linearly transform input data in a way Linearly transform input data in a way � that: that: � Maximizes signal (variance) Maximizes signal (variance) � � Minimizes redundancy of signal (covariance) Minimizes redundancy of signal (covariance) �

PCA – – Simple Example Simple Example PCA � Original Data Points Original Data Points � � E.g. shoe size E.g. shoe size � measured in ft, cm measured in ft, cm � y = x provides a y = x provides a � good approx of data good approx of data

PCA – – Simple Example (cont) Simple Example (cont) PCA � Original data Original data � restored using only restored using only first principal first principal component component

PCA – – Covariance Covariance PCA � Covariance is a measure of how much two Covariance is a measure of how much two � variables vary together variables vary together = − − cov( x , y ) E [( x x )( y y )] � � = cov( x , x ) var( x ) � � � If x and y are independent, then If x and y are independent, then cov(x,y cov(x,y) = 0 ) = 0 �

PCA – – Covariance Matrix Covariance Matrix PCA � Stores Stores pairwise pairwise covariance of variables covariance of variables � � Diagonals are variances Diagonals are variances � � Symmetric, Positive Semi Symmetric, Positive Semi- -definite definite � � Start with m column vector observations of n variables Start with m column vector observations of n variables � � Covariance is an n x n matrix Covariance is an n x n matrix � [ ] [ ] [ ] = − − T C E ( X E X )( X E X ) X 1 1 m ∑ T = = T C XX x x X i i m m = i 1

Eigendecomposition Eigendecomposition � Eigenvectors (v) and eigenvalues ( Eigenvectors (v) and eigenvalues ( λ ) for an n x n λ ) for an n x n � matrix, A, are pairs (v, λ ) such that: matrix, A, are pairs (v, λ ) such that: = λ Av v � � � If A is a real symmetric matrix, it can be If A is a real symmetric matrix, it can be � diagonalized into A = E DE into A = E DE T diagonalized T � E = A’s E = A’s orthonormal orthonormal eigenvectors eigenvectors � � D = diagonal matrix of A’s eigenvalues D = diagonal matrix of A’s eigenvalues � � A is positive semi A is positive semi- -definite => eigenvalues non definite => eigenvalues non- -negative negative �

PCA – – Goal (x3) Goal (x3) PCA � Linearly transform input data in a way that: Linearly transform input data in a way that: � � Maximizes signal (variance) Maximizes signal (variance) � � Minimizes redundancy of signal (covariance) Minimizes redundancy of signal (covariance) � � Algorithm: Algorithm: � � Select variance maximizing direction input space Select variance maximizing direction input space � � Find next variance maximizing direction that is orthogonal Find next variance maximizing direction that is orthogonal � to all previously selected directions to all previously selected directions � Repeat k Repeat k- -1 times 1 times � � Find a transformation, P, such that Y = PX and C Find a transformation, P, such that Y = PX and C Y is Y is � diagonalized diagonalized � Solution: project data onto eigenvectors of Solution: project data onto eigenvectors of C C x � x

PCA – – Algorithm Algorithm PCA Goal: Find P where Y = PX s.t s.t. C . C Y is diagonalized diagonalized Goal: Find P where Y = PX Y is � � Select P = E T , or a matrix where each Select P = E T , or a matrix where each 1 � � = T C YY row is an eigenvector of C row is an eigenvector of C x Y m x 1 = T C PAP = T ( PX )( PX ) Y m = T T P ( P DP ) P 1 = T T PXX P m = D = T PAP Inverse = Transpose for Inverse = Transpose for orthonormal orthonormal � � = 1 matrix matrix = T T where A XX EDE m C Y is diagonalized diagonalized C Y is � � PCs are the eigenvectors of C C x PCs are the eigenvectors of � � x th diagonal value of C note: eigenvectors of E are note: eigenvectors of E are i th diagonal value of C Y is the variance of i Y is the variance of � � � � orthonormal orthonormal X along p i X along p i

Gram Matrix (Kernel Matrix) Gram Matrix (Kernel Matrix) � Given X, a collection of m column vector observations Given X, a collection of m column vector observations � of n variables of n variables � Gram Matrix of M: matrix of dot products of inputs Gram Matrix of M: matrix of dot products of inputs � � m x m, real, symmetric m x m, real, symmetric � � Positive semi Positive semi- -definite definite � � “similarity matrix” “similarity matrix” � = T K X X = ⋅ K x x ij i j

Classical Multidimensional Scaling Classical Multidimensional Scaling � Given m objects and dissimilarity Given m objects and dissimilarity δ δ ij for each pair, ij for each pair, � find space in which δ δ ij ≈ Euclidean distance Euclidean distance find space in which ij ≈ 2 = Euclidean Distance: � If If δ δ ij = Euclidean Distance: ij 2 � � Can convert Dissimilarity matrix to Gram Matrix (or we Can convert Dissimilarity matrix to Gram Matrix (or we � can just start with Gram Matrix) can just start with Gram Matrix) � MDS yields same answer as PCA MDS yields same answer as PCA �

Classical Multidimensional Scaling Classical Multidimensional Scaling � Convert Dissimilarity Matrix to Gram Matrix (K) Convert Dissimilarity Matrix to Gram Matrix (K) � � Eigendecomposition Eigendecomposition of K of K � � K = EDE K = EDE T = ED ED 1/2 D 1/2 E T = (ED (ED 1/2 ) (ED 1/2 ) T T = 1/2 D 1/2 E T = 1/2 ) (ED 1/2 ) T � T X � K = K = X X T X � � X X = (ED = (ED 1/2 ) T 1/2 ) T � � Reduce Dimension Reduce Dimension � � Construct Construct X X from subset of eigenvectors/ from subset of eigenvectors/eigenvalues eigenvalues � � Identical to PCA Identical to PCA �

Limitations of Linear Methods Limitations of Linear Methods � Cannot account for non Cannot account for non- - � Small linear relationship of data in linear relationship of data in Euclidean input space input space distance � Data may still have linear Data may still have linear � relationship in some feature relationship in some feature space space � Isomap Isomap: use geodesic : use geodesic � distance to recover manifold distance to recover manifold Large geodesic � Length of shortest curve on Length of shortest curve on � distance a manifold connecting two a manifold connecting two points on the manifold points on the manifold

Local Estimation of Manifolds Local Estimation of Manifolds � Small patches on a non Small patches on a non- -linear manifold look linear linear manifold look linear � � Locally linear neighborhoods defined in two ways Locally linear neighborhoods defined in two ways � � k k - -nearest neighbors: find the k nearest points to a given point nearest neighbors: find the k nearest points to a given point � � ε ε - - ball: find all points that lie within ball: find all points that lie within ε ε of a given point of a given point �

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - PowerPoint PPT Presentation

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation Generalisation Error Error Matrix and the of Kernel PCA ( of Kernel PCA (Shawe Shawe- -Taylor, et al. 2005) Taylor, et al. 2005) Ameet

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams & Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

Nosocomial Vaccine Corporation Announces Presentation on Multivalent Vaccine for Gram-Negative

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Hi High S School ool I Intern Progr gram am TREPs request for TRMA Sponsorship Propose

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius

Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehns slides for chapter 8

Text Reuse Detection Using a Composition of Text Similarity Measures Br, Zesch, Gurevych 2012

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

11 Practicalities 2: Evaluating MT Systems Now that weve talked about how to create machine

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the - PowerPoint PPT Presentation

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation Generalisation Error Error Matrix and the of Kernel PCA ( of Kernel PCA (Shawe Shawe- -Taylor, et al. 2005) Taylor, et al. 2005) Ameet

21 st Century Antibiotics Gram Negative Antibiotic Gram Positive Antibiotic Plasmid Library

More microscopic slides of bacteria Gram stain Good example of bacilli gram stain that is

N-gram models Unsmoothed n-gram models (finish slides from last class) Smoothing

N-Gram Model Formulas Estimating Probabilities N-gram conditional probabilities can be

GOLD/SILVER/PLATINUM BARS &amp; COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

Many words share the same root word This week we are focusing on words with the root gram.

Anaerobes Veillonella Gram positive bacilli Clostridium perfringens, tetani,

N-grams &amp; Language ID If N-gram models represent language models, can we use N-gram

Language as an Interface Spencer Kelly introduction The pope is catholic. language as data

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

Nosocomial Vaccine Corporation Announces Presentation on Multivalent Vaccine for Gram-Negative

Approximate search in misuse detection-based IDS by using the q-gram distance Sverre Bakke

Hi High S School ool I Intern Progr gram am TREPs request for TRMA Sponsorship Propose

Gram-Schmidt algorithm Aim lecture: We use the theory of last lecture to give an algorithm for

SI485i : NLP Set 4 Smoothing Language Models Fall 2013 : Chambers Review: evaluating n-gram

Accelerated Natural Language Processing Lecture 5 N-gram models, entropy Sharon Goldwater (some

The ENCOPLOT Similarity Measure for Automatic Detection of Plagiarism Cristian Grozea 1 Marius

Machine Translation Evaluation Sara Stymne Partly based on Philipp Koehns slides for chapter 8

Text Reuse Detection Using a Composition of Text Similarity Measures Br, Zesch, Gurevych 2012

Word Embeddings Tutorial HILA GONEN PHD STUDENT AT YOAV GOLDBERGS LAB BAR ILAN UNIVERSITY

11 Practicalities 2: Evaluating MT Systems Now that weve talked about how to create machine

Corpus Linguistics Statistical Measures in Information Retrieval Niko Schenk Institut f ur

Image Captioning Image Captioning Image Captioning A survey of recent deep-learning approaches

GOLD/SILVER/PLATINUM BARS & COINS RSBL 0.5 Gram 999 Purity Platinum Bar/Coin More Details

N-grams & Language ID If N-gram models represent language models, can we use N-gram