linear dimensionality reduction
play

Linear Dimensionality Reduction Practical Machine Learning - PowerPoint PPT Presentation

Linear Dimensionality Reduction Practical Machine Learning (CS294-34) September 24, 2009 Percy Liang Lots of high-dimensional data... According to media reports, a pair of hackers said on Saturday that the Firefox Zambian President Levy


  1. PCA objective 2: projected variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): � n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): � n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 � n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U ⊤ x ] ?) Principal component analysis (PCA) / Basic principles 10

  2. PCA objective 2: projected variance Empirical distribution: uniform over x 1 , . . . , x n Expectation (think sum over data points): � n ˆ E [ f ( x )] = 1 i =1 f ( x i ) n Variance (think sum of squares if centered): � n E [ f ( x )]) 2 = ˆ var [ f ( x )] + (ˆ E [ f ( x ) 2 ] = 1 i =1 f ( x i ) 2 � n Assume data is centered: ˆ E [ x ] = 0 (what’s ˆ E [ U ⊤ x ] ?) Objective: maximize variance of projected data ˆ E [ � U ⊤ x � 2 ] max U ∈ R d × k , U ⊤ U = I Principal component analysis (PCA) / Basic principles 10

  3. Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Principal component analysis (PCA) / Basic principles 11

  4. Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x Principal component analysis (PCA) / Basic principles 11

  5. Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Principal component analysis (PCA) / Basic principles 11

  6. Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Take expectations; note rotation U doesn’t affect length: E [ � x � 2 ] = ˆ ˆ E [ � U ⊤ x � 2 ] + ˆ E [ � x − UU ⊤ x � 2 ] Principal component analysis (PCA) / Basic principles 11

  7. Equivalence in two objectives Key intuition: variance of data = captured variance + reconstruction error � �� � � �� � � �� � fixed want small want large Pythagorean decomposition: x = UU ⊤ x + ( I − UU ⊤ ) x � x � � ( I − UU ⊤ ) x � � UU ⊤ x � Take expectations; note rotation U doesn’t affect length: E [ � x � 2 ] = ˆ ˆ E [ � U ⊤ x � 2 ] + ˆ E [ � x − UU ⊤ x � 2 ] Minimize reconstruction error ↔ Maximize captured variance Principal component analysis (PCA) / Basic principles 11

  8. Finding one principal component Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12

  9. Finding one principal component Objective: maximize variance of projected data Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12

  10. Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12

  11. Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 Input data: X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12

  12. Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) Principal component analysis (PCA) / Basic principles 12

  13. Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max � u � =1 n i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) � 1 � n XX ⊤ � u � =1 u ⊤ = max u Principal component analysis (PCA) / Basic principles 12

  14. Finding one principal component Objective: maximize variance of projected data ˆ E [( u ⊤ x ) 2 ] = max � u � =1 n 1 � ( u ⊤ x i ) 2 = max n � u � =1 i =1 1 Input data: n � u ⊤ X � 2 = max � u � =1 X = ( x 1 . . . x n ) � 1 � n XX ⊤ � u � =1 u ⊤ = max u = 1 def n XX ⊤ = largest eigenvalue of C ( C is covariance matrix of data) Principal component analysis (PCA) / Basic principles 12

  15. How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. Principal component analysis (PCA) / Basic principles 15

  16. How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i Principal component analysis (PCA) / Basic principles 15

  17. How many principal components? • Similar to question of “How many clusters?” • Magnitude of eigenvalues indicate fraction of variance captured. • Eigenvalues on a face image dataset: 1353.2 1086.7 820.1 λ i 553.6 287.1 2 3 4 5 6 7 8 9 10 11 i • Eigenvalues typically drop off sharply, so don’t need that many. • Of course variance isn’t everything... Principal component analysis (PCA) / Basic principles 15

  18. Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Principal component analysis (PCA) / Basic principles 16

  19. Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d × d Σ d × n V ⊤ n × n where U ⊤ U = I d × d , V ⊤ V = I n × n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Principal component analysis (PCA) / Basic principles 16

  20. Computing PCA Method 1: eigendecomposition n XX ⊤ U are eigenvectors of covariance matrix C = 1 Computing C already takes O ( nd 2 ) time (very expensive) Method 2: singular value decomposition (SVD) Find X = U d × d Σ d × n V ⊤ n × n where U ⊤ U = I d × d , V ⊤ V = I n × n , Σ is diagonal Computing top k singular vectors takes only O ( ndk ) Relationship between eigendecomposition and SVD: Left singular vectors are principal components ( C = U Σ 2 U ⊤ ) Principal component analysis (PCA) / Basic principles 16

  21. Roadmap • Principal component analysis (PCA) – Basic principles – Case studies – Kernel PCA – Probabilistic PCA • Canonical correlation analysis (CCA) • Fisher discriminant analysis (FDA) • Summary Principal component analysis (PCA) / Case studies 17

  22. Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i Principal component analysis (PCA) / Case studies 18

  23. Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i X d × n ≅ U d × k Z k × n ) ( z 1 . . . z n ) ) ≅ ( ( . . . Principal component analysis (PCA) / Case studies 18

  24. Eigen-faces [Turk and Pentland, 1991] • d = number of pixels • Each x i ∈ R d is a face image • x ji = intensity of the j -th pixel in image i X d × n ≅ U d × k Z k × n ) ( z 1 . . . z n ) ) ≅ ( ( . . . Idea: z i more “meaningful” representation of i -th face than x i Can use z i for nearest-neighbor classification Much faster: O ( dk + nk ) time instead of O ( dn ) when n, d ≫ k Why no time savings for linear classifier? Principal component analysis (PCA) / Case studies 18

  25. Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· Principal component analysis (PCA) / Case studies 19

  26. Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Principal component analysis (PCA) / Case studies 19

  27. Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Applications: information retrieval Principal component analysis (PCA) / Case studies 19

  28. Latent Semantic Analysis [Deerwater, 1990] • d = number of words in the vocabulary • Each x i ∈ R d is a vector of word counts • x ji = frequency of word j in document i X d × n ≅ U d × k Z k × n ( game: 1 · · · · · · · · · 3 ) ≅ ( 1.9 ) ( z 1 . . . z n ) stocks: 2 · · · · · · · · · 0 0.4 ·· -0.001 chairman: 4 · · · · · · · · · 1 0.8 ·· 0.03 the: 8 · · · · · · · · · 7 0.01 ·· 0.04 · · · . . · · · · · · · · · . . . . . . . . . ·· . wins: 0 · · · · · · · · · 2 0.002 ·· 2.3 0.003 ·· How to measure similarity between two documents? z ⊤ 1 z 2 is probably better than x ⊤ 1 x 2 Applications: information retrieval Note: no computational savings; original x is already sparse Principal component analysis (PCA) / Case studies 19

  29. Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Principal component analysis (PCA) / Case studies 20

  30. Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Principal component analysis (PCA) / Case studies 20

  31. Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Principal component analysis (PCA) / Case studies 20

  32. Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components Principal component analysis (PCA) / Case studies 20

  33. Network anomaly detection [Lakhina, ’05] x ji = amount of traffic on link j in the network during each time interval i Model assumption: total traffic is sum of flows along a few “paths” Apply PCA: each principal component intuitively represents a “path” Anomaly when traffic deviates from first few principal components Principal component analysis (PCA) / Case studies 20

  34. Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Principal component analysis (PCA) / Case studies 21

  35. Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Each x i is (the context distribution of) a word. x ji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Principal component analysis (PCA) / Case studies 21

  36. Unsupervised POS tagging [Sch¨ utze, ’95] Part-of-speech (POS) tagging task: Input: I like reducing the dimensionality of data . Output: NOUN VERB VERB(-ING) DET NOUN PREP NOUN . Each x i is (the context distribution of) a word. x ji is number of times word i appeared in context j Key idea: words appearing in similar contexts tend to have the same POS tags; so cluster using the contexts of each word type Problem: contexts are too sparse Solution: run PCA first, then cluster using new representation Principal component analysis (PCA) / Case studies 21

  37. Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers Principal component analysis (PCA) / Case studies 22

  38. Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: Principal component analysis (PCA) / Case studies 22

  39. Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Principal component analysis (PCA) / Case studies 22

  40. Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Each principal component is a eigen-classifier Principal component analysis (PCA) / Case studies 22

  41. Multi-task learning [Ando & Zhang, ’05] • Have n related tasks (classify documents for various users) • Each task has a linear classifier with weights x i • Want to share structure between classifiers One step of their procedure: given n linear classifiers x 1 , . . . , x n , run PCA to identify shared structure: X = ( x 1 . . . x n ) ≅ UZ Each principal component is a eigen-classifier Other step of their procedure: Retrain classifiers, regularizing towards subspace U Principal component analysis (PCA) / Case studies 22

  42. PCA summary • Intuition: capture variance of data or minimize reconstruction error Principal component analysis (PCA) / Case studies 23

  43. PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD Principal component analysis (PCA) / Case studies 23

  44. PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity Principal component analysis (PCA) / Case studies 23

  45. PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity • Advantages: simple, fast Principal component analysis (PCA) / Case studies 23

  46. PCA summary • Intuition: capture variance of data or minimize reconstruction error • Algorithm: find eigendecomposition of covariance matrix or SVD • Impact: reduce storage (from O ( nd ) to O ( nk ) ), reduce time complexity • Advantages: simple, fast • Applications: eigen-faces, eigen-documents, network anomaly detection, etc. Principal component analysis (PCA) / Case studies 23

  47. Roadmap • Principal component analysis (PCA) – Basic principles – Case studies – Kernel PCA – Probabilistic PCA • Canonical correlation analysis (CCA) • Fisher discriminant analysis (FDA) • Summary Principal component analysis (PCA) / Kernel PCA 24

  48. Limitations of linearity Principal component analysis (PCA) / Kernel PCA 25

  49. Limitations of linearity PCA is effective Principal component analysis (PCA) / Kernel PCA 25

  50. Limitations of linearity PCA is effective Principal component analysis (PCA) / Kernel PCA 25

  51. Limitations of linearity PCA is effective PCA is ineffective Principal component analysis (PCA) / Kernel PCA 25

  52. Limitations of linearity PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = { x = Uz : z ∈ R k } Principal component analysis (PCA) / Kernel PCA 25

  53. Limitations of linearity PCA is effective PCA is ineffective Problem is that PCA subspace is linear: S = { x = Uz : z ∈ R k } In this example: S = { ( x 1 , x 2 ) : x 2 = u 2 u 1 x 1 } Principal component analysis (PCA) / Kernel PCA 25

  54. Going beyond linearity: quick solution Broken solution Principal component analysis (PCA) / Kernel PCA 26

  55. Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } Principal component analysis (PCA) / Kernel PCA 26

  56. Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Principal component analysis (PCA) / Kernel PCA 26

  57. Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space Principal component analysis (PCA) / Kernel PCA 26

  58. Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space 1 , x 1 x 2 , sin( x 1 ) , . . . ) ⊤ In general, can set φ ( x ) = ( x 1 , x 2 Principal component analysis (PCA) / Kernel PCA 26

  59. Going beyond linearity: quick solution Broken solution Desired solution u 1 x 2 We want desired solution: S = { ( x 1 , x 2 ) : x 2 = u 2 1 } 1 , x 2 ) ⊤ We can get this: S = { φ ( x ) = Uz } with φ ( x ) = ( x 2 Linear dimensionality reduction in φ ( x ) space ⇔ Nonlinear dimensionality reduction in x space 1 , x 1 x 2 , sin( x 1 ) , . . . ) ⊤ In general, can set φ ( x ) = ( x 1 , x 2 Problems: (1) ad-hoc and tedious (2) φ ( x ) large, computationally expensive Principal component analysis (PCA) / Kernel PCA 26

  60. Towards kernels Representer theorem: PCA solution is linear combination of x i s Principal component analysis (PCA) / Kernel PCA 27

  61. Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Principal component analysis (PCA) / Kernel PCA 27

  62. Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Principal component analysis (PCA) / Kernel PCA 27

  63. Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Principal component analysis (PCA) / Kernel PCA 27

  64. Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Key fact: PCA only needs inner products K = X ⊤ X Principal component analysis (PCA) / Kernel PCA 27

  65. Towards kernels Representer theorem: PCA solution is linear combination of x i s Why? Recall PCA eigenvalue problem: XX ⊤ u = λ u Notice that u = X α = � n i =1 α i x i for some weights α Analogy with SVMs: weight vector w = X α Key fact: PCA only needs inner products K = X ⊤ X Why? Use representer theorem on PCA objective: � u � =1 u ⊤ XX ⊤ u = α ⊤ ( X ⊤ X )( X ⊤ X ) α max max α ⊤ X ⊤ X α =1 Principal component analysis (PCA) / Kernel PCA 27

  66. Kernel PCA Kernel function: k ( x 1 , x 2 ) such that K , the kernel matrix formed by K ij = k ( x i , x j ) , is positive semi-definite Principal component analysis (PCA) / Kernel PCA 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend