Random Projections, Margins, Kernels and Feature Selection Adithya - PowerPoint PPT Presentation

Random Projections, Margins, Kernels and Feature Selection Adithya Pediredla Rice University Electrical and Computer Engineering 1

SVM: Revision f ( x i ) = w T x i + b 2

SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i 2

SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i α i − 1 � � α i α j y j y k ( x T Dual: max j x k ); 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i 2

SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C � Primal: min max(0 , 1 − y i f ( x i )); i α i − 1 � � α i α j y j y k ( x T Dual: max j x k ); 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i only inner products matter 2

SVM: Revision f ( x i ) = w T x i + b N w ∈R d � w � 2 + C max(0 , 1 − y i f ( x i )); O ( nd 2 + d 3 ) � Primal: min i α i − 1 j x k ); O ( dn 2 + n 3 ) � � α i α j y j y k ( x T Dual: max 2 α i ≥ 0 i j , k � S.T. 0 ≤ α i ≤ C ; α i y i = 0 , ∀ i i only inner products matter 2

Decreasing computations Only inner products matter. 3

Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . 3

Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . 3

Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . Any comment on rows vs columns of A . 3

Decreasing computations Only inner products matter. Can we approximate x i with z i so that dim( z i ) << dim( x i ) and x T i x j ≈ z T i z j . One way z i = Ax i . Any comment on rows vs columns of A . Turns out a random A is good !! 3

Johnson-Linderstrauss Lemma If d new = ω ( 1 γ 2 log n ), relative angles are preserved up to 1 ± γ . 4

Johnson-Linderstrauss Lemma If d new = ω ( 1 γ 2 log n ), relative angles are preserved up to 1 ± γ . How big can γ be? 4

which data set can have higher γ 20 20 15 15 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 5

which data set can have higher γ 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 6

which data set can have higher γ 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 20 20 20 15 15 15 10 10 10 5 5 5 0 0 0 -5 -5 -5 -10 -10 -10 -15 -15 -15 -20 -20 -20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 -20 -15 -10 -5 0 5 10 15 20 7

How else can big margin help A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. 2 − γ If error < 1 4 , stop else, goto step 1. 8

How else can big margin help A simple weak learner whose speed is proportional to margin. step 1: Pick random h. step 2: Evaluate error in step 1. 2 − γ If error < 1 4 , stop else, goto step 1. Bigger the margin, lesser the iterations 8

Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 9

Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 9

Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. 9

Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. For (2) and (3): Pr A [(1 − γ ) � u − v � 2 ≤ � u ′ − v ′ � 2 ≤ (1 + γ ) � u − v � 2 ] ≥ 1 − 2 e − ( γ 2 − γ 3 ) d 4 9

Dimensionality reduction: random projection Coming back to random projection. A d × D 1 Choose columns to be D random orthogonal unit-length vectors. 2 Choose each entry in A independently from a standard Gaussian. 3 Choose each entry in A to be 1 or -1 independently at random. For (2) and (3): Pr A [(1 − γ ) � u − v � 2 ≤ � u ′ − v ′ � 2 ≤ (1 + γ ) � u − v � 2 ] ≥ 1 − 2 e − ( γ 2 − γ 3 ) d 4 Can we do better? 9

Can we do better If Pr ( error < ǫ ) < δ 10

Can we do better If Pr ( error < ǫ ) < δ d = O ( 1 γ 2 log( 1 ǫδ )) is sufficient. 10

Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? 11

Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? 11

Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? Finding Inner products approximately is enough 11

Kernel functions What if we know that K ( x 1 , x 2 ) = φ ( x 1 ) φ ( x 2 )? What if we do not? Finding Inner products approximately is enough We need to know the distribution of data set 11

Mapping-1 Lemma: Consider any distribution over labelled data. 12

Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. 12

Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ 12

Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ Therefore, if ∃ w in φ − space, by sampling x 1 , x 2 , . . . x n , we are guaranteed: w ′ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + · · · + α d φ ( x d ) Hence, w ′ φ ( x ) = α 1 K ( x , x 1 ) + α 2 K ( x , x 2 ) + . . . α d K ( x , x d ); 12

Mapping-1 Lemma: Consider any distribution over labelled data. Assume ∃ w ∋ P [ � w · x � > γ ] = 0. � 1 � If we draw z 1 , z 2 , . . . z d iid with d ≥ 8 γ 2 + ln 1 then with ǫ δ probability ≥ 1 − δ , ∃ w ′ = span( z 1 , z 2 , . . . , z d ) ∋ P [ � w ′ · x � > γ/ 2] < ǫ Therefore, if ∃ w in φ − space, by sampling x 1 , x 2 , . . . x n , we are guaranteed: w ′ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + · · · + α d φ ( x d ) Hence, w ′ φ ( x ) = α 1 K ( x , x 1 ) + α 2 K ( x , x 2 ) + . . . α d K ( x , x d ); If we define F 1 ( x ) = ( K ( x , x 1 ) , . . . , K ( x , x d )); then with high probability the vector ( α 1 , . . . α d ) is an approximate linear separator. 12

Mapping-2 We can normalize K ( x , x i ) and get better bounds. 13

Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; 13

Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; Compute F 2 ( x ) = F 1 ( x ) U − 1 . 13

Mapping-2 We can normalize K ( x , x i ) and get better bounds. Compute K = U T U ; Compute F 2 ( x ) = F 1 ( x ) U − 1 . F 2 is linearly separable with error at most ǫ at margin γ/ 2 13

Key take aways Inner products are enough. Random projections are good. Higher the margin, lower the dimension. If okay with error, we can project to much lower dimension. While using Kernels, randomly drawn data points act as good features. 14

Random Projections, Margins, Kernels and Feature Selection Adithya - PowerPoint PPT Presentation

Random Projections, Margins, Kernels and Feature Selection Adithya Pediredla Rice University Electrical and Computer Engineering 1 SVM: Revision f ( x i ) = w T x i + b 2 SVM: Revision f ( x i ) = w T x i + b N w R d w 2 + C

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Feature Selection: ROC and Subset Selection Theodoridis 5.5-5.7 Using ROC for Feature Selection

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Lecture 29 Margins: Bode, Nyquist Process Control Prof. Kannan M. Moudgalya IIT Bombay

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

Random Walks, Random Fields, and Graph Kernels John Lafferty School of Computer Science

RapidChain:Scaling Blockchain via Full Sharding Jinghui Liao Outlines Background

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Introduction to Multilevel Analysis Prof. Dr. Ulrike Cress Knowledge Media Research Center

On-line Random Forests Amir Saffari, Christian Leistner, Jakob Santner Martin Godec, Horst

Masked Ballot Voting for Receipt-Free Online Elections Roland Wen and Richard Buckland School of

Range Trees Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk

Multi Multi-dimensional Data and Spatial Range dimensional Data and Spatial Range Query in

Set-Difference Range Queries David Eppstein , Michael T. Goodrich, and Joseph A. Simons 25th