Combining Kernels for Classification Doctoral Thesis Seminar Darrin - PowerPoint PPT Presentation

Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification – p.

Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.

Summary of Contribution Empirical study of kernel averaging versus SDP weighted kernel combination Nonstationary kernel combination Double Jensen bound for latent MED Efficient iterative optimization Implementation Combining Kernels for Classification – p.

Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.

Example Kernel One   1 4 2 . 75 3 4 16 11 12    2 . 75 11 7 . 5625 8 . 25  3 12 8 . 25 9 PCA Basis for Kernel 1 1 0.8 0.6 0.4 0.2 X2 0 −0.2 −0.4 −0.6 −0.8 −1 −24 −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 X1 Combining Kernels for Classification – p.

Example Kernel Two   9 12 8 . 25 3 12 16 11 4    8 . 25 11 7 . 5625 2 . 75  3 4 2 . 75 1 PCA Basis for Kernel 2 −4 −6 −8 −10 −12 X2 −14 −16 −18 −20 −22 −24 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 X1 Combining Kernels for Classification – p.

Example Kernel Combination   10 16 11 6 16 32 22 16    11 22 15 . 125 11  6 16 11 10 PCA Basis for Combined Kernel 3 2 1 X2 0 −1 −2 −3 −45 −40 −35 −30 −25 −20 X1 Combining Kernels for Classification – p.

Effect of Combination K C ( x, z ) = K 1 ( x, z ) + K 2 ( x, z ) = � φ 1 ( x ) , φ 1 ( z ) � + � φ 2 ( x ) , φ 2 ( z ) � = � φ 1 ( x ): φ 2 ( x ) , φ 1 ( z ): φ 2 ( z ) � The implicit feature space of the combined kernel is a concatenation of the feature spaces of the individual kernels. A basis in the combined feature space may be lower dimensional than the sum of the dimensions of the individual feature spaces. Combining Kernels for Classification – p.

Combination Weights There are several ways in which the combination weights can be determined: equal weight : or unweighted combination. This is also essentially kernel averaging 14 . optimized weight : SDP weighted combination 6 . Weights and SVM Lagrange multipliers are determined in a single optimization. To regularize the kernel weights, a constraint is enforced to keep the trace of the combined kernel constant. Combining Kernels for Classification – p.

Sequence/Structure We compare 10 the state -of-the-art SDP and simple averaging for conic combinations of kernels Drawbacks of SDP include optimization time and lack of a free implementation We determined the cases in which averaging is preferable and those in which SDP is required Our experiments predict Gene Ontology 2 (GO) terms using a combination of amino acid sequence and protein structural information We use the 4,1-Mismatch sequence kernel 8 and MAMMOTH (sequence-independent) structure kernel 13 Combining Kernels for Classification – p. 1

Cumulative ROC AUC No. of terms with a given mean ROC 50 40 30 20 Sequence 10 Structure Average SDP 0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC Combining Kernels for Classification – p. 1

Mean ROC AUC Top 10 GO Terms GO term Structure Sequence Average SDP GO:0008168 0.941 ± 0.014 0.709 ± 0.020 0.937 ± 0.016 0.938 ± 0.015 GO:0005506 0.934 ± 0.008 0.747 ± 0.015 0.927 ± 0.012 0.927 ± 0.012 GO:0006260 0.885 ± 0.014 0.707 ± 0.020 0.878 ± 0.016 0.870 ± 0.015 GO:0048037 0.916 ± 0.015 0.738 ± 0.025 0.911 ± 0.016 0.909 ± 0.016 GO:0046483 0.949 ± 0.007 0.787 ± 0.011 0.937 ± 0.008 0.940 ± 0.008 GO:0044255 0.891 ± 0.012 0.732 ± 0.012 0.874 ± 0.015 0.864 ± 0.013 GO:0016853 0.855 ± 0.014 0.706 ± 0.029 0.837 ± 0.017 0.810 ± 0.019 GO:0044262 0.912 ± 0.007 0.764 ± 0.018 0.908 ± 0.006 0.897 ± 0.006 GO:0009117 0.892 ± 0.015 0.748 ± 0.016 0.890 ± 0.012 0.880 ± 0.012 GO:0016829 0.935 ± 0.006 0.791 ± 0.013 0.931 ± 0.008 0.926 ± 0.007 GO:0006732 0.823 ± 0.011 0.781 ± 0.013 0.845 ± 0.011 0.828 ± 0.013 GO:0007242 0.898 ± 0.011 0.859 ± 0.014 0.903 ± 0.010 0.900 ± 0.011 GO:0005525 0.923 ± 0.008 0.884 ± 0.015 0.931 ± 0.009 0.931 ± 0.009 GO:0004252 0.937 ± 0.011 0.907 ± 0.012 0.932 ± 0.012 0.931 ± 0.012 GO:0005198 0.809 ± 0.010 0.795 ± 0.014 0.828 ± 0.010 0.824 ± 0.011 Combining Kernels for Classification – p. 1

Varying Ratio Top 10 GO Terms 0.95 0.9 Mean ROC 0.85 0.8 0.75 0.7 -Inf -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Inf Log2 ratio of kernel weights Combining Kernels for Classification – p. 1

Noisy Kernels 56 GO Terms 1 0.95 Mean ROC (SDP) 0.9 0.85 0.8 0.75 No noise 0.7 1 noise kernel 2 noise kernels 0.65 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC (Average) Combining Kernels for Classification – p. 1

Missing Data Typical GO Term GO:0046483 0.9 Mean ROC All SDP None SDP Self SDP 0.8 All Ave None Ave Self Ave Structure 0.7 0 10 20 30 40 50 60 70 80 90 100 Percent missing structures Combining Kernels for Classification – p. 1

Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p. 1

Kernelized Discriminants Single: � f ( x ) = y t λ t k ( x t , x ) + b t Linear combination: � � f ( x ) = y t λ t ν m k m ( x t , x ) + b m t Nonstationary combination 9 : � � f ( x ) = y t λ t ν m,t ( x ) k m ( x t , x ) + b m t Combining Kernels for Classification – p. 1

Parabola-Line Data Combining Kernels for Classification – p. 1

Parabola-Line SDP Combining Kernels for Classification – p. 1

Ratio of Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) µ + m , µ − n Gaussian means α, β mixing proportions b scalar bias For now, maximum likelihood parameters are estimated independently for each model. Note explicit feature maps, φ + , φ − . Combining Kernels for Classification – p. 2

Parabola-Line ML Combining Kernels for Classification – p. 2

Ratio of Generative Models � M m =1 P ( m, φ + m ( X t ) | θ + m ) L ( X t ; Θ) = ln + b � N n =1 P ( n, φ − n ( X t ) | θ − n ) Find distribution P (Θ) rather than specific Θ ∗ �� Classify using ˆ y = sign Θ P (Θ) L ( X t ; Θ) d Θ Combining Kernels for Classification – p. 2

Max Ent Parameter Estimation Find P (Θ) to satisfy “moment” constraints: � Θ P (Θ) y t L ( X t ; Θ) d Θ ≥ γ t ∀ t ∈ T while assuming nothing additional. Minimize Shannon relative entropy: P (Θ) D ( P � P (0) ) = � Θ P (Θ) ln P (0) (Θ) d Θ to allow the use of a prior P (0) (Θ) . Classic ME solution 3 is: 1 P t ∈T λ t [ y t L ( X t | Θ) − γ t ] Z ( λ ) P (0) (Θ) e P (Θ) = λ fully specifies P (Θ) . Maximize log -concave objective J ( λ ) = − log Z ( λ ) . Combining Kernels for Classification – p. 2

Tractable Partition � ¨ P (0) (Θ) Z ( λ, Q | q ) = Θ � � � q t ( m ) ln P ( m, φ + m ( X t ) | θ + exp λ t ( m ) + H ( q t ) m t ∈T + � � Q t ( n ) ln P ( n, φ − n ( X t ) | θ − − n ) − H ( Q t ) + b − γ t ) n � � � q t ( n ) ln P ( n, φ − n ( X t ) | θ − exp λ t ( n ) + H ( q t ) n t ∈T − � � Q t ( m ) ln P ( m, φ + m ( X t ) | θ + − m ) − H ( Q t ) − b − γ t ) d Θ m Introduce variational distributions q t over the correct class log -sums and Q t over the incorrect class log-sums to replace them with upper and lower bounds, respectively. argmin Q argmax q ¨ Z ( λ, Q | q ) = Z ( λ ) Iterative optimization is required. Combining Kernels for Classification – p. 2

MED Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) Gaussian priors N (0 , I ) on µ + m , µ − n Non -informative Dirichlet priors on α, β Non-informative Gaussian N (0 , ∞ ) prior on b . These assumptions simplify the objective and result in a set of linear equality constraints on the convex optimization. Combining Kernels for Classification – p. 2

Convex Objective ¨ � � J ( λ, Q | q ) = λ t ( H ( Q t ) − H ( q t )) + λ t γ t t ∈T t ∈T − 1 � � � q t ( m ) q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T + � � Q t ( n ) Q t ′ ( n ) k − n ( t, t ′ ) + n − 1 � � � Q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T − � � q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n � � � q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) + λ t λ t ′ m t ∈T + t ′ ∈T − � � Q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n Combining Kernels for Classification – p. 2

Combining Kernels for Classification Doctoral Thesis Seminar Darrin - PowerPoint PPT Presentation

Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification p. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination

Overview: Kernels for Sequences and Graphs String Kernels 8 Example Sequence Classification

The Gray Code Kernels The Gray Code Kernels The Gray Code Kernels Gil Ben-Artzi Hagit Hel-Or

Beta kernels and transformed kernels applications to copulas and quantiles Arthur Charpentier

Kernels on structures Andrea Passerini passerini@disi.unitn.it Machine Learning Kernels on

Scalable Machine Learning 6. Kernels Alex Smola Yahoo! Research and ANU

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

On enumerating the kernels in a bipolar valued digraph Raymond Bisdorff University of Luxembourg

Kernel on Automata Cousins of String Kernels and Dynamic Systems Kernels? S.V.N. Vishy

Launching Kernels Dr Eric McCreath Research School of Computer Science The Australian National

Modelling covariance kernels for nonstationary random fields Christopher G. Small University of

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Variably scaled kernels M. Bozzini jointed with L. Lenarduzzi, M. Rossini, R. Schaback Maia

Kernel methods and Graph kernels Social and Technological Networks Rik Sarkar University of

Toon Calders Discovery Science, October 30 th 2012, Lyon Frequent Itemset Mining F I Mi i

VIRTIO-NET: VHOST DATA PATH ACCELERATION TORWARDS NFV CLOUD CUNMING LIANG, Intel Agenda

Proto-Ersuic, Qiangic, and PTB Dominic Yu UC Berkeley ICSTLL45 (Singapore - NTU) 2012 October

Welcome to #WCETWebcast October 19, 2017 The webcast will begin shortly. There is no audio

Convergence of Boundary Element Methods on Fractals Simon Chandler-Wilde

Word order and disambiguation in Pangasinan Joey Lim Michael Yoshitaka Erlewine

iLab Dynamic Routing Florian Wohlfart wohlfart@in.tum.de Chair of Network Architectures and

How Can We Work Together to Close the Achievement Gap? h c L i t i g r a t i o n a