combining kernels for classification
play

Combining Kernels for Classification Doctoral Thesis Seminar Darrin - PowerPoint PPT Presentation

Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification p. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination


  1. Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification – p.

  2. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.

  3. Summary of Contribution Empirical study of kernel averaging versus SDP weighted kernel combination Nonstationary kernel combination Double Jensen bound for latent MED Efficient iterative optimization Implementation Combining Kernels for Classification – p.

  4. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p.

  5. Example Kernel One   1 4 2 . 75 3 4 16 11 12    2 . 75 11 7 . 5625 8 . 25  3 12 8 . 25 9 PCA Basis for Kernel 1 1 0.8 0.6 0.4 0.2 X2 0 −0.2 −0.4 −0.6 −0.8 −1 −24 −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 X1 Combining Kernels for Classification – p.

  6. Example Kernel Two   9 12 8 . 25 3 12 16 11 4    8 . 25 11 7 . 5625 2 . 75  3 4 2 . 75 1 PCA Basis for Kernel 2 −4 −6 −8 −10 −12 X2 −14 −16 −18 −20 −22 −24 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 X1 Combining Kernels for Classification – p.

  7. Example Kernel Combination   10 16 11 6 16 32 22 16    11 22 15 . 125 11  6 16 11 10 PCA Basis for Combined Kernel 3 2 1 X2 0 −1 −2 −3 −45 −40 −35 −30 −25 −20 X1 Combining Kernels for Classification – p.

  8. Effect of Combination K C ( x, z ) = K 1 ( x, z ) + K 2 ( x, z ) = � φ 1 ( x ) , φ 1 ( z ) � + � φ 2 ( x ) , φ 2 ( z ) � = � φ 1 ( x ): φ 2 ( x ) , φ 1 ( z ): φ 2 ( z ) � The implicit feature space of the combined kernel is a concatenation of the feature spaces of the individual kernels. A basis in the combined feature space may be lower dimensional than the sum of the dimensions of the individual feature spaces. Combining Kernels for Classification – p.

  9. Combination Weights There are several ways in which the combination weights can be determined: equal weight : or unweighted combination. This is also essentially kernel averaging 14 . optimized weight : SDP weighted combination 6 . Weights and SVM Lagrange multipliers are determined in a single optimization. To regularize the kernel weights, a constraint is enforced to keep the trace of the combined kernel constant. Combining Kernels for Classification – p.

  10. Sequence/Structure We compare 10 the state -of-the-art SDP and simple averaging for conic combinations of kernels Drawbacks of SDP include optimization time and lack of a free implementation We determined the cases in which averaging is preferable and those in which SDP is required Our experiments predict Gene Ontology 2 (GO) terms using a combination of amino acid sequence and protein structural information We use the 4,1-Mismatch sequence kernel 8 and MAMMOTH (sequence-independent) structure kernel 13 Combining Kernels for Classification – p. 1

  11. Cumulative ROC AUC No. of terms with a given mean ROC 50 40 30 20 Sequence 10 Structure Average SDP 0 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC Combining Kernels for Classification – p. 1

  12. Mean ROC AUC Top 10 GO Terms GO term Structure Sequence Average SDP GO:0008168 0.941 ± 0.014 0.709 ± 0.020 0.937 ± 0.016 0.938 ± 0.015 GO:0005506 0.934 ± 0.008 0.747 ± 0.015 0.927 ± 0.012 0.927 ± 0.012 GO:0006260 0.885 ± 0.014 0.707 ± 0.020 0.878 ± 0.016 0.870 ± 0.015 GO:0048037 0.916 ± 0.015 0.738 ± 0.025 0.911 ± 0.016 0.909 ± 0.016 GO:0046483 0.949 ± 0.007 0.787 ± 0.011 0.937 ± 0.008 0.940 ± 0.008 GO:0044255 0.891 ± 0.012 0.732 ± 0.012 0.874 ± 0.015 0.864 ± 0.013 GO:0016853 0.855 ± 0.014 0.706 ± 0.029 0.837 ± 0.017 0.810 ± 0.019 GO:0044262 0.912 ± 0.007 0.764 ± 0.018 0.908 ± 0.006 0.897 ± 0.006 GO:0009117 0.892 ± 0.015 0.748 ± 0.016 0.890 ± 0.012 0.880 ± 0.012 GO:0016829 0.935 ± 0.006 0.791 ± 0.013 0.931 ± 0.008 0.926 ± 0.007 GO:0006732 0.823 ± 0.011 0.781 ± 0.013 0.845 ± 0.011 0.828 ± 0.013 GO:0007242 0.898 ± 0.011 0.859 ± 0.014 0.903 ± 0.010 0.900 ± 0.011 GO:0005525 0.923 ± 0.008 0.884 ± 0.015 0.931 ± 0.009 0.931 ± 0.009 GO:0004252 0.937 ± 0.011 0.907 ± 0.012 0.932 ± 0.012 0.931 ± 0.012 GO:0005198 0.809 ± 0.010 0.795 ± 0.014 0.828 ± 0.010 0.824 ± 0.011 Combining Kernels for Classification – p. 1

  13. Varying Ratio Top 10 GO Terms 0.95 0.9 Mean ROC 0.85 0.8 0.75 0.7 -Inf -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Inf Log2 ratio of kernel weights Combining Kernels for Classification – p. 1

  14. Noisy Kernels 56 GO Terms 1 0.95 Mean ROC (SDP) 0.9 0.85 0.8 0.75 No noise 0.7 1 noise kernel 2 noise kernels 0.65 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC (Average) Combining Kernels for Classification – p. 1

  15. Missing Data Typical GO Term GO:0046483 0.9 Mean ROC All SDP None SDP Self SDP 0.8 All Ave None Ave Self Ave Structure 0.7 0 10 20 30 40 50 60 70 80 90 100 Percent missing structures Combining Kernels for Classification – p. 1

  16. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination Sequential minimal optimization Results Conclusion Combining Kernels for Classification – p. 1

  17. Kernelized Discriminants Single: � f ( x ) = y t λ t k ( x t , x ) + b t Linear combination: � � f ( x ) = y t λ t ν m k m ( x t , x ) + b m t Nonstationary combination 9 : � � f ( x ) = y t λ t ν m,t ( x ) k m ( x t , x ) + b m t Combining Kernels for Classification – p. 1

  18. Parabola-Line Data Combining Kernels for Classification – p. 1

  19. Parabola-Line SDP Combining Kernels for Classification – p. 1

  20. Ratio of Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) µ + m , µ − n Gaussian means α, β mixing proportions b scalar bias For now, maximum likelihood parameters are estimated independently for each model. Note explicit feature maps, φ + , φ − . Combining Kernels for Classification – p. 2

  21. Parabola-Line ML Combining Kernels for Classification – p. 2

  22. Ratio of Generative Models � M m =1 P ( m, φ + m ( X t ) | θ + m ) L ( X t ; Θ) = ln + b � N n =1 P ( n, φ − n ( X t ) | θ − n ) Find distribution P (Θ) rather than specific Θ ∗ �� � Classify using ˆ y = sign Θ P (Θ) L ( X t ; Θ) d Θ Combining Kernels for Classification – p. 2

  23. Max Ent Parameter Estimation Find P (Θ) to satisfy “moment” constraints: � Θ P (Θ) y t L ( X t ; Θ) d Θ ≥ γ t ∀ t ∈ T while assuming nothing additional. Minimize Shannon relative entropy: P (Θ) D ( P � P (0) ) = � Θ P (Θ) ln P (0) (Θ) d Θ to allow the use of a prior P (0) (Θ) . Classic ME solution 3 is: 1 P t ∈T λ t [ y t L ( X t | Θ) − γ t ] Z ( λ ) P (0) (Θ) e P (Θ) = λ fully specifies P (Θ) . Maximize log -concave objective J ( λ ) = − log Z ( λ ) . Combining Kernels for Classification – p. 2

  24. Tractable Partition � ¨ P (0) (Θ) Z ( λ, Q | q ) = Θ � � � q t ( m ) ln P ( m, φ + m ( X t ) | θ + exp λ t ( m ) + H ( q t ) m t ∈T + � � Q t ( n ) ln P ( n, φ − n ( X t ) | θ − − n ) − H ( Q t ) + b − γ t ) n � � � q t ( n ) ln P ( n, φ − n ( X t ) | θ − exp λ t ( n ) + H ( q t ) n t ∈T − � � Q t ( m ) ln P ( m, φ + m ( X t ) | θ + − m ) − H ( Q t ) − b − γ t ) d Θ m Introduce variational distributions q t over the correct class log -sums and Q t over the incorrect class log-sums to replace them with upper and lower bounds, respectively. argmin Q argmax q ¨ Z ( λ, Q | q ) = Z ( λ ) Iterative optimization is required. Combining Kernels for Classification – p. 2

  25. MED Gaussian Mixtures � M m =1 α m N ( φ + m ( X t ) | µ + m , I ) L ( X t ; Θ) = ln + b � N n =1 β n N ( φ − n ( X t ) | µ − n , I ) Gaussian priors N (0 , I ) on µ + m , µ − n Non -informative Dirichlet priors on α, β Non-informative Gaussian N (0 , ∞ ) prior on b . These assumptions simplify the objective and result in a set of linear equality constraints on the convex optimization. Combining Kernels for Classification – p. 2

  26. Convex Objective ¨ � � J ( λ, Q | q ) = λ t ( H ( Q t ) − H ( q t )) + λ t γ t t ∈T t ∈T − 1 � � � q t ( m ) q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T + � � Q t ( n ) Q t ′ ( n ) k − n ( t, t ′ ) + n − 1 � � � Q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) λ t λ t ′ 2 m t,t ′ ∈T − � � q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n � � � q t ( m ) Q t ′ ( m ) k + m ( t, t ′ ) + λ t λ t ′ m t ∈T + t ′ ∈T − � � Q t ( n ) q t ′ ( n ) k − n ( t, t ′ ) + n Combining Kernels for Classification – p. 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend