Combining Kernels for Classification
Doctoral Thesis Seminar
Darrin P . Lewis
dplewis@cs.columbia.edu
Combining Kernels for Classification – p.
Combining Kernels for Classification Doctoral Thesis Seminar Darrin - - PowerPoint PPT Presentation
Combining Kernels for Classification Doctoral Thesis Seminar Darrin P . Lewis dplewis@cs.columbia.edu Combining Kernels for Classification p. Outline Summary of Contribution Stationary kernel combination Nonstationary kernel combination
Darrin P . Lewis
dplewis@cs.columbia.edu
Combining Kernels for Classification – p.
Summary of Contribution
Combining Kernels for Classification – p.
Combining Kernels for Classification – p.
Stationary kernel combination
Combining Kernels for Classification – p.
1 4 2.75 3 4 16 11 12 2.75 11 7.5625 8.25 3 12 8.25 9
−24 −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 PCA Basis for Kernel 1 X1 X2
Combining Kernels for Classification – p.
9 12 8.25 3 12 16 11 4 8.25 11 7.5625 2.75 3 4 2.75 1
−1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 −24 −22 −20 −18 −16 −14 −12 −10 −8 −6 −4 PCA Basis for Kernel 2 X1 X2
Combining Kernels for Classification – p.
10 16 11 6 16 32 22 16 11 22 15.125 11 6 16 11 10
−45 −40 −35 −30 −25 −20 −3 −2 −1 1 2 3 PCA Basis for Combined Kernel X1 X2
Combining Kernels for Classification – p.
Combining Kernels for Classification – p.
Combining Kernels for Classification – p.
Combining Kernels for Classification – p. 1
10 20 30 40 50 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1
Mean ROC Sequence Structure Average SDP
Combining Kernels for Classification – p. 1
GO term Structure Sequence Average SDP GO:0008168 0.941 ± 0.014 0.709 ± 0.020 0.937 ± 0.016 0.938 ± 0.015 GO:0005506 0.934 ± 0.008 0.747 ± 0.015 0.927 ± 0.012 0.927 ± 0.012 GO:0006260 0.885 ± 0.014 0.707 ± 0.020 0.878 ± 0.016 0.870 ± 0.015 GO:0048037 0.916 ± 0.015 0.738 ± 0.025 0.911 ± 0.016 0.909 ± 0.016 GO:0046483 0.949 ± 0.007 0.787 ± 0.011 0.937 ± 0.008 0.940 ± 0.008 GO:0044255 0.891 ± 0.012 0.732 ± 0.012 0.874 ± 0.015 0.864 ± 0.013 GO:0016853 0.855 ± 0.014 0.706 ± 0.029 0.837 ± 0.017 0.810 ± 0.019 GO:0044262 0.912 ± 0.007 0.764 ± 0.018 0.908 ± 0.006 0.897 ± 0.006 GO:0009117 0.892 ± 0.015 0.748 ± 0.016 0.890 ± 0.012 0.880 ± 0.012 GO:0016829 0.935 ± 0.006 0.791 ± 0.013 0.931 ± 0.008 0.926 ± 0.007 GO:0006732 0.823 ± 0.011 0.781 ± 0.013 0.845 ± 0.011 0.828 ± 0.013 GO:0007242 0.898 ± 0.011 0.859 ± 0.014 0.903 ± 0.010 0.900 ± 0.011 GO:0005525 0.923 ± 0.008 0.884 ± 0.015 0.931 ± 0.009 0.931 ± 0.009 GO:0004252 0.937 ± 0.011 0.907 ± 0.012 0.932 ± 0.012 0.931 ± 0.012 GO:0005198 0.809 ± 0.010 0.795 ± 0.014 0.828 ± 0.010 0.824 ± 0.011
Combining Kernels for Classification – p. 1
0.7 0.75 0.8 0.85 0.9 0.95
1 2 3 4 5 6 7 8 Inf Mean ROC Log2 ratio of kernel weights
Combining Kernels for Classification – p. 1
0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Mean ROC (SDP) Mean ROC (Average) No noise 1 noise kernel 2 noise kernels
Combining Kernels for Classification – p. 1
0.7 0.8 0.9 10 20 30 40 50 60 70 80 90 100 Mean ROC Percent missing structures GO:0046483 All SDP None SDP Self SDP All Ave None Ave Self Ave Structure
Combining Kernels for Classification – p. 1
Nonstationary kernel combination
Combining Kernels for Classification – p. 1
Combining Kernels for Classification – p. 1
Combining Kernels for Classification – p. 1
Combining Kernels for Classification – p. 1
m=1 αmN(φ+ m(Xt)|µ+ m, I)
n=1 βnN(φ− n (Xt)|µ− n , I)
m, µ− n Gaussian means
Combining Kernels for Classification – p. 2
Combining Kernels for Classification – p. 2
m=1 P(m, φ+ m(Xt)|θ+ m)
n=1 P(n, φ− n (Xt)|θ− n )
P(Θ) P (0)(Θ)dΘ
1 Z(λ)P (0)(Θ)e P
t∈T λt[ytL(Xt|Θ)−γt]
Combining Kernels for Classification – p. 2
¨ Z(λ, Q|q) =
P (0)(Θ) exp
t∈T +
λt(
qt(m) ln P(m, φ+
m(Xt)|θ+ m) + H(qt)
−
Qt(n) ln P(n, φ−
n (Xt)|θ− n ) − H(Qt) + b − γt)
t∈T −
λt(
qt(n) ln P(n, φ−
n (Xt)|θ− n ) + H(qt)
−
Qt(m) ln P(m, φ+
m(Xt)|θ+ m) − H(Qt) − b − γt)
Introduce variational distributions qt over the correct class log-sums and Qt over the incorrect class log-sums to replace them with upper and lower bounds, respectively. argminQ argmaxq ¨ Z(λ, Q|q) = Z(λ) Iterative optimization is required.
Combining Kernels for Classification – p. 2
m=1 αmN(φ+ m(Xt)|µ+ m, I)
n=1 βnN(φ− n (Xt)|µ− n , I)
m, µ− n
Combining Kernels for Classification – p. 2
¨ J(λ, Q|q) =
λt(H(Qt) − H(qt)) +
λtγt − 1 2
λtλt′
m
qt(m)qt′(m)k+
m(t, t′)
+
Qt(n)Qt′(n)k−
n (t, t′)
2
λtλt′
m
Qt(m)Qt′(m)k+
m(t, t′)
+
qt(n)qt′(n)k−
n (t, t′)
t′∈T −
λtλt′
m
qt(m)Qt′(m)k+
m(t, t′)
+
Qt(n)qt′(n)k−
n (t, t′)
Combining Kernels for Classification – p. 2
E{ln N(φ+
m(Xt)|µ+ m)} =
− D 2 ln(2π) − 1 2 − 1 2k+
m(Xt, Xt)
+
λτqτ(m)k+
m(Xτ, Xt)
−
λτQτ(m)k+
m(Xτ, Xt)
− 1 2
λτλτ ′qτ(m)qτ ′(m)k+
m(Xτ, Xτ ′)
− 1 2
λτλτ ′Qτ(m)Qτ ′(m)k+
m(Xτ, Xτ ′)
+
τ ′∈T −
λτλτ ′qτ(m)Qτ ′(m)k+
m(Xτ, Xτ ′)
Combining Kernels for Classification – p. 2
am = E{ln αm} + 1
2E{b}
∀m = 1..M bn = E{ln βn} − 1
2E{b}
∀n = 1..N When λt ∈ (0, c) we must achieve the following with equality:
qt(m)(am + E{ln N(φ+
m(Xt)|µ+ m)}) + H(qt) =
Qt(n)(bn + E{ln N(φ−
n (Xt)|µ− n )}) + H(Qt) + γt
∀t ∈ T +
qt(n)(bn + E{ln N(φ−
n (Xt)|µ− n )}) + H(qt) =
Qt(m)(am + E{ln N(φ+
m(Xt)|µ+ m)}) + H(Qt) + γt
∀t ∈ T − We solve for am for m = 1..M and bn for n = 1..N in this (over-constrained) linear system, obtaining the expected bias and mixing proportions.
Combining Kernels for Classification – p. 2
m(X)|µ+ m)} + am
n (X)|µ− n )} + bn
m,t(X) =
m(X)|µ+ m)} + am)
m(X)|µ+ m)} + am).
Combining Kernels for Classification – p. 3
m(X)k+ m(Xτ, X)
m(X)k+ m(Xτ, X)
n (X)k− n (Xτ, X)
n (X)k− n (Xτ, X)
m(X)k+ m(X, X) −
n (X)k− n (X, X)
Combining Kernels for Classification – p. 3
Combining Kernels for Classification – p. 3
Combining Kernels for Classification – p. 3
Sequential minimal optimization
Combining Kernels for Classification – p. 3
2λTHλ subject to:
2 6 6 6 6 6 4 . . . qu1 qu1 . . . −1 . . . qw1 qw1 . . . . . . qu2 qu2 . . . −1 . . . qw2 qw2 . . . . . . 1 . . . −qv1 −qv1 . . . 1 . . . . . . 1 . . . −qv2 −qv2 . . . 1 . . . 3 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 . . . λu1 λu2 . . . λv1 λv2 . . . λw1 λw2 . . . 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 4 3 7 7 7 7 7 5
Combining Kernels for Classification – p. 3
T1
u 1
T1
v 1
u 1)qu
v 1)qv = (((∆λT u 1)qu)T1)qv = (∆λT u 1)qv.
Combining Kernels for Classification – p. 3
u 1) = (∆λT v 1) = ∆s. We have ∆λv = ∆squ and
u ∆λu + cT v ∆λv
2∆λT uHuu∆λu + ∆λT uHuv∆λv + 1 2∆λT v Hvv∆λv
t Htu∆λu + ∆λT t Htv∆λv).
Combining Kernels for Classification – p. 3
T1
T1
u 1
w1
Combining Kernels for Classification – p. 3
0.1 1 10 100 1000 10000 100 200 300 400 500 Time (s) Number of examples QUADPROG SMO
Combining Kernels for Classification – p. 4
Results
Combining Kernels for Classification – p. 4
1 x2)2, an RBF kernel
1 x2.
Combining Kernels for Classification – p. 4
0.8313 ± 0.014
Combining Kernels for Classification – p. 4
0.8634 ± 0.008
Combining Kernels for Classification – p. 4
0.6141 ± 0.032
Combining Kernels for Classification – p. 4
Combining Kernels for Classification – p. 4
Class Exp Dom Seq SDP NSKC 1 0.630 0.717
0.750
0.745 0.747 2 0.657 0.664 0.718 0.751
0.755
3 0.668 0.706 0.729 0.768
0.774
4 0.596 0.756 0.752 0.766
0.778
5 0.810 0.773 0.789 0.834
0.836
6 0.617 0.690 0.668 0.698
0.717
7 0.554 0.715
0.740
0.720 0.738 8 0.594 0.636 0.680 0.697
0.699
9 0.535 0.564
0.603
0.582 0.576 10 0.554 0.616
0.706
0.697 0.687 11 0.506 0.470 0.480 0.524
0.526
12 0.682 0.896 0.883 0.916
0.918
Combining Kernels for Classification – p. 4
GO term Average SDP NSKC GO:0008168 0.937 ± 0.016 0.938 ± 0.015 0.944 ± 0.014 GO:0005506 0.927 ± 0.012 0.927 ± 0.012 0.926 ± 0.013 GO:0006260 0.878 ± 0.016 0.870 ± 0.015 0.880 ± 0.015 GO:0048037 0.911 ± 0.016 0.909 ± 0.016 0.918 ± 0.015 GO:0046483 0.937 ± 0.008 0.940 ± 0.008 0.941 ± 0.008 GO:0044255 0.874 ± 0.015 0.864 ± 0.013 0.874 ± 0.012 GO:0016853 0.837 ± 0.017 0.810 ± 0.019 0.823 ± 0.018 GO:0044262 0.908 ± 0.006 0.897 ± 0.006 0.906 ± 0.007 GO:0009117 0.890 ± 0.012 0.880 ± 0.012 0.887 ± 0.012 GO:0016829 0.931 ± 0.008 0.926 ± 0.007 0.928 ± 0.008
Combining Kernels for Classification – p. 4
Conclusion
Combining Kernels for Classification – p. 4
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
Combining Kernels for Classification – p. 5
[1] S. Boyd and L. Vandenberghe. Convex Optimiza- tion. Prentice-Hall, 2003. To appear. Available at http://www.stanford.edu/˜boyd/cvxbook.html. [2] Gene Ontology Consortium. Gene ontology: tool for the unification of biology. Nat Genet, 25(1):25–9, 2000. [3] T. Jaakkola, M. Meila, and T. Jebara. Maximum entropy discrimination. In Advances in Neural Information Pro- cessing Systems, volume 12, December 1999. [4] T. Jebara. Machine Learning: Discriminative and Genera-
[5] G. R. G. Lanckriet, T. De Bie, N. Cristianini, M. I. Jordan, and W. S. Noble. A statistical framework for genomic data
[6] G. R. G. Lanckriet, N. Cristianini, P . Bartlett, L. El Ghaoui, and M. I. Jordan. Learning the kernel matrix with semi- definite programming. In C. Sammut and A. Hoffman, editors, Proceedings of the 19th International Conference
Kauffman. 56-1
[7] G. R. G. Lanckriet, M. Deng, N. Cristianini, M. I. Jordan, and W. S. Noble. Kernel-based data fusion and its applica- tion to protein function prediction in yeast. In R. B. Altman,
Proceedings of the Pacific Symposium on Biocomputing, pages 300–311. World Scientific, 2004. [8] C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mis- match string kernels for SVM protein classification. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Information Processing Sys- tems, pages 1441–1448, Cambridge, MA, 2003. MIT Press. [9] D. Lewis, T. Jebara, and W. S. Noble. Nonstationary ker- nel combination. In 23rd International Conference on Ma- chine Learning (ICML), 2006. [10] D. Lewis, T. Jebara, and W. S. Noble. Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Submit- ted, April 2006. [11] P . M. Murphy and D. W. Aha. UCI repository of machine learning databases. Dept. of Information and Computer Science, UC Irvine, 1995. 56-2
[12] C. S. Ong, A. J. Smola, and R. C. Williamson. Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6:1043–1071, 2005. [13] A. R. Ortiz, C. E. M. Strauss, and O. Olmea. MAMMOTH (Matching molecular models obtained from theory): An automated method for model comparison. Protein Sci- ence, 11:2606–2621, 2002. [14] P . Pavlidis, J. Weston, J. Cai, and W. S. Noble. Learning gene functional classifications from multiple data types. Journal of Computational Biology, 9(2):401–411, 2002. [15] J. C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨
Burges, and A. J. Smola, editors, Advances in Kernel
[16] B. Sch¨
Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge, MA, 1999. [17] K. Tsuda, H.J. Shin, and B. Sch¨
sification with multiple networks. In ECCB, 2005. [18] V. N. Vapnik. Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998. 56-3