GRAPHLET KERNELS FOR VERTEX CLASSIFICATION
Presenter: José Lugo-Martínez
Jun 12, 2015 jlugomar@indiana.edu Phd Candidate
SVM . . . if Pr(+1|v) > 0.5 then t (v) = +1 else t(v) = -1 G - - PowerPoint PPT Presentation
G RAPHLET K ERNELS FOR V ERTEX C LASSIFICATION Presenter: Jos Lugo-Martnez Phd Candidate jlugomar@indiana.edu Jun 12, 2015 O UTLINE Overview of classification problems on graphs Graphlet kernels for vertex classification Case
Jun 12, 2015 jlugomar@indiana.edu Phd Candidate
– Inferring molecular mechanisms of disease (if time permits)
Task: Classify graph as +1 or -1
Human lymphocyte kinase
Task: Classify node (or edge) as +1 or -1
Y394 of human lymphocyte kinase Depth-3 graph neighborhood for Y394
…
+1 +1 +1 +1
+1 +1 +1
+1
. . . . . .
Training Data
Objective: Predict class label for each unlabeled node
A B A A B A B A B B B A B A B A B A A B A B A B B A A B B A A A B A
Neighborhood graph
How to measure similarity between rooted neighborhoods?
Research Question
# of data points vector of counts
+1
. . . . . .
if Pr(+1|v) > 0.5 then t(v) = +1 else
t(v) = -1 Training Data Test Data
V
Image from Hido, S. & Kashima, H. ICDM 2009.
– Kondor & Lafferty (2002)
– walks and paths
– subtree patterns
– subgraphs
How about other factors?
Count non-isomorphic labeled n-graphlets
Vacic, V. et al. J Computational Biology 17(1): 55 (2010).
Undirected: Directed:
Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).
same symmetry class
vertex labels alphabet
A A A A A
2 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
1 1 1
31 32 33
A A A A B
Graphlet kernel, N=4
Undirected Directed n |∑| = 1 |∑| = 20 |∑| = 1 |∑| = 20 1 1 20 1 20 2 1 400 3 1,200 3 3 16,400 30 217,200 4 11 1,045,600 697 102,673,600 5 58 100,168,400 44,907 137,252,234,400 base graphlets labeled graphlets
Goal: Design robust kernels in the presence of noisy and incomplete data
Generalize the concept of counting graphlets Incorporate flexibility in counting via edit distance
Definition (Graph Edit Distance) Given two vertex- and/or edge-labeled graphs G and H. The edit distance between these graphs corresponds to the minimum number of edit operations necessary to transform G into H.
edges, or in the case of labeled graphs, substitutions of vertex and edge labels
to as an edit path
length of the shortest edit path between them
Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).
Vertex label substitutions Incorporate flexibility in counting via edit distance Edge insertions or deletions
A A A B A A A B A A A B A A A A A A A A A A A A symmetric
1-label substitution 2 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
1 1 1
31 32 33
A A A B A A A B A
A A A A A
A A A A B
1-label substitution 2 1 1 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
1 1 1
31 32 33
A A A B A A A B A
A A A A A
A A A A B
1-label substitution 2 2 2 2 1 1 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
2 2 1 1 1 1 1 1 1
31 32 33
A A A B A A A B A
A A A A A
A A A A B
1-edge indel 2 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
1 1 1
31 32 33
A A A A A A A A A
A A A A A
A A A A B
3 1 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
1 1 1
31 32 33
A A A A A A A A A
1-edge indel
A A A A A
A A A A B
3 1 1
AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB
2 1 1 2 1
31 32 33
A A A A A A A A A
1-edge indel
A A A A A
A A A A B
Edit distance graphlet kernel Normalized edit distance kernel
Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).
# of edit distance operations
Joint work with Vikas Pejaver, Matthew Mort, David N. Cooper, Sean D. Mooney and Predrag Radivojac
Xin & Radivojac. Curr Prot Pept Sci 12: 456 (2011).
Copper (Cu) Iron (Fe)
– e.g. HGMD, dbSNP
– e.g. MutPred, SIFT, PolyPhen, SNPs3D, SNAP
Pauling, L. et al. Science (1949) 110: 543-548; Chui, D.H. and Dover, G.J. Curr Opin Pediatr (2001) 13: 22-27.
2hbs
E6V
http://gingi.uchicago.edu/hbs2.html
4hhb
Sickle Cell Disease
– In particular, molecular function alterations
…LAGDKMGMGQSCVGALFNDVQ… i = 45 j = 46 s: phosphorylation site variant position (C46W)
Radivojac et al. Bioinformatics 24: i241 (2008). Image from Capriotti and Altman. BMC Bioinformatics 12 (Suppl4): S3 (2011).
Probability of loss of property Density
fpr cutoff
Data set Total # of AAS # of AAS mapped to PDB # of genes # of PDB entries # of chains Neutral 282,625 8,049 2,095 3,047 3,500 Disease 52,406 10,629 583 1,177 1,387
– 3,356 amino acid substitutions mapped to PDB (880 distinct proteins)
Joyce, P.I. et al. Hum. Mol. Genet. (2014); Seetharaman, S.V. et al. Arch. Biochem. Biophys. (2010); Krishnan, U. et al. Mol Cell Biochem (2006)
D83G Amyotrophic lateral sclerosis (ALS)
– Available at http://sourceforge.net/projects/graphletkernels/