SVM . . . if Pr(+1|v) > 0.5 then t (v) = +1 else t(v) = -1 G - - PowerPoint PPT Presentation

svm
SMART_READER_LITE
LIVE PREVIEW

SVM . . . if Pr(+1|v) > 0.5 then t (v) = +1 else t(v) = -1 G - - PowerPoint PPT Presentation

G RAPHLET K ERNELS FOR V ERTEX C LASSIFICATION Presenter: Jos Lugo-Martnez Phd Candidate jlugomar@indiana.edu Jun 12, 2015 O UTLINE Overview of classification problems on graphs Graphlet kernels for vertex classification Case


slide-1
SLIDE 1

GRAPHLET KERNELS FOR VERTEX CLASSIFICATION

Presenter: José Lugo-Martínez

Jun 12, 2015 jlugomar@indiana.edu Phd Candidate

slide-2
SLIDE 2
  • Overview of classification problems on graphs
  • Graphlet kernels for vertex classification
  • Case study: Structure-based functional residue prediction

– Inferring molecular mechanisms of disease (if time permits)

OUTLINE

slide-3
SLIDE 3

Graph Classification Vertex Classification Edge Classification

Task: Classify graph as +1 or -1

Human lymphocyte kinase

Link Prediction

CLASSIFICATION PROBLEMS ON GRAPHS

slide-4
SLIDE 4

Vertex (or Edge) Classification

Task: Classify node (or edge) as +1 or -1

Y394 of human lymphocyte kinase Depth-3 graph neighborhood for Y394

CLASSIFICATION PROBLEMS ON GRAPHS

slide-5
SLIDE 5

+1 +1 +1 +1

  • 1
  • 1
  • 1
  • 1
  • 1

+1 +1 +1

+1

  • 1

. . . . . .

Training Data

Objective: Predict class label for each unlabeled node

A B A A B A B A B B B A B A B A B A A B A B A B B A A B B A A A B A

Neighborhood graph

SEMI-SUPERVISED LEARNING SCENARIO

How to measure similarity between rooted neighborhoods?

Research Question

slide-6
SLIDE 6

Task: Design meaningful similarity measures between vertex neighborhoods

Given two neighborhood graphs N(u), N(v) from a space

  • f graphs . The problem of rooted neighborhood

comparison is to find a mapping s.t. (N(u), N(v)) quantifies the similarity of N(u) & N(v)

PROBLEM STATEMENT

slide-7
SLIDE 7

# of data points vector of counts

GRAPH KERNELS

  • Define kernel functions on pair of graphs G and G’
  • measure of similarity between G and G’
  • Kernel matrix such that
  • Properties of

I. Symmetric II. Positive semi-definite

slide-8
SLIDE 8

+1

  • 1

. . . . . .

SVM

if Pr(+1|v) > 0.5 then t(v) = +1 else

t(v) = -1 Training Data Test Data

V

METHODOLOGY OVERVIEW

slide-9
SLIDE 9

Image from Hido, S. & Kashima, H. ICDM 2009.

  • Diffusion kernels

– Kondor & Lafferty (2002)

  • Focus on counting graph substructures
  • Three categories based on

– walks and paths

  • Kashima et al. (2003), Borgwardt & Kriegel (2005)

– subtree patterns

  • Hido & Kashima (2009), Shervashidze et al. (2011)

– subgraphs

  • Shervashidze et al. (2009), Vacic et al. (2010)

How about other factors?

GRAPH KERNELS RESEARCH IN A NUTSHELL

slide-10
SLIDE 10

Graphlet Kernels

slide-11
SLIDE 11

Count non-isomorphic labeled n-graphlets

Vacic, V. et al. J Computational Biology 17(1): 55 (2010).

An n-graphlet is a small (n ≤ 5) connected rooted subgraph 3-graphlets

GRAPHLET KERNEL

slide-12
SLIDE 12

Undirected: Directed:

BASE GRAPHLETS

Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).

slide-13
SLIDE 13

same symmetry class

vertex labels alphabet

LABELED GRAPHLETS

slide-14
SLIDE 14

A A A A A

u v

2 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

1 1 1

31 32 33

3-graphlets

A A A A B

Graphlet kernel, N=4

GRAPHLET KERNEL EXAMPLE

slide-15
SLIDE 15

Undirected Directed n |∑| = 1 |∑| = 20 |∑| = 1 |∑| = 20 1 1 20 1 20 2 1 400 3 1,200 3 3 16,400 30 217,200 4 11 1,045,600 697 102,673,600 5 58 100,168,400 44,907 137,252,234,400 base graphlets labeled graphlets

HOW MANY LABELED GRAPHLETS?

slide-16
SLIDE 16
  • Exact matches less likely as alphabet size increases
  • Can’t handle misannotated labels or missing edges

– e.g. protein 3D structures can be noisy and incomplete

  • Ineffective for evolving graph neighborhoods

– e.g. closely relate protein structures

Goal: Design robust kernels in the presence of noisy and incomplete data

LIMITATIONS OF GRAPHLET KERNEL

slide-17
SLIDE 17

Generalize the concept of counting graphlets Incorporate flexibility in counting via edit distance

Definition (Graph Edit Distance) Given two vertex- and/or edge-labeled graphs G and H. The edit distance between these graphs corresponds to the minimum number of edit operations necessary to transform G into H.

  • Allowed edit operations include insertion or deletion of vertices and

edges, or in the case of labeled graphs, substitutions of vertex and edge labels

  • Any sequence of edit operations that transforms G into H is referred

to as an edit path

  • Thus, the graph edit distance between G and H corresponds to the

length of the shortest edit path between them

EDIT DISTANCE GRAPHLET KERNELS

Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).

slide-18
SLIDE 18

Vertex label substitutions Incorporate flexibility in counting via edit distance Edge insertions or deletions

A A A B A A A B A A A B A A A A A A A A A A A A symmetric

EDIT DISTANCE OPERATIONS

slide-19
SLIDE 19

1-label substitution 2 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

1 1 1

31 32 33

A A A B A A A B A

EXAMPLE REVISITED

A A A A A

u v

A A A A B

slide-20
SLIDE 20

1-label substitution 2 1 1 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

1 1 1

31 32 33

A A A B A A A B A

LABEL SUBSTITUTION KERNEL

A A A A A

u v

A A A A B

slide-21
SLIDE 21

1-label substitution 2 2 2 2 1 1 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

2 2 1 1 1 1 1 1 1

31 32 33

A A A B A A A B A

LABEL SUBSTITUTION KERNEL

A A A A A

u v

A A A A B

slide-22
SLIDE 22

1-edge indel 2 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

1 1 1

31 32 33

A A A A A A A A A

EDGE INDELS KERNEL

A A A A A

u v

A A A A B

slide-23
SLIDE 23

3 1 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

1 1 1

31 32 33

A A A A A A A A A

1-edge indel

EDGE INDELS KERNEL

A A A A A

u v

A A A A B

slide-24
SLIDE 24

3 1 1

AAA AAB ABA ABB BAA BAB BBA BBB AAA AAB ABB BAA BAB BBB AAA AAB ABB BAA BAB BBB

2 1 1 2 1

31 32 33

A A A A A A A A A

1-edge indel

EDGE INDELS KERNEL

A A A A A

u v

A A A A B

slide-25
SLIDE 25

Edit distance graphlet kernel Normalized edit distance kernel

Lugo-Martinez J. and Radivojac P. Network Science, 2(2), 254-276 , (2014).

# of edit distance operations

EDIT DISTANCE KERNELS

slide-26
SLIDE 26

Case Study: Structure-based functional residue prediction

Joint work with Vikas Pejaver, Matthew Mort, David N. Cooper, Sean D. Mooney and Predrag Radivojac

slide-27
SLIDE 27

Xin & Radivojac. Curr Prot Pept Sci 12: 456 (2011).

PREDICTION OF FUNCTIONAL SITES FROM PROTEIN STRUCTURES

slide-28
SLIDE 28

Copper (Cu) Iron (Fe)

RESULTS: METAL BINDING RESIDUES

slide-29
SLIDE 29

MULTIPLE FUNCTIONAL RESIDUE PREDICTORS

AUC measured via per chain 10-fold cross-validation

slide-30
SLIDE 30

QUICK DIGRESSION

  • Unprecedented growth of human

genetic variant data

– e.g. HGMD, dbSNP

  • In particular, amino acid substitutions (AAS)
  • Focus on tools that predict effects of AAS (deleterious vs neutral)

– e.g. MutPred, SIFT, PolyPhen, SNPs3D, SNAP

slide-31
SLIDE 31

Pauling, L. et al. Science (1949) 110: 543-548; Chui, D.H. and Dover, G.J. Curr Opin Pediatr (2001) 13: 22-27.

2hbs

E6V

http://gingi.uchicago.edu/hbs2.html

4hhb

Sickle Cell Disease

  • Autosomal recessive disorder
  • E6V in HBB causes interaction w/ F85 and L88
  • Formation of amyloid fibrils
  • Abnormally shaped red blood cells, leads to sickle cell anemia
  • Manifestation of disease vastly different over patients

MOTIVATION: MOLECULAR MECHANISMS OF DISEASE

slide-32
SLIDE 32

INFERRING MOLECULAR MECHANISMS OF DISEASE

  • Most of these tools do not predict biochemical cause of disease

– In particular, molecular function alterations

  • Lack of comprehensive studies using protein 3D structure data

Goal: Exploit the structural environment of a residue of interest to hypothesize specific molecular effects of AAS and to statistically attribute these effects to genetic disease Idea:

  • Develop methods to predict specific function
  • e.g. zinc-binding site or phosphorylation site
  • Apply to amino acid substitution data
  • Provide probabilistic estimates of molecular mechanisms of

disease

slide-33
SLIDE 33

APPROACH

  • phosphorylation in structure 𝑡 occurs at position 𝑗
  • residue 𝑦 is mutated to 𝑧, at position 𝑘 (𝑦𝑘𝑧)

Loss of phosphorylation: Consider: Gain of phosphorylation:

…LAGDKMGMGQSCVGALFNDVQ… i = 45 j = 46 s: phosphorylation site variant position (C46W)

Radivojac et al. Bioinformatics 24: i241 (2008). Image from Capriotti and Altman. BMC Bioinformatics 12 (Suppl4): S3 (2011).

slide-34
SLIDE 34

Probability of loss of property Density

IDENTIFYING ACTIVE MECHANISMS OF DISEASE

fpr cutoff

Data set Total # of AAS # of AAS mapped to PDB # of genes # of PDB entries # of chains Neutral 282,625 8,049 2,095 3,047 3,500 Disease 52,406 10,629 583 1,177 1,387

slide-35
SLIDE 35

LOSS AND GAIN OF FUNCTIONAL SITES IS AN ACTIVE MECHANISM OF DISEASE

slide-36
SLIDE 36

VALIDATION OF LOSS OF FUNCTION PREDICTIONS

  • Mutagenesis experimental data (UniProt)

– 3,356 amino acid substitutions mapped to PDB (880 distinct proteins)

  • Feasibility of computationally predicting loss of functional sites
slide-37
SLIDE 37

Joyce, P.I. et al. Hum. Mol. Genet. (2014); Seetharaman, S.V. et al. Arch. Biochem. Biophys. (2010); Krishnan, U. et al. Mol Cell Biochem (2006)

LOSS OF ZINC BINDING CAUSES DISEASE

D83G Amyotrophic lateral sclerosis (ALS)

  • D83G in superoxide dismutase (SOD1) causes:
  • Loss of zinc-binding that destabilizes native structure
  • Leads to protein aggregation that forms amyloid-like fibrils
slide-38
SLIDE 38
  • Multiple graphlet kernels for vertex classification

– Available at http://sourceforge.net/projects/graphletkernels/

  • Successfully applied to prediction of many types of functional

residues

  • Useful for predicting impact of mutations and understanding

molecular mechanism of disease

  • Implications on the ways we study disease
  • Implications on precision medicine

SUMMARY

slide-39
SLIDE 39

Indiana University

  • Predrag Radivojac
  • Vikas Pejaver
  • Radivojac’s group
  • Esfan Haghverdi

University of Washington

  • Sean Mooney’s group

Cardiff University

  • David Cooper’s group

ACKNOWLEDGEMENTS

Thank you!