Applications of Machine Learning in Computational Biology Narges - - PowerPoint PPT Presentation

applications of machine learning in computational biology
SMART_READER_LITE
LIVE PREVIEW

Applications of Machine Learning in Computational Biology Narges - - PowerPoint PPT Presentation

Applications of Machine Learning in Computational Biology Narges Razavian New York University Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop @ ECCV


slide-1
SLIDE 1

Applications of Machine Learning in Computational Biology

Narges Razavian New York University

Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop@ ECCV 2004

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4

Central Dogma of Biology

slide-5
SLIDE 5

Examples of Challenges involved

Slide Credit: Manolis Kellis

slide-6
SLIDE 6

Application : Decoding Sequences and Motif Discovery

slide-7
SLIDE 7

Motif Discovery

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

slide-8
SLIDE 8

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

Sequence Annotation

Gene

slide-9
SLIDE 9

GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG

Sequence Annotation

Gene

Promoter Motif

slide-10
SLIDE 10

A Generative Model

Background Island 0.15 0.25 0.75 0.85

A: 0.25 T: 0.25 G: 0.25 C: 0.25 TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC A: 0.15 T: 0.13 G: 0.30 C: 0.42

slide-11
SLIDE 11

P B B P P B P P B P B P B P B P B

A Generative Model(cont.)

P P B B B P P C A A A T G C G S: B B B P P P B B A: 0.42 T: 0.30 G: 0.13 C: 0.15 A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(S|P) P(S|B) P(Li+1|Li)

Bi+1 Pi+1 Bi

0.85 0.15

Pi

0.25 0.75

slide-12
SLIDE 12

Fundamental HMM Operations

Decoding

  • Given

an HMM and sequence S

  • Find

a corresponding sequence of labels, L

Evaluation

  • Given

an HMM and sequence S

  • Find

P(S|HMM)

Training

  • Given

an HMM w/o parameters and set of sequences S

  • Find

transition and emission probabilities the maximize P(S | params, HMM) Computation Biology

Annotate pathogenicity islands on a new sequence Score a particular sequence (not as useful for this model – will come back to this later) Learn a model for sequence composed of background DNA and pathogenicity islands

slide-13
SLIDE 13

Application: Modeling Protein Families

slide-14
SLIDE 14

Modeling Protein Families

  • Given amino acid sequences from a protein family, how

can we find other members?

– Can search databases with each known member – not sensitive – More information is contained in full set

  • The HMM Profile Approach

– Learn the statistical features of protein family – Model these features with an HMM – Search for new members by scoring with HMM

slide-15
SLIDE 15

UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS

Human Ubiquitin Conjugating Enzymes

slide-16
SLIDE 16

Profile HMM

Ij Start M1 Mj MN End Dj D1 DN I I1 IN

A C D E F G H I K L M N O P Q R S T V W Y A C D E F G H I K L M N O P Q R S T V W Y

A------------ D S A G

  • E2EPF5

LG K D F PA S PP K G YF L T K I F H P N VGA N UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A

  • G

E ICV N VL KR W T A E LGI RH Q VCLPVI A----------- E N W K PA T K T D Q

  • G

Q ICLPII SS A----------- E N W K PC T K T C Q S G T VCL D VI N

  • P-----------

QT W T AL Y D L TN

slide-17
SLIDE 17

Using Profile HMMs

Decoding

Find sequence of labels, L, that maximizes P(L|S, HMM)

Evaluation

  • Find

P(S|HMM)

Training

  • Find

transition and emission probabilities the maximize P(S | params, HMM) Computation Biology

Align a new sequence to a protein family Score a sequence for membership in family Discover and model family structure

slide-18
SLIDE 18

Application: Modeling Protein Dynamics

slide-19
SLIDE 19

Background

  • Proteins: Molecular machines, composed of a

sequences of Amino Acid sub-units

slide-20
SLIDE 20

Background:

  • Protein functional analysis pipeline

20

Crystallize to Get X-Ray Snapshot Molecular Dynamics Simulations Learn Probabilistic Model Analyze and Predict

Image: H khanlou, et.al. “Durable Efficacy and Continued Safety of Ibalizumab in Treatment- Experienced Patients”, Infectious Diseases Society of America (IDSA) October 2011

slide-21
SLIDE 21

Modeling Protein Tertiary Structure

slide-22
SLIDE 22

10 second Reminder! Probability Theory

  • Sum rule
  • Product rule
  • From these we have Bayes’ theorem

– with normalization

slide-23
SLIDE 23

10 second Reminder(cont.)! Decomposition

  • Consider an arbitrary joint distribution
  • By successive application of the product rule
slide-24
SLIDE 24

Directed Acyclic Graphs

  • Joint distribution

where denotes the parents of i

No directed cycles

slide-25
SLIDE 25

Undirected Graphs

  • Provided then joint distribution is

product of non-negative functions over the cliques of the graph where are the clique potentials, and Z is a normalization constant

slide-26
SLIDE 26

Undirected Graphical Models

  • Pairwise Undirected graphical models (single

and bivariate potentials only)

   

     

n n i j i eij j i ij i i n i j i eij j i ij i i

dX dX X X f X f X X f X f X P Graph Factor A as Field Random Markov .. ) , ( ) ( ) , ( ) ( ) (

1 1 1 1 1

X2 X4 Xn-1 X5

X1 X3 Xn

f12 f12 f13 f34

f4n-1

f5n-1

f5n f1 f2 f5 f3 fn-1 fn f4

26

slide-27
SLIDE 27

Question:

  • Each potential has some parameters. How to

estimate them from training data?

– Could do gradient descent on the likelihood of the data, (if we knew z) – Often iterative process

  • How to compute z?

– Belief propagation (next slides)

slide-28
SLIDE 28

Message Passing

  • Example
  • Find marginal for a particular node

– for M-state nodes, cost is – exponential in length of chain – but, we can exploit the graphical structure (conditional independences)

slide-29
SLIDE 29

Message Passing

  • Joint distribution
  • Exchange sums and products
slide-30
SLIDE 30

Message Passing

  • Express as product of messages
  • Recursive evaluation of messages
  • Find Z by normalizing
slide-31
SLIDE 31

Belief Propagation

  • Extension to general tree-structured graphs
  • At each node:

– form product of incoming messages and local evidence – marginalize to give outgoing message – one message in each direction across every link

  • No convergence guaranteed if there are loops!
slide-32
SLIDE 32

Inference and Learning

  • Data set
  • Likelihood function (independent
  • bservations)
  • Maximize (log) likelihood
slide-33
SLIDE 33

Modeling Protein Tertiary Structure

  • Optimize Pseudo-likelihood
  • f training data, to estimate parameters
slide-34
SLIDE 34

Application: Microarray Gene Expression Analysis

slide-35
SLIDE 35

35

The dramatic consequences of gene regulation in biology

Same genome  Different tissues

  • Different physiology
  • Different proteome
  • Different expression pattern

Anise swallowtail, Papilio zelicaon

slide-36
SLIDE 36

36

cDNA microarray schema

From Duggan et al. Nature Genetics 21, 10 – 14 (1999) color code for relative expression

slide-37
SLIDE 37

37

Hierarchical clustering

  • Combine most similar genes into agglomerative clusters, build tree of genes
  • Do the same procedure along the second dimension to cluster samples
  • Display as a heatmap
slide-38
SLIDE 38

Hierarchical clustering results

Chi et al., PNAS | September 16, 2003 | vol. 100 | no. 19 | 10623- 10628 “Endothelial cell diversity revealed by global expression profiling”

slide-39
SLIDE 39

 Personalized cancer treatment

160 drugs

Drug sensitivity test

~100 patients at UWMC

g1 g2 g4 g5 g6 g3 e8 g11 g14 g15 g9 g16 g g g30,000 g3 g7 g12 g13 g g g g g g g10

30,000 genes

RNA levels of genes in cancer cells Drug 3 Drug 2 Drug i Drug 6 Drug 4 Drug 5 Drug 160

30,000 features!

(feature selection)

Prior knowledge

  • n drugs’ targets

Publicly available RNA level data

>3000 patients

Transfer learning, Feature reconstruction

slide-40
SLIDE 40

Other applications

  • Predicting phenotype (symptoms) given:

– Predictive Models Can be:

  • Generative (i.e. Bayesian Network)
  • Discriminative (i.e. Regression, SVM, KNN)

RNA levels

  • f genes

Protein levels

  • f genes

Epigenetics (Methylation) A few histologic features

…ACGTAGCTAGCT AGCTAGCTGATGC TAGCTACGTGCT…

DNA sequence

slide-41
SLIDE 41

Many more exciting research to come! 