Applications of Machine Learning in Computational Biology
Narges Razavian New York University
Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop@ ECCV 2004
Applications of Machine Learning in Computational Biology Narges - - PowerPoint PPT Presentation
Applications of Machine Learning in Computational Biology Narges Razavian New York University Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop @ ECCV
Slides thanks to James Galagan@Board Institute Su-In Lee@Univ of Washington Rainer Breitling@ Univ of Glasgow Christopher M. Bishop@ ECCV 2004
Slide Credit: Manolis Kellis
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Gene
GCGTCTGACGGCGCACCGTTCGCGCTGCCGGCACCCCGGGCTCCATAATGAAAATCATGT TCAGTAAGCTACACTCTGCATATCGGGCTACCAACGAAATGGAGTATCGGTCATGATCTT GCCAGCCGTGCCTAAAAGCTTGGCCGCAGGGCCGAGTATAATTGGTCGCGGTCGCCTCGA AGTTAGCTTATGCAATGCAGGAGGTGGGGCAAAGTTCAGGCGGATCGGCCGATGGCGGGC GTAGGTGAAGGAGACAGCGGAGGCGTGGAGCGTGATGACATTGGCATGGTGGCCGCTTCC CCCGTCGCGTCTCGGGTAAATGGCAAGGTAGACGCTGACGTCGTCGGTCGATTTGCCACC TGCTGCCGTGCCCTGGGCATCGCGGTTTACCAGCGTAAACGTCCGCCGGACCTGGCTGCC GCCCGGTCTGGTTTCGCCGCGCTGACCCGCGTCGCCCATGACCAGTGCGACGCCTGGACC GGGCTGGCCGCTGCCGGCGACCAGTCCATCGGGGTGCTGGAAGCCGCCTCGCGCACGGCG ACCACGGCTGGTGTGTTGCAGCGGCAGGTGGAACTGGCCGATAACGCCTTGGGCTTCCTG TACGACACCGGGCTGTACCTGCGTTTTCGTGCCACCGGACCTGACGATTTCCACCTCGCG TATGCCGCTGCGTTGGCTTCGACGGGCGGGCCGGAGGAGTTTGCCAAGGCCAATCACGTG GTGTCCGGTATCACCGAGCGCCGCGCCGGCTGGCGTGCCGCCCGTTGGCTCGCCGTGGTC ATCAACTACCGCGCCGAGCGCTGGTCGGATGTCGTGAAGCTGCTCACTCCGATGGTTAAT GATCCCGACCTCGACGAGGCCTTTTCGCACGCGGCCAAGATCACCCTGGGCACCGCACTG GCCCGACTGGGCATGTTTGCCCCGGCGCTGTCTTATCTGGAGGAACCCGACGGTCCTGTC GCGGTCGCTGCTGTCGACGGTGCACTGGCCAAAGCGCTGGTGCTGCGCGCGCATGTGGAT ATGGAGTCGGCCAGCGAAGTGCTGCAGGACTTGTATGCGGCTCACCCCGAAAACGAACAG GTCGAGCAGGCGCTGTCGGATACCAGCTTCGGGATCGTCACCACCACAGCCGGGCGGATC GAGGCCCGCACCGATCCGTGGGATCCGGCGACCGAGCCCGGCGCGGAGGATTTCGTCGAT CCCGCGGCCCACGAACGCAAGGCCGCGCTGCTGCACGAGGCCGAACTCCAACTCGCCGAG
Gene
Promoter Motif
Background Island 0.15 0.25 0.75 0.85
A: 0.25 T: 0.25 G: 0.25 C: 0.25 TAAGAATTGTGTCACACACATAAAAACCCTAAGTTAGAGGATTGAGATTGGCA GACGATTGTTCGTGATAATAAACAAGGGGGGCATAGATCAGGCTCATATTGGC A: 0.15 T: 0.13 G: 0.30 C: 0.42
P B B P P B P P B P B P B P B P B
P P B B B P P C A A A T G C G S: B B B P P P B B A: 0.42 T: 0.30 G: 0.13 C: 0.15 A: 0.25 T: 0.25 G: 0.25 C: 0.25 P(S|P) P(S|B) P(Li+1|Li)
Bi+1 Pi+1 Bi
0.85 0.15
Pi
0.25 0.75
Decoding
an HMM and sequence S
a corresponding sequence of labels, L
Evaluation
an HMM and sequence S
P(S|HMM)
Training
an HMM w/o parameters and set of sequences S
transition and emission probabilities the maximize P(S | params, HMM) Computation Biology
Annotate pathogenicity islands on a new sequence Score a particular sequence (not as useful for this model – will come back to this later) Learn a model for sequence composed of background DNA and pathogenicity islands
– Can search databases with each known member – not sensitive – More information is contained in full set
– Learn the statistical features of protein family – Model these features with an HMM – Search for new members by scoring with HMM
UBE2D2 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK UBE2D3 FPTDYPFKPPKVAFTTRIYHPNINSN-GSICLDILR-------------SQWSPALTISK BAA91697 FPTDYPFKPPKVAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2D1 FPTDYPFKPPKIAFTTKIYHPNINSN-GSICLDILR-------------SQWSPALTVSK UBE2E1 FTPEYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBCH9 FSSDYPFKPPKVTFRTRIYHCNINSQ-GVICLDILK-------------DNWSPALTISK UBE2N LPEEYPMAAPKVRFMTKIYHPNVDKL-GRICLDILK-------------DKWSPALQIRT AAF67016 IPERYPFEPPQIRFLTPIYHPNIDSA-GRICLDVLKLP---------PKGAWRPSLNIAT UBCH10 FPSGYPYNAPTVKFLTPCYHPNVDTQ-GNICLDILK-------------EKWSALYDVRT CDC34 FPIDYPYSPPAFRFLTKMWHPNIYET-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT BAA91156 FPIDYPYSPPTFRFLTKMWHPNIYEN-GDVCISILHPPVDDPQSGELPSERWNPTQNVRT UBE2G1 FPKDYPLRPPKMKFITEIWHPNVDKN-GDVCISILHEPGEDKYGYEKPEERWLPIHTVET UBE2B FSEEYPNKPPTVRFLSKMFHPNVYAD-GSICLDILQN-------------RWSPTYDVSS UBE2I FKDDYPSSPPKCKFEPPLFHPNVYPS-GTVCLSILEED-----------KDWRPAITIKQ E2EPF5 LGKDFPASPPKGYFLTKIFHPNVGAN-GEICVNVLKR-------------DWTAELGIRH UBE2L1 FPAEYPFKPPKITFKTKIYHPNIDEK-GQVCLPVISA------------ENWKPATKTDQ UBE2L6 FPPEYPFKPPMIKFTTKIYHPNVDEN-GQICLPIISS------------ENWKPCTKTCQ UBE2H LPDKYPFKSPSIGFMNKIFHPNIDEASGTVCLDVIN-------------QTWTALYDLTN UBC12 VGQGYPHDPPKVKCETMVYHPNIDLE-GNVCLNILR-------------EDWKPVLTINS
Ij Start M1 Mj MN End Dj D1 DN I I1 IN
A C D E F G H I K L M N O P Q R S T V W Y A C D E F G H I K L M N O P Q R S T V W Y
A------------ D S A G
LG K D F PA S PP K G YF L T K I F H P N VGA N UBE2L1 F PA E Y P F K PP K I T F K T K I Y H P N I DE K UBE2L6 F PP E Y P F K PPMI K F TT K I Y H P N V DE N UBE2H LP D K Y P F K S P S IG F M N K I F H P N I DE A
E ICV N VL KR W T A E LGI RH Q VCLPVI A----------- E N W K PA T K T D Q
Q ICLPII SS A----------- E N W K PC T K T C Q S G T VCL D VI N
QT W T AL Y D L TN
Decoding
Find sequence of labels, L, that maximizes P(L|S, HMM)
Evaluation
P(S|HMM)
Training
transition and emission probabilities the maximize P(S | params, HMM) Computation Biology
Align a new sequence to a protein family Score a sequence for membership in family Discover and model family structure
20
Crystallize to Get X-Ray Snapshot Molecular Dynamics Simulations Learn Probabilistic Model Analyze and Predict
Image: H khanlou, et.al. “Durable Efficacy and Continued Safety of Ibalizumab in Treatment- Experienced Patients”, Infectious Diseases Society of America (IDSA) October 2011
No directed cycles
n n i j i eij j i ij i i n i j i eij j i ij i i
dX dX X X f X f X X f X f X P Graph Factor A as Field Random Markov .. ) , ( ) ( ) , ( ) ( ) (
1 1 1 1 1
X2 X4 Xn-1 X5
X1 X3 Xn
f12 f12 f13 f34
f4n-1
f5n-1
f5n f1 f2 f5 f3 fn-1 fn f4
26
– for M-state nodes, cost is – exponential in length of chain – but, we can exploit the graphical structure (conditional independences)
35
Same genome Different tissues
Anise swallowtail, Papilio zelicaon
36
From Duggan et al. Nature Genetics 21, 10 – 14 (1999) color code for relative expression
37
Chi et al., PNAS | September 16, 2003 | vol. 100 | no. 19 | 10623- 10628 “Endothelial cell diversity revealed by global expression profiling”
Personalized cancer treatment
160 drugs
Drug sensitivity test
~100 patients at UWMC
g1 g2 g4 g5 g6 g3 e8 g11 g14 g15 g9 g16 g g g30,000 g3 g7 g12 g13 g g g g g g g10
30,000 genes
RNA levels of genes in cancer cells Drug 3 Drug 2 Drug i Drug 6 Drug 4 Drug 5 Drug 160
30,000 features!
(feature selection)
Prior knowledge
Publicly available RNA level data
>3000 patients
Transfer learning, Feature reconstruction
RNA levels
Protein levels
Epigenetics (Methylation) A few histologic features
…ACGTAGCTAGCT AGCTAGCTGATGC TAGCTACGTGCT…
DNA sequence