applications of pattern recognition in computational
play

Applications of Pattern Recognition in Computational Biology - PowerPoint PPT Presentation

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( ) 1 89% accuracy vs 73% 2


  1. Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( สิระ ศรีสวัสดิ์ ) 1

  2. 89% accuracy vs 73% 2

  3. Biology + Computation Image from https://www.systemsbiology.org/about/what-is-systems-biology/ 3

  4. Data From High-Throughput Technology Genome Sequence The Central Dogma ACCAGCGGCGAAGCTCGGGGCGGAGGGGTTGA “Information Processing in Biology” GCCACATGAGGCGATGGCGACAATGAGGCGAG ACATGGCG TGGCTGGC TGTTACATTTTGTTTT GATGAAAAGCATAACCATGCGGATGATATTTT TATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCCA C RNA Expression Protein Interactions Genes Image from https://www.khanacademy.org/science/biology/gene-expression- central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription Conditions 4

  5. The Omics Era DNA methylation, histone modification Protein identification and Metabolites, hormones, and quantification, post- other signaling molecules translational modification Image from http://www.cubocube.com/files/images/opengenetics/chapter11/image2.png DNA sequencing, genetic RNA sequencing, RNA mapping, recombinant DNA expression, transcriptional regulation 5

  6. Application I: Evolutionary Genomics 6

  7. Genome Sequence As Species Signature CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACA TGGCG TGGCTGGC TGTTACATTTTGTTTTGATGAAAAGCATAACCATGCGGATGATATTTTTATTA TAGACTAGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... GCGAAGCTCGTTAACCATGCGGATGATATTTTTATTATAGACTAGAGATGATTATTGAATAGAC AT GCTCTTAACCATTTTTAACTCTAA ... ~98% ~60% GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCAAAAGCATAACCATGCGGATGATATTTTTATTATAG ACTAGAGATGATTATTGAA CTCTAAA ... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCATAACCATGCGGATGATATTTTTATTATAGACT AGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CCAGCGGCGAAGCGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTGGC TGTTACATTTTGTT AATCGGGGCGGAGGGGTTGAGCCACATGAGCATAACCATGCGGATGATATTTTTATTATAGACTAG ~80% AGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CAGCGGCGAAGCTCGGGGCGGAGGGGTTTATTTTTATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCC ... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGG CG TGGCTGGC TGTTACATTTTGTT TAACTCTAAT ... ~90% CGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGA CATGGCG TGGCTGGC TGTTAC... CGGCGAAGCTCGGGGCGGAGGGGTTGA TTTTTAACTCTAATT ... AGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATG GCG TGGCTGGC TGTTACATTTTGTTTTGATGAAA TTTTTAACTCTAATTC ... CCAGCGGCGAAGCTCGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTG GC TGTTACATTTTGTTTTGATGAAAAGCAT TCTAATTCCA ... 7 Image adapted from http://physrev.physiology.org/content/89/3/921

  8. Evolution Of DNA Sequences Ancestral Sequences Present Sequence (Inferred) (Observed) CAGC G CT C AG CAC T CAGCAC A AAGACTT TAGCCC T TAG C CCA TA C CCCA T AG G C CA A AGGCCA Time 8 Image adapted from http://physrev.physiology.org/content/89/3/921 and https://www.quora.com/Arent-all-sequences-homologous

  9. Inferring Evolutionary History (Phylogenetics) ▪ Reconstruction of evolutionary events over millions of years ▪ Based on genome sequences of currently existing species ▪ Assume some models of evolution on DNA sequence, e.g. P(A  T), P(G  T) ▪ Output the most likely tree topology and branch lengths ▪ Extremely large number of parameters, search spaces, and number of models to compare Time 9 Image from http://physrev.physiology.org/content/89/3/921

  10. Growing Amount Of Genomic Data 2.5E+11 ▪ >17,000 bacterial genomes ▪ >350 fungal genomes 2E+11 ▪ >100 insect genomes ▪ >150 plant genomes 1.5E+11 Base pairs ▪ >230 animal and fish genomes ▪ >70 invertebrate genomes 1E+11 5E+10 0 Dec-82 Jan-84 Feb-85 Mar-86 Apr-87 May-88 Jun-89 Jul-90 Aug-91 Sep-92 Oct-93 Nov-94 Dec-95 Jan-97 Feb-98 Mar-99 Apr-00 May-01 Jun-02 Jul-03 Aug-04 Sep-05 Oct-06 Nov-07 Dec-08 Jan-10 Feb-11 Mar-12 Apr-13 May-14 Jun-15 Jul-16 Plotted with data from https://www.ncbi.nlm.nih.gov/genbank/statistics/ 10

  11. Forecasting And Regulating Evolution ▪ Epidemiology • Tracking the spread of disease outbreaks • Predict the next outbreaks and prepare vaccines in advance ▪ Biotechnology • Genetic engineering and breeding of new strains with desired characteristics and capabilities ▪ Wildlife Conservation • Pairing evolutionary history with climate/environmental changes can reveal the factors that drive animal evolution and extinction 11

  12. Application II: Population Genetics 12

  13. Tracing Population Structure Over Time Genetic Inheritance Migration Image adapted from https://www.theodysseyonline.com/why-people-migrate Image adapted from https://wiki.uiowa.edu/display/2360159/Autosomal+Inheritance 13

  14. Single Nucleotide Polymorphisms (SNPs) As Individual ’ s Genetic Signature Identity-By-Descent (IBD) Network Image from http://uvmgg.wikia.com/wiki/SNP Han et al . Nature Communication 8, 14238 (2017) 709,358 SNPs 774,516 individuals A T T C … G C G G A … T C G G A … G Connect individuals that share significant portion of G G T C … A consecutive SNPs 14

  15. Roots Of North American Population 15 Han et al . Nature Communication 8, 14238 (2017)

  16. From Population To Personalized Medicine ▪ Social Sciences Image from http://poshrx.com/23andme-is-back-on/ • Tracking the dynamics of populations • Understanding ethnic structures ▪ Medicine • Identifying common genetic variations within a population that may be associated with drug targets • Identifying disease risk factors 16

  17. Application III: Gene Expression Analysis 17

  18. Measuring Gene Expression Sequencing Machine Image from https://support.illumina.com/sequencing/seque ncing_instruments/hiseq-4000.html Image adapted from https://www.khanacademy.org/science/biology/gene-expression-central- dogma/transcription-of-dna-into-rna/a/overview-of-transcription Amount of RNA product of each gene 18

  19. RNA Sequencing (Counting) Biases Image adapted from Image adapted from http://bio.lundberg.gu.se/courses/vt13/rnaseq.html/ https://mikelove.wordpress.com/2016/09/26/rna-seq- fragment-sequence-bias/ ▪ Due to technological limitation, the entire length of RNA cannot be sequenced at once • Full-length RNA has to be fragmented • Bias in fragment length ▪ To increase sensitivity, fragmented RNA has to be amplified • Bias in signal amplification ▪ Sequencing is directional Bias correction • Bias in head-to-tail read count 19

  20. Bias Normalization via Regression Before Normalization After Normalization Adapted from Bacher et al . Nature Methods 14, 584-586 (2017) ▪ Sequencing involves sampling of RNA transcripts ▪ Estimated expression levels of low, medium, and high expression genes are differently affected by the throughput of RNA sequencing experiment ▪ Normalization by regression corrected the biases 20

  21. What Can Gene Expression Tell Us? High Time Series Analysis Down-Regulated Genes Up-Regulated Genes Adapted from Rund et al . PNAS 108, E421-430 Low 21 Klings et al . Physiological Genomics 21, 293-298 (2005)

  22. Structure Behind Gene Expression Profiles Each cluster represents a group of genes with similar functions Hierarchical K-means Self-Organizing Map 22 D’haeseleer et al . Nature Biotechnology 23, 1499-1501 (2005)

  23. Identifying Disease Subtypes Gene expression data from >4,000 colorectal cancer patients Node = Patient Edge = Similar gene expression 23 Guinney et al . Nature Medicine 21, 1350-1356 (2015)

  24. Application Of Gene Expression Analysis To understand the blueprint of biological systems and diseases 24 Image from http://pathview.r-forge.r-project.org/

  25. A Break From Biology: Basic Clustering Techniques 25

  26. An Illustration Of K-Mean Clustering Randomly select Assign points to K=3 centroids nearest centroid Update point Update centroids Update centroids assignments 26 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

  27. Characteristics Of K-Mean Clustering ▪ The number of clusters, K , is specified in advance. ▪ Euclidean distance • The nearest centroid minimizes the sum of squares, ||x-m|| 2 . ▪ Always converge to a (local) minimum. • Poor starting centroid locations can lead to incorrect minima. Image from https://en.wikipedia.org/wiki/K-means_clustering ▪ The model has several implicit assumptions: • Data points scatter around cluster’s centers. • Boundary between adjacent clusters is always halfway between the cluster centroids. 27

  28. Effect Of Poor Initial Centroid Locations 28 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

  29. Distance Functions ▪ Property of distance function • d(x, y) ≥ 0 non -negativity • d(x, y) = 0  x = y identity • d(x, y) = d(y, x) symmetry • d(x, z) ≤ d(x, y) + d(y, z) triangle inequality ▪ Example of distance functions • Euclidean distance σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 𝐞(𝐲, 𝐳) = • Squared Euclidean distance 𝐞(𝐲, 𝐳) = σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 • Manhattan distance 𝐞(𝐲, 𝐳) = σ |𝒚 𝒋 − 𝒛 𝒋 | • Maximum distance 𝐞(𝐲, 𝐳) = 𝐧𝐛𝐲 |𝒚 𝒋 − 𝒛 𝒋 | 𝟐 𝒒  p -norm ||𝐲|| 𝒒 = (σ 𝒋 |𝐲 𝒋 | 𝒒 ) Τ ▪ 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend