Applications of Pattern Recognition in Computational Biology - PowerPoint PPT Presentation

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( สิระ ศรีสวัสดิ์ ) 1

89% accuracy vs 73% 2

Biology + Computation Image from https://www.systemsbiology.org/about/what-is-systems-biology/ 3

Data From High-Throughput Technology Genome Sequence The Central Dogma ACCAGCGGCGAAGCTCGGGGCGGAGGGGTTGA “Information Processing in Biology” GCCACATGAGGCGATGGCGACAATGAGGCGAG ACATGGCG TGGCTGGC TGTTACATTTTGTTTT GATGAAAAGCATAACCATGCGGATGATATTTT TATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCCA C RNA Expression Protein Interactions Genes Image from https://www.khanacademy.org/science/biology/gene-expression- central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription Conditions 4

The Omics Era DNA methylation, histone modification Protein identification and Metabolites, hormones, and quantification, post- other signaling molecules translational modification Image from http://www.cubocube.com/files/images/opengenetics/chapter11/image2.png DNA sequencing, genetic RNA sequencing, RNA mapping, recombinant DNA expression, transcriptional regulation 5

Application I: Evolutionary Genomics 6

Genome Sequence As Species Signature CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACA TGGCG TGGCTGGC TGTTACATTTTGTTTTGATGAAAAGCATAACCATGCGGATGATATTTTTATTA TAGACTAGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... GCGAAGCTCGTTAACCATGCGGATGATATTTTTATTATAGACTAGAGATGATTATTGAATAGAC AT GCTCTTAACCATTTTTAACTCTAA ... ~98% ~60% GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCAAAAGCATAACCATGCGGATGATATTTTTATTATAG ACTAGAGATGATTATTGAA CTCTAAA ... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCATAACCATGCGGATGATATTTTTATTATAGACT AGAGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CCAGCGGCGAAGCGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTGGC TGTTACATTTTGTT AATCGGGGCGGAGGGGTTGAGCCACATGAGCATAACCATGCGGATGATATTTTTATTATAGACTAG ~80% AGATGATTATTGAATAGAC ATTTTAACTCTAATTCCA ... CAGCGGCGAAGCTCGGGGCGGAGGGGTTTATTTTTATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCC ... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGG CG TGGCTGGC TGTTACATTTTGTT TAACTCTAAT ... ~90% CGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGA CATGGCG TGGCTGGC TGTTAC... CGGCGAAGCTCGGGGCGGAGGGGTTGA TTTTTAACTCTAATT ... AGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATG GCG TGGCTGGC TGTTACATTTTGTTTTGATGAAA TTTTTAACTCTAATTC ... CCAGCGGCGAAGCTCGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGGCG TGGCTG GC TGTTACATTTTGTTTTGATGAAAAGCAT TCTAATTCCA ... 7 Image adapted from http://physrev.physiology.org/content/89/3/921

Evolution Of DNA Sequences Ancestral Sequences Present Sequence (Inferred) (Observed) CAGC G CT C AG CAC T CAGCAC A AAGACTT TAGCCC T TAG C CCA TA C CCCA T AG G C CA A AGGCCA Time 8 Image adapted from http://physrev.physiology.org/content/89/3/921 and https://www.quora.com/Arent-all-sequences-homologous

Inferring Evolutionary History (Phylogenetics) ▪ Reconstruction of evolutionary events over millions of years ▪ Based on genome sequences of currently existing species ▪ Assume some models of evolution on DNA sequence, e.g. P(A  T), P(G  T) ▪ Output the most likely tree topology and branch lengths ▪ Extremely large number of parameters, search spaces, and number of models to compare Time 9 Image from http://physrev.physiology.org/content/89/3/921

Growing Amount Of Genomic Data 2.5E+11 ▪ >17,000 bacterial genomes ▪ >350 fungal genomes 2E+11 ▪ >100 insect genomes ▪ >150 plant genomes 1.5E+11 Base pairs ▪ >230 animal and fish genomes ▪ >70 invertebrate genomes 1E+11 5E+10 0 Dec-82 Jan-84 Feb-85 Mar-86 Apr-87 May-88 Jun-89 Jul-90 Aug-91 Sep-92 Oct-93 Nov-94 Dec-95 Jan-97 Feb-98 Mar-99 Apr-00 May-01 Jun-02 Jul-03 Aug-04 Sep-05 Oct-06 Nov-07 Dec-08 Jan-10 Feb-11 Mar-12 Apr-13 May-14 Jun-15 Jul-16 Plotted with data from https://www.ncbi.nlm.nih.gov/genbank/statistics/ 10

Forecasting And Regulating Evolution ▪ Epidemiology • Tracking the spread of disease outbreaks • Predict the next outbreaks and prepare vaccines in advance ▪ Biotechnology • Genetic engineering and breeding of new strains with desired characteristics and capabilities ▪ Wildlife Conservation • Pairing evolutionary history with climate/environmental changes can reveal the factors that drive animal evolution and extinction 11

Application II: Population Genetics 12

Tracing Population Structure Over Time Genetic Inheritance Migration Image adapted from https://www.theodysseyonline.com/why-people-migrate Image adapted from https://wiki.uiowa.edu/display/2360159/Autosomal+Inheritance 13

Single Nucleotide Polymorphisms (SNPs) As Individual ’ s Genetic Signature Identity-By-Descent (IBD) Network Image from http://uvmgg.wikia.com/wiki/SNP Han et al . Nature Communication 8, 14238 (2017) 709,358 SNPs 774,516 individuals A T T C … G C G G A … T C G G A … G Connect individuals that share significant portion of G G T C … A consecutive SNPs 14

Roots Of North American Population 15 Han et al . Nature Communication 8, 14238 (2017)

From Population To Personalized Medicine ▪ Social Sciences Image from http://poshrx.com/23andme-is-back-on/ • Tracking the dynamics of populations • Understanding ethnic structures ▪ Medicine • Identifying common genetic variations within a population that may be associated with drug targets • Identifying disease risk factors 16

Application III: Gene Expression Analysis 17

Measuring Gene Expression Sequencing Machine Image from https://support.illumina.com/sequencing/seque ncing_instruments/hiseq-4000.html Image adapted from https://www.khanacademy.org/science/biology/gene-expression-central- dogma/transcription-of-dna-into-rna/a/overview-of-transcription Amount of RNA product of each gene 18

RNA Sequencing (Counting) Biases Image adapted from Image adapted from http://bio.lundberg.gu.se/courses/vt13/rnaseq.html/ https://mikelove.wordpress.com/2016/09/26/rna-seq- fragment-sequence-bias/ ▪ Due to technological limitation, the entire length of RNA cannot be sequenced at once • Full-length RNA has to be fragmented • Bias in fragment length ▪ To increase sensitivity, fragmented RNA has to be amplified • Bias in signal amplification ▪ Sequencing is directional Bias correction • Bias in head-to-tail read count 19

Bias Normalization via Regression Before Normalization After Normalization Adapted from Bacher et al . Nature Methods 14, 584-586 (2017) ▪ Sequencing involves sampling of RNA transcripts ▪ Estimated expression levels of low, medium, and high expression genes are differently affected by the throughput of RNA sequencing experiment ▪ Normalization by regression corrected the biases 20

What Can Gene Expression Tell Us? High Time Series Analysis Down-Regulated Genes Up-Regulated Genes Adapted from Rund et al . PNAS 108, E421-430 Low 21 Klings et al . Physiological Genomics 21, 293-298 (2005)

Structure Behind Gene Expression Profiles Each cluster represents a group of genes with similar functions Hierarchical K-means Self-Organizing Map 22 D’haeseleer et al . Nature Biotechnology 23, 1499-1501 (2005)

Identifying Disease Subtypes Gene expression data from >4,000 colorectal cancer patients Node = Patient Edge = Similar gene expression 23 Guinney et al . Nature Medicine 21, 1350-1356 (2015)

Application Of Gene Expression Analysis To understand the blueprint of biological systems and diseases 24 Image from http://pathview.r-forge.r-project.org/

A Break From Biology: Basic Clustering Techniques 25

An Illustration Of K-Mean Clustering Randomly select Assign points to K=3 centroids nearest centroid Update point Update centroids Update centroids assignments 26 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Characteristics Of K-Mean Clustering ▪ The number of clusters, K , is specified in advance. ▪ Euclidean distance • The nearest centroid minimizes the sum of squares, ||x-m|| 2 . ▪ Always converge to a (local) minimum. • Poor starting centroid locations can lead to incorrect minima. Image from https://en.wikipedia.org/wiki/K-means_clustering ▪ The model has several implicit assumptions: • Data points scatter around cluster’s centers. • Boundary between adjacent clusters is always halfway between the cluster centroids. 27

Effect Of Poor Initial Centroid Locations 28 From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Distance Functions ▪ Property of distance function • d(x, y) ≥ 0 non -negativity • d(x, y) = 0  x = y identity • d(x, y) = d(y, x) symmetry • d(x, z) ≤ d(x, y) + d(y, z) triangle inequality ▪ Example of distance functions • Euclidean distance σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 𝐞(𝐲, 𝐳) = • Squared Euclidean distance 𝐞(𝐲, 𝐳) = σ(𝒚 𝒋 − 𝒛 𝒋 ) 𝟑 • Manhattan distance 𝐞(𝐲, 𝐳) = σ |𝒚 𝒋 − 𝒛 𝒋 | • Maximum distance 𝐞(𝐲, 𝐳) = 𝐧𝐛𝐲 |𝒚 𝒋 − 𝒛 𝒋 | 𝟐 𝒒  p -norm ||𝐲|| 𝒒 = (σ 𝒋 |𝐲 𝒋 | 𝒒 ) Τ ▪ 29

Applications of Pattern Recognition in Computational Biology - PowerPoint PPT Presentation

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( ) 1 89% accuracy vs 73% 2

Part 5 pattern recognition pattern recognition track pattern recognition: associate hits

Feature Selection Pattern Recognition: The Early Days Pattern Recognition: The Early Days Only

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION Pattern Recogniton Pattern: Any

CS 7616 Pattern Recognition Introduction Aaron Bobick School of Interactive Computing

Pattern Recognition CSE 802 Michigan State University Spring 2017 Lecture 1, January 9, 2017

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

CS 7616 Pattern Recognition Bayesian Decision Theory Aaron Bobick School of Interactive Computing

Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders

An NFR Pattern Approach to Dealing An NFR Pattern Approach to Dealing An NFR Pattern Approach to

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

A common pattern: map Another common pattern: filter Pattern: take a list and produce a new list,

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

Pattern Recognition Theory Lecture 12 : Correlation Filters Pattern Matching a How to match

CS 10: Problem solving via Object Oriented Programming Pattern Recognition Agenda 1. Pattern

Tracking my blood glucose - without having Diabetes and what I learned 18.09.2015 | Philipp

Semantic Theories of Presuppositions Attempt to handle presupposition within truth-conditional

Vagueness and Context Hans Kamp and Mark Sainsbury 19. Juni 2012 1 1 Part II. First steps

I. First Oracle (1:22:13): Hear , O peoples A. Judgment (1:22:11) 1. The

Stem Cells: Superheroes of the past, present and future Deepa Subramanyam National Centre for

Trees are only Temporary done by: Cheng Woon Jo Jacy Mok Deborah Theng Emily Ding overarching

Firms Firms and nd ma mark rkets Sessions 910 PMAP 8141: Microeconomics for Public Policy

Skeletons CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2019 Matrix

Sambuz

Useful Links

Newsletter

Mail Us