Applications of Pattern Recognition in Computational Biology - - PowerPoint PPT Presentation

applications of pattern recognition in computational
SMART_READER_LITE
LIVE PREVIEW

Applications of Pattern Recognition in Computational Biology - - PowerPoint PPT Presentation

Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( ) 1 89% accuracy vs 73% 2


slide-1
SLIDE 1

Applications of Pattern Recognition in Computational Biology

Pattern Recognition Course (2110597) Chulalongkorn University August 22nd, 2017 Instructor: Sira Sriswasdi (สิระ ศรีสวัสดิ์)

1

slide-2
SLIDE 2

2

89% accuracy vs 73%

slide-3
SLIDE 3

Image from https://www.systemsbiology.org/about/what-is-systems-biology/

Biology + Computation

3

slide-4
SLIDE 4

Data From High-Throughput Technology

4

The Central Dogma

“Information Processing in Biology”

Genome Sequence RNA Expression Protein Interactions

ACCAGCGGCGAAGCTCGGGGCGGAGGGGTTGA GCCACATGAGGCGATGGCGACAATGAGGCGAG ACATGGCGTGGCTGGCTGTTACATTTTGTTTT GATGAAAAGCATAACCATGCGGATGATATTTT TATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCCAC Genes Conditions

Image from https://www.khanacademy.org/science/biology/gene-expression- central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription

slide-5
SLIDE 5

Image from http://www.cubocube.com/files/images/opengenetics/chapter11/image2.png

The Omics Era

5

Metabolites, hormones, and

  • ther signaling molecules

DNA sequencing, genetic mapping, recombinant DNA Protein identification and quantification, post- translational modification RNA sequencing, RNA expression, transcriptional regulation DNA methylation, histone modification

slide-6
SLIDE 6

Application I: Evolutionary Genomics

6

slide-7
SLIDE 7

Genome Sequence As Species Signature

7

Image adapted from http://physrev.physiology.org/content/89/3/921

CGGCGAAGCTCGGGGCGGAGGGGTTGATTTTTAACTCTAATT... AGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATG GCGTGGCTGGCTGTTACATTTTGTTTTGATGAAATTTTTAACTCTAATTC... CAGCGGCGAAGCTCGGGGCGGAGGGGTTTATTTTTATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCC... GCGAAGCTCGTTAACCATGCGGATGATATTTTTATTATAGACTAGAGATGATTATTGAATAGACAT GCTCTTAACCATTTTTAACTCTAA... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACA TGGCGTGGCTGGCTGTTACATTTTGTTTTGATGAAAAGCATAACCATGCGGATGATATTTTTATTA TAGACTAGAGATGATTATTGAATAGACATTTTAACTCTAATTCCA... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGG CGTGGCTGGCTGTTACATTTTGTTTAACTCTAAT... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCAAAAGCATAACCATGCGGATGATATTTTTATTATAG ACTAGAGATGATTATTGAACTCTAAA... CGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGA CATGGCGTGGCTGGCTGTTAC... CCAGCGGCGAAGCGGCGATGGCGACAATGAGGCGAGACATGGCGTGGCTGGCTGTTACATTTTGTT AATCGGGGCGGAGGGGTTGAGCCACATGAGCATAACCATGCGGATGATATTTTTATTATAGACTAG AGATGATTATTGAATAGACATTTTAACTCTAATTCCA... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCATAACCATGCGGATGATATTTTTATTATAGACT AGAGATGATTATTGAATAGACATTTTAACTCTAATTCCA... CCAGCGGCGAAGCTCGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGGCGTGGCTG GCTGTTACATTTTGTTTTGATGAAAAGCATTCTAATTCCA...

~98% ~90% ~60% ~80%

slide-8
SLIDE 8

Evolution Of DNA Sequences

8

Image adapted from http://physrev.physiology.org/content/89/3/921 and https://www.quora.com/Arent-all-sequences-homologous

Time

AAGACTT CAGCGCT CAGCACA TAGCCCT TACCCCA AAGGCCA TAGCCCA CAGCACT TAGGCCA

Ancestral Sequences (Inferred) Present Sequence (Observed)

slide-9
SLIDE 9

Inferring Evolutionary History (Phylogenetics)

▪ Reconstruction of evolutionary events

  • ver millions of years

▪ Based on genome sequences of currently existing species ▪ Assume some models of evolution on DNA sequence, e.g. P(AT), P(GT) ▪ Output the most likely tree topology and branch lengths ▪ Extremely large number of parameters, search spaces, and number of models to compare

9

Time

Image from http://physrev.physiology.org/content/89/3/921

slide-10
SLIDE 10

Growing Amount Of Genomic Data

10 5E+10 1E+11 1.5E+11 2E+11 2.5E+11 Dec-82 Jan-84 Feb-85 Mar-86 Apr-87 May-88 Jun-89 Jul-90 Aug-91 Sep-92 Oct-93 Nov-94 Dec-95 Jan-97 Feb-98 Mar-99 Apr-00 May-01 Jun-02 Jul-03 Aug-04 Sep-05 Oct-06 Nov-07 Dec-08 Jan-10 Feb-11 Mar-12 Apr-13 May-14 Jun-15 Jul-16

Base pairs

Plotted with data from https://www.ncbi.nlm.nih.gov/genbank/statistics/

▪ >17,000 bacterial genomes ▪ >350 fungal genomes ▪ >100 insect genomes ▪ >150 plant genomes ▪ >230 animal and fish genomes ▪ >70 invertebrate genomes

slide-11
SLIDE 11

Forecasting And Regulating Evolution

▪ Epidemiology

  • Tracking the spread of disease outbreaks
  • Predict the next outbreaks and prepare vaccines in advance

▪ Biotechnology

  • Genetic engineering and breeding of new strains with desired

characteristics and capabilities ▪ Wildlife Conservation

  • Pairing evolutionary history with climate/environmental

changes can reveal the factors that drive animal evolution and extinction

11

slide-12
SLIDE 12

Application II: Population Genetics

12

slide-13
SLIDE 13

Tracing Population Structure Over Time

13

Image adapted from https://www.theodysseyonline.com/why-people-migrate Image adapted from https://wiki.uiowa.edu/display/2360159/Autosomal+Inheritance

Migration Genetic Inheritance

slide-14
SLIDE 14

Single Nucleotide Polymorphisms (SNPs) As Individual’s Genetic Signature

14

Han et al. Nature Communication 8, 14238 (2017) Image from http://uvmgg.wikia.com/wiki/SNP

Identity-By-Descent (IBD) Network

774,516 individuals 709,358 SNPs A T T C … G C G G A … T C G G A … G G G T C … A Connect individuals that share significant portion of consecutive SNPs

slide-15
SLIDE 15

Roots Of North American Population

15

Han et al. Nature Communication 8, 14238 (2017)

slide-16
SLIDE 16

From Population To Personalized Medicine

▪ Social Sciences

  • Tracking the dynamics of populations
  • Understanding ethnic structures

▪ Medicine

  • Identifying common genetic variations within a population

that may be associated with drug targets

  • Identifying disease risk factors

16

Image from http://poshrx.com/23andme-is-back-on/

slide-17
SLIDE 17

Application III: Gene Expression Analysis

17

slide-18
SLIDE 18

Measuring Gene Expression

18

Image adapted from https://www.khanacademy.org/science/biology/gene-expression-central- dogma/transcription-of-dna-into-rna/a/overview-of-transcription Image from https://support.illumina.com/sequencing/seque ncing_instruments/hiseq-4000.html

Sequencing Machine Amount of RNA product

  • f each gene
slide-19
SLIDE 19

RNA Sequencing (Counting) Biases

19

Image adapted from http://bio.lundberg.gu.se/courses/vt13/rnaseq.html/

▪ Due to technological limitation, the entire length of RNA cannot be sequenced at once

  • Full-length RNA has to be fragmented
  • Bias in fragment length

▪ To increase sensitivity, fragmented RNA has to be amplified

  • Bias in signal amplification

▪ Sequencing is directional

  • Bias in head-to-tail read count

Image adapted from https://mikelove.wordpress.com/2016/09/26/rna-seq- fragment-sequence-bias/

Bias correction

slide-20
SLIDE 20

Bias Normalization via Regression

▪ Sequencing involves sampling of RNA transcripts ▪ Estimated expression levels of low, medium, and high expression genes are differently affected by the throughput of RNA sequencing experiment ▪ Normalization by regression corrected the biases

20

Before Normalization After Normalization

Adapted from Bacher et al. Nature Methods 14, 584-586 (2017)

slide-21
SLIDE 21

What Can Gene Expression Tell Us?

21

Klings et al. Physiological Genomics 21, 293-298 (2005)

Up-Regulated Genes

Adapted from Rund et al. PNAS 108, E421-430

Time Series Analysis Down-Regulated Genes High Low

slide-22
SLIDE 22

Structure Behind Gene Expression Profiles

22

D’haeseleer et al. Nature Biotechnology 23, 1499-1501 (2005)

Each cluster represents a group of genes with similar functions

Hierarchical K-means Self-Organizing Map

slide-23
SLIDE 23

Identifying Disease Subtypes

23

Guinney et al. Nature Medicine 21, 1350-1356 (2015)

Gene expression data from >4,000 colorectal cancer patients

Node = Patient Edge = Similar gene expression

slide-24
SLIDE 24

Application Of Gene Expression Analysis

24

Image from http://pathview.r-forge.r-project.org/

To understand the blueprint of biological systems and diseases

slide-25
SLIDE 25

A Break From Biology: Basic Clustering Techniques

25

slide-26
SLIDE 26

An Illustration Of K-Mean Clustering

26

From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

Randomly select K=3 centroids Assign points to nearest centroid Update centroids Update point assignments Update centroids

slide-27
SLIDE 27

Characteristics Of K-Mean Clustering

▪ The number of clusters, K, is specified in advance. ▪ Euclidean distance

  • The nearest centroid minimizes the sum of squares, ||x-m||2.

▪ Always converge to a (local) minimum.

  • Poor starting centroid locations can lead to incorrect minima.

▪ The model has several implicit assumptions:

  • Data points scatter around cluster’s centers.
  • Boundary between adjacent clusters is always halfway

between the cluster centroids.

27

Image from https://en.wikipedia.org/wiki/K-means_clustering

slide-28
SLIDE 28

Effect Of Poor Initial Centroid Locations

28

From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

slide-29
SLIDE 29

Distance Functions

29

▪ Property of distance function

  • d(x, y) ≥ 0 non-negativity
  • d(x, y) = 0  x = y identity
  • d(x, y) = d(y, x) symmetry
  • d(x, z) ≤ d(x, y) + d(y, z) triangle inequality

▪ Example of distance functions

  • Euclidean distance

𝐞(𝐲, 𝐳) = σ(𝒚𝒋 − 𝒛𝒋)𝟑

  • Squared Euclidean distance 𝐞(𝐲, 𝐳) = σ(𝒚𝒋 − 𝒛𝒋)𝟑
  • Manhattan distance

𝐞(𝐲, 𝐳) = σ |𝒚𝒋 − 𝒛𝒋|

  • Maximum distance

𝐞(𝐲, 𝐳) = 𝐧𝐛𝐲 |𝒚𝒋 − 𝒛𝒋|

▪ ||𝐲||𝒒 = (σ𝒋 |𝐲𝒋|𝒒) Τ

𝟐 𝒒  p-norm

slide-30
SLIDE 30

More About Distance Functions

30

▪ Manhattan distance

  • 𝐞(𝐲, 𝐳) = σ |𝒚𝒋 − 𝒛𝒋|
  • Can reflect driving distance

▪ Hamming (Edit) distance

  • For two string s and t, d(s, t) = # of mismatch positions

between the two strings.

  • Can reflect the extent of evolution between genes: more

changes in sequence ~ more time has passed ATGAGCATAACCATGCGGAT ATGAGGATACCCATGCCGAT

Image from https://www.quora.com/What-is-the-difference-between-Manhattan-and-Euclidean-distance-measures

slide-31
SLIDE 31

Elbow Method For Selecting K In K-Mean

31

From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

K = 1 All-data centroid K = 2 K = 3 … fraction of explained variance = between-cluster variance all-data variance between-cluster variance = σ𝒋=𝟐

𝑳 𝒐𝒋(𝑵𝒋 −𝑵)𝟑 𝑳 −𝟐

, where ni = size of ith cluster, Mi = centroid of ith cluster, and M = all-data centroid. all-data variance = σ𝒋=𝟐

𝑶 (𝒚𝒋 −𝑵)𝟑 𝑶 −𝟐 , where xi = ith data point and N = # of data.

Fraction of explained variance Number of cluster, K Model complexity Amount of explained variance

The elbow method chooses K where increasing complexity doesn’t yield much in return. K = 4

slide-32
SLIDE 32

Other Ways For Selecting K In K-Mean

32

Fraction of explained variance Number of cluster, K 95% explained variance

Choose minimal K that explains at least 95% of the all-data variance. K = 2 K = 3 … K = 4 Training K-mean Clustering Model Testing / Cross- validation 2 3 … 4 K Accuracy 50% 68% … 83% Choose K that maximizes certain objective (e.g. accuracy on testing data)

slide-33
SLIDE 33

K-Mean Clustering Of Gene Expression

33

Image from http://www2.warwick.ac.uk/alumni/services/eportfolios/hrrgak/project_overview/systems_biology/

Time Points Normalized Gene Expression Level Centroid Trend

slide-34
SLIDE 34

Hierarchical Clustering

34

Image from https://www.slideshare.net/ElenaSgis/data-preprocessing-and-unsupervised-learning-methods-in-bioinformatics

▪ Each step finds two data points or existing clusters that are closest to each other and group them together. ▪ The number of clusters is defined afterward by setting a cutoff on the distance. ▪ Choices of distance function. ▪ Choices of how to measure distance between clusters (e.g. using cluster centroids or closest members or all members).

Cutoff

slide-35
SLIDE 35

Linkage Criteria: Distance Between Clusters

35

▪ Maximum or complete linkage ▪ Minimum or single linkage ▪ Mean or average linkage

Furthest data points Closest data points Average

slide-36
SLIDE 36

Hierarchical Clustering Of Gene Expression

36

Adapted from Chua et al. Frontiers in Bioscience 8, s913-923 (2003)

Normal Acute Severe Genes Patients Normal Acute Severe Severe A Severe B

slide-37
SLIDE 37

Application IV: Annotation of Different Cell Types

37

slide-38
SLIDE 38

Sorting Different Cell Types

38

Image from http://labs.umassmed.edu/socolovskylab/research~flow_cytometry.html

Modern cytometers contain up to 30 fluorescence detectors

slide-39
SLIDE 39

Example Of Flow Cytometry Data

39

Image from https://www.thermofisher.com/th/en/home/references/newsletters-and-journals/bioprobes-journal-of- cell-biology-applications/bioprobes-71/bioprobes-71-flow-cytometry-panel-design.html

Different fluorescence signals

slide-40
SLIDE 40

Simplifying High-Dimensional Data

40

Image from http://bigdata.csail.mit.edu/node/277 Image from http://www.stat.ucla.edu/~ybzhao/teaching/stat101c/

Dimensionality Reduction Classification

Features Data Points

x x x x x x x x

Factor 1 Factor 2

PCA, t-SNE

slide-41
SLIDE 41

Identification Of New Cell Types

41

Unannotated cell types according to existing knowledge

Becher et al. Nature Immunology 15, 1181-1189 (2014)

Clustering process defines these as new cell types

slide-42
SLIDE 42

Automatic Cell Type Classification

42

Becher et al. Nature Immunology 15, 1181-1189 (2014)

slide-43
SLIDE 43

Deconvolution Of Cell Type Composition

43

Training

Newman et al. Nature Methods 12, 453-457 (2015)

Testing

slide-44
SLIDE 44

Cell Composition Reflects Disease State

44

Image from https://shenorrlab.github.io/bseqsc/index.html

Training New Data Deconvolution Diagnosis

slide-45
SLIDE 45

Application V: Protein-To-DNA Binding Prediction

45

slide-46
SLIDE 46

Transcription Factor (TF) Binding

46

Image from https://www.khanacademy.org/science/biology/gene- regulation/gene-regulation-in-eukaryotes/a/eukaryotic-transcription-factors

Transcription Factor

Image from http://www.cpath.pitt.edu/genoAnnot.htm

slide-47
SLIDE 47

Sequence-Based Prediction Of TF Binding

47

...ACCAGCGGCGAAGCTCGGGGCGGAGGGGT TGAGCCACATGAGGCGATGGCGACATCCCATA TATGGAGACATGGCGTGGCTGGCTGTTACATT TTGTTTTGATGAAAAGCATAACCATGCGGATG ATATTTTTATTATAGACTAGAGATGATTATTG AATAGACATGCTCTTAACCATTTTTAACTCTA ATTCCAC... ...ACCAGCGGCGAAGCTCGGGGCGGAGGGGT TGAGCCACATGAGGCGATGGCGACAGGGACCT CCGACCTTATAAGGAGACATGGCGTGGCTGGC TGTTACATTTTGTTTTGATGAAAAGCATAACC ATGCGGATGATATTTTTATTATAGACTAGAGA TGATTATTGAATAGGCCTACTTTACATGCTCT TAACCATTTTTAACTCTAATTCCAC...

Genome Sequence Expected Motif

Adapted from Mathelier et al. Cell Systems 3, 278-286 (2016)

Predicted binding sites Downstream genes High chance of false positives due to flexibility

  • f binding motifs.
slide-48
SLIDE 48

DNA Packaging Affects TF Binding

48

Image from http://images.slideplayer.com/9/2508104/slides/slide_28.jpg Dillon et al. Trends in Genetics 18, 252-258 (2002)

Transcription factors cannot access tightly packed DNA region.

slide-49
SLIDE 49

Local DNA 3D Structure Affects TF Binding

49

Gordan et al. Cell Reports 3, 1093-1104 (2013)

slide-50
SLIDE 50

Prediction Of TF Binding Using Physical Data

50

Mathelier et al. Cell Systems 3, 278-286 (2016)

Decision Tree with Gradient Boosting

slide-51
SLIDE 51

Application VI: Human Genome Structure

51

slide-52
SLIDE 52

Human Genome

52

Image from https://en.wikipedia.org/wiki/Human_genome #Coding_vs._noncoding_DNA

▪ 3.2 Giga-basepairs x2 on 23 pairs of chromosomes. ▪ ~0.1% variation between individuals. ▪ <1.5% code for proteins (exons).

Image from https://bio.libretexts.org/LibreTexts/University_of_ California_Davis/BIS_2A%3A_Introductory_Biology_(Easlon)/Readings /26%3A_Genomes%3A_a_Brief_Introduction

0.002% of human genome gene gene gene

slide-53
SLIDE 53

Much Of Genome Remains Unknown

▪ Using data from well-studied genomic regions to predict the structures and functions of other genomic regions in the same

  • r in newly discovered species.

53

Image from http://www.mun.ca/biology/desmid/brian/BIOL2060/BIOL2060-18/CB18.html /

We studied them extensively We have some clue about what they are but not completely about their impacts

  • n our genome

We don’t really know what they are nor what they do

slide-54
SLIDE 54

Dynamic Bayesian Network (DBN)

54

Image from https://en.wikipedia.org/wiki/Dynamic_Bayesian_network

▪ The label of a genomic location can be predicted based on the labels and properties of nearby positions. ▪ For example, At may indicate label – such as exon or intron – at location t while Bt and Ct keep track of the properties

  • f the genome – such as coding-vs-non-coding or loosely-

packed-vs-tightly-packed – at location t. ▪ The probability of predicting a label depends on Bt and Ct.

slide-55
SLIDE 55

Histone Modifications

55

Image from http://www.peirsoncenter.com/articles/dna-methylation-in-down-syndrome Image from http://images.slideplayer.com/9/2508104/slides/slide_28.jpg

slide-56
SLIDE 56

Inferring Labels From Genomic Signals

56

Hoffman et al. Nature Methods 9, 473-476 (2012).

Label relationship graph Histone Markers

slide-57
SLIDE 57

Decent Performance Without Using Sequence

57

Hoffman et al. Nature Methods 9, 473-476 (2012).

Predicted labels Known Annotation ▪ Thick bars at the bottom are translated regions (exons). ▪ There are multiple isoform of BRD2 (i.e. different start locations and exon structures)

slide-58
SLIDE 58

Application X: Regulation Of Gene Expression

58

slide-59
SLIDE 59

Regulation Of Gene Expression

59

Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)

Genome

ON/OFF Switch Gene Long-Range Regulator RNA Transcript

Input Signal Output Signal Tag Destroy ON ON OFF OFF Produce

slide-60
SLIDE 60

Regression Model For Gene Expression I

60

Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)

Y = aX + b Y = a0 + a1X1 + a2X2 + … + anXn

slide-61
SLIDE 61

Regression Model For Gene Expression II

61

Sample #2

Y1 = a01 + a11X11 + a21X21 + … + an1Xn1 Gene #1: Y2 = a02 + a12X12 + a22X22 + … + an2Xn2 Gene #2: YN = a0N + a1NX1N + a2NX2N + … + anNXnN Gene #N: … …

Sample #1 Sample #M

Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)

… Same coefficients ai,j across samples

slide-62
SLIDE 62

Regression Model For Gene Expression III

62

Sample A2 Gene #1: Gene #2: Gene #N: … Sample A1 Sample AM

Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)

… Sample B2 Gene #1: Gene #2: Gene #N: … Sample B1 Sample BP

Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)

Condition A Condition B

Same coefficients ai,j across samples, different across conditions

slide-63
SLIDE 63

Summary

63

Image from http://physrev.physiology.org/content/89/3/921 Han et al. Nature Communication 8, 14238 (2017) Li et al. PLoS Comp Biol 10, e1003908 (2014) Newman et al. Nature Methods 12, 453-457 (2015)

Generative Model Maximum Likelihood Estimator Clustering Dynamic Bayesian Network Regression Support Vector Machine

Klings et al. Physiological Genomics 21, 293-298 (2005) Becher et al. Nature Immunology 15, 1181-1189 (2014)

Dimensionality Reduction tSNE

Image from https://en.wikipedia.org/wiki/Dynamic_Bayesian_network