Applications of Pattern Recognition in Computational Biology
Pattern Recognition Course (2110597) Chulalongkorn University August 22nd, 2017 Instructor: Sira Sriswasdi (สิระ ศรีสวัสดิ์)
1
Applications of Pattern Recognition in Computational Biology - - PowerPoint PPT Presentation
Applications of Pattern Recognition in Computational Biology Pattern Recognition Course (2110597) Chulalongkorn University August 22 nd , 2017 Instructor: Sira Sriswasdi ( ) 1 89% accuracy vs 73% 2
1
2
Image from https://www.systemsbiology.org/about/what-is-systems-biology/
3
4
“Information Processing in Biology”
ACCAGCGGCGAAGCTCGGGGCGGAGGGGTTGA GCCACATGAGGCGATGGCGACAATGAGGCGAG ACATGGCGTGGCTGGCTGTTACATTTTGTTTT GATGAAAAGCATAACCATGCGGATGATATTTT TATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCCAC Genes Conditions
Image from https://www.khanacademy.org/science/biology/gene-expression- central-dogma/transcription-of-dna-into-rna/a/overview-of-transcription
Image from http://www.cubocube.com/files/images/opengenetics/chapter11/image2.png
5
Metabolites, hormones, and
DNA sequencing, genetic mapping, recombinant DNA Protein identification and quantification, post- translational modification RNA sequencing, RNA expression, transcriptional regulation DNA methylation, histone modification
6
7
Image adapted from http://physrev.physiology.org/content/89/3/921
CGGCGAAGCTCGGGGCGGAGGGGTTGATTTTTAACTCTAATT... AGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATG GCGTGGCTGGCTGTTACATTTTGTTTTGATGAAATTTTTAACTCTAATTC... CAGCGGCGAAGCTCGGGGCGGAGGGGTTTATTTTTATTATAGACTAGAGATGATTATTGAATAGAC ATGCTCTTAACCATTTTTAACTCTAATTCC... GCGAAGCTCGTTAACCATGCGGATGATATTTTTATTATAGACTAGAGATGATTATTGAATAGACAT GCTCTTAACCATTTTTAACTCTAA... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACA TGGCGTGGCTGGCTGTTACATTTTGTTTTGATGAAAAGCATAACCATGCGGATGATATTTTTATTA TAGACTAGAGATGATTATTGAATAGACATTTTAACTCTAATTCCA... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGG CGTGGCTGGCTGTTACATTTTGTTTAACTCTAAT... GCGGCGAAGCTCGGGGCGGAGGGGTTGAGCAAAAGCATAACCATGCGGATGATATTTTTATTATAG ACTAGAGATGATTATTGAACTCTAAA... CGGAGGGGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGA CATGGCGTGGCTGGCTGTTAC... CCAGCGGCGAAGCGGCGATGGCGACAATGAGGCGAGACATGGCGTGGCTGGCTGTTACATTTTGTT AATCGGGGCGGAGGGGTTGAGCCACATGAGCATAACCATGCGGATGATATTTTTATTATAGACTAG AGATGATTATTGAATAGACATTTTAACTCTAATTCCA... CCAGCGGCGAAGCTCGGGGCGGAGGGGTTGAGCATAACCATGCGGATGATATTTTTATTATAGACT AGAGATGATTATTGAATAGACATTTTAACTCTAATTCCA... CCAGCGGCGAAGCTCGTTGAGCCACATGAGGCGATGGCGACAATGAGGCGAGACATGGCGTGGCTG GCTGTTACATTTTGTTTTGATGAAAAGCATTCTAATTCCA...
8
Image adapted from http://physrev.physiology.org/content/89/3/921 and https://www.quora.com/Arent-all-sequences-homologous
Time
AAGACTT CAGCGCT CAGCACA TAGCCCT TACCCCA AAGGCCA TAGCCCA CAGCACT TAGGCCA
Ancestral Sequences (Inferred) Present Sequence (Observed)
9
Time
Image from http://physrev.physiology.org/content/89/3/921
10 5E+10 1E+11 1.5E+11 2E+11 2.5E+11 Dec-82 Jan-84 Feb-85 Mar-86 Apr-87 May-88 Jun-89 Jul-90 Aug-91 Sep-92 Oct-93 Nov-94 Dec-95 Jan-97 Feb-98 Mar-99 Apr-00 May-01 Jun-02 Jul-03 Aug-04 Sep-05 Oct-06 Nov-07 Dec-08 Jan-10 Feb-11 Mar-12 Apr-13 May-14 Jun-15 Jul-16
Base pairs
Plotted with data from https://www.ncbi.nlm.nih.gov/genbank/statistics/
11
12
13
Image adapted from https://www.theodysseyonline.com/why-people-migrate Image adapted from https://wiki.uiowa.edu/display/2360159/Autosomal+Inheritance
14
Han et al. Nature Communication 8, 14238 (2017) Image from http://uvmgg.wikia.com/wiki/SNP
774,516 individuals 709,358 SNPs A T T C … G C G G A … T C G G A … G G G T C … A Connect individuals that share significant portion of consecutive SNPs
15
Han et al. Nature Communication 8, 14238 (2017)
16
Image from http://poshrx.com/23andme-is-back-on/
17
18
Image adapted from https://www.khanacademy.org/science/biology/gene-expression-central- dogma/transcription-of-dna-into-rna/a/overview-of-transcription Image from https://support.illumina.com/sequencing/seque ncing_instruments/hiseq-4000.html
19
Image adapted from http://bio.lundberg.gu.se/courses/vt13/rnaseq.html/
Image adapted from https://mikelove.wordpress.com/2016/09/26/rna-seq- fragment-sequence-bias/
20
Adapted from Bacher et al. Nature Methods 14, 584-586 (2017)
21
Klings et al. Physiological Genomics 21, 293-298 (2005)
Up-Regulated Genes
Adapted from Rund et al. PNAS 108, E421-430
Time Series Analysis Down-Regulated Genes High Low
22
D’haeseleer et al. Nature Biotechnology 23, 1499-1501 (2005)
Hierarchical K-means Self-Organizing Map
23
Guinney et al. Nature Medicine 21, 1350-1356 (2015)
Node = Patient Edge = Similar gene expression
24
Image from http://pathview.r-forge.r-project.org/
To understand the blueprint of biological systems and diseases
25
26
From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
Randomly select K=3 centroids Assign points to nearest centroid Update centroids Update point assignments Update centroids
27
Image from https://en.wikipedia.org/wiki/K-means_clustering
28
From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
29
𝟐 𝒒 p-norm
30
Image from https://www.quora.com/What-is-the-difference-between-Manhattan-and-Euclidean-distance-measures
31
From https://www.naftaliharris.com/blog/visualizing-k-means-clustering/
K = 1 All-data centroid K = 2 K = 3 … fraction of explained variance = between-cluster variance all-data variance between-cluster variance = σ𝒋=𝟐
𝑳 𝒐𝒋(𝑵𝒋 −𝑵)𝟑 𝑳 −𝟐
, where ni = size of ith cluster, Mi = centroid of ith cluster, and M = all-data centroid. all-data variance = σ𝒋=𝟐
𝑶 (𝒚𝒋 −𝑵)𝟑 𝑶 −𝟐 , where xi = ith data point and N = # of data.
Fraction of explained variance Number of cluster, K Model complexity Amount of explained variance
The elbow method chooses K where increasing complexity doesn’t yield much in return. K = 4
32
Fraction of explained variance Number of cluster, K 95% explained variance
Choose minimal K that explains at least 95% of the all-data variance. K = 2 K = 3 … K = 4 Training K-mean Clustering Model Testing / Cross- validation 2 3 … 4 K Accuracy 50% 68% … 83% Choose K that maximizes certain objective (e.g. accuracy on testing data)
33
Image from http://www2.warwick.ac.uk/alumni/services/eportfolios/hrrgak/project_overview/systems_biology/
Time Points Normalized Gene Expression Level Centroid Trend
34
Image from https://www.slideshare.net/ElenaSgis/data-preprocessing-and-unsupervised-learning-methods-in-bioinformatics
Cutoff
35
Furthest data points Closest data points Average
36
Adapted from Chua et al. Frontiers in Bioscience 8, s913-923 (2003)
Normal Acute Severe Genes Patients Normal Acute Severe Severe A Severe B
37
38
Image from http://labs.umassmed.edu/socolovskylab/research~flow_cytometry.html
Modern cytometers contain up to 30 fluorescence detectors
39
Image from https://www.thermofisher.com/th/en/home/references/newsletters-and-journals/bioprobes-journal-of- cell-biology-applications/bioprobes-71/bioprobes-71-flow-cytometry-panel-design.html
40
Image from http://bigdata.csail.mit.edu/node/277 Image from http://www.stat.ucla.edu/~ybzhao/teaching/stat101c/
x x x x x x x x
Factor 1 Factor 2
41
Becher et al. Nature Immunology 15, 1181-1189 (2014)
42
Becher et al. Nature Immunology 15, 1181-1189 (2014)
43
Newman et al. Nature Methods 12, 453-457 (2015)
44
Image from https://shenorrlab.github.io/bseqsc/index.html
45
46
Image from https://www.khanacademy.org/science/biology/gene- regulation/gene-regulation-in-eukaryotes/a/eukaryotic-transcription-factors
Image from http://www.cpath.pitt.edu/genoAnnot.htm
47
...ACCAGCGGCGAAGCTCGGGGCGGAGGGGT TGAGCCACATGAGGCGATGGCGACATCCCATA TATGGAGACATGGCGTGGCTGGCTGTTACATT TTGTTTTGATGAAAAGCATAACCATGCGGATG ATATTTTTATTATAGACTAGAGATGATTATTG AATAGACATGCTCTTAACCATTTTTAACTCTA ATTCCAC... ...ACCAGCGGCGAAGCTCGGGGCGGAGGGGT TGAGCCACATGAGGCGATGGCGACAGGGACCT CCGACCTTATAAGGAGACATGGCGTGGCTGGC TGTTACATTTTGTTTTGATGAAAAGCATAACC ATGCGGATGATATTTTTATTATAGACTAGAGA TGATTATTGAATAGGCCTACTTTACATGCTCT TAACCATTTTTAACTCTAATTCCAC...
Adapted from Mathelier et al. Cell Systems 3, 278-286 (2016)
48
Image from http://images.slideplayer.com/9/2508104/slides/slide_28.jpg Dillon et al. Trends in Genetics 18, 252-258 (2002)
Transcription factors cannot access tightly packed DNA region.
49
Gordan et al. Cell Reports 3, 1093-1104 (2013)
50
Mathelier et al. Cell Systems 3, 278-286 (2016)
51
52
Image from https://en.wikipedia.org/wiki/Human_genome #Coding_vs._noncoding_DNA
Image from https://bio.libretexts.org/LibreTexts/University_of_ California_Davis/BIS_2A%3A_Introductory_Biology_(Easlon)/Readings /26%3A_Genomes%3A_a_Brief_Introduction
53
Image from http://www.mun.ca/biology/desmid/brian/BIOL2060/BIOL2060-18/CB18.html /
54
Image from https://en.wikipedia.org/wiki/Dynamic_Bayesian_network
55
Image from http://www.peirsoncenter.com/articles/dna-methylation-in-down-syndrome Image from http://images.slideplayer.com/9/2508104/slides/slide_28.jpg
56
Hoffman et al. Nature Methods 9, 473-476 (2012).
57
Hoffman et al. Nature Methods 9, 473-476 (2012).
58
59
Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)
Genome
Input Signal Output Signal Tag Destroy ON ON OFF OFF Produce
60
Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)
61
Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)
62
Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)
Adapted from Li et al. PLoS Comp Biol 10, e1003908 (2014)
63
Image from http://physrev.physiology.org/content/89/3/921 Han et al. Nature Communication 8, 14238 (2017) Li et al. PLoS Comp Biol 10, e1003908 (2014) Newman et al. Nature Methods 12, 453-457 (2015)
Klings et al. Physiological Genomics 21, 293-298 (2005) Becher et al. Nature Immunology 15, 1181-1189 (2014)
Image from https://en.wikipedia.org/wiki/Dynamic_Bayesian_network