Computational Analysis of Methylome Sequencing Data Master Thesis - - PowerPoint PPT Presentation

computational analysis of methylome sequencing data
SMART_READER_LITE
LIVE PREVIEW

Computational Analysis of Methylome Sequencing Data Master Thesis - - PowerPoint PPT Presentation

The Topic The Problem The Idea The Results Computational Analysis of Methylome Sequencing Data Master Thesis Bioinformatics Till Helge Helwig Eberhard-Karls-University Tbingen Wilhelm-Schickard-Institut fr Informatik & Max Planck


slide-1
SLIDE 1

The Topic The Problem The Idea The Results

Computational Analysis of Methylome Sequencing Data

Master Thesis Bioinformatics Till Helge Helwig

Eberhard-Karls-University Tübingen Wilhelm-Schickard-Institut für Informatik & Max Planck Institute for Developmental Biology

February 22, 2011

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 1 / 21

slide-2
SLIDE 2

Outline

1

The Topic What is a Methylome? Why is the Methylome Interesting?

2

The Problem Obtaining the Methylome via Sequencing Problems with the Common Approach

3

The Idea How Can Computer Science Help? Evaluated Methods

4

The Results Performance Comparison What Do the Results Imply?

slide-3
SLIDE 3

Outline

1

The Topic What is a Methylome? Why is the Methylome Interesting?

2

The Problem Obtaining the Methylome via Sequencing Problems with the Common Approach

3

The Idea How Can Computer Science Help? Evaluated Methods

4

The Results Performance Comparison What Do the Results Imply?

slide-4
SLIDE 4

The Topic The Problem The Idea The Results What is a Methylome?

The Methylome

Entirety of methylated nucleotides (e.g. cytosines) in the DNA Addition of a methyl group converts cytosine into 5-methylcytosine

N O N H NH2 N O N H NH2

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 4 / 21

slide-5
SLIDE 5

The Topic The Problem The Idea The Results What is a Methylome?

The Methylome

Entirety of methylated nucleotides (e.g. cytosines) in the DNA Addition of a methyl group converts cytosine into 5-methylcytosine

N O N H NH2 N O N H NH2

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 4 / 21

slide-6
SLIDE 6

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-7
SLIDE 7

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-8
SLIDE 8

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases Maintenance of methylations after transcription

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-9
SLIDE 9

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases Maintenance of methylations after transcription

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT

Environmental factors influence the methylome

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-10
SLIDE 10

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases Maintenance of methylations after transcription

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT

Environmental factors influence the methylome The methylome is highly variable...

...between different species

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-11
SLIDE 11

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases Maintenance of methylations after transcription

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT

Environmental factors influence the methylome The methylome is highly variable...

...between different species ...between organisms of the same species

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-12
SLIDE 12

The Topic The Problem The Idea The Results What is a Methylome?

Properties of the Methylome

Additional layer of information within the DNA Methylations are created by methyltransferases Maintenance of methylations after transcription

A CG A T GC T

m m

A CG A T GC T

m m

A CG A T GC T

m

A CGT A T GC A CG A T GC T

m m

A CGT A T GC

m m

MT MT

A CG A T GC T

m m

A CGT A T GC

m m

MT MT

Environmental factors influence the methylome The methylome is highly variable...

...between different species ...between organisms of the same species ...between different cell types of the same organism

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 5 / 21

slide-13
SLIDE 13

The Topic The Problem The Idea The Results Why is the Methylome Interesting?

Transcription Inhibition

Methylated nucleotides can inhibit the transcription

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 6 / 21

slide-14
SLIDE 14

The Topic The Problem The Idea The Results Why is the Methylome Interesting?

Transcription Inhibition

Methylated nucleotides can inhibit the transcription Relevance for different research fields:

Developmental biology (e.g. for association studies) Medicine (e.g. for tumorgenesis) Ecology (e.g. documentation of environmental changes) ...

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 6 / 21

slide-15
SLIDE 15

Outline

1

The Topic What is a Methylome? Why is the Methylome Interesting?

2

The Problem Obtaining the Methylome via Sequencing Problems with the Common Approach

3

The Idea How Can Computer Science Help? Evaluated Methods

4

The Results Performance Comparison What Do the Results Imply?

slide-16
SLIDE 16

The Topic The Problem The Idea The Results Obtaining the Methylome via Sequencing

Making the Methylome Visible

Standard sequencing can not identify methylations

A C GT A C T G A C A G T C A T GAC T G T A GT A C T G A A G T C A T GA T G T

bisulfite treatment

m m

U U U A GT A C TG A A G T C A T GA T G T T T T

P C R

T C A T GA T T A C A A GT A CT A A C A C

forward reverse forward complement reverse complement unmethylated methylated 1:1:1:1 0:2:2:0 100% CG 50% CG complement synthesis

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 8 / 21

slide-17
SLIDE 17

The Topic The Problem The Idea The Results Obtaining the Methylome via Sequencing

Making the Methylome Visible

Standard sequencing can not identify methylations Bisulfite treatment makes methylations visible:

A C GT A C T G A C A G T C A T GAC T G T A GT A C T G A A G T C A T GA T G T

bisulfite treatment

m m

U U U A GT A C TG A A G T C A T GA T G T T T T

P C R

T C A T GA T T A C A A GT A CT A A C A C

forward reverse forward complement reverse complement unmethylated methylated 1:1:1:1 0:2:2:0 100% CG 50% CG complement synthesis

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 8 / 21

slide-18
SLIDE 18

The Topic The Problem The Idea The Results Obtaining the Methylome via Sequencing

Making the Methylome Visible

Standard sequencing can not identify methylations Bisulfite treatment makes methylations visible:

A C GT A C T G A C A G T C A T GAC T G T A GT A C T G A A G T C A T GA T G T

bisulfite treatment

m m

U U U A GT A C TG A A G T C A T GA T G T T T T

P C R

T C A T GA T T A C A A GT A CT A A C A C

forward reverse forward complement reverse complement unmethylated methylated 1:1:1:1 0:2:2:0 100% CG 50% CG complement synthesis

Sequencing now reports only methylated cytosines

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 8 / 21

slide-19
SLIDE 19

The Topic The Problem The Idea The Results Obtaining the Methylome via Sequencing

Sequencing Protocol

Bisulfite treatment inserted into the sequencing protocol

bisulfite treatment PCR Solexa sequencing DNA double strand tissue sample DNA extraction sequencing reads mapping reads per position

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 9 / 21

slide-20
SLIDE 20

The Topic The Problem The Idea The Results Obtaining the Methylome via Sequencing

Sequencing Protocol

Bisulfite treatment inserted into the sequencing protocol

bisulfite treatment PCR Solexa sequencing DNA double strand tissue sample DNA extraction sequencing reads mapping reads per position

Methylation rates calculated from the read counts per position

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 9 / 21

slide-21
SLIDE 21

The Topic The Problem The Idea The Results Problems with the Common Approach

Methylome Sequencing is Imprecise

Bisulfite treatment Has a significant conversion error rate. ⇒ Can be estimated from the mitochondrium DNA.

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 10 / 21

slide-22
SLIDE 22

The Topic The Problem The Idea The Results Problems with the Common Approach

Methylome Sequencing is Imprecise

Bisulfite treatment Has a significant conversion error rate. ⇒ Can be estimated from the mitochondrium DNA. PCR Might contain a preference for certain strands. ⇒ Difficult to take into account.

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 10 / 21

slide-23
SLIDE 23

The Topic The Problem The Idea The Results Problems with the Common Approach

Methylome Sequencing is Imprecise

Bisulfite treatment Has a significant conversion error rate. ⇒ Can be estimated from the mitochondrium DNA. Sequencing Reports wrong nucleotides sometimes. ⇒ Accuracy value is reported as well. PCR Might contain a preference for certain strands. ⇒ Difficult to take into account.

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 10 / 21

slide-24
SLIDE 24

The Topic The Problem The Idea The Results Problems with the Common Approach

Methylome Sequencing is Imprecise

Bisulfite treatment Has a significant conversion error rate. ⇒ Can be estimated from the mitochondrium DNA. Sequencing Reports wrong nucleotides sometimes. ⇒ Accuracy value is reported as well. PCR Might contain a preference for certain strands. ⇒ Difficult to take into account. Mapping Problematic due to repetetive regions and reduced sequence complexity.

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 10 / 21

slide-25
SLIDE 25

Outline

1

The Topic What is a Methylome? Why is the Methylome Interesting?

2

The Problem Obtaining the Methylome via Sequencing Problems with the Common Approach

3

The Idea How Can Computer Science Help? Evaluated Methods

4

The Results Performance Comparison What Do the Results Imply?

slide-26
SLIDE 26

The Topic The Problem The Idea The Results How Can Computer Science Help?

Improvement via Machine Learning

Methyltransferases need some form of binding sites

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 12 / 21

slide-27
SLIDE 27

The Topic The Problem The Idea The Results How Can Computer Science Help?

Improvement via Machine Learning

Methyltransferases need some form of binding sites Binding sites are patterns in the DNA nucleotide sequence

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 12 / 21

slide-28
SLIDE 28

The Topic The Problem The Idea The Results How Can Computer Science Help?

Improvement via Machine Learning

Methyltransferases need some form of binding sites Binding sites are patterns in the DNA nucleotide sequence Patterns can be learned in order to be recognized in new data

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 12 / 21

slide-29
SLIDE 29

The Topic The Problem The Idea The Results How Can Computer Science Help?

Improvement via Machine Learning

Methyltransferases need some form of binding sites Binding sites are patterns in the DNA nucleotide sequence Patterns can be learned in order to be recognized in new data Idea Use machine learning to obtain an additional confidence measure based on sequence patterns.

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 12 / 21

slide-30
SLIDE 30

The Topic The Problem The Idea The Results How Can Computer Science Help?

Requirements

Needs to handle full genomes

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 13 / 21

slide-31
SLIDE 31

The Topic The Problem The Idea The Results How Can Computer Science Help?

Requirements

Needs to handle full genomes Will be used on newly sequenced genomes ⇒ Should not rely on more than the nucleotide sequence

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 13 / 21

slide-32
SLIDE 32

The Topic The Problem The Idea The Results How Can Computer Science Help?

Requirements

Needs to handle full genomes Will be used on newly sequenced genomes ⇒ Should not rely on more than the nucleotide sequence Quantification of the likelihood for candidate nucleotides to be methylated ⇒ Confidence score between 0.0 and 1.0

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 13 / 21

slide-33
SLIDE 33

The Topic The Problem The Idea The Results Evaluated Methods

Dataset for Training and Test

Problem No dataset available with confirmed methylations. Lister et al. Cokus et al. In-house data High-confidence positions Balanced dataset Unbalanced dataset

Intron regions Exon regions Intergenic regions Intron regions Exon regions Intergenic regions

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 14 / 21

slide-34
SLIDE 34

The Topic The Problem The Idea The Results Evaluated Methods

Dataset for Training and Test

Problem No dataset available with confirmed methylations. Solution Manual creation of a high-confidence dataset Lister et al. Cokus et al. In-house data High-confidence positions Balanced dataset Unbalanced dataset

Intron regions Exon regions Intergenic regions Intron regions Exon regions Intergenic regions

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 14 / 21

slide-35
SLIDE 35

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-36
SLIDE 36

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines 3 different kernels:

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-37
SLIDE 37

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines 3 different kernels:

k-Spectrum Kernel (considers substring occurences in the input strings)

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-38
SLIDE 38

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines 3 different kernels:

k-Spectrum Kernel (considers substring occurences in the input strings) Extension of the k-Spectrum Kernel (considers additionally the position of the substrings)

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-39
SLIDE 39

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines 3 different kernels:

k-Spectrum Kernel (considers substring occurences in the input strings) Extension of the k-Spectrum Kernel (considers additionally the position of the substrings) Weighted Degree String Kernel with shifts (adds weights to account for substring shifts, substring lengths and substring positions)

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-40
SLIDE 40

The Topic The Problem The Idea The Results Evaluated Methods

Experimental Setup

Support Vector Machines 3 different kernels:

k-Spectrum Kernel (considers substring occurences in the input strings) Extension of the k-Spectrum Kernel (considers additionally the position of the substrings) Weighted Degree String Kernel with shifts (adds weights to account for substring shifts, substring lengths and substring positions)

Prediction of methylations on whole genome with best classifiers

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 15 / 21

slide-41
SLIDE 41

Outline

1

The Topic What is a Methylome? Why is the Methylome Interesting?

2

The Problem Obtaining the Methylome via Sequencing Problems with the Common Approach

3

The Idea How Can Computer Science Help? Evaluated Methods

4

The Results Performance Comparison What Do the Results Imply?

slide-42
SLIDE 42

Performances on Unbalanced Dataset

slide-43
SLIDE 43

The Topic The Problem The Idea The Results Performance Comparison

Obtaining the Confidence Value

The three best classifiers (WDSK with shifts) used to predict for whole genome

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 18 / 21

slide-44
SLIDE 44

The Topic The Problem The Idea The Results Performance Comparison

Obtaining the Confidence Value

The three best classifiers (WDSK with shifts) used to predict for whole genome Balanced classifiers performed badly (59% methylation rate)

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 18 / 21

slide-45
SLIDE 45

The Topic The Problem The Idea The Results Performance Comparison

Obtaining the Confidence Value

The three best classifiers (WDSK with shifts) used to predict for whole genome Balanced classifiers performed badly (59% methylation rate) Unbalanced classifiers report 6% methylation rate (7% expected)

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 18 / 21

slide-46
SLIDE 46

The Topic The Problem The Idea The Results Performance Comparison

Obtaining the Confidence Value

The three best classifiers (WDSK with shifts) used to predict for whole genome Balanced classifiers performed badly (59% methylation rate) Unbalanced classifiers report 6% methylation rate (7% expected) SVM calculates a confidence value

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 18 / 21

slide-47
SLIDE 47

The Topic The Problem The Idea The Results Performance Comparison

Obtaining the Confidence Value

The three best classifiers (WDSK with shifts) used to predict for whole genome Balanced classifiers performed badly (59% methylation rate) Unbalanced classifiers report 6% methylation rate (7% expected) SVM calculates a confidence value However: Few reported methylated positions occur in

  • riginal datasets

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 18 / 21

slide-48
SLIDE 48

The Topic The Problem The Idea The Results What Do the Results Imply?

What Did We Learn?

Biology Methylation state to some degree reflected by the neighboring nucleotides

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 19 / 21

slide-49
SLIDE 49

The Topic The Problem The Idea The Results What Do the Results Imply?

What Did We Learn?

Biology Methylation state to some degree reflected by the neighboring nucleotides No unique patterns identifying methylated positions

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 19 / 21

slide-50
SLIDE 50

The Topic The Problem The Idea The Results What Do the Results Imply?

What Did We Learn?

Biology Methylation state to some degree reflected by the neighboring nucleotides No unique patterns identifying methylated positions Different properties of methylated positions in varying genomic regions

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 19 / 21

slide-51
SLIDE 51

The Topic The Problem The Idea The Results What Do the Results Imply?

What Did We Learn?

Biology Methylation state to some degree reflected by the neighboring nucleotides No unique patterns identifying methylated positions Different properties of methylated positions in varying genomic regions Bioinformatics Application of supervised learning methods requires more reliable datasets

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 19 / 21

slide-52
SLIDE 52

The Topic The Problem The Idea The Results What Do the Results Imply?

What Did We Learn?

Biology Methylation state to some degree reflected by the neighboring nucleotides No unique patterns identifying methylated positions Different properties of methylated positions in varying genomic regions Bioinformatics Application of supervised learning methods requires more reliable datasets Unbalanced data is more realistic but leads to additional complexity

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 19 / 21

slide-53
SLIDE 53

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-54
SLIDE 54

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets More extensive study using more parameter values and more complex features

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-55
SLIDE 55

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets More extensive study using more parameter values and more complex features Relaxation of confidence threshold in example selection

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-56
SLIDE 56

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets More extensive study using more parameter values and more complex features Relaxation of confidence threshold in example selection Thorough analysis of methylome variability between species, organisms and cell types

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-57
SLIDE 57

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets More extensive study using more parameter values and more complex features Relaxation of confidence threshold in example selection Thorough analysis of methylome variability between species, organisms and cell types Unsupervised learning methods

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-58
SLIDE 58

The Topic The Problem The Idea The Results What Do the Results Imply?

A Look Into the Crystal Ball

Research toward validation of methylome datasets More extensive study using more parameter values and more complex features Relaxation of confidence threshold in example selection Thorough analysis of methylome variability between species, organisms and cell types Unsupervised learning methods Recent research promises methylome data as byproduct of standard sequencing

Computational Analysis of Methylome Sequencing Data Till Helge Helwig 20 / 21

slide-59
SLIDE 59

Thank you for your attention! Acknowledgements

  • Prof. Dr. Daniel Huson
  • Prof. Dr. Detlef Weigel
  • Dr. Karsten Borgwardt

MLCB group WeigelWorld Jörg Hagmann Most important sources:

  • S. J. Cokus, S. Feng, and S. E. Jacobsen.

Shotgun bisulphite sequencing of the Arabidopsis genome reveals DNA methylation patterning. Nature, 452(7184):215–219, 2008.

  • R. Lister, R. C. O’Malley, and J. R. Ecker.

Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell, 133(3):523–536, 2008.

  • G. Rätsch, S. Sonnenburg, and B. Schölkopf.

RASE: recognition of alternatively spliced exons in C. elegans. Bioinformatics, 21(suppl 1):i369, 2005.

  • K. Schneeberger, J. Hagmann, and D. Weigel.

Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10:R98, 2009.

  • B. Schölkopf and A. J. Smola.

Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, 2002.