UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - - PowerPoint PPT Presentation

uhts raw data
SMART_READER_LITE
LIVE PREVIEW

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - - PowerPoint PPT Presentation

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and


slide-1
SLIDE 1

Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL

UHTS: Raw data

slide-2
SLIDE 2

Objectives

Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and structure of the

  • utput files

Study possible sources of systematic bias and find remedies to some of them

slide-3
SLIDE 3

TTT A

Terminology

colony: set of identical sequences

  • btained on the flow-cell by amplification
  • f a template

(sequencing) cycle: attempt to incorporate the next nucleotide of every complementary strand (color) channel: 1 of 4 imaged colors, corresponding to the fluorophore associated with one base (e.g. A) read: sequence output representing the colony base calling: algorithm constructing the reads from the measurements

ATTT CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTGCGTGGTAAA GTGCGTGGTAAA

TTTA...

slide-4
SLIDE 4

Images

Each sequencing cycle produces 4 images for each of the 100 tiles DNA colonies must be located, quantified, and tracked across images stacks (~100’000 colonies/image) Each colony, at each cycle, generates a quadruplet of fluorescence intensities Naively: highest of the 4 values determines the base

slide-5
SLIDE 5

Solexa/Illumina file structure

slide-6
SLIDE 6

Solexa/Illumina file structure

slide-7
SLIDE 7

Solexa/Illumina file structure

slide-8
SLIDE 8

Quality scores

slide-9
SLIDE 9

Quality scores

slide-10
SLIDE 10

Summary of data

Quality Sequence Intensities

slide-11
SLIDE 11

Summary of data

Quality Sequence Intensities

slide-12
SLIDE 12

Summary of data

Quality Sequence Intensities

slide-13
SLIDE 13

Summary of data

Quality Sequence Intensities

slide-14
SLIDE 14

Summary of data

Quality Sequence Intensities

slide-15
SLIDE 15

Summary of data

There are ~10M such plots... Quality Sequence Intensities

slide-16
SLIDE 16

Global look

s_4_0001_int cycle 1 cycle 2

#CH4:OBJ130954 13.5 43.4 2021.8 1180.6

  • 27.6 -51.1 2531.9 1699.1
  • 143.9 -43.0 2575.9 1133.8
  • 9.3 -262.8 2657.1 1639.4
  • 107.8 -27.3 1968.3 1320.1
  • 20.5 -45.4 2312.2 862.1

105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8

  • 15.2 18.6 2401.9 1053.9

#END CYCLE 1 11.1 43.2 1875.0 1049.2

  • 48.5 -56.7 63.0 1349.1
  • 257.2 -5.8 59.8 1176.6
  • 129.3 646.7 920.1 557.2

964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2

Colonies

A C G T ...

slide-17
SLIDE 17

Global look

s_4_0001_int cycle 1 cycle 2

#CH4:OBJ130954 13.5 43.4 2021.8 1180.6

  • 27.6 -51.1 2531.9 1699.1
  • 143.9 -43.0 2575.9 1133.8
  • 9.3 -262.8 2657.1 1639.4
  • 107.8 -27.3 1968.3 1320.1
  • 20.5 -45.4 2312.2 862.1

105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8

  • 15.2 18.6 2401.9 1053.9

#END CYCLE 1 11.1 43.2 1875.0 1049.2

  • 48.5 -56.7 63.0 1349.1
  • 257.2 -5.8 59.8 1176.6
  • 129.3 646.7 920.1 557.2

964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2

Colonies

A C G T ...

slide-18
SLIDE 18

Global look

s_4_0001_int cycle 1 cycle 2

#CH4:OBJ130954 13.5 43.4 2021.8 1180.6

  • 27.6 -51.1 2531.9 1699.1
  • 143.9 -43.0 2575.9 1133.8
  • 9.3 -262.8 2657.1 1639.4
  • 107.8 -27.3 1968.3 1320.1
  • 20.5 -45.4 2312.2 862.1

105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8

  • 15.2 18.6 2401.9 1053.9

#END CYCLE 1 11.1 43.2 1875.0 1049.2

  • 48.5 -56.7 63.0 1349.1
  • 257.2 -5.8 59.8 1176.6
  • 129.3 646.7 920.1 557.2

964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2

Colonies

A C G T ...

slide-19
SLIDE 19

Global look

s_4_0001_int cycle 1 cycle 2

#CH4:OBJ130954 13.5 43.4 2021.8 1180.6

  • 27.6 -51.1 2531.9 1699.1
  • 143.9 -43.0 2575.9 1133.8
  • 9.3 -262.8 2657.1 1639.4
  • 107.8 -27.3 1968.3 1320.1
  • 20.5 -45.4 2312.2 862.1

105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8

  • 15.2 18.6 2401.9 1053.9

#END CYCLE 1 11.1 43.2 1875.0 1049.2

  • 48.5 -56.7 63.0 1349.1
  • 257.2 -5.8 59.8 1176.6
  • 129.3 646.7 920.1 557.2

964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2

Colonies

A C G T ...

Each colony is a point in 4D intensity space at each cycle Naive interpretation was

  • ptimistic
slide-20
SLIDE 20

Bias 1: optical effects

False image from measured intensities as a function of x-y coordinates on tile There are obvious boundary effects, stronger in some color channels We can correct this effect by fitting a position-depend base line There are other position- dependant issues like spot

  • verlaps
slide-21
SLIDE 21

Bias 2: sticky fluorophores

T fluorophores stick to the surface of the flow cell

slide-22
SLIDE 22

Bias 2: sticky fluorophores

T fluorophores stick to the surface of the flow cell

slide-23
SLIDE 23

Bias 3: color cross-talk and decay

Fluorophores spectra

  • verlap

Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations

slide-24
SLIDE 24

Bias 3: color cross-talk and decay

Fluorophores spectra

  • verlap

Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations

slide-25
SLIDE 25

Bias 4: dephasing

TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC

Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel

slide-26
SLIDE 26

Bias 4: dephasing

TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC A A

Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel

slide-27
SLIDE 27

Bias 4: dephasing

TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC A A G

Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel

slide-28
SLIDE 28

Binomial law

there is a probability q<1 of incorporating a nucleotide n at cycle C if q is independent of C and n, there is a simple way of correcting for dephasing sum the contributions from previous n in the sequence weighted by the probability of that many mis-incorporations I(n,c) are measured intensities, J(n,k) are dephasing-less intensities

Prob(n, C) =

C

  • k=1

C k

  • qk(1 − q)C−k 1(sk = n)

I(n, C) =

C

  • k=1

C k

  • qk(1 − q)C−kJ(n, k)
slide-29
SLIDE 29

Base probability

We would like to associate a probability with each base at each read position Solution 1: Better solution: fit gaussian distributions to the four data clouds

Prob(n, C) = I(n, C)/

  • k

I(k, C)

slide-30
SLIDE 30

Entropy

entropy H(p) is a measure of how flat (or peaked) is a probability distribution peaked = 0≤H≤log2(10) = flat H = log2(ambiguity), ambiguity is the number of states compatible with the

  • bservation

H(p) = −

10

  • k=1

p(k) log2(p(k))

slide-31
SLIDE 31

IUPAC codes

Fluorescence intensities (after bias correction and possibly normalization) provide a probability distribution over the four nucleotides We use entropy to convert this into a measure of ambiguity of the call using IUPAC’s convention, e.g. M=A or C, H=A or C or T

2 ACGT MRWSYK BDHV N

log2(1.5) log2(2.5) log2(3.5)

H

slide-32
SLIDE 32

Sequence mapping

slide-33
SLIDE 33

Sequence mapping

slide-34
SLIDE 34

Summary

Sequencing produces images that are then quantified into tab- delimited text files with four intensity values for each colony and each sequencing cycle These values can be represented in 4D space to show color cross- talk and decay They can be represented as tile pseudo-images to show optical effects Two major sources of bias are dephasing and changing baselines between colors and between cycles Simple signal transformations can decrease many of these biases Per-base quality scores are useful information at the mapping level

slide-35
SLIDE 35

References

Image analysis: ImageJ

http://rsb.info.nih.gov/ij/

Genome indexing:

Iseli et al. Indexing strategies for rapid searches of short words in genome sequences. PLoS ONE (2007) vol. 2 (6) pp. e579 tagger: http://www.isrec.isb-sib.ch/tagger/ bowtie: http://bowtie-bio.sourceforge.net/

Base calling:

Rougemont et al. Probabilistic base calling of Solexa sequencing

  • data. BMC Bioinformatics (2008) vol. 9 (1) pp. 431

http://bbcf.epfl.ch/Software