UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - - PowerPoint PPT Presentation
UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - - PowerPoint PPT Presentation
UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and
Objectives
Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and structure of the
- utput files
Study possible sources of systematic bias and find remedies to some of them
TTT A
Terminology
colony: set of identical sequences
- btained on the flow-cell by amplification
- f a template
(sequencing) cycle: attempt to incorporate the next nucleotide of every complementary strand (color) channel: 1 of 4 imaged colors, corresponding to the fluorophore associated with one base (e.g. A) read: sequence output representing the colony base calling: algorithm constructing the reads from the measurements
ATTT CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTGCGTGGTAAA GTGCGTGGTAAA
TTTA...
Images
Each sequencing cycle produces 4 images for each of the 100 tiles DNA colonies must be located, quantified, and tracked across images stacks (~100’000 colonies/image) Each colony, at each cycle, generates a quadruplet of fluorescence intensities Naively: highest of the 4 values determines the base
Solexa/Illumina file structure
Solexa/Illumina file structure
Solexa/Illumina file structure
Quality scores
Quality scores
Summary of data
Quality Sequence Intensities
Summary of data
Quality Sequence Intensities
Summary of data
Quality Sequence Intensities
Summary of data
Quality Sequence Intensities
Summary of data
Quality Sequence Intensities
Summary of data
There are ~10M such plots... Quality Sequence Intensities
Global look
s_4_0001_int cycle 1 cycle 2
#CH4:OBJ130954 13.5 43.4 2021.8 1180.6
- 27.6 -51.1 2531.9 1699.1
- 143.9 -43.0 2575.9 1133.8
- 9.3 -262.8 2657.1 1639.4
- 107.8 -27.3 1968.3 1320.1
- 20.5 -45.4 2312.2 862.1
105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8
- 15.2 18.6 2401.9 1053.9
#END CYCLE 1 11.1 43.2 1875.0 1049.2
- 48.5 -56.7 63.0 1349.1
- 257.2 -5.8 59.8 1176.6
- 129.3 646.7 920.1 557.2
964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2
Colonies
A C G T ...
Global look
s_4_0001_int cycle 1 cycle 2
#CH4:OBJ130954 13.5 43.4 2021.8 1180.6
- 27.6 -51.1 2531.9 1699.1
- 143.9 -43.0 2575.9 1133.8
- 9.3 -262.8 2657.1 1639.4
- 107.8 -27.3 1968.3 1320.1
- 20.5 -45.4 2312.2 862.1
105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8
- 15.2 18.6 2401.9 1053.9
#END CYCLE 1 11.1 43.2 1875.0 1049.2
- 48.5 -56.7 63.0 1349.1
- 257.2 -5.8 59.8 1176.6
- 129.3 646.7 920.1 557.2
964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2
Colonies
A C G T ...
Global look
s_4_0001_int cycle 1 cycle 2
#CH4:OBJ130954 13.5 43.4 2021.8 1180.6
- 27.6 -51.1 2531.9 1699.1
- 143.9 -43.0 2575.9 1133.8
- 9.3 -262.8 2657.1 1639.4
- 107.8 -27.3 1968.3 1320.1
- 20.5 -45.4 2312.2 862.1
105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8
- 15.2 18.6 2401.9 1053.9
#END CYCLE 1 11.1 43.2 1875.0 1049.2
- 48.5 -56.7 63.0 1349.1
- 257.2 -5.8 59.8 1176.6
- 129.3 646.7 920.1 557.2
964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2
Colonies
A C G T ...
Global look
s_4_0001_int cycle 1 cycle 2
#CH4:OBJ130954 13.5 43.4 2021.8 1180.6
- 27.6 -51.1 2531.9 1699.1
- 143.9 -43.0 2575.9 1133.8
- 9.3 -262.8 2657.1 1639.4
- 107.8 -27.3 1968.3 1320.1
- 20.5 -45.4 2312.2 862.1
105.8 -38.9 1938.7 966.6 52.2 201.4 1934.9 1198.6 77.1 24.3 2467.7 1102.2 637.6 198.8 2501.5 1500.8
- 15.2 18.6 2401.9 1053.9
#END CYCLE 1 11.1 43.2 1875.0 1049.2
- 48.5 -56.7 63.0 1349.1
- 257.2 -5.8 59.8 1176.6
- 129.3 646.7 920.1 557.2
964.1 540.7 1436.3 1015.0 1497.0 918.4 -6.6 13.5 14.1 34.4 1751.6 903.7 1337.1 772.6 199.4 893.7 153.9 223.8 23.4 937.0 313.7 579.3 41.0 663.9 688.1 347.3 655.9 1194.2
Colonies
A C G T ...
Each colony is a point in 4D intensity space at each cycle Naive interpretation was
- ptimistic
Bias 1: optical effects
False image from measured intensities as a function of x-y coordinates on tile There are obvious boundary effects, stronger in some color channels We can correct this effect by fitting a position-depend base line There are other position- dependant issues like spot
- verlaps
Bias 2: sticky fluorophores
T fluorophores stick to the surface of the flow cell
Bias 2: sticky fluorophores
T fluorophores stick to the surface of the flow cell
Bias 3: color cross-talk and decay
Fluorophores spectra
- verlap
Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations
Bias 3: color cross-talk and decay
Fluorophores spectra
- verlap
Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations
Bias 4: dephasing
TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC
Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel
Bias 4: dephasing
TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC A A
Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel
Bias 4: dephasing
TAC CACGTGGTCATG CACGTGGTCATG CACGTGGTCATG GTAC GTAC A A G
Suppose some strands in a colony failed to incorporate their nucleotides at a previous cycle They may successfully elongate at subsequent cycles These are therefore lagging behind in their synthesis and emit signal in a different channel
Binomial law
there is a probability q<1 of incorporating a nucleotide n at cycle C if q is independent of C and n, there is a simple way of correcting for dephasing sum the contributions from previous n in the sequence weighted by the probability of that many mis-incorporations I(n,c) are measured intensities, J(n,k) are dephasing-less intensities
Prob(n, C) =
C
- k=1
C k
- qk(1 − q)C−k 1(sk = n)
I(n, C) =
C
- k=1
C k
- qk(1 − q)C−kJ(n, k)
Base probability
We would like to associate a probability with each base at each read position Solution 1: Better solution: fit gaussian distributions to the four data clouds
Prob(n, C) = I(n, C)/
- k
I(k, C)
Entropy
entropy H(p) is a measure of how flat (or peaked) is a probability distribution peaked = 0≤H≤log2(10) = flat H = log2(ambiguity), ambiguity is the number of states compatible with the
- bservation
H(p) = −
10
- k=1
p(k) log2(p(k))
IUPAC codes
Fluorescence intensities (after bias correction and possibly normalization) provide a probability distribution over the four nucleotides We use entropy to convert this into a measure of ambiguity of the call using IUPAC’s convention, e.g. M=A or C, H=A or C or T
2 ACGT MRWSYK BDHV N
log2(1.5) log2(2.5) log2(3.5)
H
Sequence mapping
Sequence mapping
Summary
Sequencing produces images that are then quantified into tab- delimited text files with four intensity values for each colony and each sequencing cycle These values can be represented in 4D space to show color cross- talk and decay They can be represented as tile pseudo-images to show optical effects Two major sources of bias are dephasing and changing baselines between colors and between cycles Simple signal transformations can decrease many of these biases Per-base quality scores are useful information at the mapping level
References
Image analysis: ImageJ
http://rsb.info.nih.gov/ij/
Genome indexing:
Iseli et al. Indexing strategies for rapid searches of short words in genome sequences. PLoS ONE (2007) vol. 2 (6) pp. e579 tagger: http://www.isrec.isb-sib.ch/tagger/ bowtie: http://bowtie-bio.sourceforge.net/
Base calling:
Rougemont et al. Probabilistic base calling of Solexa sequencing
- data. BMC Bioinformatics (2008) vol. 9 (1) pp. 431
http://bbcf.epfl.ch/Software