uhts raw data
play

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - PowerPoint PPT Presentation

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and


  1. UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL

  2. Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and structure of the output files Study possible sources of systematic bias and find remedies to some of them

  3. Terminology colony: set of identical sequences obtained on the flow-cell by amplification of a template CACGTGGTCATG (sequencing) cycle: attempt to CACGTGGTCATG incorporate the next nucleotide of every CACGTGGTCATG complementary strand (color) channel: 1 of 4 imaged colors, A corresponding to the fluorophore TTT associated with one base (e.g. A ) GTGCGTGGTAAA read: sequence output representing the GTGCGTGGTAAA TTTA... colony ATTT base calling: algorithm constructing the reads from the measurements

  4. Images Each sequencing cycle produces 4 images for each of the 100 tiles DNA colonies must be located, quantified, and tracked across images stacks (~100’000 colonies/image) Each colony, at each cycle, generates a quadruplet of fluorescence intensities Naively: highest of the 4 values determines the base

  5. Solexa/Illumina file structure

  6. Solexa/Illumina file structure

  7. Solexa/Illumina file structure

  8. Quality scores

  9. Quality scores

  10. Summary of data Intensities Sequence Quality

  11. Summary of data Intensities Sequence Quality

  12. Summary of data Intensities Sequence Quality

  13. Summary of data Intensities Sequence Quality

  14. Summary of data Intensities Sequence Quality

  15. Summary of data Intensities Sequence Quality There are ~10M such plots...

  16. Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

  17. Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

  18. Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

  19. Global look Each colony is a point in 4D s_4_0001_int intensity space at each cycle cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 Naive interpretation was 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 optimistic Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

  20. Bias 1: optical effects False image from measured intensities as a function of x-y coordinates on tile There are obvious boundary effects, stronger in some color channels We can correct this effect by fitting a position-depend base line There are other position- dependant issues like spot overlaps

  21. Bias 2: sticky fluorophores T fluorophores stick to the surface of the flow cell

  22. Bias 2: sticky fluorophores T fluorophores stick to the surface of the flow cell

  23. Bias 3: color cross-talk and decay Fluorophores spectra overlap Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations

  24. Bias 3: color cross-talk and decay Fluorophores spectra overlap Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations

  25. Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle GTAC CACGTGGTCATG They may successfully elongate GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

  26. Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

  27. Bias 4: dephasing Suppose some strands in a G TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

  28. Binomial law there is a probability q<1 of incorporating a nucleotide n at cycle C if q is independent of C and n, there is a simple way of correcting for dephasing sum the contributions from previous n in the sequence weighted by the probability of that many mis-incorporations I(n,c) are measured intensities, J(n,k) are dephasing-less intensities C � C � q k (1 − q ) C − k 1( s k = n ) � Prob( n, C ) = k k =1 C � C � � q k (1 − q ) C − k J ( n, k ) I ( n, C ) = k k =1

  29. Base probability We would like to associate a probability with each base at each read position � Solution 1: Prob( n, C ) = I ( n, C ) / I ( k, C ) k Better solution: fit gaussian distributions to the four data clouds

  30. Entropy entropy H(p) is a measure of how flat (or peaked) is a probability distribution peaked = 0 ≤ H ≤ log 2 (10) = flat H = log 2 (ambiguity), ambiguity is the number of 10 � states compatible with the H ( p ) = − p ( k ) log 2 ( p ( k )) observation k =1

  31. IUPAC codes Fluorescence intensities (after bias correction and possibly normalization) provide a probability distribution over the four nucleotides We use entropy to convert this into a measure of ambiguity of the call using IUPAC’s convention, e.g. M=A or C, H=A or C or T log 2 (1.5) log 2 (2.5) log 2 (3.5) H 0 2 ACGT MRWSYK BDHV N

  32. Sequence mapping

  33. Sequence mapping

  34. Summary Sequencing produces images that are then quantified into tab- delimited text files with four intensity values for each colony and each sequencing cycle These values can be represented in 4D space to show color cross- talk and decay They can be represented as tile pseudo-images to show optical effects Two major sources of bias are dephasing and changing baselines between colors and between cycles Simple signal transformations can decrease many of these biases Per-base quality scores are useful information at the mapping level

  35. References Image analysis: ImageJ http://rsb.info.nih.gov/ij/ Genome indexing: Iseli et al. Indexing strategies for rapid searches of short words in genome sequences. PLoS ONE (2007) vol. 2 (6) pp. e579 tagger: http://www.isrec.isb-sib.ch/tagger/ bowtie: http://bowtie-bio.sourceforge.net/ Base calling: Rougemont et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics (2008) vol. 9 (1) pp. 431 http://bbcf.epfl.ch/Software

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend