UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - PowerPoint PPT Presentation

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL

Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and structure of the output files Study possible sources of systematic bias and find remedies to some of them

Terminology colony: set of identical sequences obtained on the flow-cell by amplification of a template CACGTGGTCATG (sequencing) cycle: attempt to CACGTGGTCATG incorporate the next nucleotide of every CACGTGGTCATG complementary strand (color) channel: 1 of 4 imaged colors, A corresponding to the fluorophore TTT associated with one base (e.g. A ) GTGCGTGGTAAA read: sequence output representing the GTGCGTGGTAAA TTTA... colony ATTT base calling: algorithm constructing the reads from the measurements

Images Each sequencing cycle produces 4 images for each of the 100 tiles DNA colonies must be located, quantified, and tracked across images stacks (~100’000 colonies/image) Each colony, at each cycle, generates a quadruplet of fluorescence intensities Naively: highest of the 4 values determines the base

Solexa/Illumina file structure

Quality scores

Summary of data Intensities Sequence Quality

Summary of data Intensities Sequence Quality There are ~10M such plots...

Global look s_4_0001_int cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

Global look Each colony is a point in 4D s_4_0001_int intensity space at each cycle cycle 1 cycle 2 #CH4:OBJ130954 #END CYCLE 1 Naive interpretation was 13.5 43.4 2021.8 1180.6 11.1 43.2 1875.0 1049.2 -27.6 -51.1 2531.9 1699.1 -48.5 -56.7 63.0 1349.1 optimistic Colonies -143.9 -43.0 2575.9 1133.8 -257.2 -5.8 59.8 1176.6 -9.3 -262.8 2657.1 1639.4 -129.3 646.7 920.1 557.2 -107.8 -27.3 1968.3 1320.1 964.1 540.7 1436.3 1015.0 -20.5 -45.4 2312.2 862.1 1497.0 918.4 -6.6 13.5 105.8 -38.9 1938.7 966.6 14.1 34.4 1751.6 903.7 52.2 201.4 1934.9 1198.6 1337.1 772.6 199.4 893.7 77.1 24.3 2467.7 1102.2 153.9 223.8 23.4 937.0 637.6 198.8 2501.5 1500.8 313.7 579.3 41.0 663.9 -15.2 18.6 2401.9 1053.9 688.1 347.3 655.9 1194.2 ... A C G T

Bias 1: optical effects False image from measured intensities as a function of x-y coordinates on tile There are obvious boundary effects, stronger in some color channels We can correct this effect by fitting a position-depend base line There are other position- dependant issues like spot overlaps

Bias 2: sticky fluorophores T fluorophores stick to the surface of the flow cell

Bias 3: color cross-talk and decay Fluorophores spectra overlap Some intensity pairs are correlated We can use a basis transform in 4D space to reduce correlations

Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle GTAC CACGTGGTCATG They may successfully elongate GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

Bias 4: dephasing Suppose some strands in a TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

Bias 4: dephasing Suppose some strands in a G TAC colony failed to incorporate their CACGTGGTCATG nucleotides at a previous cycle A GTAC CACGTGGTCATG They may successfully elongate A GTAC at subsequent cycles CACGTGGTCATG These are therefore lagging behind in their synthesis and emit signal in a different channel

Binomial law there is a probability q<1 of incorporating a nucleotide n at cycle C if q is independent of C and n, there is a simple way of correcting for dephasing sum the contributions from previous n in the sequence weighted by the probability of that many mis-incorporations I(n,c) are measured intensities, J(n,k) are dephasing-less intensities C � C � q k (1 − q ) C − k 1( s k = n ) � Prob( n, C ) = k k =1 C � C � � q k (1 − q ) C − k J ( n, k ) I ( n, C ) = k k =1

Base probability We would like to associate a probability with each base at each read position � Solution 1: Prob( n, C ) = I ( n, C ) / I ( k, C ) k Better solution: fit gaussian distributions to the four data clouds

Entropy entropy H(p) is a measure of how flat (or peaked) is a probability distribution peaked = 0 ≤ H ≤ log 2 (10) = flat H = log 2 (ambiguity), ambiguity is the number of 10 � states compatible with the H ( p ) = − p ( k ) log 2 ( p ( k )) observation k =1

IUPAC codes Fluorescence intensities (after bias correction and possibly normalization) provide a probability distribution over the four nucleotides We use entropy to convert this into a measure of ambiguity of the call using IUPAC’s convention, e.g. M=A or C, H=A or C or T log 2 (1.5) log 2 (2.5) log 2 (3.5) H 0 2 ACGT MRWSYK BDHV N

Sequence mapping

Summary Sequencing produces images that are then quantified into tab- delimited text files with four intensity values for each colony and each sequencing cycle These values can be represented in 4D space to show color cross- talk and decay They can be represented as tile pseudo-images to show optical effects Two major sources of bias are dephasing and changing baselines between colors and between cycles Simple signal transformations can decrease many of these biases Per-base quality scores are useful information at the mapping level

References Image analysis: ImageJ http://rsb.info.nih.gov/ij/ Genome indexing: Iseli et al. Indexing strategies for rapid searches of short words in genome sequences. PLoS ONE (2007) vol. 2 (6) pp. e579 tagger: http://www.isrec.isb-sib.ch/tagger/ bowtie: http://bowtie-bio.sourceforge.net/ Base calling: Rougemont et al. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics (2008) vol. 9 (1) pp. 431 http://bbcf.epfl.ch/Software

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - PowerPoint PPT Presentation

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

2019 RAW CASHEW NUTS CROP IN 2019 RAW CASHEW NUTS CROP IN 2019 RAW CASHEW NUTS CROP IN 2019 RAW

Raw Sockets and ICMP Raw Sockets and ICMP Code Examples Ping Traceroute Srinidhi

Extracting relevant information from UHTS data: analysis pipelines (smallRNA) Patricia Otten

Raw Committee Meeting 2015 Raw Nationals Scranton, PA October 14, 2015 Welcome from the Raw

Ultra high throughput DNA sequencing technologies Keith Harshman DNA Array Facility Center for

Open house Open house Open house Open house on on on on on on on on World Raw Cashew

Radio-Activated Water (RAW) Systems RAW Exchange System Preliminary Design In-Process Stakeholder

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Raw Data Reconstruction with Raw-Data Reconstruction with PROOF C. Cheshkov, P. Hristov

Raw materials for Agricus compost Ralph Noble, East Malling Research, UK RAW MATERIALS FOR

Treatment Filtration Media & Industrial Raw Materials Content INDUSTRIAL RAW 1 17 11 ABOUT

Re- Refinery Products Market Use Table of Contents Raw Gas Oil Vacuum Gas Oil

E&T RAW (Energy and Transmutation RAW) THE THEME CODE NUMBER 1089/2011 2013 SURNAME

Supplementary Information Supplementary table S1. Raw reads and selected effective sequences in

CCD Image Processing: CCD Image Processing: [ ] [ ] r x y , d x y , Raw File [ ]

Introduction to C Programming File Input/Output Waseda University Todays Topics

Data Modeling and Database Design Yuri Takhteyev Faculty of Information University of Toronto

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the

What if you need real-time read/write for large datasets? 2 Lecture based on these two books.

1 Sequential data analysis Sequential data analysis Objects and operators Objects and operators

Day 3 Lab2: IoT Introduction In this example, we process real-world vehicle IoT data. Our

Tabu Search Key idea: Use aspects of search history (memory) to escape from local minima. Simple

Sambuz

Useful Links

Newsletter

Mail Us

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics - PowerPoint PPT Presentation

UHTS: Raw data Jacques Rougemont Bioinformatics and Biostatistics Core Facility EPFL Objectives Concentrate on Solexa/Illumina technology Describe the steps from imaging colonies to mapping/assembling sequence tags Understand the content and

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

2019 RAW CASHEW NUTS CROP IN 2019 RAW CASHEW NUTS CROP IN 2019 RAW CASHEW NUTS CROP IN 2019 RAW

Raw Sockets and ICMP Raw Sockets and ICMP Code Examples Ping Traceroute Srinidhi

Extracting relevant information from UHTS data: analysis pipelines (smallRNA) Patricia Otten

Raw Committee Meeting 2015 Raw Nationals Scranton, PA October 14, 2015 Welcome from the Raw

Ultra high throughput DNA sequencing technologies Keith Harshman DNA Array Facility Center for

Open house Open house Open house Open house on on on on on on on on World Raw Cashew

Radio-Activated Water (RAW) Systems RAW Exchange System Preliminary Design In-Process Stakeholder

Extracting Gait Parameters Extracting Gait Parameters from Raw Data from Raw Data

Raw Data Reconstruction with Raw-Data Reconstruction with PROOF C. Cheshkov, P. Hristov

Raw materials for Agricus compost Ralph Noble, East Malling Research, UK RAW MATERIALS FOR

Treatment Filtration Media &amp; Industrial Raw Materials Content INDUSTRIAL RAW 1 17 11 ABOUT

Re- Refinery Products Market Use Table of Contents Raw Gas Oil Vacuum Gas Oil

E&amp;T RAW (Energy and Transmutation RAW) THE THEME CODE NUMBER 1089/2011 2013 SURNAME

Supplementary Information Supplementary table S1. Raw reads and selected effective sequences in

CCD Image Processing: CCD Image Processing: [ ] [ ] r x y , d x y , Raw File [ ]

Introduction to C Programming File Input/Output Waseda University Todays Topics

Data Modeling and Database Design Yuri Takhteyev Faculty of Information University of Toronto

Data analysis pipelines Reading and tidying tables R.W. Oldford readr - importing

Importing data into R Workshop 3 2 Objectives By doing this workshop and carrying out the

What if you need real-time read/write for large datasets? 2 Lecture based on these two books.

1 Sequential data analysis Sequential data analysis Objects and operators Objects and operators

Day 3 Lab2: IoT Introduction In this example, we process real-world vehicle IoT data. Our

Tabu Search Key idea: Use aspects of search history (memory) to escape from local minima. Simple

Sambuz

Useful Links

Newsletter

Mail Us

Treatment Filtration Media & Industrial Raw Materials Content INDUSTRIAL RAW 1 17 11 ABOUT

E&T RAW (Energy and Transmutation RAW) THE THEME CODE NUMBER 1089/2011 2013 SURNAME