Short read quality assessment Martin Morgan 1 June 20-23, 2011 1 - - PowerPoint PPT Presentation

short read quality assessment
SMART_READER_LITE
LIVE PREVIEW

Short read quality assessment Martin Morgan 1 June 20-23, 2011 1 - - PowerPoint PPT Presentation

Short read quality assessment Martin Morgan 1 June 20-23, 2011 1 mtmorgan@fhcrc.org Why sequence? e.g., RNA-seq Expression in novel (un-annotated) regions Exon junction / RNA editing insights Allele-specific / transcript isoform


slide-1
SLIDE 1

Short read quality assessment

Martin Morgan1 June 20-23, 2011

1mtmorgan@fhcrc.org

slide-2
SLIDE 2

Why sequence?

e.g., RNA-seq

◮ Expression in novel (un-annotated) regions ◮ Exon junction / RNA editing insights ◮ Allele-specific / transcript isoform quantification ◮ Non-model organisms ◮ Greater dynamic range and sensitivity?

Lessons from microarrays

◮ Initially: variability between manufactures, technologies, labs ◮ MAQC: quality control standards and analysis protocols

slide-3
SLIDE 3

Example work flow – [4]

Sample

◮ Purify poly(A)+ RNA with

  • ligo(dT) magnetic beads

◮ cDNA synthesis primed with

random hexamers Microarray

◮ Dye-swap, hybridization,

florescence, analysis RNA-seq

◮ Fragment and size-select ◮ Illumina adapter ligation

slide-4
SLIDE 4

Example work flow – [4]

Sample

◮ Purify poly(A)+ RNA with

  • ligo(dT) magnetic beads

◮ cDNA synthesis primed with

random hexamers Microarray

◮ Dye-swap, hybridization,

florescence, analysis RNA-seq

◮ Fragment and size-select ◮ Illumina adapter ligation

slide-5
SLIDE 5

Example work flow – [4]

Sample

◮ Purify poly(A)+ RNA with

  • ligo(dT) magnetic beads

◮ cDNA synthesis primed with

random hexamers Microarray

◮ Dye-swap, hybridization,

florescence, analysis RNA-seq

◮ Fragment and size-select ◮ Illumina adapter ligation

slide-6
SLIDE 6

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

slide-7
SLIDE 7

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

ROC simulation

◮ Replication (red vs. blue) ◮ Randomization and blocking

(solid vs. dot)

slide-8
SLIDE 8

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

Number of occurrences of each read (log10) Cumulative proportion of reads

0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4

1 2

1 2 3 4

3 4 5

1 2 3 4

6 7

1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

8

Cumulative proportion of reads

  • ccuring 0, 1, . . . times
slide-9
SLIDE 9

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

Copies per read (log10) Cummulative proportion

0.0 0.2 0.4 0.6 0.8 1.0 2.0 2.2 2.4 2.6

Actual versus uniform φX174 coverage

slide-10
SLIDE 10

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

Read count increases with gene length

slide-11
SLIDE 11

Key issues

◮ Experimental design [1]

◮ Replication ◮ Randomization and

blocking, e.g., batch effects

◮ Depth of coverage

◮ Statistical power ◮ Library complexity

◮ Coverage heterogeneity

◮ Estimation biases ◮ Legitimate comparison

◮ Sequencing uncertainty [2]

Reads, stratified by cycle, supporting a spurious SNP call in φX174

slide-12
SLIDE 12

Case study

Subset of Brooks et al. [3]

◮ RNAi and mRNA-seq to identify pasilla-regulated alternative

splicing

◮ Purified polyA, random hexamer primed ◮ Single- and paired end sequences ◮ Alignment to reference genome and curated splic junctions

slide-13
SLIDE 13
  • P. L. Auer and R. W. Doerge.

Statistical design and analysis of RNA sequencing data. Genetics, 185:405–416, Jun 2010.

  • H. C. Bravo and R. A. Irizarry.

Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics, 66:665–674, Sep 2010.

  • A. N. Brooks, L. Yang, M. O. Duff, K. D. Hansen, J. W. Park,
  • S. Dudoit, S. E. Brenner, and B. R. Graveley.

Conservation of an RNA regulatory map between Drosophila and mammals. Genome Res., 21:193–202, Feb 2011.

  • J. H. Malone and B. Oliver.

Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol., 9:34, 2011.