Errors, biases and Quality control in Next Gen Sequencing Dr David - - PowerPoint PPT Presentation

errors biases and quality control in next gen sequencing
SMART_READER_LITE
LIVE PREVIEW

Errors, biases and Quality control in Next Gen Sequencing Dr David - - PowerPoint PPT Presentation

Errors, biases and Quality control in Next Gen Sequencing Dr David Humphreys d.humphreys@victorchang.edu.au - Lab scientist : Bioinformatician - RNA biologist - small RNAs (miRNA) Victor Chang Cardiac Research Institute, Sydney, Australia


slide-1
SLIDE 1

Errors, biases and Quality control in Next Gen Sequencing

Dr David Humphreys

d.humphreys@victorchang.edu.au

  • Lab scientist : Bioinformatician
  • RNA biologist
  • small RNAs (miRNA)

Victor Chang Cardiac Research Institute, Sydney, Australia

slide-2
SLIDE 2

Testing hypothesis and theories

Errors/Biases:

  • Present in all experiments
  • Be aware/informed
  • Minimise
  • Test

Data points HTS/NGS

Time line

1994 2013

2009

ME! ???

2013

You???

Next generation sequencing:

  • Series of experiments
  • Biases/error accumulate!
slide-3
SLIDE 3

Anscombe’s Quartet

Image source: Wikipedia

  • Maths is a tool for analysis.
  • You can blindly ignore biases and errors in data sets.
  • mean, stdev, variance, correlation are the same!

Anscombe F.J (1973) American Statistician

slide-4
SLIDE 4

Workflow:

High Throughput Sequencing

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics

Challenges:

Quantification Purity

(1) Awareness Community Literature Network; (2) QC considerations Time Cost

Gels Stains Absorbance Molarity Titrations Fluoresence CPU Cores Scripts Command line RAM Threads

Consumption Throughput

Genes Genome SNPs

Sensitivity/specificity Cummulative Error

slide-5
SLIDE 5

Quantification:

Nanodrop spectrophotometer

http://www.nanodrop.com/Library/CVStech_17_11_FINAL.pdf

WARNING

!

  • Careful of accuracy < 50ng/ul
  • Careful of concentrations > 1ug/ul
  • Does not assess quality!!

* http://seqanswers.com/forums/showthread.php?t=21280 Contaminants: 230nm: EDTA, carbohydrates, sodium acetate*, tris* 270nm: Phenol (plus at 230nm*) 280nm: DTT

WARNING

!

  • Contaminants can impact on downstream

enzymatic reactions

Ratios 260/280 : 1.8 (DNA) 2.0 (RNA) 260/270 : 1.2 – 1.3? 260/230 : 2.0 – 2.2

  • Quick
  • Consumes 1-2ul sample
  • Large dynamic range

(10 – 10,000ng/ul)

  • Can identify contaminations

Solution: Re-precipitate/buffer exchange

slide-6
SLIDE 6

Quantification:

Qubit fluorimeter

WARNING

!

  • Known biases in quantifying ssRNA < 50ng/ul
  • Cannot quantitate ssDNA in presence of dsDNA
  • More sensitive than nano-drop
  • Consumes small amount of sample
  • Specific assays
slide-7
SLIDE 7

Quantification

  • Consumes small amount of sample
  • Quantification
  • Estimating nucleic acid size

Agilent Bioanalyzer

WARNING

!

  • Each chip has a quantitative range
  • Sensitive to salts.
  • Limitations on size range
  • Not accurate quantitating broad smears

* RNA integrity index (RIN)

  • Use at least 50ng for meaningful RIN

Schroeder et al (2006) BMC Mol Bio.

Total RNA * 5-500ng/ul mRNA 25-250ng/ul Total RNA * 50-5000pg/ul mRNA 250-5000pg/ul dsDNA 5-500 pg/ul (50-7000bp)

Chip Application Quantitative range

slide-8
SLIDE 8

Criteria RNA DNA QC High complexity Trizol vs column based Phenol:chloroform vs column based qPCR, Northern blotting??

High quality RIN > 8 Unfragmented Bioanalyzer, gel electrophoresis Accurate Quantification pg - ng - ug pg - ng - ug Qubit/Nanodrop, Agilent Bioanalyser Contamination (salts, organics) A260/280 = 2 A260/230 >2 A260/280 = 1.8 A260/230 >2 Qubit, Nanodrop

Enrichment Deplete ribosomes Exome capture qPCR/Agilent Fragment Uniform peaks better than broad Agilent

GOAL: to have a final sample with high complexity

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics

Sample Purification/Assessment/Processing

1) Library manual as provided by the manufacturer 2) http://nxseq.bitesizebio.com/articles/

slide-9
SLIDE 9

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics

miRNAs:

  • 141 -29b -21 -106b -15a -34a

decreased in cells grown at low confluence/loss of adhesion

Library prep + Sequence

Purification biases

Cell number (L) = 200,000 (H) = 800,000

1mL Trizol

Kim et al., (2011)

Molecular Cell 43, 1005-1014

Cell number Low = 500,000 High = 800,000 Ratio 141/200c

Kim et al., (2012)

Molecular Cell 46, 893-895

  • Small RNA ppt with longer RNA
  • Most susceptible:

Low GC content, 2ndary structure

slide-10
SLIDE 10

Ligation biases

  • Enzyme
  • Temperature
  • Sequence

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics

miRNA library biases

Hafner et al., (2011)

“RNA-ligase-dependent biases in miRNA ….. cDNA libraries”

RNA 17(9), 1-16

Input:

  • 770 synthetic miRNAs
  • 45 designed RNAs

Reverse Transcription bias

Not a significant source of sequence specific biases Pool A = Equimolar Pool B = 10 fold serial dilution

PCR bias

Dilute 1:10000 10 PCR cycles

  • No appreciable distortion!

5 x WARNING

!

  • Don’t compare NGS data sets from different library preps
  • Be consistent with incubation times/temperatures
slide-11
SLIDE 11
  • Ross et al., Characterizing and measuring bias in sequence data. Genome Biology 2013
  • Bragg et al., Shining a light on Dark sequencing characterising errors. PLoS Comp Biol 2013
  • Loman et al., Performance comparison of benchtop HTS platforms. Nature Biotech 2012
  • Quail et al., Tale of three NGS platforms. BMC Genomics, 2012
  • Lam et al., Performance comparison of whole genome sequencing platforms. Nat Biotech 2012

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics Ion torrent Illumina Complete genomics Kapa Biosystems Standard reagents Flowcell/lane variations do occur Smaller than those observed between platforms Sequencing platforms

slide-12
SLIDE 12

Sample preparation Library preparation Clonal amplification Sequencing Bioinformatics

Raw sequencing files Assessing sequence quality Align (pipeline) Assessing alignment data

slide-13
SLIDE 13

Raw sequencing files

Assessing sequence quality Align (pipeline) Assessing alignment data

The Basics:

10 20 30 40 . . . . ! “ # $ % & ‘ ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e f g h

Numerical : Phred+33 : Phred+64 : Quality values: Phred score

File types: fastq, csfasta, qual, fasta, xsq

Sequence: A T C G N/. Header: Coordinates/other

VCCRI

slide-14
SLIDE 14

Raw sequencing files Assessing sequence quality Align (pipeline) Assessing alignment data

  • Free java utility that can assess QC metrics of HTS data sets.
  • GUI
  • Command line
  • Can create html output
  • fastq (standard, gzip, colorspace, casava), SAM/BAM

Not all data sets require full complement of green ticks!!

VCCRI

slide-15
SLIDE 15

Very good Reasonable Poor

Median 90% 10% 75% 25% Raw sequencing files

Assessing sequence quality

Align (pipeline) Assessing alignment data Mean

slide-16
SLIDE 16

Identify adaptors and primers

VCCRI

Raw sequencing files

Assessing sequence quality

Align (pipeline) Assessing alignment data

Identifies if subset

  • f sequences have

low quality May identify cycles that are unreliable Helps assess raw data files prior to mapping

  • low quality data may cause incorrect alignments
  • low quality data may incorrectly call variations
  • Sequence with trailing adaptor sequences will not map
slide-17
SLIDE 17

Aligners

Choose a suitable reference. Include mitochondrial sequence Design a filter set to capture repeated sequences (rRNA, tRNA) Reference Be aware of the default options

  • Accepted Errors
  • Multimappers

Raw sequencing files Assessing sequence quality

Align (pipeline)

Assessing alignment data

Different aligners can give different results.

Benchmarking short sequence mapping tools Hatem et al (BMC Bioinformatics, 2013)

slide-18
SLIDE 18

Assessing alignment data

Raw sequencing files Assessing sequence quality Align (pipeline)

Assessing alignment data Mapping statistics

Pass Questionable Alignment feature statistics

  • Coverage
  • Expression
  • Discovery

Test Filter raw data

  • Filter
  • Trim

Important

!

  • Know your mapping statistics
  • Know what to expect from your data sets
  • Test on existing data set

Include a filter % mapped % mapped at what length

slide-19
SLIDE 19

Take home messages

Be familiar with existing data sets

  • NGS is a collection of experiments
  • Biases/errors can/will occur at all steps of a high throughput sequencing study
  • QC measures should applied at all steps of a high throughput sequencing study
  • Don’t be alarmed, stay informed
slide-20
SLIDE 20

miRNA sequencing profiling

miRspring

  • Small (<2MB) HTML document that replicates the miRNA aligned sequencing data.
  • Needs NO internet connectivity.
  • Provides visualization of sequence data
  • Reports on miRNA processing
  • Complete transparency.

Humphreys D.T., and Suter C.M. Nucleic Acids Research 2013. http://miRspring.victorchang.edu.au

slide-21
SLIDE 21

microRNAs

miRspring reporting tools

5’ 3’

i i i) 5’ isomiRs ii ii) 3’ isomiRs ii iii iii) Non-canonical iv iv) Arm bias v) miRNA length v v A G C T vi vi) RNA editing

  • Small non-coding RNAs (22nt)
  • Bind to 3’UTRs decay and/or translational repression
  • Biogenesis: Derived from longer stem loop precursors
slide-22
SLIDE 22

miRspring

miRNA clusters

Mono-cistronic Poly-cistronic

miRNA Seed analysis miR-196a UAGGUAGUUUCCUGUUGUUGGG let-7a UGAGGUAGUAGGUUGUAUAGUUU AGGUAGU GAGGUAG let-7a UGAGGUAGUAGGUUGUAUAGUUU

Genomic Genomic

slide-23
SLIDE 23

miRspring QC features

Sampling bias!

Tissue Atlas Heart Kidney Liver Lung Ovary Spleen Testes Thymus Brain Placenta AGO IP THP-1 ENCODE HeLa S3 A549 Ag04450 Bj Gm1287 H1hesc HepG2 Huvec K562 MCF7 NheK Sknshra

  • 73 miRspring documents
  • 895 million sequence tags
  • < 55 megabytes of disk space
slide-24
SLIDE 24

miRspring reporting features

Top 100 miRNAs typically:

  • 22nt long
  • Good correlation with miRBase

miRspring provide a quick easy way to analyse QC parameters of your data set

Centile Rank Centile Rank

slide-25
SLIDE 25

Final points

Victor Chang Cardiac Research Institute, Sydney, Australia

  • Many NGS protocols are well established.
  • Worth understanding what variations/features are found in data sets.
  • miRspring a powerful tools to help you assess a data set
  • Yes only examines one data set at a time.
  • Provides complete transparency
  • Allows ANYONE to examine a NGS data set.

Example miRspring documents can be found at http://miRspring.victorchang.edu.au

slide-26
SLIDE 26

Acknowledgements

VCCRI

Cath Suter Paul Young Rupert Shuttleworth Diane Fatkin Monique Ohanian Djordje Djordjevic Chris Hayward Kavitha Muthiah Richard Harvey Mirana Ramialison Ashley Waardenberg IT Timothy Kersten Pardeep Dhiman Thomas Priess (VCCRI/ANU)

  • Pardeep Patel
  • Carly Hynes
  • Tennille Sibbritt
  • Jennifer Clancy

Matthias Hentze (EMBL)

Funding bodies ARC NHMRC Viertel Charitable Foundation Perpetual Trust VCCRI