Analysis of Ashley Sawle based on slides by Bernard Pereira The - - PowerPoint PPT Presentation

analysis of
SMART_READER_LITE
LIVE PREVIEW

Analysis of Ashley Sawle based on slides by Bernard Pereira The - - PowerPoint PPT Presentation

Analysis of Ashley Sawle based on slides by Bernard Pereira The many faces of RNA-seq Techniques mRNA-seq Exome capture Targeted miRNA Small RNA piRNA Total RNA sncRNA Ribosome


slide-1
SLIDE 1

Analysis of

Ashley Sawle

based on slides by Bernard Pereira

slide-2
SLIDE 2

The many faces of RNA-seq – Techniques

  • mRNA-seq
  • Exome capture
  • Targeted
  • Small RNA
  • Total RNA
  • Ribosome profiling
  • Single Cell RNA-Seq

piRNA miRNA sncRNA

slide-3
SLIDE 3

The many faces of RNA-seq – Applications

Discovery

  • Transcripts
  • Isoforms
  • Splice junctions
  • Fusion genes

Differential expression

  • Gene level expression changes

Gene level expression changes

  • Relative isoform abundance
  • Splicing patterns

Variant calling

slide-4
SLIDE 4

Microarray à RNA-seq

Guo et al. (2013) Plos One Wang et al (2014) Nature Biotech.

slide-5
SLIDE 5

Library Preparation & Sequencing

modified from Malone JH, Oliver B (2011) BMC Biol. QC - RIN number

Sigurgeirsson, Emanuelsson &

Lundeberg (2014) PLOS ONE Multiplexing

slide-6
SLIDE 6

Sources of Noise

Biological Technical Sampling Process

slide-7
SLIDE 7

Sources of Noise – Sampling Bias

Sample A Sample B Subsampling a from a pool of RNAs

slide-8
SLIDE 8

Sources of Noise – Sampling Bias

Transcript A Transcript B Transcript length affects the number of RNA fragments present in the library from that gene

slide-9
SLIDE 9

Sources of Noise - Process

slide-10
SLIDE 10

Sources of Noise - Process

slide-11
SLIDE 11

Sources of Noise – Process

PCR Duplicates Optical Sequencing Errors Index Swapping

slide-12
SLIDE 12

Raw Sequence QC - FASTQC

https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

slide-13
SLIDE 13

Trimming

  • Quality-based Trimming
  • Adapter contamination

50 bases Insert

slide-14
SLIDE 14

Adapter contamination - FASTQC

slide-15
SLIDE 15

Sequence to Sense

Conesa et al. (2016) Genome Biology

slide-16
SLIDE 16

De Novo assembly

Haas, B.J.. et al (2013) Nature Protocols e.g. TRINITY

slide-17
SLIDE 17

Analysis Overview

Mapping Summarisation Normalisation DE analysis Functional analysis

slide-18
SLIDE 18

Reference-based assembly

Genome mapping Genome mapping

  • Can identify novel features
  • Splice aware?
  • Can be difficult to reconstruct

isoform and gene structures

Transcriptome ranscriptome mapping mapping

  • No repetitive reference
  • Novel features?
  • How reliable is the

transcriptome?

Trapnell & Salzberg (2009) Nature Biotech

slide-19
SLIDE 19

A smart suit(e) for RNA-seq analysis

Trapnell, C. et al (2012) Nature Protocols

slide-20
SLIDE 20

Spliced Alignment

slide-21
SLIDE 21

Spliced Alignment with Tophat/Bowtie

Kim, D. et al (2012) Genome Biology

slide-22
SLIDE 22

Visualising Mapping Results – IGV

slide-23
SLIDE 23

Summarisation/Counting

Oshlack, A. et al. (2010) Genome Biology

Genome-based features

  • Exon or gene boundaries?
  • Isoform structures
  • Gene multireads

Transcript-based features

  • Transcript assembly
  • Novel structures
  • Isoform multireads
slide-24
SLIDE 24

Summarisation/Counting

e.g. Htseq or Subread

slide-25
SLIDE 25

Summarisation/Counting

Mortazavi, A. et al (2008) Nature Methods

slide-26
SLIDE 26

Counting

slide-27
SLIDE 27

Normalisation

  • Counting

à estimate of relative counts for each gene Does this accurately r Does this accurately repr epresent the original population? esent the original population? Library size

Sequencing depth varies between samples

Gene Properties

GC content, length, sequence

Library composition

Highly expressed genes

  • verrepresented at cost of

lowly expressed genes

slide-28
SLIDE 28

Normalisation - Scaling

Total Count

  • Normalise each sample by total number of reads sequenced.
  • Can also use another statistic similar to total count; eg. median, upper quartile

Scaling

slide-29
SLIDE 29

Normalisation - TPM

reads for gene A length of gene A ÷ 1000 RPK for gene A sum of all RPKs 1,000,000 Scaling factor RPK for gene A Scaling factor TPM for gene A

slide-30
SLIDE 30

Normalisation – Geometric Scaling

Geometric scaling factor

  • Assumes that most genes are not differentially expressed

GM of Gene 1 GM of Gene 2 GM of Gene 3 GM of Gene N

. . . . . .

RC of Gene 1 RC of Gene 2 RC of Gene 3 RC of Gene N

. . . . . .

Median

RC = read counts (per sample) GM =geometric mean (all samples)

slide-31
SLIDE 31

Normalisation – Trimmed Mean of M

Robinson, M.D. & Oshlack, A. (2010) Genome Biology

Trimmed mean of M

  • Implemented in edgeR
  • Assumes most genes are not differentially expressed
slide-32
SLIDE 32

Differential Expression

  • Comparing feature abundance under different

conditions

  • Assumes linearity of signal
  • When feature=gene, well-established pre- and post-

analysis strategies exist

Mortazavi, A. et al (2008) Nature Methods

slide-33
SLIDE 33

7 6 5 4 3 2 1

Differential Expression

  • Simple difference in means

7 6 5 4 3 2 1 A B A B

  • Replication introduces variance
slide-34
SLIDE 34

Differential Expression - Modelling

Normal distribution Normal distribution à t-test t-test

slide-35
SLIDE 35

Differential Expression- Modelling

  • Use the Poisson distribution for count data
  • Just one parameter required – the mean
slide-36
SLIDE 36

Differential Expression- Modelling

  • Biology is never that simple
  • The negative binomial

distribution represents an

  • verdispersed Poisson

distribution

  • It has two parameters:

mean and (over)dispersion

Anders, S. & Huber, W. (2010) Genome Biology

slide-37
SLIDE 37

Differential Expression- Modelling

  • Estimating the dispersion parameter can be difficult with a small number of

samples

  • edgeR: models the variance as the sum of technical and biological variance
  • ‘Share’ information from all genes to obtain global estimate - shrinkage

Simon Anders

slide-38
SLIDE 38

Modelling – in fashion

  • DESeq uses a similar formulation of the variance term
slide-39
SLIDE 39

Towards Biological Meaning

  • Clustering

Hamy et al. (2016) PLOS One

slide-40
SLIDE 40

Towards Biological Meaning

  • Gene Set Enrichment Analysis
slide-41
SLIDE 41

Towards Biological Meaning

  • Network analysis

Hamy et al. (2016) PLOS One

slide-42
SLIDE 42

Replicates v Sequencing Depth

Liu et al. (2014) Bioinformatics

slide-43
SLIDE 43

Replicates v Sequencing Depth

HIGH MEDIUM LOW

Liu et al. (2014) Bioinformatics

slide-44
SLIDE 44

Replicates v Sequencing Depth

Liu et al. (2014) Bioinformatics