Quantifying gene expression Genome Sequence reads GTF - - PowerPoint PPT Presentation

quantifying gene expression genome
SMART_READER_LITE
LIVE PREVIEW

Quantifying gene expression Genome Sequence reads GTF - - PowerPoint PPT Presentation

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference transcriptome index) Quality control FASTQ (+reference genome index) FASTQ (known GTF, optional) Alignment to Genome: HISAT2, STAR multiple BAMs


slide-1
SLIDE 1

Quantifying gene expression

slide-2
SLIDE 2

multiple BAMs (+known GTF)

Alignment to Genome: HISAT2, STAR DGE with R: DESeq2, EdgeR, limma:voom Count reads associated with genes: htseq-count, featureCounts

Count Matrix

✓ Genome ✓ GTF (annotation)?

(+reference transcriptome index) FASTQ

DGE with Sleuth Pseudocounts with Kallisto, Sailfish, Salmon

Count Matrix generated using tximport (+reference genome index)

Sequence reads Quality control

FASTQ FASTQ (known GTF, optional)

slide-3
SLIDE 3

A simple case of string matching

chrX:

  • -->

152139280 152139290 152139300 152139310 152139320 152139330 CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos CD133lo:LM-Mel-34neg CD133lo:LM-Mel-14neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-14neg CD133hi:LM-Mel-14pos CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-34pos Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH DBTSS:human_MCF7 CD133hi_Cage0805 152139280 152139290 152139300 152139310 CGCCGTCCCTCAGAATGGAAACCTCGCT TCT

Genome Sequence reads

slide-4
SLIDE 4

chrX:

  • -->

152139280 152139290 152139300 152139310 152139320 152139330 CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos CD133lo:LM-Mel-34neg CD133lo:LM-Mel-14neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-14neg CD133hi:LM-Mel-14pos CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-34pos Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH Normal:HAH DBTSS:human_MCF7 CD133hi_Cage0805 flcDNA_all CGTCCCTCAGATTGGAAACCTCGCTT

Genome Genome Sequence reads

A simple case of string matching?

slide-5
SLIDE 5
  • Large, incomplete and repetitive genomes OR

transcriptomes with overlapping transcripts (isoforms)

  • Short reads: 50-150 bp
  • Non-unique alignment
  • Sensitive to non-exact matching (variants, sequencing errors)
  • Massive number of short reads
  • Small insert size: 200-500 bp libraries
  • Compute capacity for efficient mapping

Non-comprehensive list of challenges

slide-6
SLIDE 6
  • Having an index of the reference sequence provides an

efficient way to search

  • Once index is built, it can be queried any number of times
  • Every genome or transcriptome build requires a new index

for the specific tool in question.

Building an index

slide-7
SLIDE 7
  • Hash-based (Salmon, Kallisto)
  • Suffix arrays (Salmon, STAR)
  • Burrows-Wheeler Transform (BWA, Bowtie2)

Commonly used indexing methods

slide-8
SLIDE 8
  • Ensembl, UCSC and NCBI all often use the same genome

assemblies or builds (e.g. GrCh38 == hg38)

  • Make sure that the annotation file (GTF) is exactly matched

with the genome file (fasta)

  • Same genome version
  • Same source (e.g. both from FlyBase)


Genome versions matter

slide-9
SLIDE 9

(+reference genome index) multiple BAMs (+known GTF)

Sequence reads Quality control Alignment to Genome: HISAT2, STAR DGE with R: DESeq2, EdgeR, limma:voom Count reads associated with genes: htseq-count, featureCounts

FASTQ FASTQ Count Matrix

✓ Genome FASTA ✓ GTF (annotation)

(known GTF, optional)

slide-10
SLIDE 10

Alignment to genome

  • Is it important that the genome index is created with

awareness of known splice junctions?

  • Don’t use default parameters; read the manual and ask

questions about parameters

  • Parameter sweeps may be needed if you are working on a

non-model organism

slide-11
SLIDE 11

BAM alignment files

  • Binary version of SAM alignment format files
  • Recommended over SAM files for saving alignments
  • Contain information on a per-read basis:
  • - Coordinates of alignment, including strand
  • - Mismatches
  • - Mapping information (unique?, properly paired?, etc.)
  • - Quality of mapping (tool-specific scoring systems)

More information about SAM/BAM

slide-12
SLIDE 12

QC on BAM files

Evaluating the quality of the aligned data can give important information about the quality of the library:

  • - Total % of reads aligning to the genome? % of uniquely

mapping reads? % of properly paired PE reads?

  • - Genomic origin of reads (exonic, intronic, intergenic)
  • - Quantity of rRNA
  • - Transcript coverage and 5'-3' bias

Samples should have fairly consistent percentages.

slide-13
SLIDE 13

QC on BAM files

Gather QC metrics using:

  • Log files from alignment run
  • Qualimap
  • RNASeQC (paper)

More information about alignment QC

slide-14
SLIDE 14
  • htseq-count
  • featureCounts

Quantification from BAM files

slide-15
SLIDE 15
  • htseq-count and featureCounts
  • - Strandedness
  • - Stringency
  • Results in a gene-level counts

matrix (raw)

  • Output ready for DGE analysis

using tools like DESeq2 or EdgeR

Quantification from BAM files

slide-16
SLIDE 16

DGE with R: DESeq2, EdgeR, limma:voom

✓ Transcriptome FASTA

(+reference transcriptome index) FASTQ

DGE with Sleuth Pseudocounts with Kallisto, Sailfish, Salmon

Count Matrix generated using tximport

Sequence reads Quality control

FASTQ

slide-17
SLIDE 17

More efficient quantification approaches

  • Approaches that avoid base-to-base alignment
  • Kallisto (quasi-aligner), Sailfish (kmer-based), Salmon (quasi-

aligner), RSEM

  • Faster, more efficient (~ >20x faster than alignment-based)
  • Improved accuracy for transcript-level quantification
  • Improvements in accuracy for gene-level quantification**

**doi: 10.12688/f1000research.7563.2

slide-18
SLIDE 18

More efficient quantification approaches

  • Results in a matrix of abundance estimates (not raw) at the

isoform-level

  • Abundance estimates can be used for differential isoform

expression using sleuth (designed for Kallisto output)

  • Gene-level counts can be calculated using tximport
  • - ready for DGE analysis using tools like DESeq2 or EdgeR
slide-19
SLIDE 19

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.