Quantifying gene expression Genome Sequence reads GTF - PowerPoint PPT Presentation

Quantifying gene expression

✓ Genome Sequence reads ✓ GTF (annotation)? FASTQ (+reference transcriptome index) Quality control FASTQ (+reference genome index) FASTQ (known GTF, optional) Alignment to Genome: HISAT2, STAR multiple BAMs (+known GTF) Pseudocounts with Kallisto, Count reads Sailfish, Salmon associated with genes: htseq-count, featureCounts Count Matrix Count Matrix generated using tximport DGE with R: DGE with DESeq2, EdgeR, Sleuth limma:voom

Genome chrX: 152139280 152139290 152139300 152139310 152139320 152139330 ---> CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg Normal:HAH CD133hi:LM-Mel-42pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-14pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg DBTSS:human_MCF7 CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos 152139280 152139290 152139300 152139310 Sequence reads CGCCGTCCCTCAGAATGGAAACCTCGCT TCT CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-34pos CD133hi_Cage0805 A simple case of string matching

Genome Genome chrX: 152139280 152139290 152139300 152139310 152139320 152139330 CGCCGTCCCTCAGAATGGAAACCTCGCT TCTCTCTGCCCCACAATGCGCAAGTCAG ---> CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg Normal:HAH CD133hi:LM-Mel-42pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-14neg Normal:HAH CD133hi:LM-Mel-14pos Normal:HAH CD133lo:LM-Mel-34neg Normal:HAH CD133hi:LM-Mel-34pos Normal:HAH CD133lo:LM-Mel-42neg DBTSS:human_MCF7 CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-34pos CD133lo:LM-Mel-42neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-42pos Sequence reads CGTCCCTCAGA T TGGAAACCTCGCTT CD133hi:LM-Mel-42pos CD133lo:LM-Mel-14neg CD133lo:LM-Mel-34neg CD133hi:LM-Mel-42pos CD133hi:LM-Mel-34pos CD133hi_Cage0805 flcDNA_all A simple case of string matching?

Non-comprehensive list of challenges • Large, incomplete and repetitive genomes OR transcriptomes with overlapping transcripts (isoforms) • Short reads: 50-150 bp - Non-unique alignment - Sensitive to non-exact matching (variants, sequencing errors) • Massive number of short reads • Small insert size: 200-500 bp libraries • Compute capacity for efficient mapping

Building an index • Having an index of the reference sequence provides an efficient way to search • Once index is built, it can be queried any number of times • Every genome or transcriptome build requires a new index for the specific tool in question.

Commonly used indexing methods • Hash-based (Salmon, Kallisto) • Suffix arrays (Salmon, STAR) • Burrows-Wheeler Transform (BWA, Bowtie2)

Genome versions matter • Ensembl, UCSC and NCBI all often use the same genome assemblies or builds (e.g. GrCh38 == hg38) • Make sure that the annotation file (GTF) is exactly matched with the genome file (fasta) - Same genome version - Same source (e.g. both from FlyBase)  

✓ Genome FASTA Sequence reads ✓ GTF (annotation) FASTQ Quality control (+reference genome index) FASTQ (known GTF, optional) Alignment to Genome: HISAT2, STAR multiple BAMs (+known GTF) Count reads associated with genes: htseq-count, featureCounts Count Matrix DGE with R: DESeq2, EdgeR, limma:voom

Alignment to genome • Is it important that the genome index is created with awareness of known splice junctions? • Don’t use default parameters; read the manual and ask questions about parameters • Parameter sweeps may be needed if you are working on a non-model organism

BAM alignment files • Binary version of SAM alignment format files • Recommended over SAM files for saving alignments • Contain information on a per-read basis: -- Coordinates of alignment, including strand -- Mismatches -- Mapping information (unique?, properly paired?, etc.) -- Quality of mapping (tool-specific scoring systems) More information about SAM/BAM

QC on BAM files Evaluating the quality of the aligned data can give important information about the quality of the library: -- Total % of reads aligning to the genome? % of uniquely mapping reads? % of properly paired PE reads? -- Genomic origin of reads (exonic, intronic, intergenic) -- Quantity of rRNA -- Transcript coverage and 5'-3' bias Samples should have fairly consistent percentages.

QC on BAM files Gather QC metrics using: • Log files from alignment run • Qualimap • RNASeQC (paper) More information about alignment QC

Quantification from BAM files • htseq-count • featureCounts

Quantification from BAM files htseq-count and featureCounts • -- Strandedness -- Stringency • Results in a gene-level counts matrix (raw) • Output ready for DGE analysis using tools like DESeq2 or EdgeR

✓ Transcriptome FASTA Sequence reads FASTQ (+reference transcriptome index) Quality control FASTQ Pseudocounts with Kallisto, Sailfish, Salmon Count Matrix generated using tximport DGE with R: DGE with DESeq2, EdgeR, Sleuth limma:voom

More efficient quantification approaches • Approaches that avoid base-to-base alignment • Kallisto (quasi-aligner), Sailfish (kmer-based), Salmon (quasi- aligner), RSEM • Faster, more efficient (~ >20x faster than alignment-based) • Improved accuracy for transcript-level quantification • Improvements in accuracy for gene-level quantification** **doi: 10.12688/f1000research.7563.2

More efficient quantification approaches • Results in a matrix of abundance estimates (not raw) at the isoform-level • Abundance estimates can be used for differential isoform expression using sleuth (designed for Kallisto output) • Gene-level counts can be calculated using tximport -- ready for DGE analysis using tools like DESeq2 or EdgeR

These materials have been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC). These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Quantifying gene expression Genome Sequence reads GTF - PowerPoint PPT Presentation

Quantifying gene expression Genome Sequence reads GTF (annotation)? FASTQ (+reference transcriptome index) Quality control FASTQ (+reference genome index) FASTQ (known GTF, optional) Alignment to Genome: HISAT2, STAR multiple BAMs

Gene Expression Data Introduction to gene expression data Expression data storage concept An

Family-based analysis of genome-wide gene gene interactions Marit Ackermann Biotec TU Dresden

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Eukaryotic Gene Eukaryotic Gene Prediction Prediction Eukaryotic gene structure Eukaryotic

Gene Expression Microarray 02-223 How to Analyze Your Own

Analysis of Gene Expression Profiles Analysis of Gene Expression Profiles and Drug Activity

Gene Expression Remember the days of 10 th grade biology Learning about gene expression Which can

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

AP BIOLOGY Gene Expression Summer 2013 www.njctl.org Slide 3 / 199 Gene Expression Unit Topics

1 Milestones Milestones ID Task Name Duration Start Finish % Complete 1 Project Proposal

CSci 8980: Advanced Topics in Graphical Models Application: Gene Expression Analysis Instructor:

Boolean models of the lac operon in E. coli Matthew Macauley Clemson University Gene expression

Survival Models built from Gene Expression Data using Gene Groups as Covariates Kai Kammers,

Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling

RNA-seq: Analysis options Genome? Biological samples/Library preparation Transcriptome

TOWARDS EXCELLENCE IN INNOVATION ECOSYSTEM SF2 : EXCELLENCE IN INNOVATION ECOSYSTEM UNIMAS

DECIGO and B-DECIGO Masaki Ando (Univ. of Tokyo / NAOJ) On behalf of DECIGO Working Group Credit:

Gravitational waves from Extreme mass ratio inspirals Gravitational Radiation Reaction Problem

Targeting multiple mobile platforms with Qt Creator Aurindam Jana Aurindam Jana IRC: auri__ :

organization prepared by Jenny Bryan for Reproducible Science Workshop A place for everything,

Bias in RNA sequencing and what to do about it Walter L. (Larry) Ruzzo Computer Science and

Sector New Y Sector Ne w Yor ork Risk Insight Risk Insight Define maritime risks