Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife - - PowerPoint PPT Presentation

characterizing transcriptomes using ngs data
SMART_READER_LITE
LIVE PREVIEW

Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife - - PowerPoint PPT Presentation

Characterizing transcriptomes using ngs data T. Kllman BILS/Scilife Lab/Uppsala University Feb. 2015 20150212 1/33 Outline The transcriptome 1 RNA sequence technologies 2 RNA-seq analysis 3 Mapping based approach Tools for working


slide-1
SLIDE 1

Characterizing transcriptomes using ngs data

  • T. Källman

BILS/Scilife Lab/Uppsala University

  • Feb. 2015

20150212 1/33

slide-2
SLIDE 2

Outline

1

The transcriptome

2

RNA sequence technologies

3

RNA-seq analysis Mapping based approach Tools for working with ngs alignments Gene expression from RNA-seq de-novo assembly

20150212 2/33

slide-3
SLIDE 3

The transcriptome

The Central Dogma

ATG Promoter Region Intron Exon AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA UGA UAA UAG PO

4

PO

4

S S 3’ Poly A tail 5’ Cap Methionine Stop Codons

Transcription and mRNA processing Translation Post-Translational Modification DNA mRNA Protein

5’ Un-Translated Region TATA

Active Protein 20150212 3/33

slide-4
SLIDE 4

The transcriptome

A more complex view

20150212 4/33

slide-5
SLIDE 5

The transcriptome

Transcriptomes vs genomes

Dynamic, not the same over tissues and time points Smaller sequence space Less repetitive (but large gene families can be found) Fairly stable in size? (eg. 2-4 fold change among eukaryotes, whereas genome size can vary 1000-fold) Genes are often expressed in multiple different splice-variants RNA often from only one strand

20150212 5/33

slide-6
SLIDE 6

RNA sequence technologies

NGS data

20150212 6/33

slide-7
SLIDE 7

RNA sequence technologies

Machine output

20150212 7/33

slide-8
SLIDE 8

RNA sequence technologies

Machine output

20150212 8/33

slide-9
SLIDE 9

RNA sequence technologies

Sequence quality

Phred quality scores: Q = -10 x log P (High Q = high probability of the base being correct A Phred quality score of 20 to a base, means that the base is called incorrectly in 1 out of 100 times.

20150212 9/33

slide-10
SLIDE 10

RNA sequence technologies

Pair-end (PE) sequencing

20150212 10/33

slide-11
SLIDE 11

RNA sequence technologies

Pair-end reads

File format Two files are created The order in files identical and naming of reads are the same with the exception of the end The way of naming reads are changing over time so the read names depend on software version

@61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA @61DFRAAXX100204:1:100:10494:3070/2 ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA + _^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad

20150212 11/33

slide-12
SLIDE 12

RNA sequence technologies

Pair-end data

20150212 12/33

slide-13
SLIDE 13

RNA sequence technologies

Stranded or not

20150212 13/33

slide-14
SLIDE 14

RNA-seq analysis

Two main routes for analysis

Haas & Zody (2010), Nature Biotechnology 28, 421–423 20150212 14/33

slide-15
SLIDE 15

RNA-seq analysis Mapping based approach

Aligning short reads from RNA to genomes

Large number of programs available: Star, Tophat, Subread etc Important feature: Allow for spliced mapping

20150212 15/33

slide-16
SLIDE 16

RNA-seq analysis Mapping based approach

Example workflow

Tophat: Aligns reads to genome (allows for spliced read mapping) Cufflinks: Extract transcripts from spliced read alignments Cuffmerge: Merge results from multiple Cufflinks results

Trapnell et al. (2012), Nature Protocols 7, 562–578 20150212 16/33

slide-17
SLIDE 17

RNA-seq analysis Mapping based approach

Tophat

1

Efficient and fast alignment to the genome using bowtie2

2

Create a data base of putative splice junctions from the reads mapping in step 1

3

Map reads that did not map in step 1 run using the splice information

20150212 17/33

slide-18
SLIDE 18

RNA-seq analysis Mapping based approach

Cufflinks

20150212 18/33

slide-19
SLIDE 19

RNA-seq analysis Tools for working with ngs alignments

Samtools

Program to work with ngs alignment files (SAM, BAM, CRAM) Can be used to view data, calculate basic info, extract subsets of alignments and convert between file formats http://www.htslib.org

20150212 19/33

slide-20
SLIDE 20

RNA-seq analysis Tools for working with ngs alignments

Picard

A set of Java command line tools with the same (or similar functionality as samtools) Note that even though they largely aim at doing similar functions Picard and Samtools is not always generating compatible file formats http://broadinstitute.github.io/picard/

20150212 20/33

slide-21
SLIDE 21

RNA-seq analysis Tools for working with ngs alignments

Samtools tview, a text-based alignment viewer

$ samtools view alignment.bam target.fasta

20150212 21/33

slide-22
SLIDE 22

RNA-seq analysis Tools for working with ngs alignments

IGV: Integrative Genomics Viewer

20150212 22/33

slide-23
SLIDE 23

RNA-seq analysis Tools for working with ngs alignments

IGV: Integrative Genomics Viewer

20150212 23/33

slide-24
SLIDE 24

RNA-seq analysis Gene expression from RNA-seq

From counts to gene expression

20150212 24/33

slide-25
SLIDE 25

RNA-seq analysis Gene expression from RNA-seq

From counts to gene expression

20150212 25/33

slide-26
SLIDE 26

RNA-seq analysis Gene expression from RNA-seq

Not all reads are the same

from: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html 20150212 26/33

slide-27
SLIDE 27

RNA-seq analysis Gene expression from RNA-seq

Normalized expression Values

Transcript-mapped read counts are normalized for both length of the transcript and total depth of sequencing. Count data is hence converted to: Reads/Fragments per kb of transcript length and million mapped reads (RPKM or FPKM)

20150212 27/33

slide-28
SLIDE 28

RNA-seq analysis Gene expression from RNA-seq

Experimental design

20150212 28/33

slide-29
SLIDE 29

RNA-seq analysis Gene expression from RNA-seq

Experimental design

Count reads (convert to RPKM/FPKM?) Small number of reads (= low RPKM/FPKM values) often non-significant Remember that Fold change is not the same as significance

Gene A Condition 1 Condition 2 Gene B Fold_Change Significant? 1 2 2-fold 100 200 2-fold No Yes 20150212 29/33

slide-30
SLIDE 30

RNA-seq analysis de-novo assembly

Major challenges in relation to genome assembly

Genes show different levels of gene expression, hence uneven coverage among genes Many genes are expressed in different isoforms As sequence depth increase detected number of loci increase. (What is actually expressed?) Sequence error from highly expressed genes might be seen more

  • ften than "true" sequences from lowly expressed genes

20150212 30/33

slide-31
SLIDE 31

RNA-seq analysis de-novo assembly

Several programs available

SOAP-denovo TRANS Oases Trans-ABYSS Trinity All of them uses de Bruijn graphs to cope with the data and many of them have been developed from a genome assembly program

20150212 31/33

slide-32
SLIDE 32

RNA-seq analysis de-novo assembly

Trinity

20150212 32/33

slide-33
SLIDE 33

RNA-seq analysis de-novo assembly

Trinity

20150212 33/33