Transcript Assembly and Quantification from RNASeq Data Angelika - - PowerPoint PPT Presentation

transcript assembly and quantification from rnaseq data
SMART_READER_LITE
LIVE PREVIEW

Transcript Assembly and Quantification from RNASeq Data Angelika - - PowerPoint PPT Presentation

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David Gonzales-Knowles Centre de Regulacio Genomica (CRG), Barcelona, Spain COST RNASeq workshop, Uppsala (May 2012) Tuesday, May 22, 2012 Why RNASeq... its


slide-1
SLIDE 1

Transcript Assembly and Quantification from RNASeq Data

Angelika Merkel & David Gonzales-Knowles

Centre de Regulacio Genomica (CRG), Barcelona, Spain

COST RNASeq workshop, Uppsala (May 2012)

Tuesday, May 22, 2012

slide-2
SLIDE 2

Why RNASeq...

  • its amazing!
  • representing steady state RNA abundance

at a dynamic range

  • quantifying alternative splicing
  • de novo splice junction and element

detection

Tuesday, May 22, 2012

slide-3
SLIDE 3

RNASeq Data

  • illustrate the “volume”:
  • give some specs on current RNAseq

datasets i.e. CLL, ENCODE, Illumina...

  • amount: number of sets in the lab

Tuesday, May 22, 2012

slide-4
SLIDE 4

Software

  • Mapping: Tophat, GEM
  • Transcript assembly: Cufflinks
  • Transcript quantification: FluxCapacitor

Tuesday, May 22, 2012

slide-5
SLIDE 5

Mapping I - Tophat

  • ab initio by large-scale mapping of RNA-Seq reads. TopHat maps reads to

splice sites in a mammalian genome at a rate of ∼2.2 million reads per CPU hour

  • TopHat first maps non-junction reads (those contained within exons) using

Bowtie (http://bowtie-bio.sourceforge.net), an ultra-fast short-read mapping program (Langmead et al., 2009). [2 mism, up to 10 multimaps] -> initially unmapped reads

  • The TopHat pipeline. RNA-Seq reads are mapped against the whole

reference genome, and those reads that do not map are set aside. An initial consensus of mapped regions is computed by Maq. Sequences flanking potential donor/acceptor splice sites within neighboring regions are joined to form potential splice junctions. The IUM reads are indexed and aligned to these splice junction sequences.

Trapnell, Pachter & Salzberg (2009)

Tuesday, May 22, 2012

slide-6
SLIDE 6

Tophat seeding..

The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default Tuesday, May 22, 2012

slide-7
SLIDE 7
  • lates release: TopHat 2.0.0 release 4/09/2012
  • Tophat homepage:

http://tophat.cbcb.umd.edu/

Tuesday, May 22, 2012

slide-8
SLIDE 8

GEM

  • pipeline for RNAseq:
  • initital mapping with GEM short read

mapper (genome, transcriptome)

  • unaligned reads mapped with GEM

splitmapper

  • recursive mapping with trimmed reads and

increased no. of mismatches to improve

Tuesday, May 22, 2012

slide-9
SLIDE 9

GEM split-mapper

  • short description + picture (hopefully

Paolo is sending me something)

Tuesday, May 22, 2012

slide-10
SLIDE 10
  • latest release: ?
  • GEM homepage:

http://sourceforge.net/apps/mediawiki/ gemlibrary/

Tuesday, May 22, 2012

slide-11
SLIDE 11

Transcript assembly+ quantification

  • Genes sometimes have multiple alternative splicing events, and there may be many possible reconstructions of the gene model that explain the sequencing
  • data. In fact, it is often not obvious how many splice variants of the gene may be present. Thus, Cufflinks reports a parsimonious transcriptome assembly
  • f the data. The algorithm reports as few full-length transcript fragments or 'transfrags' as are needed to 'explain' all the splicing event outcomes in the

input data.

  • Issues are the same for the FluxCapacitor.

Tuesday, May 22, 2012

slide-12
SLIDE 12

Cufflinks

  • Overview of Cufflinks. The algorithm takes as input cDNA fragment sequences that have

been (a) aligned to the genome by software capable of producing spliced alignments, such as TopHat. With paired-end RNA-Seq, Cufflinks treats each pair of fragment reads as a single alignment. The algorithm assembles overlapping ‘bundles’ of fragment alignments (b-c) separately, which reduces running time and memory use because each bundle typically contains the fragments from no more than a few genes. Cufflinks then estimates the abundances of the assembled transcripts (d-e). (b) The first step in fragment assembly is to identify pairs of ‘incompatible’ fragments that must have originated from distinct spliced mRNA isoforms. Fragments are connected in an ‘overlap graph’ when they are compatible and their alignments overlap in the genome. Each fragment has one node in the graph, and an edge, directed from left to right along the genome, is placed between each pair of compatible fragments. In this example, the yellow, blue, and red fragments must have originated from separate isoforms, but any other fragment could have come from the same transcript as one of these three. (c) Assembling isoforms from the overlap graph. Paths through the graph correspond to sets of mutually compatible fragments that could be merged into complete isoforms. The overlap graph here can be minimally ‘covered’ by three paths, each representing a different isoform. Dilworth's Theorem states that the number of mutually incompatible reads is the same as the minimum number of transcripts needed to “explain” all the fragments. Cufflinks implements a proof of Dilworth's Theorem that produces a minimal set of paths that cover all the fragments in the overlap graph by finding the largest set of reads with the property that no two could have

  • riginated from the same isoform. (d) Estimating transcript abundance. Fragments are

matched (denoted here using color) to the transcripts from which they could have

  • riginated. The violet fragment could have originated from the blue or red isoform. Gray

fragments could have come from any of the three shown. Cufflinks estimates transcript abundances using a statistical model in which the probability of observing each fragment is a linear function of the abundances of the transcripts from which it could have

  • riginated. Because only the ends of each fragment are sequenced, the length of each may

be unknown. Assigning a fragment to different isoforms often implies a different length for

  • it. Cufflinks can incorporate the distribution of fragment lengths to help assign fragments

to isoforms. For example, the violet fragment would be much longer, and very improbable according to Cufflinks' model, if it were to come from the red isoform instead of the blue

  • isoform. (e) The program then numerically maximizes a function that assigns a likelihood

to all possible sets of relative abundances of the yellow, red and blue isoforms (γ1,γ2,γ3), producing the abundances that best explain the observed fragments, shown as a pie chart

Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Trapnell (2010)

Tuesday, May 22, 2012

slide-13
SLIDE 13
  • latest release: Cufflinks 2.0.0 release 5/4/2012 (includes cuffmerge, cuffcomp,

cuffdiff)

  • Cufflinks homepage:

http://cufflinks.cbcb.umd.edu/

  • Workshop at Berkley, June 30th (cufflinks

and eXpress)- online viewing 50$ http://qb3.berkeley.edu/qb3/starseq/

Tuesday, May 22, 2012

slide-14
SLIDE 14

FluxCapcitor

Reference: Montgomery et al. 2010

The basic problem addressed by the FLUX CAPACITOR. The exonic structure of two spliceforms (labeled as "SF A" and "SF B") is shown, with aligned reads from by RNAseq methods (top) . Those reads mapped to the edges of a splicing graph (bottom) represent a signal, measured as the FLUX - the relative coverage along an exonic stretch. Where transcripts overlap in exons, their respective flux is combined. Given the information from all edges in a locus, signal separation is achieved by decomposition across a flow network Tuesday, May 22, 2012

slide-15
SLIDE 15

The assignment of reads — after having mapped them to genomic locations — is not straightforward. The Flux Capacitor follows a conservative annotation assignment,i.e., reads are assigned uniquely to genomic regions („segments” or ,,junctions). These regions are defined given the exon-intron structure of each locus, an example is shown in Fig.1. Fig.1: An example locus with two transcripts

I and II (names to the left) that overlap in segments of their exons (green boxes denoted by letters A through E, indices indicate segments of overlapping exons). The Flux Capacitor distinguishes further 5 non-exonic areas. 19 sequencing reads (arrows with heart labels) have been mapped in the arrea of the locus as shown.

http://fluxcapacitor.wikidot.com/capacitor

Tuesday, May 22, 2012

slide-16
SLIDE 16
  • latest release:?
  • FluxCapacitor homepage:

http://flux.sammeth.net/capacitor.html

  • video:

http://www.scivee.tv/node/10013 Tuesday, May 22, 2012