Spliced Spliced Transcripts Transcripts STAR STAR Alignment - - PowerPoint PPT Presentation

▶

Jul 04, 2023 348 likes •506 views

Spliced Spliced Transcripts Transcripts STAR STAR Alignment & Alignment & Reconstruction Reconstruction using high throughput using high throughput RNA-seq data RNA-seq data Alexander Dobin, Philippe Batut, Sudipto

SLIDE 1

Spliced Spliced Transcripts Transcripts Alignment & Alignment & Reconstruction Reconstruction

CSHL CSHL

STAR STAR

using high throughput using high throughput RNA-seq data RNA-seq data

Alexander Dobin Alexander Dobin, Philippe Batut, Sudipto Chakrabortty, Carrie Davis, , Philippe Batut, Sudipto Chakrabortty, Carrie Davis, Delphine Fagegaltier, Sonali Jha, Wei Lin, Delphine Fagegaltier, Sonali Jha, Wei Lin, Felix Schlesinger, Chenghai Xue, Christopher Zaleski, Felix Schlesinger, Chenghai Xue, Christopher Zaleski, Thomas Gingeras Thomas Gingeras

SLIDE 2

# 2

STAR: spliced transcript STAR: spliced transcript alignment and reconstruction alignment and reconstruction

'Ab initio' splice junctions

– un-annotated, non-canonical, distal exons, chimeric ...

Unique and multiple mappers
Any read length, any number of SJs per read
Any (reasonable) number of mismatches and indels
Alignment scoring utilizing reads quality scores
"Auto" trimming of poor quality ends
poly-A tails detection
Very Fast: human 75-mer reads: 60 Million read per hour
Memory: RAM~9*(Genome length) bytes: 25GB for human
II. Algorithm

SLIDE 3

# 3

Splitting the reads Splitting the reads

Split the read at poor quality bases (QS<10), 'N'
Map each good piece separately
Detect mismatches caused by poor SNR
Avoid erroneous mapping caused by sequencing errors:

just 1 SNP can cause mis-mapping from paralog to paralog

SLIDE 4

# 4

Suffix array based search Suffix array based search

For each good piece

find maximum exactly mappable length (could be a multiple mapper)
if a long portion of the good piece is still unmapped - repeat
repeat this procedure backwards (from 3' to 5' of a good piece)

SLIDE 5

# 5

Typical short read aligner: does the read map entirely, i.e. at full length?
What is the maximum mappable length?

– can detect many mismatches – can precisely "trim" poor quality tails – can detect splice junctions

With suffix arrays we find maximum mappable length in no extra time

Maximum mappable length Maximum mappable length

Map Extend Map Map Map again

II. Algorithm

SLIDE 6

# 6

Similar to local alignment scoring,

but penalties have probabilistic meaning

Illumina quality score:
+QS for matches; -QS for mismatches
Penalty for gap opening:
Total score
A more elaborate iterative penalty system is being developed

– gap penalty is calculated from mapped gap length distribution – mismatch penalties vs QS scores are re-calibrated after mapping

Choose the alignment(s) with highest score

Scoring with quality scores Scoring with quality scores

∑ ∑ ∑

= = =

− − + =

gap i gap mismatch i i match i i

P Q Q S

II. Algorithm

( )

error

base

10 P

log 10

⋅ =

( )

SJ gap

P

log 10 ⋅

SLIDE 7

# 7

Stitch and extend mapped pieces Stitch and extend mapped pieces

Select anchors and alignment windows
Collect all mapped pieces within an alignment window
Consider all collinear transcripts of mapped pieces within a window
Stitch all pieces together
Extend the transcripts through the un-mapped 5' and 3' ends

Extend Extend Stitch

SLIDE 8

# 8

Comparison with exhaustive search Comparison with exhaustive search

Exhaustively mapped Only in STAR Missed by STAR Exact 5,125,614 2,425 1MM 1,353,709 94 3,217 2MM 417,225 23 4,172 Multiple mappers by exhaustive search, <0.002% of all reads STAR maps 99.8% of all exhaustively mapped reads poor quality reads which did not have a single unique "anchor" Fly embryo 76mer RNA seq 1 Illumina lane: 8,930,945 total reads, good quality

III. Application

SLIDE 9

# 9

Reads mapped by STAR Reads mapped by STAR

III. Application

77% overlap with exhaustive search 11% STAR >2MM or shorter length 0.2% STAR InDels 8.5% STAR splice junctions 1.5% multi-mappers 1.8% not mapped by STAR

SLIDE 10

# 10

STAR alignments STAR alignments

Distribution of mapped lengths mean length = 72 Distribution of mismatches

poor quality tails spliced portions

III. Application

~1,000,000 alignments found by STAR and not by exhaustive search

SLIDE 11

# 11

Benchmarks Benchmarks

Single thread benchmarks 75-mer reads Bowtie (-v2 -k1) only reports non-spliced alignments with 0-2 MM, 1 or 2 alignments per read BLAT and STAR report >2MM and spliced alignments, and all the multiple alignments

BLAT Bowtie STAR Fly 13 19 91 Human 1 13 58

Million of reads aligned per hour

III. Application

SLIDE 12

# 12

percentile

Human total RNA Human total RNA

uniquely mapped reads mean mapped length (bases) 0-2 MM 51M 76 0-2MM & trim to 50 72M 50 STAR 106M 64.8

III. Application

270M reads (76-mer)

K562 human cell line
Quality scores vs cycle

Poor quality tails!

SLIDE 13

# 13

Splice junctions Splice junctions

Canonical GT/AG 3.6M (96%) Non-Canonical 150k (4%) All spliced reads: 3.75M Splice junctions with 3 or more reads per junction: 87k Annotated Canonical 78k (90%) Un-Annotated Canonical 6k (7%) Non- Canonical 2.8k (3%) ~0.5% of mapped reads are chimeric: inter-chromosome or inter-strand

SLIDE 14

# 14

Summary Summary

STAR: ab initio splice junction detection
Maximum mappable length search with suffix arrays
Alignment scoring uses quality scores of the reads
Very fast: 60M/hour for 75-mer reads in human,