SV Detection Strategy Combine methods to detect a wider range of - - PowerPoint PPT Presentation

sv detection strategy
SMART_READER_LITE
LIVE PREVIEW

SV Detection Strategy Combine methods to detect a wider range of - - PowerPoint PPT Presentation

SV Detection Strategy Combine methods to detect a wider range of SVs Read-pair (RP) analysis Deletions, insertions, inversions, translocations Split-read (SR) mapping Deletions, insertions (small), deletions with small


slide-1
SLIDE 1

SV Detection Strategy

  • Combine methods to detect a wider range of SVs
  • Read-pair (RP) analysis
  • Deletions, insertions, inversions, translocations
  • Split-read (SR) mapping
  • Deletions, insertions (small), deletions with small insertions
  • Read depth (RD)
  • Deletions, duplications
  • Single-end cluster (SEC) analysis
  • Large insertions

Thursday, 11 March 2010

slide-2
SLIDE 2

*.bam file Pindel Split reads BreakDancer Read Pairs CND HMM ReadDepth SE Cluster Single-end mapped clusters; large insertions

Filter calls by score Exclude calls near gaps, cen./tel. Filter calls by score Exclude calls near gaps, cen./tel. Filter calls by # supporting reads Exclude calls near gaps, cen./tel. Filter calls by posterior probabilities Exclude calls near gaps, cen./tel.

  • Min. Loss = 1kb, Min. Gain= 2kb

Separate small (<100bp) and large SVs

  • Merge overlapping SVs of

the same type

  • Create tab-delimited (BED)

‘merged’ SV list

  • Run local assemblies
  • Align contigs to Reference

Parse contig alignments Overlap SVs with: Genes/Exons QTL regions Other regions of interest

Analysis pipeline for calling SVs in Mouse genomes

  • Refine Coordinates
  • Rank Calls based on

alignment evidence

Summary Stats

  • Total deletions,

insertions, etc.

  • Number of SVs

affecting exons

  • Affected genes in

QTL

Thursday, 11 March 2010

slide-3
SLIDE 3

Breakdancer

  • Performs read-pair analysis
  • Accepts Maq map files or bam files
  • Handles multiple library insert sizes
  • Provides a confidence score and number of

supporting reads

  • K. Chen et al., Nature Methods (2009)

Thursday, 11 March 2010

slide-4
SLIDE 4

Read Pair Analysis

ref ref

Insertion Deletion Inversion

ref ref A B

Translocation

200bp insert 200bp insert

Thursday, 11 March 2010

slide-5
SLIDE 5

Read pairs displayed in

  • LookSeq. Deletions are

easily spotted; read pairs are mapped further apart than expected

Thursday, 11 March 2010

slide-6
SLIDE 6

Inversion on chr 4 Mate pairs align in the same orientation Dip in coverage at the breakpoints

Thursday, 11 March 2010

slide-7
SLIDE 7

Pindel

  • Maq can’t align reads at breakpoints of larger SVs
  • Often one read in a pair is mapped, other is unmapped
  • Pindel uses mapped partner to localise unmapped

reads, and uses a split-read approach to align the read (del:1bp - 100kb, ins:1bp - <100bp)

ref

Deletion

200bp insert ref

Insertion

200bp insert

K. Ye et al., Bioinformatics (2009)

Thursday, 11 March 2010

slide-8
SLIDE 8

CND

  • Copy number detection from mapped read depth

using a Hidden Markov Model (HMM)

  • Resolution: 1kb
  • Uses bam files and samtools pileup
  • Repeat regions are skipped
  • GC correction included to improve calls
  • J. Simpson et al., Bioinformatics (2010)

Thursday, 11 March 2010

slide-9
SLIDE 9

GC content of mapped reads from individual lanes

Thursday, 11 March 2010

slide-10
SLIDE 10

Gain, no GC correction Gain, GC correction Loss, no GC correction Loss, GC correction

Total Mb included in regions of copy number gain or loss, with and without GC correction

Corrected depth for each 1 kb window: where di is the mean depth per base of the ith window, mGC is the median depth of all windows with the same G+C percentage as the ith window, and m is the median depth of all windows (revised from Yoon et al., 2009)

di m mGC . d =

Thursday, 11 March 2010

slide-11
SLIDE 11

SE Clusters

  • Identifies candidate insertion sites by

finding clusters of single-end mapped reads (one end mapped, other end unmapped)

ref inserted sequence Unmapped reads

Thursday, 11 March 2010

slide-12
SLIDE 12

*.bam file Pindel Split reads BreakDancer Read Pairs CND HMM ReadDepth SE Cluster Single-end mapped clusters; large insertions

Filter calls by score Exclude calls near gaps, cen./tel. Filter calls by score Exclude calls near gaps, cen./tel. Filter calls by # supporting reads Exclude calls near gaps, cen./tel. Filter calls by posterior probabilities Exclude calls near gaps, cen./tel.

  • Min. Loss = 1kb, Min. Gain= 2kb

Separate small (<100bp) and large SVs

  • Merge overlapping SVs of

the same type

  • Create tab-delimited (BED)

‘merged’ SV list

  • Run local assemblies
  • Align contigs to Reference

Parse contig alignments Overlap SVs with: Genes/Exons QTL regions Other regions of interest

Analysis pipeline for calling SVs in Mouse genomes

  • Refine Coordinates
  • Rank Calls based on

alignment evidence

Summary Stats

  • Total deletions,

insertions, etc.

  • Number of SVs

affecting exons

  • Affected genes in

QTL

‘Merged’ set ‘Refined’ set

Thursday, 11 March 2010

slide-13
SLIDE 13

Filtering and merging

  • SV are filtered by score or number of supporting

reads; HMM calls filtered by posterior probabilities

  • Remove SVs near reference sequence gaps,

and1Mb from centromere or telomere

  • Reference mm9: 562 gaps (~96 Mb total)
  • Take the union of calls from all methods to

create a ‘Merged’ set

Thursday, 11 March 2010

slide-14
SLIDE 14

Applying several methods to identify SVs

544 Pindel (min 100bp) 13,442 BreakDancer 4377

Deletions

BreakDancer Pindel

(all calls are >100bp)

SECluster 3 21 275 11 934 3957 11359

Insertions

NOD strain NOD strain Overlap between methods is small; Using only a single method will not provide a complete list of SVs

Thursday, 11 March 2010

slide-15
SLIDE 15

Local assembly and breakpoint refinement

  • Local assemblies (Velvet) are performed for each SV in ‘merged’ set, except

CND calls

  • Two assemblies for each, with scaffolding and no scaffolding
  • Contigs are aligned (Exonerate) to reference, alignments parsed to find

breakpoint(s); SVs are ranked and breakpoint coordinates adjusted:

  • Expected SV is found within range: rank 1 (all rank 1 are in the

‘refined’ set)

  • Expected SV not found, the alternate SV is recorded: rank 1
  • Expected SV not found, there are gaps in contig coverage: rank 2

(inconclusive)

  • No breakpoints are found, no large gaps in contig coverage: rank 3

Thursday, 11 March 2010

slide-16
SLIDE 16

Complications

  • Local assembly from mapped reads
  • Unmapped mates are included, but large insertions can’t be completely

assembled (eg: both reads are inside the insertion, they may be ‘unmapped’)

  • Reads are not aligned near breakpoints if too many SNPs/indels are

present (scaffolding helps sometimes)

  • Automated parsing of alignments:
  • Repetitive sequence or microhomology at or flanking breakpoints
  • Variants and small indels near breakpoints

Thursday, 11 March 2010

slide-17
SLIDE 17

Examples

SV calls confirmed by local assembly

Thursday, 11 March 2010

slide-18
SLIDE 18

LookSeq view of Chr 4 inversion + insertion

Insertion Insertion (zoomed in) Inversion

Thursday, 11 March 2010

slide-19
SLIDE 19

inversion Ins Chr 4 inversion

  • Local assembly
  • NODE_1 and

NODE_2 contigs contain breakpoints

UCSC browser view of aligned contigs

Thursday, 11 March 2010

slide-20
SLIDE 20

Chr 6 LTR deletion

Orange reads have mapping quality 0 (they map to more than one location). Mate pairs span an LTR in the reference genome

Thursday, 11 March 2010

slide-21
SLIDE 21

Chr 6 LTR deletion deletion

  • The repetitive reads

aligned to this region by Maq are also assembled

  • Local assembly
  • No scaffolding

Contig spans the breakpoints

UCSC browser view of aligned contigs

Thursday, 11 March 2010

slide-22
SLIDE 22

Chr 13 insertion

Dip in coverage Clusters of single-end mapped reads (green)

Thursday, 11 March 2010

slide-23
SLIDE 23
  • Local assembly
  • Scaffolding
  • Contig alignments

are cut off near the expected breakpoint location

  • Part of insertion

is reconstructed with scaffolding

Chr 13 insertion

UCSC browser view of aligned contigs

Thursday, 11 March 2010

slide-24
SLIDE 24

Chr 13 insertion

BLAST: Both contigs align full- length to Mouse Celera assembly*

N’s from scaffolding

Reference assembly hit Celera assembly hit Reference assembly hit Celera assembly hit

Contig 1 Contig 2

*The Celera assembly is a mixed-strain assembly of 129X1/SvJ, DBA/2J, and A/J (Released in 2001)

Thursday, 11 March 2010

slide-25
SLIDE 25

Chr 17 deletion

Thursday, 11 March 2010

slide-26
SLIDE 26

Chr 17 deletion

  • Local assembly
  • No scaffolding
  • No contig

coverage in the deleted region, but no contig crosses the breakpoint

UCSC browser view of aligned contigs

Thursday, 11 March 2010

slide-27
SLIDE 27

Chr 17 deletion

  • Scaffolding allows

contigs to join and provides evidence for the deletion

UCSC browser view of aligned contigs

Thursday, 11 March 2010

slide-28
SLIDE 28

Analysis of 17 inbred mouse Genomes

Thursday, 11 March 2010

slide-29
SLIDE 29

22500 45000 67500 90000 N O D 1 2 9 S 1 _ S v I m J C 3 H _ H e J C A S T _ E i L P _ J P W K _ P h S p r e t u s _ E i C B A _ J A K R _ J 1 2 9 P 2 B A L B c _ J C 5 7 B L _ 6 N A _ J W S B _ E i D B A _ 2 J 1 2 9 S 5 N Z O

Local assembly of deletions in Mouse (predicted size >=100bp)

Confirm(>=75bp) Confirm(<75bp) Complex Gap NoMatch NoAssem

  • Confirm(<75): local assembly identified a deletion, but less than 75bp
  • Complex: event involving a deletion and another event
  • Gap: not enough contig coverage (inconclusive)
  • NoMatch: No deletion found, but another SV type may be found
  • NoAssem: predicted deletion is in very high coverage region

Deletions are easiest to predict and find exact breakpoints for using local assembly

Thursday, 11 March 2010

slide-30
SLIDE 30

7000 14000 21000 28000 35000 N O D 1 2 9 S 1 _ S v I m J C 3 H _ H e J C A S T _ E i L P _ J P W K _ P h S p r e t u s _ E i C B A _ J A K R _ J 1 2 9 P 2 B A L B c _ J C 5 7 B L _ 6 N A _ J W S B _ E i D B A _ 2 J 1 2 9 S 5 N Z O

Local assembly of insertions (predicted size >=100bp)

Confirm(>=75bp) Confirm(<75bp) Complex Gap NoMatch NoAssem

  • Confirm(<75): local assembly identified a deletion, but less than 75bp
  • Complex: event involving an insertion and another event
  • Gap: not enough contig coverage (inconclusive)
  • NoMatch: No insertion found but another SV type may be found
  • NoAssem: predicted deletion is in very high coverage region

Insertions are harder to predict and often turn out to be smaller than predicted

Thursday, 11 March 2010

slide-31
SLIDE 31

175 350 525 700 N O D 1 2 9 S 1 _ S v I m J C 3 H _ H e J C A S T _ E i L P _ J P W K _ P h S p r e t u s _ E i C B A _ J A K R _ J 1 2 9 P 2 B A L B c _ J C 5 7 B L _ 6 N A _ J W S B _ E i D B A _ 2 J 1 2 9 S 5 N Z O

Local assembly of inversions (predicted size >=100bp)

Confirm(>=75bp) Confirm(<75bp) Complex Gap NoMatch NoAssem

  • Confirm(<75): local assembly identified a deletion, but less than 75bp
  • Complex: event involving an inversion and another event
  • Gap: not enough contig coverage (inconclusive)
  • NoMatch: No inversion found but another SV type may be found
  • NoAssem: predicted deletion is in very high coverage region

Inversions ‘signatures’ are easy to identify, but many do not validate. PCR shows that many read pairs with inversion ’signatures’ turn out to be insertions.

Thursday, 11 March 2010

slide-32
SLIDE 32

Ongoing PCR Validation

  • Collaborators at Oxford performing PCR validations on automated

and manual calls

  • Deletions:
  • Calls from manual inspection compared to automated calls:

‘merged’ set misses 3%

  • Insertions from ‘refined’ (local assembly) set:
  • Random seletion of 24 calls;
  • 20/24 validated; 4 may required long range PCR
  • Inversions from ‘refined’ set:
  • Initial results suggest 50% do not validate but this may be due to

PCR issues

Thursday, 11 March 2010

slide-33
SLIDE 33

Analysis of COLO-829 data

Thursday, 11 March 2010

slide-34
SLIDE 34

DATA

  • COLO-829: cell line derived (before treatment) from a

metastasis of a malignant melanoma, 43 yr old male; 40-fold haploid coverage

  • 2 insert sizes: 456bp (36bp reads) and 201bp (75bp reads)
  • COLO-829-BL: lymphoblastoid line from the same patient
  • 196bp mean insert; 75bp reads

Pleasance et al. A comprehensive catalogue of somatic mutations from a human cancer

  • genome. Nature. 463,191-197 (2010)

Thursday, 11 March 2010

slide-35
SLIDE 35

Merged SV calls (100bp min)

SV Type COLO-829 Mean size (bp) COLO-829-BL Mean size (bp) DEL 3679 10998 3252 10046 INS 1480 (124 large INS) 171* 113 (8 large INS) 104* INV 192 117056 142 135289

*Based on insertions with available size estimates; excludes ‘large’ insertions from SE cluster analysis **Table excludes Copy Number Gain/Loss calls, translocations

Thursday, 11 March 2010

slide-36
SLIDE 36

SVs unique to COLO-829

  • Start with merged, filtered calls sets (unrefined breakpoints),

100bp or larger

  • Find unique SVs in COLO-829

DEL and INS: non-overlapping calls INV: reciprocal overlap less than 50%

  • Check unique SVs in ‘refined’ lists (confirmed by local

assembly) to get the highest confidence set

normal cancer unique calls

Thursday, 11 March 2010

slide-37
SLIDE 37

SVs unique to COLO-829

SV type

Calls unique to COLO-829 (from comparison of ‘merged’ sets)* Confirmed unique calls by local assembly

Deletion 1027 308 (129 >=100bp) Insertion 1301 416 (111 >=100bp) Inversion 88 6

Preliminary analysis: find SVs unique to COLO-829 by comparison of ‘merged’ sets; highest confidence set is from those with breakpoints found by local assembly

*Excludes CND Gain/Loss calls; predictions >=100bp

Thursday, 11 March 2010

slide-38
SLIDE 38

Comparison to validated rearrangements

PCR Validated SVs (Pleasance et al, 2010)

  • 33 total (excluding 3 translocations and 1 large ‘Other

Intrachromosomal’) Comparison with:

  • ‘Merged’ set: 8 are missed (due to low read pair support)
  • ‘Refined’ set: an additional are 7 missed: (ie: no breakpoints for a large

SV is found) Why?

  • Heterozygous SVs are harder to confirm with local assembly
  • Assembly parameters were optimised for shorter reads

Thursday, 11 March 2010

slide-39
SLIDE 39

Deletion validated by CGP; not called by BreakDancer due to low read pair support

Thursday, 11 March 2010

slide-40
SLIDE 40

Inversion validated by CGP; not called by Breakdancer due to low read support

Thursday, 11 March 2010

slide-41
SLIDE 41

Future Work

  • Currently testing other read-depth based copy number

callers (eg: RDXplorer, Yoon et al.; another in-house HMM CNV caller, Aylywn Scally)

  • Package up the code, make useable for any genome (tested
  • n mouse, human, zebrafish so far); add new SV callers,

assemblers

  • Mouse
  • More PCR validations (inversions, complex SVs)
  • Human melanoma data
  • Further optimisation of pipeline parameters
  • PCR validation of SVs unique to COLO-829

Thursday, 11 March 2010