Discovering Individual Human Genetic Variation with CLEVER + SMART - - PowerPoint PPT Presentation

discovering individual human genetic variation with
SMART_READER_LITE
LIVE PREVIEW

Discovering Individual Human Genetic Variation with CLEVER + SMART - - PowerPoint PPT Presentation

Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012


slide-1
SLIDE 1

Discovering Individual Human Genetic Variation with CLEVER + SMART

Alexander Sch¨

  • nhuth

joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep

CWI Scientific Meeting March 30, 2012

slide-2
SLIDE 2

Structural Variations

The human reference genome: ...ACCGGAGTAGTATATTTCAGG... Assumption until 2006: only single nucleotide polymorphisms (SNPs) ...ACCGGAGTAGTATATTTCAGG... ...ACTGGAGTACTATATATCAGG... Since 2006: also insertions and deletions (indels), inversions, translocations ... ...ACCGGAGTAGTATATTT---CAGG... ...AC----GTAGATATTTTTTTCAGG...

Structural Variation Discovery 2

slide-3
SLIDE 3

Next-Generation Sequenced Genomes

Figure: MRC National Center for Medical Research, London

Structural Variation Discovery 3

slide-4
SLIDE 4

Insert Size Distribution

Paired-End Read Insert End End

Paired-End Reads

Read ends of known length Insert of unknown length

Insert Size Distribution, Fragments from Yoruban Individual

Structural Variation Discovery 4

slide-5
SLIDE 5

Discovering Insertions and Deletions

Current Challenges

Small/mid-size deletions Repetitive regions Multiply mapped reads

Possible Approaches

Coverage based Insert size based Split read based

IGV Screenshot: Deletion Red reads: Insert ≥ µ + 2.5σ

Structural Variation Discovery 5

slide-6
SLIDE 6

Insertions and Deletions: Alignments

Reference Genome

x_A y_A

I(A) = y_A - x_A - 1 Reference Genome

x_B y_B

I(B) = y_B - x_B - 1 Paired-End Read Paired-End Read Alignment A Alignment B

Insertion: I(B) too small Deletion: I(A) too large Indels: Alignment length deviates from insert size distribution

Structural Variation Discovery 6

slide-7
SLIDE 7

The Read Alignment Graph

3 1 2 4 6 7 5

Reference Genome

A1 A2 A3 A4 A5 A6 A7

C1 = (A1,A2,A3) C2 = (A5,A6,A7)

Read alignment graph Alignments

Structural Variation Discovery 7

slide-8
SLIDE 8

The Read Alignment Graph

3 1 2 4 6 7 5

Reference Genome

A1 A2 A3 A4 A5 A6 A7

C1 = (A1,A2,A3) C2 = (A5,A6,A7)

Read alignment graph Alignments

Idea: Find all maximal cliques.

Structural Variation Discovery 7

slide-9
SLIDE 9

I(A)≈μ A B I(B)>μ I(B)>μ I(A)>μ A B O(A,B) I(A)≈μ A B I(B)≈μ O(A,B) I(B)>μ I(A)>μ A B O(A,B)

Long internal segments, sufficient overlap (4) Average internal segments lengths, small overlap (3) Long internal segments, but small overlap (2) Too large length difference (1)

I(A,B)>μ I(A,B)≈μ I(A,B)>μ

Incompatible Alignments (NO edge): Compatible Alignments (edge):

Structural Variation Discovery 8

slide-10
SLIDE 10

Significantly incompatible?

Notations

Difference of internal segment length: ∆12 Overlap of internal segments: ∩12 Mean internal segment length: ¯ I12 Length compatiblity: U12 := ¯ I12 − ∩12

Statistical tests

1 Mean compatibility: P(X ≥ ∆12 √ 2σ) ≤ α = 0.1 2 Intersection compatibility: P(X ≥ √ 2(U12−µ) σ

) ≤ α = 0.1 X is a N(0,1) distributed random variable.

Structural Variation Discovery 9

slide-11
SLIDE 11

CLEVER: Workflow

1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly

indicate insertions or deletions.

Structural Variation Discovery 10

slide-12
SLIDE 12

CLEVER: Workflow

1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly

indicate insertions or deletions.

Structural Variation Discovery 11

slide-13
SLIDE 13

Short Read Alignments

Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 12

slide-14
SLIDE 14

CLEVER: Clique-Enumerating Variant Finder

Outline

1 Iterate over all alignments, sorted by position 2 Maintain a set of active cliques (and active alignments) 3 Output a clique once it “goes out of scope” (+free memory)

For each alignment

1 Find set of adjacent nodes 2 Intersect with all active cliques and either

Add to existing clique Split clique Create new clique

3 Eliminate duplicate and non-maximal cliques

Structural Variation Discovery 13

slide-15
SLIDE 15

Fast Implementation

Techniques

Active alignments: binary search tree (sorted by insert length) Cliques: store as bit-vectors over active alignments

Clique intersection bit-parallel

Reorganize storage now and then

Runtime

30× coverage, all reads, up to ≈ 650 alignments per read: Around 20 minutes for whole chromosome 1

Structural Variation Discovery 14

slide-16
SLIDE 16

CLEVER: Workflow

1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly

indicate insertions or deletions.

Structural Variation Discovery 15

slide-17
SLIDE 17

CLEVER: Clique-Enumerating Variant Finder

Each max-clique C (accounting for multiply mapped reads): P(H0 | C) =

  • C⊂C

P(C correct and C \ C incorrect) · P(H0 | C correct) where H0 null hypothesis of no variation, P(H0 | C correct) = P(XN(0,1) ≤

  • |C| · |¯

I − µ| σ ) reflects Z-test for sample of size |C|. After correction for multiple hypothesis testing: predict indels from all significant cliques C

Structural Variation Discovery 16

slide-18
SLIDE 18

Prior Approaches

Issues

Discard massive amounts of read alignments (all alignments from concordant reads). Statistically less principled definition of variant-related alignment groups No correction for multiple hypothesis testing

Structural Variation Discovery 17

slide-19
SLIDE 19

Evaluation

Benchmarks

1 Simulated data: Venter’s Genome 2 Real data: Yoruban individual (NA18507)

Structural Variation Discovery 18

slide-20
SLIDE 20

Results (Hit Statistics)

Simulation Study: Deletions in Venter’s Genome

Length range → 10–49 50–99 100–499 ↓ Algorithm ↓

  • Prec. / Rec.
  • Prec. / Rec.
  • Prec. / Rec.

PINDEL 42.0 / 42.0 52.0 / 35.3 85.8 / 40.5 CLEVER 51.9 / 22.7 51.1 / 76.5 82.5 / 72.4 BreakDancer 15.1 / 0.3 43.5 / 20.1 48.6 / 56.5 GASV 1.1 / 10.5 29.6 / 26.1 0.8 / 53.6 HYDRA – / 0.0 – / 0.0 85.7 / 61.3 VariationHunter 15.2 / 0.8 29.3 / 20.5 49.2 / 59.4 MoDIL1 18.6 / 16.0 22.3 / 68.5 41.7 / 41.7

1 MoDIL: Run only on Chromosome 1. Structural Variation Discovery 19

slide-21
SLIDE 21

Results (Hit Statistics)

Real Data: Individual NA18507

Length range → 10–49 50–99 100–499 ↓ Algorithm ↓

  • Rec. / Excl.
  • Rec. / Excl.
  • Rec. / Excl.

PINDEL 45.5 / 38.5 33.7 / 1.0 52.7 / 0.0 CLEVER 9.4 / 2.1 73.2 / 26.3 78.1 / 5.2 BreakDancer 0.2 / 0.0 22.1 / 0.5 63.7 / 0.4 GASV 5.0 / 1.8 27.4 / 0.5 61.7 / 2.3 HYDRA 0.0 / 0.0 0.0 / 0.0 67.2 / 0.4 VariationHunter 0.3 / 0.0 18.9 / 0.0 69.8 / 2.3

Used Annotations

Mills et al., Genome Research, 2011.

Structural Variation Discovery 20

slide-22
SLIDE 22

Room for improvements

Issues

1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing

Structural Variation Discovery 21

slide-23
SLIDE 23

Room for improvements

Issues

1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing

Structural Variation Discovery 22

slide-24
SLIDE 24

Split-Read Information

Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 23

slide-25
SLIDE 25

Room for improvements

Issues

1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing

Structural Variation Discovery 24

slide-26
SLIDE 26

Read Mapping Ambiguities

Red(dish): Misplaced Alignments

Goal: Determine correct alignment for multiply mapped reads.

Structural Variation Discovery 25

slide-27
SLIDE 27

Room for improvements

Issues

1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing

Structural Variation Discovery 26

slide-28
SLIDE 28

Other Variations

Figure: Feuk et al., 2006

Structural Variation Discovery 27

slide-29
SLIDE 29

Room for improvements

Issues

1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing

Structural Variation Discovery 28

slide-30
SLIDE 30

Software

Clique Enumeration + Significance Testing

CLEVER - CLique Enumerating Variant findER Availability: http://clever-sv.googlecode.com

Fred Clever and Jeff Smart

Resolving ambiguities with EM algorithm

SMART - Sparse Mixture Ambiguity Resolving Tool Availability: Coming Soon

Structural Variation Discovery 29

slide-31
SLIDE 31

Conclusions

Summary

Statistically sound criterion for edges Enumerating all maximal cliques is feasible Significance test for cliques Results are good (even without SMART)

Future work

Finish and benchmark SMART Try other read mappers / re-map reads Integrate split-read information into CLEVER

Structural Variation Discovery 30