discovering individual human genetic variation with
play

Discovering Individual Human Genetic Variation with CLEVER + SMART - PowerPoint PPT Presentation

Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012


  1. Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch¨ onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012

  2. Structural Variations The human reference genome: ...ACCGGAGTAGTATATTTCAGG... Assumption until 2006: only single nucleotide polymorphisms (SNPs) ...ACCGGAGTAGTATATTTCAGG... ...ACTGGAGTACTATATATCAGG... Since 2006: also insertions and deletions (indels), inversions, translocations ... ...ACCGGAGTAGTATATTT---CAGG... ...AC----GTAGATATTTTTTTCAGG... Structural Variation Discovery 2

  3. Next-Generation Sequenced Genomes Figure: MRC National Center for Medical Research, London Structural Variation Discovery 3

  4. Insert Size Distribution Paired-End Read End End Insert Paired-End Reads Read ends of known length Insert of unknown length Insert Size Distribution, Fragments from Yoruban Individual Structural Variation Discovery 4

  5. Discovering Insertions and Deletions Current Challenges Small/mid-size deletions Repetitive regions Multiply mapped reads Possible Approaches Coverage based IGV Screenshot: Deletion Insert size based Red reads: Insert ≥ µ + 2 . 5 σ Split read based Structural Variation Discovery 5

  6. Insertions and Deletions: Alignments I(B) = y_B - x_B - 1 Reference Genome x_B y_B Alignment B Insertion: I(B) too small Paired-End Read Deletion: I(A) too I(A) = y_A - x_A - 1 Reference Genome x_A y_A large Alignment A Paired-End Read Indels: Alignment length deviates from insert size distribution Structural Variation Discovery 6

  7. The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Structural Variation Discovery 7

  8. The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Idea: Find all maximal cliques . Structural Variation Discovery 7

  9. Incompatible Alignments (NO edge): (1) Too large length (2) Long internal segments, difference but small overlap I(A)≈μ I(A)>μ A A B B I(B)>μ I(B)>μ I(A,B)>μ O(A,B) Compatible Alignments (edge): (3) Average internal segments (4) Long internal segments, lengths, small overlap sufficient overlap I(A)≈μ I(A)>μ A A B B I(B)≈μ I(B)>μ I(A,B)≈μ I(A,B)>μ O(A,B) O(A,B) Structural Variation Discovery 8

  10. Significantly incompatible? Notations Difference of internal segment length: ∆ 12 Overlap of internal segments: ∩ 12 Mean internal segment length: ¯ I 12 Length compatiblity: U 12 := ¯ I 12 − ∩ 12 Statistical tests 1 Mean compatibility: P ( X ≥ ∆ 12 2 σ ) ≤ α = 0 . 1 √ √ 2( U 12 − µ ) 2 Intersection compatibility: P ( X ≥ ) ≤ α = 0 . 1 σ X is a N (0 , 1) distributed random variable. Structural Variation Discovery 9

  11. CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 10

  12. CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 11

  13. Short Read Alignments Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 12

  14. CLEVER: C lique- E numerating V ariant Find er Outline 1 Iterate over all alignments, sorted by position 2 Maintain a set of active cliques (and active alignments) 3 Output a clique once it “goes out of scope” (+free memory) For each alignment 1 Find set of adjacent nodes 2 Intersect with all active cliques and either Add to existing clique Split clique Create new clique 3 Eliminate duplicate and non-maximal cliques Structural Variation Discovery 13

  15. Fast Implementation Techniques Active alignments: binary search tree (sorted by insert length) Cliques: store as bit-vectors over active alignments Clique intersection bit-parallel Reorganize storage now and then Runtime 30 × coverage, all reads, up to ≈ 650 alignments per read: Around 20 minutes for whole chromosome 1 Structural Variation Discovery 14

  16. CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 15

  17. CLEVER: C lique- E numerating V ariant Find er Each max-clique C (accounting for multiply mapped reads): P ( H 0 | C ) = � P ( C correct and C \ C incorrect) · P ( H 0 | C correct) C ⊂C where H 0 null hypothesis of no variation, | C | · | ¯ I − µ | � P ( H 0 | C correct) = P ( X N (0 , 1) ≤ ) σ reflects Z-test for sample of size | C | . After correction for multiple hypothesis testing : predict indels from all significant cliques C Structural Variation Discovery 16

  18. Prior Approaches Issues Discard massive amounts of read alignments (all alignments from concordant reads ). Statistically less principled definition of variant-related alignment groups No correction for multiple hypothesis testing Structural Variation Discovery 17

  19. Evaluation Benchmarks 1 Simulated data: Venter’s Genome 2 Real data: Yoruban individual (NA18507) Structural Variation Discovery 18

  20. Results (Hit Statistics) Simulation Study: Deletions in Venter’s Genome Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Prec. / Rec. Prec. / Rec. Prec. / Rec. PINDEL 42.0 / 42.0 52.0 / 35.3 85.8 / 40.5 CLEVER 51.9 / 22.7 51.1 / 76.5 82.5 / 72.4 BreakDancer 15.1 / 0.3 43.5 / 20.1 48.6 / 56.5 GASV 1.1 / 10.5 29.6 / 26.1 0.8 / 53.6 HYDRA – / 0.0 – / 0.0 85.7 / 61.3 VariationHunter 15.2 / 0.8 29.3 / 20.5 49.2 / 59.4 MoDIL 1 18.6 / 16.0 22.3 / 68.5 41.7 / 41.7 1 MoDIL: Run only on Chromosome 1. Structural Variation Discovery 19

  21. Results (Hit Statistics) Real Data: Individual NA18507 Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Rec. / Excl. Rec. / Excl. Rec. / Excl. PINDEL 45.5 / 38.5 33.7 / 1.0 52.7 / 0.0 CLEVER 9.4 / 2.1 73.2 / 26.3 78.1 / 5.2 BreakDancer 0.2 / 0.0 22.1 / 0.5 63.7 / 0.4 GASV 5.0 / 1.8 27.4 / 0.5 61.7 / 2.3 HYDRA 0.0 / 0.0 0.0 / 0.0 67.2 / 0.4 VariationHunter 0.3 / 0.0 18.9 / 0.0 69.8 / 2.3 Used Annotations Mills et al., Genome Research, 2011. Structural Variation Discovery 20

  22. Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 21

  23. Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 22

  24. Split-Read Information Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 23

  25. Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 24

  26. Read Mapping Ambiguities Red(dish): Misplaced Alignments Goal: Determine correct alignment for multiply mapped reads. Structural Variation Discovery 25

  27. Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 26

  28. Other Variations Figure: Feuk et al., 2006 Structural Variation Discovery 27

  29. Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 28

  30. Software Clique Enumeration + Significance Testing CLEVER - CL ique E numerating V ariant find ER Availability: http://clever-sv.googlecode.com Fred Clever and Jeff Smart Resolving ambiguities with EM algorithm SMART - S parse M ixture A mbiguity R esolving T ool Availability: Coming Soon Structural Variation Discovery 29

  31. Conclusions Summary Statistically sound criterion for edges Enumerating all maximal cliques is feasible Significance test for cliques Results are good (even without SMART) Future work Finish and benchmark SMART Try other read mappers / re-map reads Integrate split-read information into CLEVER Structural Variation Discovery 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend