Discovering Individual Human Genetic Variation with CLEVER + SMART - PowerPoint PPT Presentation

Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch¨ onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012

Structural Variations The human reference genome: ...ACCGGAGTAGTATATTTCAGG... Assumption until 2006: only single nucleotide polymorphisms (SNPs) ...ACCGGAGTAGTATATTTCAGG... ...ACTGGAGTACTATATATCAGG... Since 2006: also insertions and deletions (indels), inversions, translocations ... ...ACCGGAGTAGTATATTT---CAGG... ...AC----GTAGATATTTTTTTCAGG... Structural Variation Discovery 2

Next-Generation Sequenced Genomes Figure: MRC National Center for Medical Research, London Structural Variation Discovery 3

Insert Size Distribution Paired-End Read End End Insert Paired-End Reads Read ends of known length Insert of unknown length Insert Size Distribution, Fragments from Yoruban Individual Structural Variation Discovery 4

Discovering Insertions and Deletions Current Challenges Small/mid-size deletions Repetitive regions Multiply mapped reads Possible Approaches Coverage based IGV Screenshot: Deletion Insert size based Red reads: Insert ≥ µ + 2 . 5 σ Split read based Structural Variation Discovery 5

Insertions and Deletions: Alignments I(B) = y_B - x_B - 1 Reference Genome x_B y_B Alignment B Insertion: I(B) too small Paired-End Read Deletion: I(A) too I(A) = y_A - x_A - 1 Reference Genome x_A y_A large Alignment A Paired-End Read Indels: Alignment length deviates from insert size distribution Structural Variation Discovery 6

The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Structural Variation Discovery 7

The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Idea: Find all maximal cliques . Structural Variation Discovery 7

Incompatible Alignments (NO edge): (1) Too large length (2) Long internal segments, difference but small overlap I(A)≈μ I(A)>μ A A B B I(B)>μ I(B)>μ I(A,B)>μ O(A,B) Compatible Alignments (edge): (3) Average internal segments (4) Long internal segments, lengths, small overlap sufficient overlap I(A)≈μ I(A)>μ A A B B I(B)≈μ I(B)>μ I(A,B)≈μ I(A,B)>μ O(A,B) O(A,B) Structural Variation Discovery 8

Significantly incompatible? Notations Difference of internal segment length: ∆ 12 Overlap of internal segments: ∩ 12 Mean internal segment length: ¯ I 12 Length compatiblity: U 12 := ¯ I 12 − ∩ 12 Statistical tests 1 Mean compatibility: P ( X ≥ ∆ 12 2 σ ) ≤ α = 0 . 1 √ √ 2( U 12 − µ ) 2 Intersection compatibility: P ( X ≥ ) ≤ α = 0 . 1 σ X is a N (0 , 1) distributed random variable. Structural Variation Discovery 9

CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 10

Short Read Alignments Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 12

CLEVER: C lique- E numerating V ariant Find er Outline 1 Iterate over all alignments, sorted by position 2 Maintain a set of active cliques (and active alignments) 3 Output a clique once it “goes out of scope” (+free memory) For each alignment 1 Find set of adjacent nodes 2 Intersect with all active cliques and either Add to existing clique Split clique Create new clique 3 Eliminate duplicate and non-maximal cliques Structural Variation Discovery 13

Fast Implementation Techniques Active alignments: binary search tree (sorted by insert length) Cliques: store as bit-vectors over active alignments Clique intersection bit-parallel Reorganize storage now and then Runtime 30 × coverage, all reads, up to ≈ 650 alignments per read: Around 20 minutes for whole chromosome 1 Structural Variation Discovery 14

CLEVER: C lique- E numerating V ariant Find er Each max-clique C (accounting for multiply mapped reads): P ( H 0 | C ) = � P ( C correct and C \ C incorrect) · P ( H 0 | C correct) C ⊂C where H 0 null hypothesis of no variation, | C | · | ¯ I − µ | � P ( H 0 | C correct) = P ( X N (0 , 1) ≤ ) σ reflects Z-test for sample of size | C | . After correction for multiple hypothesis testing : predict indels from all significant cliques C Structural Variation Discovery 16

Prior Approaches Issues Discard massive amounts of read alignments (all alignments from concordant reads ). Statistically less principled definition of variant-related alignment groups No correction for multiple hypothesis testing Structural Variation Discovery 17

Evaluation Benchmarks 1 Simulated data: Venter’s Genome 2 Real data: Yoruban individual (NA18507) Structural Variation Discovery 18

Results (Hit Statistics) Simulation Study: Deletions in Venter’s Genome Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Prec. / Rec. Prec. / Rec. Prec. / Rec. PINDEL 42.0 / 42.0 52.0 / 35.3 85.8 / 40.5 CLEVER 51.9 / 22.7 51.1 / 76.5 82.5 / 72.4 BreakDancer 15.1 / 0.3 43.5 / 20.1 48.6 / 56.5 GASV 1.1 / 10.5 29.6 / 26.1 0.8 / 53.6 HYDRA – / 0.0 – / 0.0 85.7 / 61.3 VariationHunter 15.2 / 0.8 29.3 / 20.5 49.2 / 59.4 MoDIL 1 18.6 / 16.0 22.3 / 68.5 41.7 / 41.7 1 MoDIL: Run only on Chromosome 1. Structural Variation Discovery 19

Results (Hit Statistics) Real Data: Individual NA18507 Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Rec. / Excl. Rec. / Excl. Rec. / Excl. PINDEL 45.5 / 38.5 33.7 / 1.0 52.7 / 0.0 CLEVER 9.4 / 2.1 73.2 / 26.3 78.1 / 5.2 BreakDancer 0.2 / 0.0 22.1 / 0.5 63.7 / 0.4 GASV 5.0 / 1.8 27.4 / 0.5 61.7 / 2.3 HYDRA 0.0 / 0.0 0.0 / 0.0 67.2 / 0.4 VariationHunter 0.3 / 0.0 18.9 / 0.0 69.8 / 2.3 Used Annotations Mills et al., Genome Research, 2011. Structural Variation Discovery 20

Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 21

Split-Read Information Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 23

Read Mapping Ambiguities Red(dish): Misplaced Alignments Goal: Determine correct alignment for multiply mapped reads. Structural Variation Discovery 25

Other Variations Figure: Feuk et al., 2006 Structural Variation Discovery 27

Software Clique Enumeration + Significance Testing CLEVER - CL ique E numerating V ariant find ER Availability: http://clever-sv.googlecode.com Fred Clever and Jeff Smart Resolving ambiguities with EM algorithm SMART - S parse M ixture A mbiguity R esolving T ool Availability: Coming Soon Structural Variation Discovery 29

Conclusions Summary Statistically sound criterion for edges Enumerating all maximal cliques is feasible Significance test for cliques Results are good (even without SMART) Future work Finish and benchmark SMART Try other read mappers / re-map reads Integrate split-read information into CLEVER Structural Variation Discovery 30

Discovering Individual Human Genetic Variation with CLEVER + SMART - PowerPoint PPT Presentation

Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012

1 2 Genetic Program Genetic Program Parameter 3 Genetic Program Genetic Program 4 Softcoding

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Genetic variation for Wood Basic Density, Knot index and Their Genetic variation for Wood Basic

Genetic.io Genetic Algorithms in all their shapes and forms ! Genetic.io Make something of your

Germ- -line Genetic Therapy line Genetic Therapy Germ Munson- -Davis Look Bravely at a Davis

Genetic Programming What is it? Genetic Programming Genetic programming (GP) is an

Discovering Gods Word (Part-2) Discovering Gods Word (Part-2) Hermeneutics = The science

Modern human variation (Ch 12) There's variation in ALL humans Polytypic species: composed of

Syntactic variation in the individual Edward Stabler, UCLA NELS, October 2010 Edward Stabler,

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Introduction to Genetic Epidemiology CM van Duijn Genetic Epidemiology Unit Gene Discovery

Genetic drift (two types) Genetic drift: changes in allele frequencies due to chance. Founder

All in the Family How Genetic Counselors Facilitate Familial Genetic Testing Amanda Openshaw, MS,

Nonhomogeneous linear systems of DEs Diagonalization, Variation of Parameters ITI 11/04/2020

The 3 rd Covenant Re-Discovering the Word of God within the words of the Bible Re-Discovering The

~ Discovering gold in the Cortez gold-trend of Nevada ~ NUG:V NULGF:QX Discovering gold in

Participants Questions and Comments when Learning their Childrens CYP2D6 Research Results

Processing Real-Time LOFAR Processing Real-Time LOFAR Telescope Data on a Blue Gene/P Telescope

A factor model to analyze heterogeneity in gene expression in a context of QTL mapping Yuna

CS681: Advanced Topics in Computational Biology Week 2, Lecture 1 Can Alkan EA224

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Methodological Challenges in the Pursuit of the Tree of Life ! Christophe Dessimoz February

POLYTOMY REFINEMENT FOR THE CORRECTION OF DUBIOUS DUPLICATIONS IN GENE TREES Manuel Lafond 1 ,

d.diochnos@di.uoa.gr