Discovery of Genomic Structural Variations with Next-Generation - - PowerPoint PPT Presentation

discovery of genomic structural variations with next
SMART_READER_LITE
LIVE PREVIEW

Discovery of Genomic Structural Variations with Next-Generation - - PowerPoint PPT Presentation

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data Marcel H. Schulz Advanced Topics in Computational Genomics Oct 2011 with slides from Tobias Rausch (EMBL) and Kai Ye (Leiden University) Genomic Rearrangements/


slide-1
SLIDE 1

Discovery of Genomic Structural Variations with Next-Generation Sequencing Data

Marcel H. Schulz Advanced Topics in Computational Genomics Oct 2011

with slides from Tobias Rausch (EMBL) and Kai Ye (Leiden University)

slide-2
SLIDE 2

Genomic Rearrangements/ Structural Variations (SVs)

  • 1 Kb to several Mb in size

courtesy of Tobias Rausch (EMBL)

slide-3
SLIDE 3
  • 1 Kb to several Mb in size
  • Copy number variants

(CNVs)

– Deletion – Duplication

Genomic Rearrangements/ Structural Variations (SVs)

courtesy of Tobias Rausch (EMBL)

slide-4
SLIDE 4

Genomic Rearrangements/ Structural Variations (SVs)

  • 1 Kb to several Mb in size
  • Copy number variants

(CNVs)

– Deletion – Duplication

  • Insertion

courtesy of Tobias Rausch (EMBL)

slide-5
SLIDE 5

Genomic Rearrangements/ Structural Variations (SVs)

  • 1 Kb to several Mb in size
  • Copy number variants

(CNVs)

– Deletion – Duplication

  • Insertion, Inversion

courtesy of Tobias Rausch (EMBL)

slide-6
SLIDE 6
  • 1 Kb to several Mb in size
  • Copy number variants

(CNVs)

– Deletion – Duplication

  • Insertion, Inversion, Translocation

Genomic Rearrangements/ Structural Variations

courtesy of Tobias Rausch (EMBL)

slide-7
SLIDE 7
  • 1 Kb to several Mb in size
  • Copy number variants

– Deletion – Duplication

  • Insertion, Inversion, Translocation
  • More abundant than SNPs

Genomic Rearrangements/ Structural Variations

…ACGATACG… …ACGAGACG…

courtesy of Tobias Rausch (EMBL)

slide-8
SLIDE 8
  • 1 Kb to several Mb in size
  • Copy number variants

– Deletion – Duplication

  • Insertion, Inversion, Translocation
  • More abundant than SNPs
  • Either neutral or non-neutral in function
  • Non-neutral mechanisms

– Disrupting genes – Creating fusion genes – Copy number changes of dosage-sensitive genes

Genomic Rearrangements/ Structural Variations

courtesy of Tobias Rausch (EMBL)

slide-9
SLIDE 9

Why Structural Variation Discovery

  • Finding disease causal genes
  • Trace evolutionary genome history
  • Analyze the mechanisms of SVs occurrence
  • Understand Repetitive Element spreading

(LINEs, ALUs, etc.)

slide-10
SLIDE 10

Technologies to Discover Structural Variations

slide-11
SLIDE 11

Technologies

  • Fluorescent in situ hybridization (FISH)

– Fluorescent probes (≈100kb) detect and localize the presence or absence of specific DNA sequence

 Perry et al. (2007)

courtesy of Tobias Rausch (EMBL)

slide-12
SLIDE 12

Technologies

  • Fluorescent in situ hybridization (FISH)
  • Comparative Genomic Hybridization (CGH)

– Test vs. reference sample – 2.1 million probes – Different types

  • Whole-Genome Tiling Arrays
  • Whole-Genome Exon-Focused Arrays
  • CNV Arrays

courtesy of Tobias Rausch (EMBL)

slide-13
SLIDE 13

Technologies

  • Fluorescent in situ hybridization (FISH)
  • Comparative Genomic Hybridization (CGH)
  • Genome-Wide Human SNP Array 6.0

– 1.8 million genetic markers

  • 906,600 SNPs
  • 946,000 probes for CNVs

courtesy of Tobias Rausch (EMBL)

slide-14
SLIDE 14

Technologies

  • Fluorescent in situ hybridization (FISH)
  • Comparative Genomic Hybridization (CGH)
  • Genome-Wide Human SNP Array 6.0
  • Human 1M-Duo DNA Analysis BeadChip

– 1.2 million genetic markers

  • Markers for SNPs and CNV regions

– Targeted studies

  • 60,800 additional custom SNPs
  • 60,000 custom CNV-targets

courtesy of Tobias Rausch (EMBL)

slide-15
SLIDE 15

Technologies

  • Fluorescent in situ hybridization (FISH)
  • Comparative Genomic Hybridization (CGH)
  • Genome-Wide Human SNP Array 6.0
  • Human 1M-Duo DNA Analysis BeadChip
  • Next-Generation Sequencing (NGS)

– Whole-genome sequencing – Targeted, e.g. RNA-Seq

courtesy of Tobias Rausch (EMBL)

slide-16
SLIDE 16

Focus on NGS

  • Limitations of Arrays

– Lower resolution for genomic rearrangements – Balanced events (e.g., inversions) cannot be detected using signal intensity differences – No breakpoint information

courtesy of Tobias Rausch (EMBL)

slide-17
SLIDE 17

Paired-end data

  • Two protocols for paired-end data

– mate-pair sequencing by circularization (traditional Sanger sequencing) – paired-end NGS

  • verview protocol
slide-18
SLIDE 18

Paired-end data

– paired-end NGS (insert distribution known due to fragment size selection)

slide-19
SLIDE 19

Computational Methods

slide-20
SLIDE 20

Experiment

slide-21
SLIDE 21

Reference Split-Read alignments Read depth signals Mate-pair or paired-end mapping abnormalities

Detecting Genomic Rearrangements

courtesy of Tobias Rausch (EMBL)

slide-22
SLIDE 22

Detecting Genomic Rearrangements

Unmapped or single-anchored reads Reference Split-Read alignments Read depth signals Mate-pair or paired-end mapping abnormalities Local assembly courtesy of Tobias Rausch (EMBL)

slide-23
SLIDE 23

courtesy of Tobias Rausch (EMBL)

slide-24
SLIDE 24

courtesy of Tobias Rausch (EMBL)

slide-25
SLIDE 25

Insertions   Deletions

courtesy of Tobias Rausch (EMBL)

slide-26
SLIDE 26

courtesy of Tobias Rausch (EMBL)

slide-27
SLIDE 27

 Korbel et al. (2007)  Lee et al. (2009)

courtesy of Tobias Rausch (EMBL)

slide-28
SLIDE 28

courtesy of Tobias Rausch (EMBL)

slide-29
SLIDE 29

courtesy of Tobias Rausch (EMBL)

slide-30
SLIDE 30

courtesy of Tobias Rausch (EMBL)

slide-31
SLIDE 31

courtesy of Tobias Rausch (EMBL)

slide-32
SLIDE 32

courtesy of Tobias Rausch (EMBL)

slide-33
SLIDE 33

1 Copy 1 Copy 0 Copy 2 Copy 2 Copy

 Chiang et al. (2009)

courtesy of Tobias Rausch (EMBL)

slide-34
SLIDE 34
  • Down-Syndrom

– Partial Trisomie 21

 Xie et al. (2009)

courtesy of Tobias Rausch (EMBL)

slide-35
SLIDE 35

 Chiang et al. (2009) Human cancer cell lines compared to normal cell lines (SeqSeq algorithm, no fixed window size, multiple change points method )

slide-36
SLIDE 36

With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation?

slide-37
SLIDE 37

With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using split-read mapping Example for read of length 40: Expected random matches for a 12bp read-prefix in the human genome?

Donor Reference

slide-38
SLIDE 38

With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using split-read mapping Example for read of length 40: Expected random matches for a 12bp read-prefix in the human genome?

Donor Reference

1⋅109 412 ≈179

slide-39
SLIDE 39

With reads of length 40-100 bps are we able to find the exact breakpoint of a structural variation? Yes – using anchored split-read mapping mappable read mate provides anchor to narrow down search space

Donor Reference  Medvedev et al. (2009)

slide-40
SLIDE 40

The Pindel algorithm (Deletions)

 Ye et al. (2009)

slide-41
SLIDE 41

The Pindel algorithm (Deletions)

 Ye et al. (2009) How to do that?

slide-42
SLIDE 42

The Pindel algorithm (Deletions)

 Ye et al. (2009) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3′ end of the unmapped read (<=2x insert size)

slide-43
SLIDE 43

!"!#$%&$'($)!*!++ +,

#&)-./!'0&12-./!(3!%0&&$).!/)45&2

ATGCA ATCAAGTATGCTTAGC

courtesy of Kai Ye (Leiden U.)

slide-44
SLIDE 44

!"!#$%&$'($)!*!++ +,

#&)-./!'0&12-./!(3!%0&&$).!/)45&2

ATGCA ATCAAGTATGCTTAGC

courtesy of Kai Ye (Leiden U.)

slide-45
SLIDE 45

!"!#$%&$'($)!*!++ +,

#&)-./!'0&12-./!(3!%0&&$).!/)45&2

ATGCA ATCAAGTATGCTTAGC

courtesy of Kai Ye (Leiden U.)

slide-46
SLIDE 46

!"!#$%&$'($)!*!++ +,

#&)-./!'0&12-./!(3!%0&&$).!/)45&2

ATGCA ATCAAGTATGCTTAGC

courtesy of Kai Ye (Leiden U.)

slide-47
SLIDE 47

!"!#$%&$'($)!*!++ *!

#&),-.!'/&01,-.!(2!%/&&$)-!.)34&1

ATGCA ATCAAGTATGCTTAGC

5,-,'6'!6-,76$!86(8&),-.9!:;< 5/=,'6'!6-,76$!86(8&),-.9!:;<> courtesy of Kai Ye (Leiden U.)

slide-48
SLIDE 48

The Pindel algorithm (Deletions)

 Ye et al. (2009) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length + Max_D) starting from mapped end in step 2

slide-49
SLIDE 49

The Pindel algorithm (Deletions)

 Ye et al. (2009) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length + Max_D) starting from mapped end in step 2 ④ check if complete unmapped read can be combined from 3’ and 5’ end substrings matches

slide-50
SLIDE 50

The Pindel algorithm (Insertions)

 Ye et al. (2009) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length -1) starting from mapped end in step 2 ④ check if complete unmapped read can be combined from 3’ and 5’ end substrings matches

slide-51
SLIDE 51

The Pindel algorithm (Insertions)

 Ye et al. (2009) ① Use 3’ end of left read as anchor point ② Use pattern growth to search for minimum and maximum unique substrings from the 3′ end of the unmapped read (<=2x insert size) ③ Use pattern growth to search for minimum and maximum unique substrings from the 5’ end of the unmapped read (read length -1) starting from mapped end in step 2 ④ check if complete unmapped read can be combined from 3’ and 5’ end substrings matches

  • In initial Pindel version exact matches to reference where required
slide-52
SLIDE 52

The Pindel algorithm (Simulations)

 Ye et al. (2009)

slide-53
SLIDE 53

The Pindel algorithm (Real Data)

 Ye et al. (2009)

slide-54
SLIDE 54

The Pindel algorithm (Real Data)

 Ye et al. (2009)

slide-55
SLIDE 55

The Pindel algorithm for complex variants

 Ye et al. Pindel manual a) large deletion b) tandem duplication c) inversion d-f) same as a-c with non-template sequence (yellow part)

slide-56
SLIDE 56

Comparison to SplazerS

 Emde et al. submitted ① SplazerS detects any possible prefix-suffix decomposition of the unmapped read in search region ② allows arbitrary number of mismatches and even small indels in the unmapped read ③ delay decision to indel calling step

slide-57
SLIDE 57

Computational Methods for De Novo Genomic Rearrangement Detection

courtesy of Tobias Rausch (EMBL)

slide-58
SLIDE 58

Computational Methods for De Novo Genomic Rearrangement Detection

courtesy of Tobias Rausch (EMBL)

slide-59
SLIDE 59

Computational Methods for De Novo Genomic Rearrangement Detection

courtesy of Tobias Rausch (EMBL)

slide-60
SLIDE 60

Acknowledgements

  • Tobias Rausch (EMBL)
  • Kai Ye (Leiden University Medical Center)
  • Anne-Katrin Emde (Freie Universität Berlin)

References

Kai Ye, Marcel H. Schulz, Quan Long, Rolf Apweiler, and Zemin Ning Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics (2009) 25(21): 2865-2871 Pindel homepage: https://trac.nbic.nl/pindel/ SplazerS homepage: http://www.seqan.de/projects/splazers.html