Synthetic long read technologies in genome phasing and beyond - - PowerPoint PPT Presentation

synthetic long read technologies in genome phasing and
SMART_READER_LITE
LIVE PREVIEW

Synthetic long read technologies in genome phasing and beyond - - PowerPoint PPT Presentation

Synthetic long read technologies in genome phasing and beyond Volodymyr Kuleshov Stanford University Batzoglou & Snyder Labs + Latest ongoing research on synthetic long reads Genome phasing ----- [A/T] ------ [C/G] ----- [G/T] ------


slide-1
SLIDE 1

Synthetic long read technologies in genome phasing and beyond

Volodymyr Kuleshov

Stanford University Batzoglou & Snyder Labs

slide-2
SLIDE 2

+ Latest ongoing research on synthetic long reads

slide-3
SLIDE 3

Genome phasing

  • ---- [A/T] ------ [C/G] ----- [G/T] ------
  • ---- [A] ------ [G] ----- [G] ------
  • ---- [T] ------ [C] ----- [T] ------

Fundamental aspect of human genetics that is relevant in many applied problems

slide-4
SLIDE 4

Scientific application

TF factor binding site Differentially methylated region

Allele-specific methylation

Paternal and maternal methylation levels

slide-5
SLIDE 5
  • Immune response during organ

transplantation depends on compatibility between HLA genes

Medical application

HLA typing

  • These genes are highly heterozygous
slide-6
SLIDE 6

General principle

unphased genome sequence reads phased result

  • ---- [A/T] ------ [C/G] ----- [G/T] ------
  • ---- [A] ------ [G] ----- [G] ------
  • ---- [T] ------ [C] ----- [T] ------
  • ---- [A] ------ [G] -----
  • ----- [C] ----- [T] ------
  • ---- [T] ------ [C] -----
slide-7
SLIDE 7

Long read sequencing

  • Phasing is now becoming possible thanks to

new synthetic long read technologies

  • Examples: Moleculo, Long Fragment Reads

(LFR), 10X Genomics

  • Produce virtual multi-kb reads on regular

sequencers

slide-8
SLIDE 8

2.

DNA is cut into 10 Kbp fragments

3.

The fragments are placed into wells

4.

Wells are assigned a unique barcode

5.

The contents of each well are sequenced with short reads and reconstructed on a computer

1.

A

Moleculo starts with quality DNA

slide-9
SLIDE 9

3.

+

Locally phased blocks

  • Phasing as inference in a probabilistic model (ECCB14)
  • 11% more accurate than RefHap
  • Produces useful confidence scores
slide-10
SLIDE 10

Shortcomings

Moleculo LFR N50 60 Kbp 600 Kbp % phased 90% 95%

  • Reads too short relative to other methods
  • 10% of variants unphased due to

sequencing bias

slide-11
SLIDE 11

Idea:

Use statistical phasing!

slide-12
SLIDE 12

Prism Statistical Phaser

3. 4. 5.

+

  • Extends earlier methods

to handle pre-phased blocks

  • Prior information from

blocks significantly improves accuracy

  • Works best where

molecular phasing fails

  • Produces useful confidence

scores

slide-13
SLIDE 13

Prism Statistical Phaser

3. 4. 5.

+

  • Augments the HMM model
  • f Li and Stephens (used in

Impute2, Shape-IT, etc) with additional variables

  • Determines scores using

probabilistic inference in the model

slide-14
SLIDE 14

Experiments

Haplotype block N50 length (bp) Phasing rate

  • ver SNVs

Switches per Mbp NA12878 (two libraries) 563,801 99.00% 0.47 NA12891 (two libraries) 647,599 99.25% 0.68 NA12892 (two libraries) 531,804 98.84% 0.75 NA12878 (library #1) 401,342 98.49% 0.51

500 Kbp N50 < 1 error/Mbp 99% of SNVs phased

slide-15
SLIDE 15

Comparison

This shows how clever algorithms can greatly improve sequencing technology

slide-16
SLIDE 16

Metagenomics

  • We used Moleculo to assemble the human

gut microbiome, which led to:

  • Very long contigs
  • High resolution analysis of strains
  • Enabled by new software package called

Nanoscope

slide-17
SLIDE 17

Assembly results

  • 650 Mbp of sequence as 50 Kbp (N50)

contigs (7x longer than with Illumina)

  • Several megabase-long contigs, including a

recently discovered species

slide-18
SLIDE 18

Sub-strain identification

slide-19
SLIDE 19

Sub-strain identification

  • Lens phasing algorithm

reconstructs bacterial haplotypes.

  • Over 200K variants
  • Haplotype N50 length of

22 Kb

  • Several long haplotypes
  • f over 120 Kbp

assembly ection A T T C C G

A G A T T C G A

G A A T T T C A C T

T A C T A T A G T A C T G A T C T

G T C

G T C

T A

assembly etection

T A

assembly etection

a. b. c.

slide-20
SLIDE 20

De-novo Assembly

A R B C R D A B C D B D A C

Two regions with repeat R covered by long reads Repeat structure in the assembly graph Resolving the repeat using raw short reads

R R

slide-21
SLIDE 21

Conclusion

  • Synthetic long reads are a promising

sequencing technology that can make progress on important genomics problems

  • This technology requires developing novel

computational methods, which opens a new research direction

slide-22
SLIDE 22

Acknowledgements

  • Snyder Lab: Mike

Snyder, Dan Xie, Chao Jiang, Wenyu Zhou

  • Batzoglou Lab:

Serafim Batzoglou, Alex Bishara, Yuling Liu

  • Moleculo Team:

Dmitry Pushkarev, Michael Kertesz,Tim Blauwkamp

  • Funding Agencies
  • NIH Training Grant
  • NSERC Canada

Graduate Fellowship

  • ISCB for travel support
slide-23
SLIDE 23

Thank you!