Workshop Schedule 9am Introductions & Running the VM 10:30am - - PowerPoint PPT Presentation

workshop schedule
SMART_READER_LITE
LIVE PREVIEW

Workshop Schedule 9am Introductions & Running the VM 10:30am - - PowerPoint PPT Presentation

Workshop Schedule 9am Introductions & Running the VM 10:30am Coffee 11am Analyzing tool performance & Merging 11:30am Managing File Structures 12pm Understanding VCFs and Visualizing your SVs 1pm Lunch 2pm


slide-1
SLIDE 1

Workshop Schedule

 9am

Introductions & Running the VM

 10:30am Coffee  11am

Analyzing tool performance & Merging

 11:30am Managing File Structures  12pm

Understanding VCFs and Visualizing your SVs

 1pm

Lunch

 2pm

Customizing the VM – new data/new tools

 2:30pm

BreakOut Sessions

  • Option 1: Run your own Data
  • Option 2: Discussion of projects, current difficulties, Issues

 3:45pm

Wrap-Up Session

slide-2
SLIDE 2

THE NEW FACE OF SV DETECTION

  • Dr. Tiffanie Yael Maoz (Moss)

Weizmann Institute of Science, Israel

slide-3
SLIDE 3

Motivation for Work

 AllBio Research Consortium Project  Test Case #2 – Identification of Large SVs  Goal: Create a tool or pipeline which can identify

large structural variants in multiple accessions of a genome

slide-4
SLIDE 4

The Need for a Benchmark

 Many tools exist to predict structural variants (SVs),

but have largely been developed and tested on human germline or somatic (e.g. cancer) variation.

 Benchmarking is needed to determine which tools

are most suited non-human genomes.

slide-5
SLIDE 5

SV Prediction Tools – Algorithm design

slide-6
SLIDE 6

Procedure/Methodology

 Choose Non-Human Genome  Choose SV tools to be tested  Measure computational metrics  Measure SV tool accuracy  Optimize output for users  Standardize process for future SV tool development

slide-7
SLIDE 7

Non-Human Dataset

 Choice of Genome subject: Arabidopsis (Landsberg)

 Well annotated reference (TAIR v9/10)  Validated Structural Variants (Landsberg – Ler ecotype)  Diploid  Homogenous

 Creation of semi-synthetic dataset  Integration of INDEL calls into Reference genome  30x coverage  More vs. Less ideal datasets

slide-8
SLIDE 8

Tools

Delly InGap (Java) BreakDancer GenomeSTRIP Tigra-SV GASV/GASVPro Hydra SVMerge SVDetect CNVSeq CNVnator Pindel PRISM

slide-9
SLIDE 9

Popular SV tools

slide-10
SLIDE 10

To do List

 Check quality of genomes and apply quality control

as needed using fastqc

 Align reads with multiple alignment programs:

BWA, Bowtie2, SOAP?

 Quality check the alignments  Run SV programs on each alignment  Format output as needed for visualization  Test accuracy of output  Write up the analysis

slide-11
SLIDE 11

Day 1

 Use Bam and Bowtie2 sorted bam files to run the

following programs:

 Delly  Breakdancer  GASV-Pro  SVDetect  Pindel  Prism  Clever

slide-12
SLIDE 12

Day 2

 Completed the testing of the selected structural variation programs

with bwa and bowtie2 alignments

 flagstat – Statistics for mapping quality for bwa and bowtie2  Used a script to compare the quality of the SV programs on the

basis of precision, recall, novel calls, distance of the predicted from the true call and the difference in length. (Only examine SVs >10nt)

 Need add additional SV calls to our reference data set to be used

in generating the synthetic reads

 Will use tool from the assemble-a-thon to create the synthetic reads

in future (use same mean and standard deviation as was seen in the reference data)

 Working on generating a make file which should automate future

runs on new test data and would properly format the VCF files for matrix analysis

slide-13
SLIDE 13

Genomic Data

Test dataset

Data QC Fastqc Reads alignment Alignments QC SV analysis Output format Performance Metrics Optional: Merge Select Tool(s) Genome Browser

Validate SVs

Develop/Refine New Tools

Meta-Tool: Identification of the best tool for SV analysis Downstream Options

BWA-mem 0.7.5a Samtools flagstat

Breakdancer GASV Clever Pindel Prism SVdetect Delly

Modified VCF Analysis PDF Merged VCF ESTs Retro-elements Small RNAs X

Pipeline for Meta-Tool Virtual Machine (VM)

Sickle

slide-14
SLIDE 14

Computational Performance

Tool Threads (reserved / ran) Mem use on Tair Mem use on Human (mb) CPU time Tair (h:m.s) CPU time Human (h:m.s) Algorithm SV’s gasv 1 1058 594 0:02.08 0:01.20 Paired End IDVT delly 1 578 1236 0:15.02 0:03.18 Paired End & Split Read DVTP breakdancer 1 21.9 7 0:02.41 0:27.7 Paired End IDVT pindel 4 3500.5 5779 3:02.46 1:16.0 Split Read IDVP clever 1 - 4 238.7 1598 0:15.47 0:14.04 Paired End ID svdetect 1 - 2 172.3 3223 0:07.56 0:07.31 Paired End IDVTP

SV types: Insertion (I), Deletion (D), Inversion (V), Translocation(T) , SNP (S), Duplication (P)

slide-15
SLIDE 15

Measuring SV tool accuracy

 Measurements  Precision & Recall

 Precision: percentage of correct calls among all calls  Recall: percentage of true SV being called

 Across SV size classes

 20-49nt  50-99nt  100-249nt  250-999nt  1000-50,000nt

slide-16
SLIDE 16

Benchmarking Summary

 Tools vary in the calling ability across size classes  Some tools are more adaptable to the Arabidopsis

dataset

 Trade-off of many precise calls but often with more

false positives

 There is benefit to using multiple tools, but the

number of false positives would overwhelm the researcher with many non-true calls

slide-17
SLIDE 17

Expectations for SV discovery

Alkan, Coe & Eichler, 2011

slide-18
SLIDE 18

Merging Package

 Merge Package  Cluster and merge output  Statistically deterimined clustering based on tool algorithms  Calls reduced by as much as 70%  VCF output format (suitable for IGV)  Benefits of 7 callers with fewer false positives

slide-19
SLIDE 19

Merge Package Overlay

slide-20
SLIDE 20

AllBio Test Case #2 Hackers:

W.Y. Leung, Sequencing Analysis Support Core, Leiden University Medical, The Netherlands Alexander Schoenhuth, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Tobias Marschall, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands Leon Mei, Sequencing Analysis Support Core, Leiden University Medical, The Netherlands Yogesh Paudel, Animal breeding and genomics centre, Wageningen University, The Netherlands Laurent Falquet, University of Fribourg and Swiss Institute of Bioinformatics, Fribourg, Switzerland Tiffanie Yael Maoz (Moss), Weizmann Institute of Science, Rehovot, Israel

Supported by: ALLBio Leadership with special thanks to Gert Vriend and Greg Rossier