Input Input Data Next-generation sequencing data Single-end or - PowerPoint PPT Presentation

Input Input Data • Next-generation sequencing data • Single-end or paired-end • Mix of viruses, host, bacteria and phage DNA and RNA • Capture sequencing to enrich for viruses Aim of Vi VirMap • Identify reads that map to known viral sequences • De novo sequence assembly and recover putative novel virus sequences • Assign each viral sequence to species (e.g. identify the taxonomy)

Instructions for Installation of Vi VirMap on Gadi • Lack of installation instructions (early 2019). Enlisted help from ResTech • Instructions from ResTech: • https://github.com/rsoftone/virmap • Download VirMap pipeline: https://github.com/cmmr/virmap.git Miniconda • Conda modules • (e.g. bbmap, diamond, megahit) • Perl • RocksDB – taxonomy lookup • Recommend not to install in default $TMPDIR (/scratch/projectID/userID/tmp). Please update $TMPDIR with some other location (e.g. export TMPDIR= /g/data/projectID/userID/tmp).

Testing Vi VirMap Installation

Pa Part 1: Taxonomy database generation • Run script to generate the taxaJson.dat file • The name is misleading - it is not JSON but actually Sereal- encoded, Zstd compressed Perl hash reference built up from NCBI taxonomy files. Code highlights • Download taxonomy data from NCBI • Construct edges to child-nodes from taxonomy data • Sereal encoding + Zstd compression of Perl hash-reference • https://github.com/rsoftone/virmap/tree/master/10- helper-scripts

Pa Part 2: Accession -> > GI lookup • Downloads a table that converts NCBI accession to GI accession • NCBI accession: NM_000202.5 • GI accession: 262118207 FASTA format header: • blue = fields that are contained in the NCBI Genbank feature table. • yellow = not currently in Genbank’s feature table and need to be obtained via Accession -> GI lookup • Information obtained by correspondence with VirMap authors. Nucleotide: • >gi|2765944|gb|Z99337.1|.Human.immunodeficiency.virus.1.isolate.C49.DNA.for.reverse;taxId=11676 Protein: • >GI|GI:817036220|KR080198.1|AKF14503.1|GI:817036221|hypothetical.protein|Mycobacterium.phage.Ca mbiare;pos=46..537;codonStart=1;taxId=1647305 https://github.com/rsoftone/virmap/tree/master/10-helper-scripts

Pa Part 3: Genbank division download • https://github.com/rsoftone/virmap/tree/master/referencedbs • Download aspera fast data transfer (ascp) utility • List files served by GenBank FTP server • Downloading required GenBank divisions • The GenBank database is divided into 18 divisions. For example: • PRI - primate sequences • VRT - other vertebrate sequences • PLN - plant, fungal, and algal sequences • BCT - bacterial sequences • VRL - viral sequences

Ot Other Steps s for Setting g the Database se • Part 4: Nucleotide .fasta generation using Gadi PBS script • Part 5: Protein .fasta generation using Gadi PBS script • Part 6: Build BBmap virus database (reads mapping) • Part 7: Build Diamond virus database (protein sequence DB search) • Part 7.5: Build Diamond genbank database (gbBlastx) • Part 8: Build Kraken2 database – find lowest common ancestor (LCA) of all genomes containing the short sequence (k-mer), assign taxonomy to sequence • Part 9: Build blastn genbank database (gbBlastn)

Con Configure virma rmap_con onfig.sh in the ## Under $TMPDIR/virmap/ virmap_config.sh ## File path, used as the value for the --gbBlastx parameter GB_BLASTX ="/g/data1a/projectID/VirMap/200625-gbblastx.dmnd" ## Partial file path, used as the value for the --gbBlastn parameter GB_BLASTN ="/g/data1a/projectID/VirMap/191205-gbBlastn/191205-gbBlastn" ## Folder path, used as the value for the --virBbmap parameter VIR_BBMAP ="/g/data1a/projectID/VirMap/191205-virBbmap" ## File path, used as the value for the --virDmnd parameter VIR_DMND ="/g/data1a/projectID/VirMap/200625-virdiamond.dmnd" ## File path, used as the value for the --taxaJson parameter TAXA_JSON ="/g/data1a/projectID/VirMap/200117-taxaJson.dat"

Run Vi VirMap Wrapper Script (part 1) # Activate Conda Environment INSTALL_DIR=$TMPDIR/virmap MINICONDA_DIR=$TMPDIR/miniconda3 export TMPDIR= /g/data/projectID/userID/tmp mkdir -p/path/to/store/output cd /path/to/store/output cp $INSTALL_DIR/activate.sh /path/to/store/output source activate.sh

Run Vi VirMap Wrapper Script (part 2) qsub -P projectID -q normal -l walltime=6:00:00,\ mem=48G,ncpus=12,wd,jobfs=100GB,storage=scratch/projectID +gdata/projectID -joe -- \ $INSTALL_DIR/virmap_wrapper.sh \ --readUnpaired "/path/to/my/sample/.fastq" \ --sampleName "sample_name"

Vi VirMap Output • Final Output file (e.g. nCoV_228_S83_L001_R1_001.final.fa) • Fasta file contained list of all assembled virus genomes and taxonomy information, if available. • Example fasta header: >gi|1369125429|gb|MG772934.1|Bat.SARS-like.coronavirus.isolate.bat-SL- CoVZXC21,.complete.genome.,...;taxId=1508227;length=29566;size=53956; maxScore=9709;maxAlign=9964;maxPossibleScore=15495;protScore=9709; maxProtAlign=9964;nuclScore=0;maxNuclAlign=0;letterOccupancy=71.54;hi ghDivergence=1

Ag Aggregate Results • Give the script a list of results directories and it will aggregate data from running many samples into one Excel file • Whether the run is successful and returned usable results • For each sample • List of viruses, listed by NCBI taxonomy ID • The taxonomy classification hierarchy • De-duplicate read counts https://github.com/rsoftone/virmap/tree/master/aggreg_stats

Ag Aggregate number of errors for each sample

Wa Walltimes for running each step for individual sa sample

Duration for Each Step 12 cores and 48 Gb Ram 24 cores and 112 Gb RAM

Ru Run Statistics • 94 samples from 93 individuals • Two of the 94 samples failed our first attempt at running VirMap • 8 samples failed de-duplication but still produced results • The total amount of SU used by the 92 samples is 5,894 SU • median of 62.5 SU ±2.5 SU standard error (se) per sample • Walltime: • 1:37 ±4 min se. for 12 cores and 48 Gb RAM, and • 1:22 ± 2.3 min. se for 24 cores and 112 Gb RAM

Ge Get t de-dupl duplicated ed rea ead d coun unts • In a table format. Columns include: • NCBI tax id (e.g. 1508227) • Size = number of deduplicated reads (e.g. 408) • Weak or strong match, whether the sequence can be assigned to a taxa, potential for mis-annotation etc… • Taxonomy lineage in text. For example: • Viruses -> Riboviria -> Nidovirales -> Cornidovirineae -> Coronaviridae -> Orthocoronavirinae -> Betacoronavirus -> Sarbecovirus -> Severe acute respiratory syndrome-related coronavirus -> Bat SARS-like coronavirus

Hea Heat ma map of 94 94 sam sample les s ve versus vir viruse ses s id identifie ified – pr prelim. . re results Coronaviruses

Summary Results • 63/92 (68%) samples had coronavirus of some sort (2019 reference database) as detected by VirMap • 81/92 (88%) samples had other viruses that are not coronaviruses or phage. • 55/92 (59.8%) samples had co-infection of Coronavirus and other viruses. • Endogenous retrovirus K, found, in 89/92 samples, not Coronaviruses shown in heat map

Re Re-map Reads to Vi Viral Sequences to get Read Cou Counts • Get read counts for each virus • Remove all reads that mapped to • The human genome • Bacterial Genbank division genomes • Bacteriophage Genbank division genomes • Map all remaining reads to the *.final.fa virus genomes assembled from the sample • Remove multi-mapped reads • Get read counts • Usage: ./get_read_counts.sh -g human_genome_bwa -p phage_bwa -b bacteria_bwa virmap_output.final.fa virmap_input_sequence.fq • https://github.com/rsoftone/virmap/blob/master/get_read_counts.sh

Av Availability on Other Platforms • Katana • Duncan has attempted to install VirMap • Differences in CentOS version (Katana v6, Gadi v7) • Conda dependencies in Katana cannot be resolved • Microsoft Azure • VirMap will install in an Ubuntu instance • Need to transfer data into blob storage and transfer output back into blob storage • Around $300 test Credits for UNSW staff and students • Need to submit request for higher capacity machine, contact support. • Remember to switch off the instance when you don’t need it • Amazon • Original VirMap Github currently has instructions on how to install pipeline and reference DB in Amazon. (Was not available before in 2019.)

Futur Future e Works • Comparison to other tools (e.g. IDSeq and Genome Detective) • Analyze more samples wit VirMap • Update sequence database to 2020 version on Gadi (for COVID19 analysis) • Possible developments: • Option to replace for BlastN with a faster tool MMseqs2 (https://github.com/soedinglab/MMseqs2) • Re-write parts of VirMap in a workflow scripting language (e.g. Snakemake or Nextflow)

Input Input Data Next-generation sequencing data Single-end or - PowerPoint PPT Presentation

Input Input Data Next-generation sequencing data Single-end or paired-end Mix of viruses, host, bacteria and phage DNA and RNA Capture sequencing to enrich for viruses Aim of Vi VirMap Identify reads that map to known viral

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Chapter 2 SPATIAL DATA INPUT SPATIAL DATA INPUT Analysis and modeling in a GIS requires input of

Input Devices Managing text and positional input 1 CS 349 - Input Devices iPod Wheel 2 CS 349

Expanding the Reach of Fuzzing Caroline Lemieux September 8 th , 2020 Fuzzcon Europe

Wayland Input Methods Michael Hasselmann Openismus GmbH Wayland Input Methods Input methods?

problems of direct input and solutions Input devices vs. Finger-based input Indirect vs. Direct

Exercise 1: Basic Input Exercise 1: Basic Input FLUKA Beginners Course Exercise 1: Basic Input

17. Recursion 2 Input: 3 + 5 * 20 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20

7. Java Input/Output User Input/Console Output, File Input and Output (I/O) 133 User Input (half

Input and Validation Mendel Rosenblum CS142 Lecture Notes - Input Early web app input: HTTP form

BASIC INPUT/OUTPUT Fundamentals of Computer Science I Outline: Basic Input/Output Screen

16. Recursion 2 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20 Output: 12

Input Input devices Text entry Positional input 1 MacBook Wheel (The Onion) -

SAFE INJECTION SERVICES TASK FORCE MEETING 1 JUNE 15 TH , 2017 9AM-11AM 25 VAN NESS

MOL2NET In silico analysis of cytotoxicity, rate of absorption and molecular docking of natural

Next Up in Series on Px Research & Reality: Hormonal contraception and HIV Prof. Helen Rees

A phylogenetic follow-up study of 4 individuals infected with closely related HIV-1 strains MIEP

Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C ATTCGGGTATTTAAGCC

Disclosures I have nothing to disclose Paired Exchange: Compatible Pairs and Process Valerie

Disclosure None Medication Management and Beyond David J. Quan, Pharm.D., BCPS Pharmacist

Live Webcast This activity is supported by independent educational grants from July 16, 2020

Sambuz

Useful Links

Newsletter

Mail Us

Input Input Data Next-generation sequencing data Single-end or - PowerPoint PPT Presentation

Input Input Data Next-generation sequencing data Single-end or paired-end Mix of viruses, host, bacteria and phage DNA and RNA Capture sequencing to enrich for viruses Aim of Vi VirMap Identify reads that map to known viral

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

Input Input devices Text entry Positional input Input Devices 1 iPod Wheel Input Devices 2

Tra ffi c Management as a Service | Ghent, Belgium INPUT PROCESS OUTPUT INPUT PROCESS OUTPUT

Input Input devices Text entry Positional input Input Devices 1 MacBook Wheel (The Onion) -

Chapter 2 SPATIAL DATA INPUT SPATIAL DATA INPUT Analysis and modeling in a GIS requires input of

Input Devices Managing text and positional input 1 CS 349 - Input Devices iPod Wheel 2 CS 349

Expanding the Reach of Fuzzing Caroline Lemieux September 8 th , 2020 Fuzzcon Europe

Wayland Input Methods Michael Hasselmann Openismus GmbH Wayland Input Methods Input methods?

problems of direct input and solutions Input devices vs. Finger-based input Indirect vs. Direct

Exercise 1: Basic Input Exercise 1: Basic Input FLUKA Beginners Course Exercise 1: Basic Input

17. Recursion 2 Input: 3 + 5 * 20 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20

7. Java Input/Output User Input/Console Output, File Input and Output (I/O) 133 User Input (half

Input and Validation Mendel Rosenblum CS142 Lecture Notes - Input Early web app input: HTTP form

BASIC INPUT/OUTPUT Fundamentals of Computer Science I Outline: Basic Input/Output Screen

16. Recursion 2 Output: 103 Input: (3 + 5) * 20 Output: 160 Input: -(3 + 5) + 20 Output: 12

Input Input devices Text entry Positional input 1 MacBook Wheel (The Onion) -

SAFE INJECTION SERVICES TASK FORCE MEETING 1 JUNE 15 TH , 2017 9AM-11AM 25 VAN NESS

MOL2NET In silico analysis of cytotoxicity, rate of absorption and molecular docking of natural

Next Up in Series on Px Research &amp; Reality: Hormonal contraception and HIV Prof. Helen Rees

A phylogenetic follow-up study of 4 individuals infected with closely related HIV-1 strains MIEP

Cancer Panomics Hoifung Poon 1 Overview ATTCGG A TATTTAAG G C ATTCGGGTATTTAAGCC

Disclosures I have nothing to disclose Paired Exchange: Compatible Pairs and Process Valerie

Disclosure None Medication Management and Beyond David J. Quan, Pharm.D., BCPS Pharmacist

Live Webcast This activity is supported by independent educational grants from July 16, 2020

Sambuz

Useful Links

Newsletter

Mail Us

Next Up in Series on Px Research & Reality: Hormonal contraception and HIV Prof. Helen Rees