Input Input Data Next-generation sequencing data Single-end or - - PowerPoint PPT Presentation

input input data
SMART_READER_LITE
LIVE PREVIEW

Input Input Data Next-generation sequencing data Single-end or - - PowerPoint PPT Presentation

Input Input Data Next-generation sequencing data Single-end or paired-end Mix of viruses, host, bacteria and phage DNA and RNA Capture sequencing to enrich for viruses Aim of Vi VirMap Identify reads that map to known viral


slide-1
SLIDE 1

Input Input Data

  • Next-generation sequencing data
  • Single-end or paired-end
  • Mix of viruses, host, bacteria and phage DNA and RNA
  • Capture sequencing to enrich for viruses
  • Identify reads that map to known viral sequences
  • De novo sequence assembly and recover putative novel virus

sequences

  • Assign each viral sequence to species (e.g. identify the taxonomy)

Aim of Vi VirMap

slide-2
SLIDE 2

Instructions for Installation of Vi VirMap on Gadi

  • Lack of installation instructions (early 2019). Enlisted help from ResTech
  • Instructions from ResTech:
  • https://github.com/rsoftone/virmap
  • Download VirMap pipeline: https://github.com/cmmr/virmap.git

Miniconda

  • Conda modules
  • (e.g. bbmap, diamond, megahit)
  • Perl
  • RocksDB – taxonomy lookup
  • Recommend not to install in default $TMPDIR

(/scratch/projectID/userID/tmp). Please update $TMPDIR with some other location (e.g. export TMPDIR= /g/data/projectID/userID/tmp).

slide-3
SLIDE 3

Testing Vi VirMap Installation

slide-4
SLIDE 4

Pa Part 1: Taxonomy database generation

  • Run script to generate the taxaJson.dat file
  • The name is misleading - it is not JSON but actually Sereal-

encoded, Zstd compressed Perl hash reference built up from NCBI taxonomy files. Code highlights

  • Download taxonomy data from NCBI
  • Construct edges to child-nodes from taxonomy data
  • Sereal encoding + Zstd compression of Perl hash-reference
  • https://github.com/rsoftone/virmap/tree/master/10-

helper-scripts

slide-5
SLIDE 5

Pa Part 2: Accession -> > GI lookup

  • Downloads a table that converts NCBI accession to GI accession
  • NCBI accession: NM_000202.5
  • GI accession: 262118207

FASTA format header:

  • blue = fields that are contained in the NCBI Genbank feature table.
  • yellow = not currently in Genbank’s feature table and need to be obtained via Accession -> GI lookup
  • Information obtained by correspondence with VirMap authors.

Nucleotide:

  • >gi|2765944|gb|Z99337.1|.Human.immunodeficiency.virus.1.isolate.C49.DNA.for.reverse;taxId=11676

Protein:

  • >GI|GI:817036220|KR080198.1|AKF14503.1|GI:817036221|hypothetical.protein|Mycobacterium.phage.Ca

mbiare;pos=46..537;codonStart=1;taxId=1647305

https://github.com/rsoftone/virmap/tree/master/10-helper-scripts

slide-6
SLIDE 6

Pa Part 3: Genbank division download

  • https://github.com/rsoftone/virmap/tree/master/referencedbs
  • Download aspera fast data transfer (ascp) utility
  • List files served by GenBank FTP server
  • Downloading required GenBank divisions
  • The GenBank database is divided into 18 divisions. For example:
  • PRI - primate sequences
  • VRT - other vertebrate sequences
  • PLN - plant, fungal, and algal sequences
  • BCT - bacterial sequences
  • VRL - viral sequences
slide-7
SLIDE 7

Ot Other Steps s for Setting g the Database se

  • Part 4: Nucleotide .fasta generation using Gadi PBS script
  • Part 5: Protein .fasta generation using Gadi PBS script
  • Part 6: Build BBmap virus database (reads mapping)
  • Part 7: Build Diamond virus database (protein sequence DB search)
  • Part 7.5: Build Diamond genbank database (gbBlastx)
  • Part 8: Build Kraken2 database – find lowest common ancestor (LCA)
  • f all genomes containing the short sequence (k-mer), assign

taxonomy to sequence

  • Part 9: Build blastn genbank database (gbBlastn)
slide-8
SLIDE 8

Con Configure virma rmap_con

  • nfig.sh in the

## Under $TMPDIR/virmap/ virmap_config.sh ## File path, used as the value for the --gbBlastx parameter GB_BLASTX="/g/data1a/projectID/VirMap/200625-gbblastx.dmnd" ## Partial file path, used as the value for the --gbBlastn parameter GB_BLASTN="/g/data1a/projectID/VirMap/191205-gbBlastn/191205-gbBlastn" ## Folder path, used as the value for the --virBbmap parameter VIR_BBMAP="/g/data1a/projectID/VirMap/191205-virBbmap" ## File path, used as the value for the --virDmnd parameter VIR_DMND="/g/data1a/projectID/VirMap/200625-virdiamond.dmnd" ## File path, used as the value for the --taxaJson parameter TAXA_JSON="/g/data1a/projectID/VirMap/200117-taxaJson.dat"

slide-9
SLIDE 9

Run Vi VirMap Wrapper Script (part 1)

# Activate Conda Environment INSTALL_DIR=$TMPDIR/virmap MINICONDA_DIR=$TMPDIR/miniconda3 export TMPDIR= /g/data/projectID/userID/tmp mkdir -p/path/to/store/output cd /path/to/store/output cp $INSTALL_DIR/activate.sh /path/to/store/output source activate.sh

slide-10
SLIDE 10

Run Vi VirMap Wrapper Script (part 2)

qsub -P projectID -q normal -l walltime=6:00:00,\ mem=48G,ncpus=12,wd,jobfs=100GB,storage=scratch/projectID +gdata/projectID -joe -- \ $INSTALL_DIR/virmap_wrapper.sh \

  • -readUnpaired "/path/to/my/sample/.fastq" \
  • -sampleName "sample_name"
slide-11
SLIDE 11

Vi VirMap Output

  • Final Output file (e.g. nCoV_228_S83_L001_R1_001.final.fa)
  • Fasta file contained list of all assembled virus genomes and taxonomy

information, if available.

  • Example fasta header:

>gi|1369125429|gb|MG772934.1|Bat.SARS-like.coronavirus.isolate.bat-SL- CoVZXC21,.complete.genome.,...;taxId=1508227;length=29566;size=53956; maxScore=9709;maxAlign=9964;maxPossibleScore=15495;protScore=9709; maxProtAlign=9964;nuclScore=0;maxNuclAlign=0;letterOccupancy=71.54;hi ghDivergence=1

slide-12
SLIDE 12

Ag Aggregate Results

  • Give the script a list of results directories and it will

aggregate data from running many samples into one Excel file

  • Whether the run is successful and returned usable results
  • For each sample
  • List of viruses, listed by NCBI taxonomy ID
  • The taxonomy classification hierarchy
  • De-duplicate read counts

https://github.com/rsoftone/virmap/tree/master/aggreg_stats

slide-13
SLIDE 13

Ag Aggregate number of errors for each sample

slide-14
SLIDE 14

Wa Walltimes for running each step for individual sa sample

slide-15
SLIDE 15

Duration for Each Step

12 cores and 48 Gb Ram 24 cores and 112 Gb RAM

slide-16
SLIDE 16

Ru Run Statistics

  • 94 samples from 93 individuals
  • Two of the 94 samples failed our first attempt at running VirMap
  • 8 samples failed de-duplication but still produced results
  • The total amount of SU used by the 92 samples is 5,894 SU
  • median of 62.5 SU ±2.5 SU standard error (se) per sample
  • Walltime:
  • 1:37 ±4 min se. for 12 cores and 48 Gb RAM, and
  • 1:22 ± 2.3 min. se for 24 cores and 112 Gb RAM
slide-17
SLIDE 17

Ge Get t de-dupl duplicated ed rea ead d coun unts

  • In a table format. Columns include:
  • NCBI tax id (e.g. 1508227)
  • Size = number of deduplicated reads (e.g. 408)
  • Weak or strong match, whether the sequence can be assigned to a

taxa, potential for mis-annotation etc…

  • Taxonomy lineage in text. For example:
  • Viruses -> Riboviria -> Nidovirales -> Cornidovirineae -> Coronaviridae
  • > Orthocoronavirinae -> Betacoronavirus -> Sarbecovirus -> Severe

acute respiratory syndrome-related coronavirus -> Bat SARS-like coronavirus

slide-18
SLIDE 18

Hea Heat ma map of 94 94 sam sample les s ve versus vir viruse ses s id identifie ified – pr prelim. . re results

Coronaviruses

slide-19
SLIDE 19

Summary Results

  • 63/92 (68%) samples had

coronavirus of some sort (2019 reference database) as detected by VirMap

  • 81/92 (88%) samples had
  • ther viruses that are not

coronaviruses or phage.

  • 55/92 (59.8%) samples had

co-infection of Coronavirus and other viruses.

  • Endogenous retrovirus K,

found, in 89/92 samples, not shown in heat map

Coronaviruses

slide-20
SLIDE 20

Re Re-map Reads to Vi Viral Sequences to get Read Cou Counts

  • Get read counts for each virus
  • Remove all reads that mapped to
  • The human genome
  • Bacterial Genbank division genomes
  • Bacteriophage Genbank division genomes
  • Map all remaining reads to the *.final.fa virus genomes assembled from the sample
  • Remove multi-mapped reads
  • Get read counts
  • Usage: ./get_read_counts.sh -g human_genome_bwa -p phage_bwa -b bacteria_bwa

virmap_output.final.fa virmap_input_sequence.fq

  • https://github.com/rsoftone/virmap/blob/master/get_read_counts.sh
slide-21
SLIDE 21

Av Availability on Other Platforms

  • Katana
  • Duncan has attempted to install VirMap
  • Differences in CentOS version (Katana v6, Gadi v7)
  • Conda dependencies in Katana cannot be resolved
  • Microsoft Azure
  • VirMap will install in an Ubuntu instance
  • Need to transfer data into blob storage and transfer output back into blob storage
  • Around $300 test Credits for UNSW staff and students
  • Need to submit request for higher capacity machine, contact support.
  • Remember to switch off the instance when you don’t need it
  • Amazon
  • Original VirMap Github currently has instructions on how to install pipeline and

reference DB in Amazon. (Was not available before in 2019.)

slide-22
SLIDE 22

Futur Future e Works

  • Comparison to other tools (e.g. IDSeq and Genome Detective)
  • Analyze more samples wit VirMap
  • Update sequence database to 2020 version on Gadi (for COVID19

analysis)

  • Possible developments:
  • Option to replace for BlastN with a faster tool MMseqs2

(https://github.com/soedinglab/MMseqs2)

  • Re-write parts of VirMap in a workflow scripting language (e.g.

Snakemake or Nextflow)

slide-23
SLIDE 23

Acknowledgements

  • Vi

Virology Research h Labo boratory:

  • Prof. Maria Craig.
  • Prof. William Rawlinson
  • Sonia Isaacs (PhD)
  • Emily Gibson & Cynthia Lau (RA)
  • Clare Faulkner, Dylan Foskett, Jessica Lau (Students)
  • Coordinators: Jacki, Deb, Susan, Kelly, Jess
  • UN

UNSW:

  • Prof. Marc Wilkins
  • Dr. Ignatius Pang
  • A/Prof. Rowena Bull
  • A/Prof. Fabio Luciani
  • UNSW ResTech (Simon, David, Martin, Duncan, Robin)
  • Center for Infection & Immunity:
  • Prof. Ian Lipkin
  • A/Prof. Thomas Briese
  • EN

ENDIA Study:

  • Prof. Jennifer Couper (CI)
  • All investigators
  • Study coordinators
  • Study team
  • Microsoft (Andreas Wilm & Auda Eltahla)
  • Mo

Mothers rs & Childre ren support rting ENDIA

  • National Collaborative Research Infrastructure Strategy

(NCRIS) and Bioplatforms Australia

Serology & UNSW ResTech

slide-24
SLIDE 24

Ta Taxonomy data from NCBI

Notes from Simon. Downloading and unzipping should produce the files: names.dmp nodes.dmp You'll need the following fields from each file, based on the VirMap krakenFilter.pl, which imports the following columns ($parents, $names, $ranks, $children) nodes.dmp tax_id

  • - node id in GenBank taxonomy database. The ‘children’ needs to be derived from reverse of ‘parents’, so some code is

needed to perform this. parent tax_id

  • - parent node id in GenBank taxonomy database (I think this corresponds to the 'parents' field)

rank

  • - rank of this node (superkingdom, kingdom, ... (I think this corresponds to the 'ranks' field)

names.dmp tax_id

  • - the id of node associated with this name (use this field to merge with the nodes.dmp file)

name_txt

  • - name itself (I think this corresponds to the ‘names’ field).