NGI-RNAseq Processing RNA-seq data at the National Genomics - PowerPoint PPT Presentation

NGI-RNAseq Processing RNA-seq data at the National Genomics Infrastructure Phil Ewels phil.ewels@scilifelab.se NBIS RNA-seq tutorial NGI stockholm 2017-11-09

SciLifeLab NGI Our mission is to o ff er a   state-of-the-art infrastructure   for massively parallel DNA sequencing and SNP genotyping, available to researchers all over Sweden NGI stockholm

SciLifeLab NGI State-of-the-art National resource infrastructure We provide   guidelines and support   Guidelines and for sample collection, study support design, protocol selection and bioinformatics analysis NGI stockholm

NGI Organisation NGI Stockholm NGI Uppsala NGI stockholm

NGI Organisation Reagent costs User fees NGI Stockholm NGI Uppsala Funding Premises and service Sta ff salaries Capital equipment contracts Host universities SciLifeLab VR KAW NGI stockholm

Project timeline Library Data processing Scientific support preparation, Sample QC and primary and project Data delivery Sequencing, analysis consultation Genotyping NGI stockholm

Methods offered at NGI Accredited methods RNA-seq Whole de novo Genome seq Just   Data Sequencing analysis included for FREE Nanopore sequencing Exome   Metagenomics sequencing ChIP-seq RAD-seq Bisulphite   NGI stockholm sequencing ATAC-seq

RNA-Seq: NGI Stockholm • RNA-seq is the most common project type # Projects in 2016 RNA-Seq 131 WG Re-Seq 110 De-Novo 72 Targeted Re-Seq 25 Metagenomics 19 ChIP-Seq 9 Epigenetics 6 RAD Seq 1 0 35 70 105 140 NGI stockholm

RNA-Seq: NGI Stockholm • RNA-seq is the most common project type • Production protocols: # Samples in 2016 • TruSeq (poly-A) RNA-Seq 6,048 • RiboZero WG Re-Seq 4,006 De-Novo 306 • In development: Targeted Re-Seq 5,153 Metagenomics 1,482 • SMARTer Pico ChIP-Seq 244 Epigenetics 33 • RNA Access RAD Seq 288 0 1750 3500 5250 7000 NGI stockholm

RNA-Seq: NGI Stockholm • RNA-seq is the most common project type • Production protocols: • TruSeq (poly-A) • RiboZero • In development: • SMARTer Pico • RNA Access NGI stockholm

RNA-Seq Pipeline • Takes raw FastQ sequencing data as input • Provides range of results • Alignments (BAM) • Gene counts (Counts, FPKM) • Quality Control • First RNA Pipeline running since 2012 • Second RNA Pipeline in use since April 2017 NGI -RNAseq NGI stockholm

RNA-Seq Pipeline NGI -RNAseq FastQC Sequence QC TrimGalore! Read trimming STAR Alignment dupRadar Duplication QC featureCounts Gene counts StringTie Normalised FPKM RSeQC Alignments QC Preseq Library complexity edgeR Heatmap, clustering MultiQC Reporting NGI stockholm

RNA-Seq Pipeline NGI -RNAseq FastQC Sequence QC FastQ TrimGalore! Read trimming STAR Alignment BAM dupRadar Duplication QC featureCounts Gene counts TSV StringTie Normalised FPKM RSeQC Alignments QC Preseq Library complexity edgeR Heatmap, clustering MultiQC Reporting NGI stockholm HTML

Nextflow • Tool to manage computational pipelines • Handles interaction with compute infrastructure • Easy to learn how to run, minimal oversight required NGI stockholm

Nextflow https://www.nextflow.io/ NGI stockholm

Nextflow #!/usr/bin/env nextflow input = Channel.fromFilePairs( params.reads ) process fastqc { input: file reads from input output: file "*_fastqc.{zip,html}" into results script: """ fastqc -q $reads """ } NGI stockholm

Nextflow #!/usr/bin/env nextflow Default: Run locally, assume input = Channel.fromFilePairs( params.reads ) software is installed process fastqc { input: file reads from input output: file "*_fastqc.{zip,html}" into results script: """ fastqc -q $reads """ process { } executor = 'slurm' clusterOptions = { "-A b2017123" } cpus = 1 Submit jobs to SLURM queue memory = 8.GB time = 2.h Use environment modules $fastqc { module = ['bioinfo-tools', ‘FastQC'] } } NGI stockholm

Nextflow #!/usr/bin/env nextflow docker { input = Channel.fromFilePairs( params.reads ) enabled = true process fastqc { input: } file reads from input process { output: file "*_fastqc.{zip,html}" into results container = 'biocontainers/fastqc' script: cpus = 1 """ fastqc -q $reads memory = 8.GB """ process { time = 2.h } } executor = 'slurm' Run locally, use docker container clusterOptions = { "-A b2017123" } for all software dependencies cpus = 1 memory = 8.GB time = 2.h $fastqc { module = ['bioinfo-tools', ‘FastQC'] } } NGI stockholm

NGI-RNAseq https://github.com/SciLifeLab/NGI-RNAseq NGI stockholm

Running NGI-RNAseq Step 1: Install Nextflow • Uppmax - load the Nextflow module module load nextflow • Anywhere (including Uppmax) - install Nextflow curl -s https://get.nextflow.io | bash Step 2: Try running NGI-RNAseq pipeline nextflow run SciLifeLab/NGI-RNAseq --help NGI stockholm

Running NGI-RNAseq Step 3: Choose your reference • Common organism - use iGenomes --genome GRCh37 • Custom genome - Fasta + GTF (minimum) --fasta genome.fa --gtf genes.gtf Step 4: Organise your data • One (if single-end) or two (if paired-end) FastQ per sample • Everything in one directory, simple filenames help! NGI stockholm

Running NGI-RNAseq Step 5: Run the pipeline on your data • Remember to run detached from your terminal screen / tmux / nohup Step 6: Check your results • Read the Nextflow log and check the MultiQC report Step 7: Delete temporary files • Delete the ./work directory, which holds all intermediates NGI stockholm

Typical pipeline output NGI stockholm

Using UPPMAX nextflow run SciLifeLab/NGI-RNAseq --project b2017123 --genome GRCh37 --reads "data/*_R{1,2}.fastq.gz" • Default config is for UPPMAX • Knows about central iGenomes references • Uses centrally installed software NGI stockholm

Using other clusters nextflow run SciLifeLab/NGI-RNAseq -profile hebbe --fasta genome.fa --gtf genes.gtf --reads "data/*_R{1,2}.fastq.gz" • Can run just about anywhere • Supports local, SGE, LSF, SLURM, PBS/Torque, HTCondor, DRMAA, DNAnexus, Ignite, Kubernetes NGI stockholm

Using Docker nextflow run SciLifeLab/NGI-RNAseq -profile docker --fasta genome.fa --gtf genes.gtf --reads "data/*_R{1,2}.fastq.gz" • Can run anywhere with Docker • Downloads required software and runs in a container • Portable and reproducible. NGI stockholm

Using AWS nextflow run SciLifeLab/NGI-RNAseq -profile aws --genome GRCh37 --reads "s3://my-bucket/*_{1,2}.fq.gz" --outdir "s3://my-bucket/results/" • Runs on the AWS cloud with Docker • Pay-as-you go, flexible computing • Can launch from anywhere with minimal configuration NGI stockholm

Input data ERROR ~ Cannot find any reads matching: XXXX NB: Path needs to be enclosed in quotes! NB: Path requires at least one * wildcard! If this is single-end data, please specify   --singleEnd on the command line. --reads '*_R{1,2}.fastq.gz' --reads '*.fastq.gz' --singleEnd --reads sample.fastq.gz --reads *_R{1,2}.fastq.gz --reads '*.fastq.gz' NGI stockholm

Read trimming • Pipeline runs TrimGalore! to remove adapter contamination and low quality bases automatically • Some library preps also include additional adapters • Will get poor alignment rates without additional trimming --clip_r1 [int] --clip_r2 [int] --three_prime_clip_r1 [int] --three_prime_clip_r2 [int] NGI stockholm

Library strandedness • Most RNA-seq data is strand-specific now • Can be "forward-stranded" (same as transcript) or "reverse-stranded" (opposite to transcript) • UPPMAX config runs as reverse stranded by default • If wrong, QC will say most reads don't fall within genes --forward_stranded --reverse_stranded --unstranded NGI stockholm

Lib-prep presets • There are some presets for common kits • Clontech SMARTer PICO • Forward stranded, needs R1 5' 3bp and R2 3' 3bp trimming --pico • Please suggest others! NGI stockholm

Saving intermediates • By default, the pipeline doesn't save some intermediate files to your final results directory • Reference genome indices that have been built • FastQ files from TrimGalore! • BAM files from STAR (we have BAMs from Picard) --saveReference --saveTrimmed --saveAlignedIntermediates NGI stockholm

Resuming pipelines • If something goes wrong, you can resume a stopped pipeline • Will use cached versions of completed processes • NB: Only one hyphen! -resume • Can resume specific past runs • Use nextflow log to find job names -resume job_name NGI stockholm

Customising output Give a name to your run. Used in logs -name and reports Specify the directory for saved results --outdir Use HiSAT2 instead of STAR for --aligner hisat2 alignment Get e-mailed a summary report when --email the pipeline finishes NGI stockholm

Nextflow config files • Can save a config file with defaults • Anything with two hyphens is a params ./nextflow.config params { email = 'phil.ewels@scilifelab.se' ~/.nextflow/config project = "b2017123" } -c /path/to/my.config process.$multiqc.module = [] NGI stockholm

NGI-RNAseq Processing RNA-seq data at the National Genomics - PowerPoint PPT Presentation

NGI-RNAseq Processing RNA-seq data at the National Genomics Infrastructure Phil Ewels phil.ewels@scilifelab.se NBIS RNA-seq tutorial NGI stockholm 2017-11-09 SciLifeLab NGI Our mission is to o ff er a state-of-the-art infrastructure

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David

@NGI4eu www.ngi.eu 1 Roadmap Digital Single Market Policy Programme Future digital policies

COMPETENCY & NGI Atty. Eric Heywood WI SPD Criminal Law Basics March 15, 2018 Competency

SUBMISSION ON APP202170 Application to import CADET herbicide . from TE RNANGA O NGI

SUBMISSION ON APP202334 Application to import ESTEEM fungicide . from TE RNANGA O NGI

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

NGI-DE Central Nagios Monitoring Torsten Antoni, Wilhelm Bhler, Dimitri Nilsen, Pavel Weber

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

Identification and quantification of isoforms in RNAseq data : deep short reads Vs shallow long

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis

RNAseq analysis -its complicated Oktober 2016 RNA

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

The original problem Let X 1 , . . . , X n be a random sample from a density f 0 in R d . How

Topics in Economics (Module B) Giovanni Sulis (gsulis@unica.it) Office Hours: Tuesdays 10 12

Optimizing Cost and Performance for Content Multihoming Hongqiang Harry Liu Ye Wang Yang

Introduction to the course Marco Chiarandini (marco@imada.sdu.dk), Claudio Pica

Technicolor after the Higgs Discovery Francesco Sannino SCGT12 @ Nagoya 2012 September 2011

Hardware Security Chester Rebeiro IIT Madras 1 Physically Unclonable Functions Physical

Effects of Employment Protection Legislation on wages: a regression discontinuity approach Marco

Code Modification Forum Ashling Hotel, Dublin Wednesday, 11 th October 2017 Agenda (1 of 2) 1.

NGI-RNAseq Processing RNA-seq data at the National Genomics - PowerPoint PPT Presentation

NGI-RNAseq Processing RNA-seq data at the National Genomics Infrastructure Phil Ewels phil.ewels@scilifelab.se NBIS RNA-seq tutorial NGI stockholm 2017-11-09 SciLifeLab NGI Our mission is to o ff er a state-of-the-art infrastructure

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel &amp; David

@NGI4eu www.ngi.eu 1 Roadmap Digital Single Market Policy Programme Future digital policies

COMPETENCY &amp; NGI Atty. Eric Heywood WI SPD Criminal Law Basics March 15, 2018 Competency

SUBMISSION ON APP202170 Application to import CADET herbicide . from TE RNANGA O NGI

SUBMISSION ON APP202334 Application to import ESTEEM fungicide . from TE RNANGA O NGI

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

NGI-DE Central Nagios Monitoring Torsten Antoni, Wilhelm Bhler, Dimitri Nilsen, Pavel Weber

NGI Sweden Next Generation Sequencing at the National Genomics Infrastructure Phil Ewels

Identification and quantification of isoforms in RNAseq data : deep short reads Vs shallow long

Statistical analysis of RNASeq Data Introduction to RNA-seq data analysis

RNAseq analysis -its complicated Oktober 2016 RNA

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 527 Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

CSEP 590 B Computational Biology Gene Expression Analysis 1 Assaying Gene Expression 3

Classification and Clustering of RNAseq data Verena Zuber IMISE, University of Leipzig 5th June

RNAseq: Normalization and differential expression I Jens Gietzelt 22.05.2012 Robinson, Oshlack.

The original problem Let X 1 , . . . , X n be a random sample from a density f 0 in R d . How

Topics in Economics (Module B) Giovanni Sulis (gsulis@unica.it) Office Hours: Tuesdays 10 12

Optimizing Cost and Performance for Content Multihoming Hongqiang Harry Liu Ye Wang Yang

Introduction to the course Marco Chiarandini (marco@imada.sdu.dk), Claudio Pica

Technicolor after the Higgs Discovery Francesco Sannino SCGT12 @ Nagoya 2012 September 2011

Hardware Security Chester Rebeiro IIT Madras 1 Physically Unclonable Functions Physical

Effects of Employment Protection Legislation on wages: a regression discontinuity approach Marco

Code Modification Forum Ashling Hotel, Dublin Wednesday, 11 th October 2017 Agenda (1 of 2) 1.

Transcript Assembly and Quantification from RNASeq Data Angelika Merkel & David

COMPETENCY & NGI Atty. Eric Heywood WI SPD Criminal Law Basics March 15, 2018 Competency