Short Reads Alignment to a Reference Genome Joanna Krupka CRUK - - PowerPoint PPT Presentation

short reads alignment to a reference genome
SMART_READER_LITE
LIVE PREVIEW

Short Reads Alignment to a Reference Genome Joanna Krupka CRUK - - PowerPoint PPT Presentation

Short Reads Alignment to a Reference Genome Joanna Krupka CRUK Summer School in Bioinformatics Cambridge, July 2020 Shotgun Sequencing and sequence assembly approaches Commins J. et al, Biol Proced Online 11(1)


slide-1
SLIDE 1

Short Reads Alignment to a Reference Genome

Joanna Krupka


CRUK Summer School in Bioinformatics Cambridge, July 2020

slide-2
SLIDE 2

Shotgun Sequencing and sequence assembly approaches

2

Commins J. et al, Biol Proced Online 11(1) 2015

De Novo assembly Mapping to reference sequence

Recreate the genome with no prior knowledge Recreate the genome with using prior knowledge as reference Problem with repeated regions, high coverage and long reads required Mapping is as good as reference used

slide-3
SLIDE 3

Mappability

3

Repeat-regions

?

Mapping uncertainty if the reads are shorter than a repeat region

Mappability (or uniqueness) is a measure of the ability of aligning the short reads to a unique location in the reference genome.

Rozowsky J. Et al. Nat Biotechnol 2009

slide-4
SLIDE 4

Short sequence mapping tools

4

https://www.ecseq.com/support/ngs/what-is-the-best-ngs-alignment-software

More than 80 different mappers

slide-5
SLIDE 5

5

Splice aware Not splice aware

exon exon intron exon exon intron Bowtie2 BWA STAR TopHat2 Hisat2

  • eg. Whole Genome Sequencing
  • eg. RNA-Seq

Annotations with exons genomic coordinates Reference genome with exons genomic coordinates Alternatively:
 Reference transcriptome

Short sequence mapping tools

slide-6
SLIDE 6

6

Splice aware Not splice aware

exon exon intron exon exon intron Bowtie2 BWA STAR TopHat2 Hisat2

  • eg. Whole Genome Sequencing, ChIP-Seq
  • eg. RNA-Seq

Annotations with exons genomic coordinates Reference genome with exons genomic coordinates Alternatively:
 Reference transcriptome

Short sequence mapping tools

slide-7
SLIDE 7

ENCODE: encyclopedia of DNA elements

7

https://www.encodeproject.org

The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome employing variety of assays and techniques.

slide-8
SLIDE 8

Annotations: GTF/GFF file

8

RefSeq

Resources:

GENCODE annotation is made by merging the manual gene annotation produced by the Ensembl-Havana team and the Ensembl-genebuild automated gene annotation.

  • The gene annotation is the same in both files. The only exception is that the genes which

are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

  • GENCODE GTF contains also APPRIS tags and the annotation are on the reference

chromosomes only

exon exon intron Gencode vs. Ensembl Always make sure that annotations match the genome FASTA file (the same version & source)

slide-9
SLIDE 9

9

Splice aware Not splice aware

exon exon intron exon exon intron Bowtie2 BWA STAR TopHat2 Hisat2

  • eg. Whole Genome Sequencing, ChIP-Seq
  • eg. RNA-Seq

Annotations with exons genomic coordinates Reference genome with exons genomic coordinates Alternatively:
 Reference transcriptome

Short sequence mapping tools

Pseudo-aligners

slide-10
SLIDE 10

Annotations: GTF/GFF file

10

Header * * * * New line

feature type {gene,transcript,exon,CDS,UTR,start_codon,stop_codon}

exon Transcript/gene start_codon CDS stop_codon 5’UTR 3’UTR

slide-11
SLIDE 11

Annotations: GTF/GFF file

11

Header * * * * New line

Genomic coordinates Annotation source Strand Additional information Gene id Transcript id Gene type Gene status Gene name Transcript type Transcript status Transcript status Exon number Exon id Level

slide-12
SLIDE 12

Pseudo-aligners

12

Salmon Sailfish Kallisto

  • Quantification estimates rather than base-to-base alignment
  • Can model sequencing bias, eg. GC-bias, fragment length
  • Can handle multi mapping
  • Faster
  • Improved accuracy at the transcript level

Zhang, C., Zhang, B., Lin, L. L., & Zhao, S. (2017). Evaluation and comparison of computational tools for RNA-seq isoform quantification. BMC Genomics, 18(1), 1–11.

slide-13
SLIDE 13

Before you align checklist & standard workflow

13

  • Do I need splice-aware aligner?
  • Am I using right genome version? (hg38 - human, mm10 -mouse?)
  • Do annotations match the reference genome?
  • Read manual, select parameters, check default settings

Genome index

Standard alignment workflow

Sequenced reads

Once per genome

Reference Genome

FASTA

Annotations

GTF (optional) FASTQ

Alignment

BAM

Aligned reads Pseudo-alignment Transcript abundance

slide-14
SLIDE 14

Coverage and Depth

14

The average depth of sequencing coverage can be defined theoretically as LN/G, where L is the read length, N is the number of reads and G is the haploid genome length. Example: If we sequence a genome with total length of 100 nucleotides and we have 500 reads, 25 nucleotides length each - the average depth of sequencing is 125

Coverage: average number of reads

  • f a given length that align to given

region. Depth: redundancy of coverage or the total number of bases sequenced and aligned at a given reference position.

exon exon intron

Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: Key considerations in genomic analyses. Nature Reviews Genetics, 15(2),

slide-15
SLIDE 15

Mapping quality check

15

SAMstat is a C program that plots nucleotide overrepresentation and other statistics in mapped and unmapped reads and helps understand the relationship between potential protocol biases and poor mapping. Log files returned by aligner, eg Log.final.out file from STAR FastQC

slide-16
SLIDE 16

16

Let’s practice!