File Types in Bioinformatics 2017-11-28 Martin Dahl - - PowerPoint PPT Presentation

file types in bioinformatics
SMART_READER_LITE
LIVE PREVIEW

File Types in Bioinformatics 2017-11-28 Martin Dahl - - PowerPoint PPT Presentation

File Types in Bioinformatics 2017-11-28 Martin Dahl martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se http://xkcd.com Overwhelming at first Overview FASTA


slide-1
SLIDE 1
slide-2
SLIDE 2

File Types in Bioinformatics

2017-11-28 Martin Dahlö martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se

slide-3
SLIDE 3

http://xkcd.com

slide-4
SLIDE 4
  • Overwhelming at first
  • Overview

○ FASTA – reference sequences ○ FASTQ – reads in raw form ○ SAM – aligned reads ○ BAM – compressed SAM file ○ CRAM – even more compressed SAM file ○ GTF/GFF/BED – annotations

slide-5
SLIDE 5

FASTA

  • Used for: nucleotide or peptide sequences
  • Simple structure

> header sequence

slide-6
SLIDE 6

FASTA

  • Used for: nucleotide or peptide sequences
  • Simple structure
slide-7
SLIDE 7

FASTQ

  • Just like FASTA, but with quality values
  • Used for: raw data from sequencing (unaligned reads)

@ header sequence + quality

slide-8
SLIDE 8

FASTQ

  • Just like FASTA, but with quality values
  • Used for: raw data from sequencing (unaligned reads)
slide-9
SLIDE 9

FASTQ

  • Quality 0-40 (Illumina 1.8+ = 41)

○ 40 = best

  • ASCII encoded
slide-10
SLIDE 10

FASTQ

  • Quality 0-40 (Illumina 1.8+ = 41)

○ 40 = best

  • ASCII encoded
slide-11
SLIDE 11

FASTQ

  • Quality 0-40 (Illumina 1.8+ = 41)

○ 40 = best

  • ASCII encoded
slide-12
SLIDE 12

FASTQ

Phred Quality Score Error Accuracy 10 1/10 = 10% 90% 20 1/100 = 1% 99% 30 1/1000 = 0.1% 99.9% 40 1/10000 = 0.01% 99.99% 50 1/100000 = 0.001% 99.999% 60 1/1000000 = 0.0001% 99.9999%

slide-13
SLIDE 13

SAM

  • Used for: aligned reads
  • Lots of columns..
slide-14
SLIDE 14

SAM

slide-15
SLIDE 15

SAM

  • Used for: aligned reads
  • Lots of columns..

Read name Start position bp chr Sequence Quality

slide-16
SLIDE 16

BAM

  • Binary SAM (compressed)
  • 25% of the size
  • SAMtools to convert
  • .bai = BAM index
slide-17
SLIDE 17
slide-18
SLIDE 18

BAM

  • Random order
  • Have to sort before indexing
slide-19
SLIDE 19

BAM

  • Random order
  • Have to sort before indexing

Chr1 Chr2 Chr3 Chr4 Chr5

slide-20
SLIDE 20

BAM

slide-21
SLIDE 21

BAM

slide-22
SLIDE 22

BAM

slide-23
SLIDE 23

CRAM

  • Very complex format
  • Used together with a reference genome
slide-24
SLIDE 24

CRAM

  • Quality scores?
  • 3 modes:

○ Lossless ○ Binned ○ No quality

slide-25
SLIDE 25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 32 33 34 35 36 37 38 39 40 41 1-5 6-10 11-15 16-20 21-25 26-30 31-35 35-40 41-45

=> Reducing the number of quality values increases shared blocks and improves compression.

slide-26
SLIDE 26

CRAM

  • Quality scores?
  • 3 modes:

○ Lossless ○ Binned ○ No quality

  • Not widespread, yet
slide-27
SLIDE 27

GTF/GFF/BED

  • Used for: annotations
  • Column structure
  • one line = one feature (match, exon, etc)
slide-28
SLIDE 28

GTF/GFF/BED

BED format:

  • 3-12 columns

3 mandatory fields + 9 optional fields chr start stop extra info

  • + optional track definition lines

chr1 213941196 213942363 chr1 213942363 213943530

slide-29
SLIDE 29

GTF/GFF/BED

BED format:

  • optional fields
  • 4. name - Label to be displayed under the feature, if turned on in "Configure this page".
  • 5. score - A score between 0 and 1000.
  • 6. strand - defined as + (forward) or - (reverse).
  • 7. thickStart - coordinate at which to start drawing the feature as a solid rectangle
  • 8. thickEnd - coordinate at which to stop drawing the feature as a solid rectangle
  • 9. itemRgb - an RGB colour value (e.g. 0,0,255). Only used if there is a track line with the value of itemRgb set to

"on" (case-insensitive).

  • 10. blockCount - the number of sub-elements (e.g. exons) within the feature
  • 11. blockSizes - the size of these sub-elements
  • 12. blockStarts - the start coordinate of each sub-element

chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

slide-30
SLIDE 30

GTF/GFF/BED

BED format:

  • optional track definition lines

The track line consists of the word 'track' followed by space- separated key=value pairs Parameters differ from databases. Ensembl example:

track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

slide-31
SLIDE 31

GTF/GFF/BED

GFF/GTF format:

  • 9 columns

/!\ different version 1, 2, 2.5, 3 GTF = GFF version 2

Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN

  • 1. sequence id
  • 2. source
  • 3. feature type
  • 4. start
  • 5. end
  • 6. score
  • 7. strand
  • 8. phase
  • 9. attribute(s)

tag=value

slide-32
SLIDE 32

GTF/GFF/BED

GFF3:

  • Headers

##gff-version 3 ##sequence-region ctg123 1 1497228

  • Features
  • Sequences (optional)

##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat

Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN

slide-33
SLIDE 33

GTF/GFF/BED

##gff-version 3.2.1 ##sequence-region ctg123 1 1497228 ctg123 . Gene 1000 9000 . + . ID=gene1;Name=EDEN ctg123 . mRNA 1050 9000 . + . ID=mRNA1;Parent=gene1 ctg123 . exon 1050 1500 . + . ID=exon1;Parent=mRNA1 ctg123 . exon 7000 9000 . + . ID=exon2;Parent=mRNA1 ctg123 . CDS 1201 1500 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1 ctg123 . CDS 7000 7600 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1

slide-34
SLIDE 34
  • Laboratory time! (yet again)