file types in bioinformatics
play

File Types in Bioinformatics 2017-11-28 Martin Dahl - PowerPoint PPT Presentation

File Types in Bioinformatics 2017-11-28 Martin Dahl martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se http://xkcd.com Overwhelming at first Overview FASTA


  1. File Types in Bioinformatics 2017-11-28 Martin Dahlö martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se

  2. http://xkcd.com ■

  3. ● Overwhelming at first ● Overview ○ FASTA – reference sequences ○ FASTQ – reads in raw form ○ SAM – aligned reads ○ BAM – compressed SAM file ○ CRAM – even more compressed SAM file ○ GTF/GFF/BED – annotations

  4. FASTA ● Used for: nucleotide or peptide sequences ● Simple structure > header sequence

  5. FASTA ● Used for: nucleotide or peptide sequences ● Simple structure

  6. FASTQ ● Just like FASTA, but with quality values ● Used for: raw data from sequencing (unaligned reads) @ header sequence + quality

  7. FASTQ ● Just like FASTA, but with quality values ● Used for: raw data from sequencing (unaligned reads)

  8. FASTQ ● Quality 0-40 (Illumina 1.8+ = 41) ○ 40 = best ● ASCII encoded

  9. FASTQ ● Quality 0-40 (Illumina 1.8+ = 41) ○ 40 = best ● ASCII encoded

  10. FASTQ ● Quality 0-40 (Illumina 1.8+ = 41) ○ 40 = best ● ASCII encoded

  11. FASTQ Phred Quality Score Error Accuracy 10 1/10 = 10% 90% 20 1/100 = 1% 99% 30 1/1000 = 0.1% 99.9% 40 1/10000 = 0.01% 99.99% 50 1/100000 = 0.001% 99.999% 60 1/1000000 = 0.0001% 99.9999%

  12. SAM ● Used for: aligned reads ● Lots of columns..

  13. SAM

  14. SAM ● Used for: aligned reads ● Lots of columns.. Start position bp chr Sequence Quality Read name

  15. BAM ● Binary SAM (compressed) ● 25% of the size ● SAMtools to convert ● .bai = BAM index

  16. BAM ● Random order ● Have to sort before indexing

  17. BAM ● Random order ● Have to sort before indexing Chr1 Chr2 Chr3 Chr4 Chr5

  18. BAM

  19. BAM

  20. BAM

  21. CRAM ● Very complex format ● Used together with a reference genome

  22. CRAM ● Quality scores? ● 3 modes: ○ Lossless ○ Binned ○ No quality

  23. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 32 33 34 35 36 37 38 39 40 41 1-5 6-10 11-15 16-20 21-25 26-30 31-35 35-40 41-45 => Reducing the number of quality values increases shared blocks and improves compression.

  24. CRAM ● Quality scores? ● 3 modes: ○ Lossless ○ Binned ○ No quality ● Not widespread, yet

  25. GTF/GFF/BED ● Used for: annotations ● Column structure ● one line = one feature (match, exon, etc)

  26. GTF/GFF/BED BED format: ● 3-12 columns 3 mandatory fields + 9 optional fields chr start stop extra info chr1 213941196 213942363 chr1 213942363 213943530 ● + optional track definition lines

  27. GTF/GFF/BED BED format: ● optional fields 4. name - Label to be displayed under the feature, if turned on in "Configure this page". 5. score - A score between 0 and 1000. 6. strand - defined as + (forward) or - (reverse). 7. thickStart - coordinate at which to start drawing the feature as a solid rectangle 8. thickEnd - coordinate at which to stop drawing the feature as a solid rectangle 9. itemRgb - an RGB colour value (e.g. 0,0,255). Only used if there is a track line with the value of itemRgb set to "on" (case-insensitive). 10. blockCount - the number of sub-elements (e.g. exons) within the feature 11. blockSizes - the size of these sub-elements 12. blockStarts - the start coordinate of each sub-element chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

  28. GTF/GFF/BED BED format: ● optional track definition lines The track line consists of the word 'track' followed by space- separated key=value pairs Parameters differ from databases. Ensembl example: track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

  29. GTF/GFF/BED GFF/GTF format: ● 9 columns 2. source 4. start 6. score 8. phase Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN 9. attribute(s) 5. end 7. strand 1. sequence id 3. feature type tag=value /!\ different version 1, 2, 2.5, 3 GTF = GFF version 2

  30. GTF/GFF/BED GFF3: ● Headers ##gff-version 3 ##sequence-region ctg123 1 1497228 ● Features Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN ● Sequences (optional) ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat

  31. GTF/GFF/BED ##gff-version 3.2.1 ##sequence-region ctg123 1 1497228 ctg123 . Gene 1000 9000 . + . ID=gene1;Name=EDEN ctg123 . mRNA 1050 9000 . + . ID=mRNA1;Parent=gene1 ctg123 . exon 1050 1500 . + . ID=exon1;Parent=mRNA1 ctg123 . exon 7000 9000 . + . ID=exon2;Parent=mRNA1 ctg123 . CDS 1201 1500 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1 ctg123 . CDS 7000 7600 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1

  32. ● Laboratory time! (yet again)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend