File Types in Bioinformatics 2017-11-28 Martin Dahl - PowerPoint PPT Presentation

File Types in Bioinformatics 2017-11-28 Martin Dahlö martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se

http://xkcd.com ■

● Overwhelming at first ● Overview ○ FASTA – reference sequences ○ FASTQ – reads in raw form ○ SAM – aligned reads ○ BAM – compressed SAM file ○ CRAM – even more compressed SAM file ○ GTF/GFF/BED – annotations

FASTA ● Used for: nucleotide or peptide sequences ● Simple structure > header sequence

FASTA ● Used for: nucleotide or peptide sequences ● Simple structure

FASTQ ● Just like FASTA, but with quality values ● Used for: raw data from sequencing (unaligned reads) @ header sequence + quality

FASTQ ● Just like FASTA, but with quality values ● Used for: raw data from sequencing (unaligned reads)

FASTQ ● Quality 0-40 (Illumina 1.8+ = 41) ○ 40 = best ● ASCII encoded

FASTQ Phred Quality Score Error Accuracy 10 1/10 = 10% 90% 20 1/100 = 1% 99% 30 1/1000 = 0.1% 99.9% 40 1/10000 = 0.01% 99.99% 50 1/100000 = 0.001% 99.999% 60 1/1000000 = 0.0001% 99.9999%

SAM ● Used for: aligned reads ● Lots of columns..

SAM ● Used for: aligned reads ● Lots of columns.. Start position bp chr Sequence Quality Read name

BAM ● Binary SAM (compressed) ● 25% of the size ● SAMtools to convert ● .bai = BAM index

BAM ● Random order ● Have to sort before indexing

BAM ● Random order ● Have to sort before indexing Chr1 Chr2 Chr3 Chr4 Chr5

CRAM ● Very complex format ● Used together with a reference genome

CRAM ● Quality scores? ● 3 modes: ○ Lossless ○ Binned ○ No quality

1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 32 33 34 35 36 37 38 39 40 41 1-5 6-10 11-15 16-20 21-25 26-30 31-35 35-40 41-45 => Reducing the number of quality values increases shared blocks and improves compression.

CRAM ● Quality scores? ● 3 modes: ○ Lossless ○ Binned ○ No quality ● Not widespread, yet

GTF/GFF/BED ● Used for: annotations ● Column structure ● one line = one feature (match, exon, etc)

GTF/GFF/BED BED format: ● 3-12 columns 3 mandatory fields + 9 optional fields chr start stop extra info chr1 213941196 213942363 chr1 213942363 213943530 ● + optional track definition lines

GTF/GFF/BED BED format: ● optional fields 4. name - Label to be displayed under the feature, if turned on in "Configure this page". 5. score - A score between 0 and 1000. 6. strand - defined as + (forward) or - (reverse). 7. thickStart - coordinate at which to start drawing the feature as a solid rectangle 8. thickEnd - coordinate at which to stop drawing the feature as a solid rectangle 9. itemRgb - an RGB colour value (e.g. 0,0,255). Only used if there is a track line with the value of itemRgb set to "on" (case-insensitive). 10. blockCount - the number of sub-elements (e.g. exons) within the feature 11. blockSizes - the size of these sub-elements 12. blockStarts - the start coordinate of each sub-element chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

GTF/GFF/BED BED format: ● optional track definition lines The track line consists of the word 'track' followed by space- separated key=value pairs Parameters differ from databases. Ensembl example: track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0

GTF/GFF/BED GFF/GTF format: ● 9 columns 2. source 4. start 6. score 8. phase Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN 9. attribute(s) 5. end 7. strand 1. sequence id 3. feature type tag=value /!\ different version 1, 2, 2.5, 3 GTF = GFF version 2

GTF/GFF/BED GFF3: ● Headers ##gff-version 3 ##sequence-region ctg123 1 1497228 ● Features Ctg123 cufflinks Gene 1000 9000 . + . ID=gene1; Name=EDEN ● Sequences (optional) ##FASTA >ctg123 cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat

GTF/GFF/BED ##gff-version 3.2.1 ##sequence-region ctg123 1 1497228 ctg123 . Gene 1000 9000 . + . ID=gene1;Name=EDEN ctg123 . mRNA 1050 9000 . + . ID=mRNA1;Parent=gene1 ctg123 . exon 1050 1500 . + . ID=exon1;Parent=mRNA1 ctg123 . exon 7000 9000 . + . ID=exon2;Parent=mRNA1 ctg123 . CDS 1201 1500 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1 ctg123 . CDS 7000 7600 . + 0 ID=cds1;Parent=mRNA1;Name=edenprotein.1

● Laboratory time! (yet again)

File Types in Bioinformatics 2017-11-28 Martin Dahl - PowerPoint PPT Presentation

File Types in Bioinformatics 2017-11-28 Martin Dahl martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se http://xkcd.com Overwhelming at first Overview FASTA

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Types Dynamic types Types are broken down into many categories Static types Duck typing

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics & Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

File Systems: Semantics & Structure What is a File a file is a named collection of

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

Today The multiplicative weights framework. Experts framework. n experts. Every day, each offers

Learning and Inference to Exploit High Order Poten7als Richard

UPDATE ON WANO Zack T. Pate Chairman World Association of Nuclear Operators WANO Mission To

Wakefield due to roughness in a pipe of rectangular cross section Gennady Stupakov and Karl Bane

Lossy Encryption from General Assumptions Brett Hemenway and Rafail Ostrovsky Crypto in the

Efficient Wavefield Simulators Based on Krylov Model-Order Reduction Techniques From Resonators

t s

Tighter Security Proofs for GPV-IBE in the Quantum Random Oracle Model (The University of Tokyo

File Types in Bioinformatics 2017-11-28 Martin Dahl - PowerPoint PPT Presentation

File Types in Bioinformatics 2017-11-28 Martin Dahl martin.dahlo@scilifelab.uu.se Valentin Georgiev valentin.georgiev@icm.uu.se Jacques Dainat jacques.dainat@nbis.se http://xkcd.com Overwhelming at first Overview FASTA

File Management What is a file? Elements of file management File organization

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

CPSC 410/611: File Management What is a file? Elements of file management File

Week 10: File Management What is a file? Elements of file management File

Types Dynamic types Types are broken down into many categories Static types Duck typing

~FILE SYSTEM~ SUNU WIBIRAMA OUTLINE FILE SYSTEM ACCESS METHODS DIRECTORY STRUCTURE FILE

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

What if... There is no file with the name given to the File constructor: new File

CPSC 410/611: File Management What is a file? Elements of file management

File Systems: Semantics &amp; Structure What is a File a file is a named collection of

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

Today The multiplicative weights framework. Experts framework. n experts. Every day, each offers

Learning and Inference to Exploit High Order Poten7als Richard

UPDATE ON WANO Zack T. Pate Chairman World Association of Nuclear Operators WANO Mission To

Wakefield due to roughness in a pipe of rectangular cross section Gennady Stupakov and Karl Bane

Lossy Encryption from General Assumptions Brett Hemenway and Rafail Ostrovsky Crypto in the

Efficient Wavefield Simulators Based on Krylov Model-Order Reduction Techniques From Resonators

t s

Tighter Security Proofs for GPV-IBE in the Quantum Random Oracle Model (The University of Tokyo

File Systems: Semantics & Structure What is a File a file is a named collection of

File Systems: Semantics & Structure What is a File a file is a named collection of

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt