Databases # sequenced genomes Bas E. Dutilh Systems Biology: - PDF document

25 ‐ Mar ‐ 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you figure out the function of a protein? Activity assay X ‐ ray structure IBM 7090 computer • First protein sequence: bovine insulin (51 amino acids, 1956) • Atlas of Protein Sequence and Structure (1965) – Margaret Oakley Dayhoff • Protein DataBank (10 proteins, 1972) – X ‐ ray crystallographic protein structures Knock ‐ out mouse SWISSPROT (1987) • – Protein sequence database • Genbank (1982) – Nucleotide and protein sequences BLAST search Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV Each sequence has an identifier >protein_seque >protein_sequence_B nce_B that has to be unique in the file MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_A nce_A >protein_seque >protein_sequence_C nce_C MTQSSHAVAA FDL MT Q Q SSHAVAA FDLGAALR GAALRQE GLTETDYSE Q Q E GLTETDYSEI I QRDPNRAELG TFGV Q RDPNRAELG TFGV MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA L GRAMFGVMWS EHCCYRNDDA YRNDDA YRNDDA >protein_sequence_B >protein_seque nce_B RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA – Fasta Nucleic Acid RPLLRPIKSP FGAWVVIV RPLLRPIKSP FGA WVVIV >DNA_sequence_X >DNA_sequence_ GAGGAATTCA TAGCTGACGA GTCGAGTGA GAGGAATTCA TAG CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z >DNA_sequence_ they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG GAAGCTGACC CGT TTCCGGA AGAGGGAGG 1

25 ‐ Mar ‐ 15 DNA sequencing Bad quality sequencing read • DNA sequencing depends on A/C/G/T signal being “read” – Differently colored fluorophore signals – Signal is not always unambiguous • DNA sequencing machines estimate the quality of a sequenced nucleotide DNA sequencing quality scores Good quality sequencing read • DNA sequencing quality is measured in Phred scores – Phred 10: 10 ‐ 1 chance that the base is wrong • 90% accuracy; 10% error rate – Phred 20: 10 ‐ 2 chance that the base is wrong • 99% accuracy ; 1% error rate – Phred 30: 10 ‐ 3 chance that the base is wrong • 99.9% accuracy ; 0.1% error rate – Etcetera • Phred scores in Fastq files are stored as ASCII characters – Phred score + 33, converted to ASCII text Fastq Genbank format • Sequencing output and quality are stored in Fastq format • Used by the Genbank database – Based on Fasta format – Used in sequence similarity – Contains information about quality of each nucleotide searches (more about this – Quality score is estimated by sequencing machine later) >sequence_identifier_1 @sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC +sequence identifier 1 >sequence identifier 2 q q _ _ _ _ AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAA hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: – Identifier line starting with @ • Contains the sequence and – DNA sequence on one line all related information – Second identifier line starting with + • Click on “FASTA” to get Fasta – String of quality scores on one line, encoded in ASCII characters format 2

25 ‐ Mar ‐ 15 Central paradigm of Bioinformatics International Nucleotide Sequence Database Collaboration • INSDC is a collaboration between: • Central dogma of (molecular) biology – DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics Institute (EMBL ‐ EBI) • Biological sequences encode a lot of information • One of the most important applications of Bioinformatic Data Analysis is to extract this information Protein families • Pfam database • SEED database • EGGnog database Ribosomal RNA genes Protein structures • Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL) – 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes) 3

25 ‐ Mar ‐ 15 Transcription factor binding sites (TFBS) Metabolic pathways (etc.) Scientific literature Protein interactions Using databases reproducibly in science • Databases are not static, but are constantly updated • Thus, every entry (sequence, protein, structure, function, etc.) in a database has a unique identifier – Sometimes identifiers are changed in new versions of the database • Entries can be retrieved from the database – Search possibilities depend on the database – Often the complete database can be downloaded for large ‐ scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science! 4

Databases # sequenced genomes Bas E. Dutilh Systems Biology: - PDF document

25 Mar 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Module 3: Creating and Managing Databases Overview Creating Databases Creating

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Neo4j and graph databases Presented By: Stephanie McIntyre Graph Databases: The Database Model

HUDOC databases: CPT and ESC Patrick Mller HUDOC databases: CPT and ESC HUDOC HUDOC = Human

CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong Today

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Introduc)ontoDatabases 1 Rela%onal(Databases(with(PostgreSQL

CSE 462 - Databases Oliver Kennedy okennedy@buffalo.edu 1 Why Study Databases? 2 3 3 2

CSEP 590A Summer 2006 Lecture 8 RNA Secondary Structure Prediction Outline Biological roles

Fastest Origin of Life? Human Life needs Gene? information carrier: DNA molecular machines,

Molecular biology recap Autumn 2007 Esa Pitknen Master's Degree Programme in Bioinformatics

Health and Movement Learning Objective: To explore human and animal skeletons. NEXT

A Set Cover Approach to Taxonomic Annotation Francesc Rossell o Gabriel Valiente Department

Expressive pattern matching with LOGOL Application to the modelling of -1 Ribosomal Frameshift

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

preservation / curation Logo Research cannot flourish if data are not preserved and made

Databases # sequenced genomes Bas E. Dutilh Systems Biology: - PDF document

25 Mar 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you

Creating Databases and Tables Introduction to Databases in Python Creating Databases

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and

Lecture 11: Persistent Memory Databases 1 / 71 Persistent Memory Databases Recap

Module 3: Creating and Managing Databases Overview Creating Databases Creating

GEMS/Food Databases and GEMS/Food Databases and GEMS/Food Databases and in the Food Supply

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Lecture 10: Larger-than-Memory Databases 1 / 53 Larger-than-Memory Databases Recap

Databases and PHP Accessing databases from PHP PHP &amp; Databases l PHP can connect to

3. Text and document databases Normal databases: formatted records; document databases:

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Neo4j and graph databases Presented By: Stephanie McIntyre Graph Databases: The Database Model

HUDOC databases: CPT and ESC Patrick Mller HUDOC databases: CPT and ESC HUDOC HUDOC = Human

CS573 Data Privacy and Security Statistical Databases Statistical Databases Li Xiong Today

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Introduc)on*to*Databases 1 Rela%onal(Databases(with(PostgreSQL

CSE 462 - Databases Oliver Kennedy okennedy@buffalo.edu 1 Why Study Databases? 2 3 3 2

CSEP 590A Summer 2006 Lecture 8 RNA Secondary Structure Prediction Outline Biological roles

Fastest Origin of Life? Human Life needs Gene? information carrier: DNA molecular machines,

Molecular biology recap Autumn 2007 Esa Pitknen Master's Degree Programme in Bioinformatics

Health and Movement Learning Objective: To explore human and animal skeletons. NEXT

A Set Cover Approach to Taxonomic Annotation Francesc Rossell o Gabriel Valiente Department

Expressive pattern matching with LOGOL Application to the modelling of -1 Ribosomal Frameshift

Towards Process-Level Charm++ Programming in NAMD James Phillips Beckman Institute, University

preservation / curation Logo Research cannot flourish if data are not preserved and made

Databases and PHP Accessing databases from PHP PHP & Databases l PHP can connect to

Introduc)ontoDatabases 1 Rela%onal(Databases(with(PostgreSQL