databases
play

Databases # sequenced genomes Bas E. Dutilh Systems Biology: - PDF document

25 Mar 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you


  1. 25 ‐ Mar ‐ 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you figure out the function of a protein? Activity assay X ‐ ray structure IBM 7090 computer • First protein sequence: bovine insulin (51 amino acids, 1956) • Atlas of Protein Sequence and Structure (1965) – Margaret Oakley Dayhoff • Protein DataBank (10 proteins, 1972) – X ‐ ray crystallographic protein structures Knock ‐ out mouse SWISSPROT (1987) • – Protein sequence database • Genbank (1982) – Nucleotide and protein sequences BLAST search Fasta files Fasta file extensions • Biological sequences are stored in Fasta files • The file extension of a Fasta file is .fa or .fasta • Fasta files are plain text files (open e.g. in ) • The preferred extension for protein Fasta files is .faa – Fasta Amino Acid Every new sequence entry starts with a “>” sign at the start of a line >protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV Each sequence has an identifier >protein_seque >protein_sequence_B nce_B that has to be unique in the file MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_A nce_A >protein_seque >protein_sequence_C nce_C MTQSSHAVAA FDL MT Q Q SSHAVAA FDLGAALR GAALRQE GLTETDYSE Q Q E GLTETDYSEI I QRDPNRAELG TFGV Q RDPNRAELG TFGV MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA L GRAMFGVMWS EHCCYRNDDA YRNDDA YRNDDA >protein_sequence_B >protein_seque nce_B RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV MLTETDYSEI QRRLGRDPNR AELGMFGVM MLTETDYSEI QRR LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C • The preferred extension for DNA Fasta files is .fna MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA – Fasta Nucleic Acid RPLLRPIKSP FGAWVVIV RPLLRPIKSP FGA WVVIV >DNA_sequence_X >DNA_sequence_ GAGGAATTCA TAGCTGACGA GTCGAGTGA GAGGAATTCA TAG CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y The sequence can be on one or more lines Spaces and newlines just make CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG until the next “>” at the start of a new line sequences easier to read/count, >DNA_sequence_Z >DNA_sequence_ they do not have any meaning GAAGCTGACC CGTTTCCGGA AGAGGGAGG GAAGCTGACC CGT TTCCGGA AGAGGGAGG 1

  2. 25 ‐ Mar ‐ 15 DNA sequencing Bad quality sequencing read • DNA sequencing depends on A/C/G/T signal being “read” – Differently colored fluorophore signals – Signal is not always unambiguous • DNA sequencing machines estimate the quality of a sequenced nucleotide DNA sequencing quality scores Good quality sequencing read • DNA sequencing quality is measured in Phred scores – Phred 10: 10 ‐ 1 chance that the base is wrong • 90% accuracy; 10% error rate – Phred 20: 10 ‐ 2 chance that the base is wrong • 99% accuracy ; 1% error rate – Phred 30: 10 ‐ 3 chance that the base is wrong • 99.9% accuracy ; 0.1% error rate – Etcetera • Phred scores in Fastq files are stored as ASCII characters – Phred score + 33, converted to ASCII text Fastq Genbank format • Sequencing output and quality are stored in Fastq format • Used by the Genbank database – Based on Fasta format – Used in sequence similarity – Contains information about quality of each nucleotide searches (more about this – Quality score is estimated by sequencing machine later) >sequence_identifier_1 @sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC +sequence identifier 1 >sequence identifier 2 q q _ _ _ _ AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAA hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E% • Four lines per sequence: – Identifier line starting with @ • Contains the sequence and – DNA sequence on one line all related information – Second identifier line starting with + • Click on “FASTA” to get Fasta – String of quality scores on one line, encoded in ASCII characters format 2

  3. 25 ‐ Mar ‐ 15 Central paradigm of Bioinformatics International Nucleotide Sequence Database Collaboration • INSDC is a collaboration between: • Central dogma of (molecular) biology – DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics Institute (EMBL ‐ EBI) • Biological sequences encode a lot of information • One of the most important applications of Bioinformatic Data Analysis is to extract this information Protein families • Pfam database • SEED database • EGGnog database Ribosomal RNA genes Protein structures • Small subunit ribosomal RNA (SSU rRNA) is a universal marker gene that indicates the taxonomic group of an organism, and was used to discover the three domains in the Tree of Life (ToL) – 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes) 3

  4. 25 ‐ Mar ‐ 15 Transcription factor binding sites (TFBS) Metabolic pathways (etc.) Scientific literature Protein interactions Using databases reproducibly in science • Databases are not static, but are constantly updated • Thus, every entry (sequence, protein, structure, function, etc.) in a database has a unique identifier – Sometimes identifiers are changed in new versions of the database • Entries can be retrieved from the database – Search possibilities depend on the database – Often the complete database can be downloaded for large ‐ scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science! 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend