25‐Mar‐15 1
Databases
Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 26th 2015
Biology is Big Data science
# sequenced genomes Moore's Law: computer power doubles every ~2 years.
History
- First protein sequence: bovine insulin (51 amino acids, 1956)
- Atlas of Protein Sequence and Structure (1965)
– Margaret Oakley Dayhoff
- Protein DataBank (10 proteins, 1972)
– X‐ray crystallographic protein structures
- SWISSPROT (1987)
– Protein sequence database
- Genbank (1982)
– Nucleotide and protein sequences IBM 7090 computer
How would you figure out the function of a protein?
X‐ray structure Activity assay Knock‐out mouse BLAST search
- Biological sequences are stored in Fasta files
- Fasta files are plain text files (open e.g. in )
Fasta files
>protein_seque >protein_sequence_A nce_A MT MTQSSHAVAA FDL SSHAVAA FDLGAALR GAALRQE GLTETDYSE E GLTETDYSEI I QRDPNRAELG TFGV RDPNRAELG TFGV Every new sequence entry starts with a “>” sign at the start of a line Each sequence has an identifier that has to be unique in the file Q Q Q Q Q >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV The sequence can be on one or more lines until the next “>” at the start of a new line Spaces and newlines just make sequences easier to read/count, they do not have any meaning
Fasta file extensions
- The file extension of a Fasta file is .fa or .fasta
- The preferred extension for protein Fasta files is .faa
– Fasta Amino Acid
>protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA
- The preferred extension for DNA Fasta files is .fna
– Fasta Nucleic Acid
MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV >DNA_sequence_ >DNA_sequence_X GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG >DNA_sequence_ >DNA_sequence_Z GAAGCTGACC CGT GAAGCTGACC CGTTTCCGGA AGAGGGAGG TTCCGGA AGAGGGAGG