Databases # sequenced genomes Bas E. Dutilh Systems Biology: - - PDF document

databases
SMART_READER_LITE
LIVE PREVIEW

Databases # sequenced genomes Bas E. Dutilh Systems Biology: - - PDF document

25 Mar 15 Biology is Big Data science Databases # sequenced genomes Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Moore's Law: computer Utrecht University, March 26 th 2015 power doubles every ~2 years. History How would you


slide-1
SLIDE 1

25‐Mar‐15 1

Databases

Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 26th 2015

Biology is Big Data science

# sequenced genomes Moore's Law: computer power doubles every ~2 years.

History

  • First protein sequence: bovine insulin (51 amino acids, 1956)
  • Atlas of Protein Sequence and Structure (1965)

– Margaret Oakley Dayhoff

  • Protein DataBank (10 proteins, 1972)

– X‐ray crystallographic protein structures

  • SWISSPROT (1987)

– Protein sequence database

  • Genbank (1982)

– Nucleotide and protein sequences IBM 7090 computer

How would you figure out the function of a protein?

X‐ray structure Activity assay Knock‐out mouse BLAST search

  • Biological sequences are stored in Fasta files
  • Fasta files are plain text files (open e.g. in )

Fasta files

>protein_seque >protein_sequence_A nce_A MT MTQSSHAVAA FDL SSHAVAA FDLGAALR GAALRQE GLTETDYSE E GLTETDYSEI I QRDPNRAELG TFGV RDPNRAELG TFGV Every new sequence entry starts with a “>” sign at the start of a line Each sequence has an identifier that has to be unique in the file Q Q Q Q Q >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV The sequence can be on one or more lines until the next “>” at the start of a new line Spaces and newlines just make sequences easier to read/count, they do not have any meaning

Fasta file extensions

  • The file extension of a Fasta file is .fa or .fasta
  • The preferred extension for protein Fasta files is .faa

– Fasta Amino Acid

>protein_seque >protein_sequence_A nce_A MTQSSHAVAA FDL MTQSSHAVAA FDLGAALRQE GLTETDYSE GAALRQE GLTETDYSEI QRDPNRAELG TFGV I QRDPNRAELG TFGV >protein_seque >protein_sequence_B nce_B MLTETDYSEI QRR MLTETDYSEI QRRLGRDPNR AELGMFGVM LGRDPNR AELGMFGVMN RAELGMFGY N RAELGMFGY >protein_seque >protein_sequence_C nce_C MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA

  • The preferred extension for DNA Fasta files is .fna

– Fasta Nucleic Acid

MHAVAAFDLG AAL MHAVAAFDLG AALRQEGLTE TDYSEIQRR RQEGLTE TDYSEIQRRL GRAMFGVMWS EHCC L GRAMFGVMWS EHCCYRNDDA YRNDDA RPLLRPIKSP FGA RPLLRPIKSP FGAWVVIV WVVIV >DNA_sequence_ >DNA_sequence_X GAGGAATTCA TAG GAGGAATTCA TAGCTGACGA GTCGAGTGA CTGACGA GTCGAGTGAA AACCGTGTCG TAAA A AACCGTGTCG TAAAAGA AGA >DNA_sequence_ >DNA_sequence_Y CTGACGAGTC GCC CTGACGAGTC GCCCCCCCCC ATAGAGTGG CCCCCCC ATAGAGTGGT TTCCGTTTCC GGAA T TTCCGTTTCC GGAAGGGTCG GGGTCG >DNA_sequence_ >DNA_sequence_Z GAAGCTGACC CGT GAAGCTGACC CGTTTCCGGA AGAGGGAGG TTCCGGA AGAGGGAGG

slide-2
SLIDE 2

25‐Mar‐15 2

DNA sequencing

  • DNA sequencing depends on A/C/G/T signal being “read”

– Differently colored fluorophore signals – Signal is not always unambiguous

  • DNA sequencing machines estimate the quality of a

sequenced nucleotide

Bad quality sequencing read Good quality sequencing read DNA sequencing quality scores

  • DNA sequencing quality is measured in Phred scores

– Phred 10: 10‐1 chance that the base is wrong

  • 90% accuracy; 10% error rate

– Phred 20: 10‐2 chance that the base is wrong

  • 99% accuracy ; 1% error rate

– Phred 30: 10‐3 chance that the base is wrong

  • 99.9% accuracy ; 0.1% error rate

– Etcetera

  • Phred scores in Fastq files are

stored as ASCII characters

– Phred score + 33, converted to ASCII text

Fastq

  • Sequencing output and quality are stored in Fastq format

– Based on Fasta format – Contains information about quality of each nucleotide – Quality score is estimated by sequencing machine >sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC >sequence identifier 2 @sequence_identifier_1 GGAAATGGAGTACGGATCGATTTTGTTTGGAACCGAAAGGGTC +sequence identifier 1

  • Four lines per sequence:

– Identifier line starting with @ – DNA sequence on one line – Second identifier line starting with + – String of quality scores on one line, encoded in ASCII characters q _ _ AAGCATCCGAATGACGAGCTAGGAGAGATCTGAGCCTTTCAAA q _ _ hhhhhhhhhhhhhhhh7F@71,'";C?,B;?6B;:EA1EA1E%

Genbank format

  • Used by the Genbank

database

– Used in sequence similarity searches (more about this later)

  • Contains the sequence and

all related information

  • Click on “FASTA” to get Fasta

format

slide-3
SLIDE 3

25‐Mar‐15 3

Central paradigm of Bioinformatics

  • Central dogma of (molecular) biology
  • Biological sequences encode a lot of information
  • One of the most important applications of Bioinformatic

Data Analysis is to extract this information International Nucleotide Sequence Database Collaboration

  • INSDC is a collaboration between:

– DNA Data Bank of Japan (DDBJ) – National Center for Biotechnology Information (NCBI) – European Molecular Biology Laboratory / European Bioinformatics Institute (EMBL‐EBI)

Protein families

  • Pfam database
  • SEED database
  • EGGnog database

Ribosomal RNA genes

  • Small subunit ribosomal RNA (SSU rRNA) is a universal

marker gene that indicates the taxonomic group of an

  • rganism, and was used to discover the three domains

in the Tree of Life (ToL)

– 16S rRNA (Bacteria and Archaea) – 18S rRNA (Eukaryotes)

Protein structures

slide-4
SLIDE 4

25‐Mar‐15 4

Transcription factor binding sites (TFBS) Metabolic pathways (etc.) Scientific literature Protein interactions Using databases reproducibly in science

  • Databases are not static, but are constantly updated
  • Thus, every entry (sequence, protein, structure, function, etc.)

in a database has a unique identifier

– Sometimes identifiers are changed in new versions of the database

  • Entries can be retrieved from the database

– Search possibilities depend on the database – Often the complete database can be downloaded for large‐scale analyses – When you access a database, note the version of the database or the date of accessing the database for reproducible science!