MySQL A Little Science Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

mysql a little science
SMART_READER_LITE
LIVE PREVIEW

MySQL A Little Science Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

MySQL A Little Science Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Structure SCOP tables in detail SCOP entity relation diagram Complex queries to SCOP How many nodes for class, fold, superfamily and


slide-1
SLIDE 1

MySQL – A Little Science

Michael Schroeder

BIOTEC TU Dresden ms@biotec.tu-dresden.de

slide-2
SLIDE 2

Structure

§ SCOP tables in detail § SCOP entity relation diagram § Complex queries to SCOP § How many nodes for class, fold, superfamily and family? § Which superfamily has the most families? § Which families does the DNA binding-domain superfamily have? § Retrieve PDB ids for a given family/superfamily? § What is the percentage of superfamilies with just one family? § What is the PDB structure with the biggest number of distinct superfamilies? § What is the percentage of PDBs with just one superfamily? § What are the most popular superfamilies (biggest number of PDB entries) ? § Selections of selections § Creting temporary tables

slide-3
SLIDE 3

A Little Science

§ When working with SCOP through the web interface we are limited in what we can ask § What can we get out of SCOP when it is available as a relational table? §

§ Classes: all alpha, all beta, alpha/beta, alpha+beta § SCOP family: >30% sequence similarity § SCOP superfamily: good structural similarity (possibly <30%) A reminder

slide-4
SLIDE 4

Sequence vs structure

slide-5
SLIDE 5

SCOPE tables

§ Three tables:

§ cla, PDB entry and reference to its class, fold, superfamily, family, domain § des, description of each node in the SCOP hierarchy § astral, sequence for a domain

slide-6
SLIDE 6

SCOPE tables

mysql> SELECT * FROM cla LIMIT 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> SELECT * FROM des LIMIT 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ mysql> SELECT * FROM astral LIMIT 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+

slide-7
SLIDE 7

Entity relationship diagram for SCOP

slide-8
SLIDE 8

Complex queries to SCOP

slide-9
SLIDE 9

A Little Science

§ How many nodes are there in the hierarchy of type class, fold, superfamily, family?

slide-10
SLIDE 10

A Little Science

§ How many nodes are there in the hierarchy of type class, fold, superfamily, family? § Let us first find out how these types are called: SELECT DISTINCT type FROM des;

Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+

+------+ | type | +------+ | cl | | cf | | sf | | fa | | dm | | sp | | px | +------+

slide-11
SLIDE 11

A Little Science

§ How many nodes are there in the hierarchy of type class, fold, superfamily, family?

slide-12
SLIDE 12

A Little Science

§ How many nodes are there in the hierarchy of type class, fold, superfamily, family?

SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num;

+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067| | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+

slide-13
SLIDE 13

A Little Science

§ How many nodes are there in the hierarchy of type class, fold, superfamily, family?

SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num; § There are not that many more families than superfamilies.

§ Which superfamily has the most families?

+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067 | | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+

slide-14
SLIDE 14

A Little Science

§ Which superfamily has the most families?

Cla (domain table) +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+

slide-15
SLIDE 15

A Little Science

§ Which superfamily has the most families?

SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC;

+---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |

slide-16
SLIDE 16

A Little Science

§ Which families does the DNA binding-domain superfamily have? § The sccs of the superfamily is a.4.5. § Its families have sccs,… a.4.5.1, a.4.5.2 § How can we list them?

Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+

slide-17
SLIDE 17

A Little Science

Which families does the DNA binding-domain superfamily have?

SELECT DISTINCT sccs, description FROM des WHERE sccs LIKE “a.4.5.%” AND type=“fa“ ORDER BY sccs;

| sccs | description +----------+------------------------------------------------ | a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein | a.4.5.11 | Helicase DNA-binding domain | a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain | a.4.5.13 | Linker histone H1/H5 | a.4.5.14 | Forkhead DNA-binding domain | a.4.5.15 | DNA-binding domain from rap30 | a.4.5.16 | C-terminal domain of RPA32 | a.4.5.17 | Cell cycle transcription factor e2f-dp | a.4.5.18 | The central core domain of TFIIE beta | a.4.5.19 | Z-DNA binding domain | a.4.5.2 | LexA repressor, N-terminal DNA-binding domain | a.4.5.20 | P4 origin-binding domain-like | a.4.5.21 | ets domain | a.4.5.22 | Heat-shock transcription factor | a.4.5.23 | Interferon regulatory factor

[...] => 45 rows

slide-18
SLIDE 18

A Little Science

§ Find example PDB structure for the families of the DNA binding-domain superfamily ...

mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1"; +--------+ | pdb_id | +--------+ | 1bia | | 1hxd | | 1bib | | 1j5y | +--------+ mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2"; +--------+ | pdb_id | +--------+ | 1jhf | | 1jhh | | 1lea | | 1leb | +--------+

slide-19
SLIDE 19

A Little Science

1bia (a.4.5.1) 1jhf (a.4.5.2)

Structure very similar! LexA represor Biotin represor-like

slide-20
SLIDE 20

Sequence vs structure

slide-21
SLIDE 21

A Little Science

1cgp (a.4.5.4)

slide-22
SLIDE 22

A Little Science

Some more…

1smt (a.4.5.5) 1hw1 (a.4.5.6) 1b9n (a.4.5.8) 1bm9 (a.4.5.7) 1f4k (a.4.5.7)

slide-23
SLIDE 23

A Little Science

§ How many percent of superfamilies have only 1 family, how many 2,… ?

slide-24
SLIDE 24

A Little Science

§ How many percent of superfamilies have only 1, 2, 3, … families?

  • 1. First let’s create the result of the query that found the number of

families for each superfamily. SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc

+---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |

slide-25
SLIDE 25

A Little Science

+-----------+------+ | fa_per_sf | freq | +-----------+------+ | 1 | 1279 | | 2 | 183 | | 3 | 83 | | 4 | 39 | | 5 | 30 | | 6 | 19 | | 7 | 9 | | 8 | 9 | | 9 | 2 | | 10 | 4 |

§ How many percent of superfamilies have only 1, 2, 3, … families?

  • 2. Now we count the frequency of families per superfamilies:

SELECT num AS fa_per_sf, COUNT(*) AS freq FROM (SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc) AS fa_freq GROUP BY num; We have frequency, but we need percentage!

slide-26
SLIDE 26

A Little Science

Reusing the previous query: SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num; We have a total of 1678 superfamilies!

+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067| | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+

slide-27
SLIDE 27

A Little Science

+-----------+------+ | fa_per_sf | perc | +-----------+------+ | 1 | 0.76 | | 2 | 0.10 | | 3 | 0.04 | | 4 | 0.02 | | 5 | 0.01 | | 6 | 0.01 | | 7 | 0.00 | | 8 | 0.00 | | 9 | 0.00 | | 10 | 0.00 | +-----------+------+

§ How many percent of superfamilies have only 1,2,3,… families?

  • 3. We now calculate the percentage.

SELECT num AS fa_per_sf, (COUNT(*)/1678) AS perc FROM (SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc) AS fa_freq GROUP BY num;

=> 76% of superfamilies have only one familie! This is interesting! For most superfamilies there is only one family!

slide-28
SLIDE 28

A Little Science

SELECT pdb_id, COUNT(sf) AS sf_num FROM cla GROUP BY pdb_id ORDER BY sf_num DESC LIMIT 10;

+--------+--------+ | pdb_id | sf_num | +--------+--------+ | 1pf9 | 49 | | 1aon | 49 | | 1pcq | 49 | | 1htq | 48 | | 1p7g | 48 | | 1mcz | 48 | | 1hto | 48 | | 1ir2 | 48 | | 1kpo | 42 | | 1j4z | 42 | +--------+--------+

SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_num FROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC LIMIT 10;

+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1kd1 | 23 | | 1nji | 23 | | 1mk1 | 23 | | 1q82 | 23 | | 1kc8 | 23 | | 1jj2 | 23 | | 1m90 | 23 | | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | +--------+-----------------+

§ What is the PDB structure with the largest number of (distinct) superfamilies?

slide-29
SLIDE 29

1k9m: a structure with 23 different superfamilies

Crystal structure of different macrolide antibiotics binding to the large ribosomal unit of Haloarcula marismortui

slide-30
SLIDE 30

A Little Science

§ Now let’s plot how many PDBs have 1, 2, 3,… distinct superfamilies We are going to use the query of the previous slide: A. Use the complete query inside other queries B. Save it as a new table and reference it by the given name

  • If the table already exists we have to erase it first:

DROP TABLE pdb_sf_num.

  • Create a table with the query:

CREATE TABLE pdb_sf_num AS SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_num FROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC;

+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | | 1q81 | 23 | | 1qvg | 23 |

slide-31
SLIDE 31

A Little Science

§ How many PDBs have 1, 2, 3,… distinct superfamilies?

SELECT distinct_sf_num, COUNT(pdb_id) AS num FROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;

+-----------------+-------+ | distinct_sf_num | num | +-----------------+-------+ | 1 | 18955 | | 2 | 3823 | | 3 | 874 | | 4 | 249 | | 5 | 37 | | 6 | 35 | | 7 | 2 | | 9 | 4 | | 20 | 6 | | 21 | 7 |

+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | | 1q81 | 23 | | 1qvg | 23 |

slide-32
SLIDE 32

A Little Science

Let’s do the same in percent ...

SELECT COUNT(DISTINCT pdb_id) FROM cla;

+----------+ | count(*) | +----------+ | 24037 | +----------+

=> There are 24037 PDB IDs

slide-33
SLIDE 33

A Little Science

§ How many PDBs have 1, 2, 3,… distinct superfamilies? SELECT distinct_sf_num, COUNT(pdb_id)/17434 AS perc FROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;

+-----------------+------+ | distinct_sf_num | perc | +-----------------+------+ | 1 | 0.80 | | 2 | 0.16 | | 3 | 0.03 | | 4 | 0.01 | | 5 | 0.00 | | 6 | 0.00 | | 7 | 0.00 | | 9 | 0.00 | | 20 | 0.00 | | 21 | 0.00 | | 22 | 0.00 | | 23 | 0.00 | +-----------------+------+

=> 80% of PDB entries consist only of one type of superfamily!

slide-34
SLIDE 34

A Little Science

SELECT des.sccs, des.description, COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids FROM cla,des WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num_of_pdb_ids DESC LIMIT 10;

+--------+------------------------------------------------------+----------------+ | sccs | description | num_of_pdb_ids | +--------+------------------------------------------------------+----------------+ | b.1.1 | Immunoglobulin | 823 | | d.2.1 | Lysozyme-like | 777 | | b.47.1 | Trypsin-like serine proteases | 649 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 521 | | c.2.1 | NAD(P)-binding Rossmann-fold domains | 384 | | a.1.1 | Globin-like | 384 | | c.1.8 | (Trans)glycosidases | 332 | | b.50.1 | Acid proteases | 288 | | b.29.1 | Concanavalin A-like lectins/glucanases | 230 | | c.47.1 | Thioredoxin-like | 217 | +--------+------------------------------------------------------+----------------+

§ What are the most popular superfamilies? i.e. for which are there the most PDB entries?

slide-35
SLIDE 35

Entity relationship diagram for SCOP

slide-36
SLIDE 36

Summary

■ When working with the web interface we are limited in what we can ask ■ Creating temporary tables is often necessary / helpful for executing complex queries ■ For most superfamilies, there is only one family ■ Most of PDB entries consist only of one type of superfamily ■ Immunoglobulin superfamily is the most popular among the proteins in PDB