MySQL – A Little Science
Michael Schroeder
BIOTEC TU Dresden ms@biotec.tu-dresden.de
MySQL A Little Science Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation
MySQL A Little Science Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Structure SCOP tables in detail SCOP entity relation diagram Complex queries to SCOP How many nodes for class, fold, superfamily and
BIOTEC TU Dresden ms@biotec.tu-dresden.de
§ SCOP tables in detail § SCOP entity relation diagram § Complex queries to SCOP § How many nodes for class, fold, superfamily and family? § Which superfamily has the most families? § Which families does the DNA binding-domain superfamily have? § Retrieve PDB ids for a given family/superfamily? § What is the percentage of superfamilies with just one family? § What is the PDB structure with the biggest number of distinct superfamilies? § What is the percentage of PDBs with just one superfamily? § What are the most popular superfamilies (biggest number of PDB entries) ? § Selections of selections § Creting temporary tables
mysql> SELECT * FROM cla LIMIT 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> SELECT * FROM des LIMIT 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ mysql> SELECT * FROM astral LIMIT 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+
Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+
+------+ | type | +------+ | cl | | cf | | sf | | fa | | dm | | sp | | px | +------+
+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067| | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+
+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067 | | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+
Cla (domain table) +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+
SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC;
+---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |
Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+
SELECT DISTINCT sccs, description FROM des WHERE sccs LIKE “a.4.5.%” AND type=“fa“ ORDER BY sccs;
| sccs | description +----------+------------------------------------------------ | a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein | a.4.5.11 | Helicase DNA-binding domain | a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain | a.4.5.13 | Linker histone H1/H5 | a.4.5.14 | Forkhead DNA-binding domain | a.4.5.15 | DNA-binding domain from rap30 | a.4.5.16 | C-terminal domain of RPA32 | a.4.5.17 | Cell cycle transcription factor e2f-dp | a.4.5.18 | The central core domain of TFIIE beta | a.4.5.19 | Z-DNA binding domain | a.4.5.2 | LexA repressor, N-terminal DNA-binding domain | a.4.5.20 | P4 origin-binding domain-like | a.4.5.21 | ets domain | a.4.5.22 | Heat-shock transcription factor | a.4.5.23 | Interferon regulatory factor
[...] => 45 rows
mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1"; +--------+ | pdb_id | +--------+ | 1bia | | 1hxd | | 1bib | | 1j5y | +--------+ mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2"; +--------+ | pdb_id | +--------+ | 1jhf | | 1jhh | | 1lea | | 1leb | +--------+
Structure very similar! LexA represor Biotin represor-like
§ How many percent of superfamilies have only 1, 2, 3, … families?
families for each superfamily. SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc
+---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |
+-----------+------+ | fa_per_sf | freq | +-----------+------+ | 1 | 1279 | | 2 | 183 | | 3 | 83 | | 4 | 39 | | 5 | 30 | | 6 | 19 | | 7 | 9 | | 8 | 9 | | 9 | 2 | | 10 | 4 |
§ How many percent of superfamilies have only 1, 2, 3, … families?
SELECT num AS fa_per_sf, COUNT(*) AS freq FROM (SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc) AS fa_freq GROUP BY num; We have frequency, but we need percentage!
Reusing the previous query: SELECT type, COUNT(*) AS num FROM des GROUP BY type ORDER BY num; We have a total of 1678 superfamilies!
+------+-------+ | type | num | +------+-------+ | cl | 11 | | cf | 1067| | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+
+-----------+------+ | fa_per_sf | perc | +-----------+------+ | 1 | 0.76 | | 2 | 0.10 | | 3 | 0.04 | | 4 | 0.02 | | 5 | 0.01 | | 6 | 0.01 | | 7 | 0.00 | | 8 | 0.00 | | 9 | 0.00 | | 10 | 0.00 | +-----------+------+
§ How many percent of superfamilies have only 1,2,3,… families?
SELECT num AS fa_per_sf, (COUNT(*)/1678) AS perc FROM (SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc) AS fa_freq GROUP BY num;
=> 76% of superfamilies have only one familie! This is interesting! For most superfamilies there is only one family!
SELECT pdb_id, COUNT(sf) AS sf_num FROM cla GROUP BY pdb_id ORDER BY sf_num DESC LIMIT 10;
+--------+--------+ | pdb_id | sf_num | +--------+--------+ | 1pf9 | 49 | | 1aon | 49 | | 1pcq | 49 | | 1htq | 48 | | 1p7g | 48 | | 1mcz | 48 | | 1hto | 48 | | 1ir2 | 48 | | 1kpo | 42 | | 1j4z | 42 | +--------+--------+
SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_num FROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC LIMIT 10;
+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1kd1 | 23 | | 1nji | 23 | | 1mk1 | 23 | | 1q82 | 23 | | 1kc8 | 23 | | 1jj2 | 23 | | 1m90 | 23 | | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | +--------+-----------------+
§ What is the PDB structure with the largest number of (distinct) superfamilies?
Crystal structure of different macrolide antibiotics binding to the large ribosomal unit of Haloarcula marismortui
§ Now let’s plot how many PDBs have 1, 2, 3,… distinct superfamilies We are going to use the query of the previous slide: A. Use the complete query inside other queries B. Save it as a new table and reference it by the given name
DROP TABLE pdb_sf_num.
CREATE TABLE pdb_sf_num AS SELECT pdb_id, COUNT(DISTINCT sf) AS distinct_sf_num FROM cla GROUP BY pdb_id ORDER BY distinct_sf_num DESC;
+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | | 1q81 | 23 | | 1qvg | 23 |
§ How many PDBs have 1, 2, 3,… distinct superfamilies?
SELECT distinct_sf_num, COUNT(pdb_id) AS num FROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;
+-----------------+-------+ | distinct_sf_num | num | +-----------------+-------+ | 1 | 18955 | | 2 | 3823 | | 3 | 874 | | 4 | 249 | | 5 | 37 | | 6 | 35 | | 7 | 2 | | 9 | 4 | | 20 | 6 | | 21 | 7 |
+--------+-----------------+ | pdb_id | distinct_sf_num | +--------+-----------------+ | 1qvf | 23 | | 1k73 | 23 | | 1n8r | 23 | | 1q81 | 23 | | 1qvg | 23 |
+----------+ | count(*) | +----------+ | 24037 | +----------+
§ How many PDBs have 1, 2, 3,… distinct superfamilies? SELECT distinct_sf_num, COUNT(pdb_id)/17434 AS perc FROM pdb_sf_num GROUP BY distinct_sf_num ORDER BY distinct_sf_num;
+-----------------+------+ | distinct_sf_num | perc | +-----------------+------+ | 1 | 0.80 | | 2 | 0.16 | | 3 | 0.03 | | 4 | 0.01 | | 5 | 0.00 | | 6 | 0.00 | | 7 | 0.00 | | 9 | 0.00 | | 20 | 0.00 | | 21 | 0.00 | | 22 | 0.00 | | 23 | 0.00 | +-----------------+------+
=> 80% of PDB entries consist only of one type of superfamily!
SELECT des.sccs, des.description, COUNT(DISTINCT cla.pdb_id) AS num_of_pdb_ids FROM cla,des WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num_of_pdb_ids DESC LIMIT 10;
+--------+------------------------------------------------------+----------------+ | sccs | description | num_of_pdb_ids | +--------+------------------------------------------------------+----------------+ | b.1.1 | Immunoglobulin | 823 | | d.2.1 | Lysozyme-like | 777 | | b.47.1 | Trypsin-like serine proteases | 649 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 521 | | c.2.1 | NAD(P)-binding Rossmann-fold domains | 384 | | a.1.1 | Globin-like | 384 | | c.1.8 | (Trans)glycosidases | 332 | | b.50.1 | Acid proteases | 288 | | b.29.1 | Concanavalin A-like lectins/glucanases | 230 | | c.47.1 | Thioredoxin-like | 217 | +--------+------------------------------------------------------+----------------+
§ What are the most popular superfamilies? i.e. for which are there the most PDB entries?