mysql a little science
play

MySQL A Little Science Michael Schroeder BIOTEC TU Dresden - PowerPoint PPT Presentation

MySQL A Little Science Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Structure SCOP tables in detail SCOP entity relation diagram Complex queries to SCOP How many nodes for class, fold, superfamily and


  1. MySQL – A Little Science Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de

  2. Structure § SCOP tables in detail § SCOP entity relation diagram § Complex queries to SCOP § How many nodes for class, fold, superfamily and family? § Which superfamily has the most families? § Which families does the DNA binding-domain superfamily have? § Retrieve PDB ids for a given family/superfamily? § What is the percentage of superfamilies with just one family? § What is the PDB structure with the biggest number of distinct superfamilies? § What is the percentage of PDBs with just one superfamily? § What are the most popular superfamilies (biggest number of PDB entries) ? § Selections of selections § Creting temporary tables

  3. A Little Science § When working with SCOP through the web interface we are limited in what we can ask § What can we get out of SCOP when it is available as a relational table? § A reminder § Classes: all alpha, all beta, alpha/beta, alpha+beta § SCOP family: >30% sequence similarity § SCOP superfamily: good structural similarity (possibly <30%)

  4. Sequence vs structure

  5. SCOPE tables § Three tables: § cla , PDB entry and reference to its class, fold, superfamily, family, domain § des , description of each node in the SCOP hierarchy § astral , sequence for a domain

  6. SCOPE tables mysql> SELECT * FROM cla LIMIT 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> SELECT * FROM des LIMIT 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ mysql> SELECT * FROM astral LIMIT 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+

  7. Entity relationship diagram for SCOP

  8. Complex queries to SCOP

  9. A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family?

  10. A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ § Let us first find out how these types are called: +------+ | type | SELECT DISTINCT type +------+ | cl | FROM des; | cf | | sf | | fa | | dm | | sp | | px | +------+

  11. A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family?

  12. A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? +------+-------+ SELECT type, COUNT(*) AS num | type | num | FROM des +------+-------+ | cl | 11 | GROUP BY type | cf | 1067| ORDER BY num; | sf | 1678 | | fa | 2886 | | dm | 6475 | | sp | 9526 | | px | 65122 | +------+-------+

  13. A Little Science § How many nodes are there in the hierarchy of type class, fold, superfamily, family? +------+-------+ SELECT type, COUNT(*) AS num | type | num | FROM des +------+-------+ | cl | 11 | GROUP BY type | cf | 1067 | ORDER BY num; | sf | 1678 | | fa | 2886 | § There are not that many more | dm | 6475 | | sp | 9526 | families than superfamilies. | px | 65122 | +------+-------+ § Which superfamily has the most families?

  14. A Little Science § Which superfamily has the most families? Cla (domain table) +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+

  15. A Little Science § Which superfamily has the most families? SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num DESC; +---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |

  16. A Little Science § Which families does the DNA binding-domain superfamily have? Des (Description table) +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ § The sccs of the superfamily is a.4.5. § Its families have sccs ,… a.4.5.1, a.4.5.2 § How can we list them?

  17. A Little Science Which families does the DNA binding-domain superfamily have? SELECT DISTINCT sccs, description FROM des WHERE sccs LIKE “a.4.5.%” AND type=“fa“ ORDER BY sccs; | sccs | description +----------+------------------------------------------------ | a.4.5.1 | Biotin repressor-like | a.4.5.10 | Replication initiation protein | a.4.5.11 | Helicase DNA-binding domain | a.4.5.12 | Restriction endonuclease FokI, N-terminal (recognition) domain | a.4.5.13 | Linker histone H1/H5 | a.4.5.14 | Forkhead DNA-binding domain | a.4.5.15 | DNA-binding domain from rap30 | a.4.5.16 | C-terminal domain of RPA32 | a.4.5.17 | Cell cycle transcription factor e2f-dp | a.4.5.18 | The central core domain of TFIIE beta | a.4.5.19 | Z-DNA binding domain | a.4.5.2 | LexA repressor, N-terminal DNA-binding domain | a.4.5.20 | P4 origin-binding domain-like | a.4.5.21 | ets domain | a.4.5.22 | Heat-shock transcription factor | a.4.5.23 | Interferon regulatory factor [...] => 45 rows

  18. A Little Science § Find example PDB structure for the families of the DNA binding-domain superfamily ... mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.1"; +--------+ | pdb_id | +--------+ | 1bia | | 1hxd | | 1bib | | 1j5y | +--------+ mysql> SELECT DISTINCT pdb_id FROM cla WHERE sccs="a.4.5.2"; +--------+ | pdb_id | +--------+ | 1jhf | | 1jhh | | 1lea | | 1leb | +--------+

  19. A Little Science Structure very similar! Biotin represor-like LexA represor 1jhf (a.4.5.2) 1bia (a.4.5.1)

  20. Sequence vs structure

  21. A Little Science 1cgp ( a.4.5.4)

  22. A Little Science Some more… 1b9n (a.4.5.8) 1smt (a.4.5.5) 1f4k (a.4.5.7) 1bm9 (a.4.5.7) 1hw1 (a.4.5.6)

  23. A Little Science How many percent of superfamilies have only § 1 family, how many 2,… ?

  24. A Little Science § How many percent of superfamilies have only 1, 2, 3, … families? 1. First let’s create the result of the query that found the number of families for each superfamily. SELECT des.sccs, des.description, COUNT(DISTINCT cla.fa) AS num FROM des, cla WHERE des.id=cla.sf GROUP BY cla.sf ORDER BY num desc +---------+---------------------------------------------------------+-----+ | sccs | description | num | +---------+---------------------------------------------------------+-----+ | a.4.5 | "Winged helix" DNA-binding domain | 38 | | c.66.1 | S-adenosyl-L-methionine-dependent methyltransferases | 35 | | c.69.1 | alpha/beta-Hydrolases | 30 | | c.37.1 | P-loop containing nucleotide triphosphate hydrolases | 22 | | c.52.1 | Restriction endonuclease-like | 22 | | b.18.1 | Galactose-binding domain-like | 21 |

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend