Michael Schroeder
BIOTEC TU Dresden ms@biotec.tu-dresden.de
Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation
Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Organization Lectures: Thursday 9:20 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa
BIOTEC TU Dresden ms@biotec.tu-dresden.de
1
2
3
§ Introduction to SQL § A Little Exercise § A Little Science
§ Data types and loops § Sequences and lists § Patterns and functions § Dictionaries & More Concepts § Data Visualization
§ MySQL Database Connection § REST Queries § Dyn. Progr. & Clustering
§ Commands and Scripting § PyMOL Movie Project
4
§ Online resources for Python and MySQL available on course web page
http://www.biotec.tu-dresden.de/de/forschung/schroeder/teaching/programming-for-bioinformatics.html
5
§ Python in a Nutshell Alex Martelli (O’Reilly) § Python Cookbook * David Beazley (O’Reilly) The publisher O’Reilly has many good general programming (e-)books on Linux, Python, etc. § Learn Python the Hard Way * Zed A. Shaw § Think Python: How to Think Like a Computer Scientist * Allen B. Downey
* free HTML version 6
§ W3schools SQL (Interactive online tutorial) § MySQL Cookbook Paul DuBois (O'Reilly) § Jump Start MYSQL Timothy Boronczyk (O'Reilly) § MySQL Reference Manual includes Tutorials
7
§ Each week during the lab you will get exercises which you have to do during the lab (recommended) or finish on your own during the week § Results will be discussed the next week in the lab. Questions on exercises at the beginning of the lecture or labs § Using the machines in the PC pool is recommended § Access to databases § Availability of python modules § No marks for the exercises § Doing all exercises each week makes the exam easier § You should try yourself before asking others
8
§ Goal I: Demonstrate ability to use SQL and Python § Goal II: Applying your skills on a real-world problem § You will work in a team and get a biological problem. § Implementation of small workflows § Integration of data from various sources § Visualization of data § Explain approach to others (5 minutes presentation) § Possible tasks: § What is the largest ligand that can bind a metalloprotease?
9
In the last term, § we accessed most information online via the web § we interacted directly and manually with databases and tools § we had to manually submit queries, interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: § Reasonably easy to get hold of information Con: § Not possible to ask many queries § Queries limited by interface provided by web page § Difficult/impossible to integrate information from different sites In this term, § we will look at the databases underlying the online front ends § How is the data internally stored? § How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems
10
You are limited by what web server allows you to ask: Example CATH:
But you cannot ask:
there a P-loop domain?
DNA-binding domain
the largest structure in PDB have?
database you could answer all these questions (and many more) 11
§ We worked with SCOP, the Structural Classification of Proteins § Family: >30% sequence identity § Superfamily: Similar structure and function (possibly lower 30% sequence identity)
12
Sequence identity Structure similarity
§ How many families and superfamilies are there? § Do all superfamilies roughly have the same number of families? § How many families does the immunoglobulin superfamily have? § Which superfamily has the most families and how many? § How many percent of superfamilies have only one family? § Which PDB structure has the largest number of distinct superfamilies? § How many percent of PDB structures have only one type of superfamily, how many percent have at least two? § Which is the most popular superfamily? § Are all superfamilies equally likely to co-occur or do they have preferences? § Which superfamily has the most co-occurrence partners? § Is the number of co-occurrence partners and the frequency of the superfamily correlated? Can we do it with the knowledge you have so far? 13
§ Logical organization of data § data models, schema design, dictionaries § Physical organization of data § Fast retrieval, indexing, compact storage of data
14
+-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+
15
+-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+
16
17
Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+
Do you see any relation between tables? 18
Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+
Do you see any relation between tables? 18
Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+
Do you see any relation between tables? 18
Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+
Do you see any relation between tables? 18
SQL = Structured Query Language § Select Which attributes? § from Which tables? § where Which conditions? Select … from … where … § Distinct § Like § Union/intersect § Join § Count/average/sum/min/max § Group by § Having § Show tables § Show databases § Use § Create database § Create table … as § Drop table § Load data § Insert into
19 Using SQL constructs we can answer the complex biological questions!!
http://www.sallmann-genealogy.de/Stammbaum1RobertGross.jpg
21
22
§ Easy to write quick programs § More than just a scripting language § Interpreted, interactive, indented § Supports string processing well § Widely used in bioinformatics § Object oriented, general purpose § Many nice libraries for database access, Graphics, Web, GUI, R… § Scientific orientation: Numerical Python (math), Scientific Python, Biopython Beware: (Python is inefficient), but computationally expensive parts can be included as C-libraries
23
§ Variables, strings § For/while Loops § If statements § File I/O § Regular expressions § Data structures: § Lists: an ordered collection which is homogeneous and changeable § Dictionaries: a collection which is unordered, changeable and indexed § Code Structure: § Objects: take a class as a blueprint and make a copy § Classes: containers where you can put functions and variables into § Modules: specialized container to store Python code, also for larger chunks and whole programs 24
Many tools today have web services available: § REST provides an straightforward way to send queries § no need to setup libraries, tools, databases locally § just formulate a query with your parameters and retrieve the result via the browser or programatically § E.g. for listing tabular information for a query in Uniprot
25
§ Can we replicate the plot below? § Can we create a similar plot for specific superfamilies? E.g. DNA-binding domains?
26
script or another tool
TM-score or plain RMSD calculation
27
■ Can we characterise the amino acid composition of different families/superfamilies? ■ Again: select the relevant sequences from astral and count the frequencies of amino acids ■ Is the amino acid composition at the interface of a domain different from the rest of the domain?
28
■ Given a SCOP superfamily and its sequences, how can we divide it into families? ■ First, we need dynamic programming to determine the sequence similarity ■ Then we do the following: ■ For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix ■ Next, run hierarchical clustering to cluster the sequences.
29