Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

programming for bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - - PowerPoint PPT Presentation

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Organization Lectures: Thursday 9:20 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa


slide-1
SLIDE 1

Michael Schroeder

BIOTEC TU Dresden ms@biotec.tu-dresden.de

Programming for Bioinformatics

slide-2
SLIDE 2

Organization

§ Lectures: Thursday 9:20 – 10:50, Auditorium right CRTD

  • Prof. Michael Schroeder: michael.schroeder@tu-dresden.de

Predoc Melissa Adasme: melissa.adasme@tu-dresden.de § Labs: Thursday 11:10 – 12:40, E030 (PC Pool) BIOTEC Predoc Negin Malekian: negin.malekian@tu-dresden.de § Group projects: 31st January 9:20-10:50 (task description will be released 3 weeks before )

1

slide-3
SLIDE 3

The module…

§ will teach you basic programming skills relevant to bioinformatics, which will enable you to actively develop bioinformatics tools. § will take a problem-driven approach. § will present bioinformatics problems and show how to solve them using existing online tools and how to implement such tools. § will revisit some of the problems and databases discussed in applied bioinformatics. § will be very practical and hands-on approach to basic computer science tools such as using command line operating systems, programming in Python, and using relational databases.

2

slide-4
SLIDE 4

Objectives

§ You will be able to automate simple repetitive information retrieval tasks § You will be able to write simple programs in Python § You will be able to work with relational databases § You will appreciate the principles, limits, and possibilities of programming § You will be able to formulate biological questions as information processing problems § You will understand when and how programming can help to automate bioinformatics problems

3

slide-5
SLIDE 5

Module Structure

■ Introduction ■ Databases

§ Introduction to SQL § A Little Exercise § A Little Science

■ Introduction to Python ■ Programming concepts

§ Data types and loops § Sequences and lists § Patterns and functions § Dictionaries & More Concepts § Data Visualization

■ More Python

§ MySQL Database Connection § REST Queries § Dyn. Progr. & Clustering

■ Introduction to PyMOL

§ Commands and Scripting § PyMOL Movie Project

■ Revision Class

4

slide-6
SLIDE 6

Resources

§ Online resources for Python and MySQL available on course web page

http://www.biotec.tu-dresden.de/de/forschung/schroeder/teaching/programming-for-bioinformatics.html

5

slide-7
SLIDE 7

Resources: Python

§ Python in a Nutshell Alex Martelli (O’Reilly) § Python Cookbook * David Beazley (O’Reilly) The publisher O’Reilly has many good general programming (e-)books on Linux, Python, etc. § Learn Python the Hard Way * Zed A. Shaw § Think Python: How to Think Like a Computer Scientist * Allen B. Downey

* free HTML version 6

slide-8
SLIDE 8

Resources: MySQL

§ W3schools SQL (Interactive online tutorial) § MySQL Cookbook Paul DuBois (O'Reilly) § Jump Start MYSQL Timothy Boronczyk (O'Reilly) § MySQL Reference Manual includes Tutorials

7

slide-9
SLIDE 9

Labs Exercises

§ Each week during the lab you will get exercises which you have to do during the lab (recommended) or finish on your own during the week § Results will be discussed the next week in the lab. Questions on exercises at the beginning of the lecture or labs § Using the machines in the PC pool is recommended § Access to databases § Availability of python modules § No marks for the exercises § Doing all exercises each week makes the exam easier § You should try yourself before asking others

8

slide-10
SLIDE 10

Programming Projects

§ Goal I: Demonstrate ability to use SQL and Python § Goal II: Applying your skills on a real-world problem § You will work in a team and get a biological problem. § Implementation of small workflows § Integration of data from various sources § Visualization of data § Explain approach to others (5 minutes presentation) § Possible tasks: § What is the largest ligand that can bind a metalloprotease?

Suggestions for tasks?

9

slide-11
SLIDE 11

Motivation: Databases

In the last term, § we accessed most information online via the web § we interacted directly and manually with databases and tools § we had to manually submit queries, interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: § Reasonably easy to get hold of information Con: § Not possible to ask many queries § Queries limited by interface provided by web page § Difficult/impossible to integrate information from different sites In this term, § we will look at the databases underlying the online front ends § How is the data internally stored? § How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems

10

slide-12
SLIDE 12

What actually happens

You are limited by what web server allows you to ask: Example CATH:

  • PDB ID,
  • CATH code, or
  • General text

But you cannot ask:

  • In how many different PDB structures is

there a P-loop domain?

  • Is there a PDB entry with a P-loop and a

DNA-binding domain

  • How many different superfamilies does

the largest structure in PDB have?

  • With direct access to the underlying

database you could answer all these questions (and many more) 11

slide-13
SLIDE 13

Motivation: SCOP as Relational Database

§ We worked with SCOP, the Structural Classification of Proteins § Family: >30% sequence identity § Superfamily: Similar structure and function (possibly lower 30% sequence identity)

12

Sequence identity Structure similarity

slide-14
SLIDE 14

Motivation: Databases

We wish to answer the following questions:

§ How many families and superfamilies are there? § Do all superfamilies roughly have the same number of families? § How many families does the immunoglobulin superfamily have? § Which superfamily has the most families and how many? § How many percent of superfamilies have only one family? § Which PDB structure has the largest number of distinct superfamilies? § How many percent of PDB structures have only one type of superfamily, how many percent have at least two? § Which is the most popular superfamily? § Are all superfamilies equally likely to co-occur or do they have preferences? § Which superfamily has the most co-occurrence partners? § Is the number of co-occurrence partners and the frequency of the superfamily correlated? Can we do it with the knowledge you have so far? 13

slide-15
SLIDE 15

What is a Database ?

§ SCOP contains relevant information, but we cannot answer the above questions through the web-interface of SCOP § The problem is that we do not have access to the underlying database What is a database? A database provides…

§ Logical organization of data § data models, schema design, dictionaries § Physical organization of data § Fast retrieval, indexing, compact storage of data

14

slide-16
SLIDE 16

Relational Database

Central Idea: Data as relations in a table § E.g. Employee

+-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+

15

slide-17
SLIDE 17

Relational Database

Central Idea: Data as relations in a table § E.g. SCOP, Structural Classification of Proteins

+-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+

16

slide-18
SLIDE 18

SCOP database

17

slide-19
SLIDE 19

SCOP Tables

Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+

Do you see any relation between tables? 18

slide-20
SLIDE 20

SCOP Tables

Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+

Do you see any relation between tables? 18

slide-21
SLIDE 21

SCOP Tables

Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+

Do you see any relation between tables? 18

slide-22
SLIDE 22

SCOP Tables

Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+

Do you see any relation between tables? 18

slide-23
SLIDE 23

Querying Relational Databases

SQL = Structured Query Language § Select Which attributes? § from Which tables? § where Which conditions? Select … from … where … § Distinct § Like § Union/intersect § Join § Count/average/sum/min/max § Group by § Having § Show tables § Show databases § Use § Create database § Create table … as § Drop table § Load data § Insert into

19 Using SQL constructs we can answer the complex biological questions!!

slide-24
SLIDE 24

Why SQL is not enough

Total number of ancestors for any given family member?

http://www.sallmann-genealogy.de/Stammbaum1RobertGross.jpg

21

slide-25
SLIDE 25

What’s needed…

§ …programming in Python

22

slide-26
SLIDE 26

Programming

We will use Python (Guido van Rossum, named after Monty Python) as a convenient extension to the operating system

§ Easy to write quick programs § More than just a scripting language § Interpreted, interactive, indented § Supports string processing well § Widely used in bioinformatics § Object oriented, general purpose § Many nice libraries for database access, Graphics, Web, GUI, R… § Scientific orientation: Numerical Python (math), Scientific Python, Biopython Beware: (Python is inefficient), but computationally expensive parts can be included as C-libraries

23

slide-27
SLIDE 27

Python Programming Constructs

§ Variables, strings § For/while Loops § If statements § File I/O § Regular expressions § Data structures: § Lists: an ordered collection which is homogeneous and changeable § Dictionaries: a collection which is unordered, changeable and indexed § Code Structure: § Objects: take a class as a blueprint and make a copy § Classes: containers where you can put functions and variables into § Modules: specialized container to store Python code, also for larger chunks and whole programs 24

slide-28
SLIDE 28

Don’t spend time for setup

Many tools today have web services available: § REST provides an straightforward way to send queries § no need to setup libraries, tools, databases locally § just formulate a query with your parameters and retrieve the result via the browser or programatically § E.g. for listing tabular information for a query in Uniprot

http://www.uniprot.org/uniprot/?query=insu lin&sort=score&columns=id,entry%20name,org anism,length&format=tab Also available for BLAST, Clustal Omega, and others.

25

slide-29
SLIDE 29

Motivation: Sequence vs. Structure

§ Can we replicate the plot below? § Can we create a similar plot for specific superfamilies? E.g. DNA-binding domains?

26

slide-30
SLIDE 30

Motivation: Sequence vs. Structure

  • 1. Select the relevant sequences from the astral table
  • 2. Compute the pairwise sequence identity using a simple Python

script or another tool

  • 3. Retrieve all structures from PDB for the given sequences.
  • 4. Compute pairwise structural similarity using an algorithm such as

TM-score or plain RMSD calculation

  • 5. Finally, plot the two similarities against each other in a scatter plot

27

slide-31
SLIDE 31

Motivation: Amino Acid Composition of Families

■ Can we characterise the amino acid composition of different families/superfamilies? ■ Again: select the relevant sequences from astral and count the frequencies of amino acids ■ Is the amino acid composition at the interface of a domain different from the rest of the domain?

28

slide-32
SLIDE 32

Motivation: Let’s rebuild SCOP families

■ Given a SCOP superfamily and its sequences, how can we divide it into families? ■ First, we need dynamic programming to determine the sequence similarity ■ Then we do the following: ■ For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix ■ Next, run hierarchical clustering to cluster the sequences.

29