Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - PowerPoint PPT Presentation

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de

Organization § Lectures: Thursday 9:20 – 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa Adasme: melissa.adasme@tu-dresden.de § Labs: Thursday 11:10 – 12:40, E030 (PC Pool) BIOTEC Predoc Negin Malekian: negin.malekian@tu-dresden.de § Group projects : 31 st January 9:20-10:50 (task description will be released 3 weeks before ) 1

The module… § will teach you basic programming skills relevant to bioinformatics, which will enable you to actively develop bioinformatics tools . § will take a problem-driven approach . § will present bioinformatics problems and show how to solve them using existing online tools and how to implement such tools . § will revisit some of the problems and databases discussed in applied bioinformatics. § will be very practical and hands-on approach to basic computer science tools such as using command line operating systems , programming in Python , and using relational databases . 2

Objectives § You will be able to automate simple repetitive information retrieval tasks § You will be able to write simple programs in Python § You will be able to work with relational databases § You will appreciate the principles, limits, and possibilities of programming § You will be able to formulate biological questions as information processing problems § You will understand when and how programming can help to automate bioinformatics problems 3

Module Structure ■ More Python ■ Introduction § MySQL Database Connection § REST Queries ■ Databases § Dyn. Progr. & Clustering § Introduction to SQL § A Little Exercise § A Little Science ■ Introduction to PyMOL § Commands and Scripting ■ Introduction to Python § PyMOL Movie Project ■ Programming concepts ■ Revision Class § Data types and loops § Sequences and lists § Patterns and functions § Dictionaries & More Concepts § Data Visualization 4

Resources § Online resources for Python and MySQL available on course web page http://www.biotec.tu-dresden.de/de/forschung/schroeder/teaching/programming-for-bioinformatics.html 5

Resources: Python § Python in a Nutshell Alex Martelli (O’Reilly) § Python Cookbook * David Beazley (O’Reilly) The publisher O’Reilly has many good general programming (e-)books on Linux, Python, etc. § Learn Python the Hard Way * Zed A. Shaw § Think Python: How to Think Like a Computer Scientist * Allen B. Downey * free HTML version 6

Resources: MySQL § W3schools SQL (Interactive online tutorial) § MySQL Cookbook Paul DuBois (O'Reilly) § Jump Start MYSQL Timothy Boronczyk (O'Reilly) § MySQL Reference Manual includes Tutorials 7

Labs Exercises § Each week during the lab you will get exercises which you have to do during the lab (recommended) or finish on your own during the week § Results will be discussed the next week in the lab. Questions on exercises at the beginning of the lecture or labs § Using the machines in the PC pool is recommended § Access to databases § Availability of python modules § No marks for the exercises § Doing all exercises each week makes the exam easier § You should try yourself before asking others 8

Programming Projects § Goal I: Demonstrate ability to use SQL and Python § Goal II: Applying your skills on a real-world problem § You will work in a team and get a biological problem. § Implementation of small workflows § Integration of data from various sources § Visualization of data § Explain approach to others (5 minutes presentation) § Possible tasks: § What is the largest ligand that can bind a metalloprotease? Suggestions for tasks? 9

Motivation: Databases In the last term , § we accessed most information online via the web § we interacted directly and manually with databases and tools § we had to manually submit queries , interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: § Reasonably easy to get hold of information Con: § Not possible to ask many queries § Queries limited by interface provided by web page § Difficult/impossible to integrate information from different sites In this term , § we will look at the databases underlying the online front ends § How is the data internally stored? § How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems 10

What actually happens You are limited by what web server allows you to ask: Example CATH: • PDB ID, • CATH code, or • General text But you cannot ask: • In how many different PDB structures is there a P-loop domain? • Is there a PDB entry with a P-loop and a DNA-binding domain • How many different superfamilies does the largest structure in PDB have? • With direct access to the underlying database you could answer all these questions (and many more) 11

Motivation: SCOP as Relational Database § We worked with SCOP , the Structural Classification of Proteins § Family : >30% sequence identity § Superfamily : Similar structure and function (possibly lower 30% sequence identity) Structure similarity Sequence identity 12

Motivation: Databases We wish to answer the following questions: § How many families and superfamilies are there? § Do all superfamilies roughly have the same number of families ? § How many families does the immunoglobulin superfamily have? § Which superfamily has the most families and how many ? § How many percent of superfamilies have only one family ? § Which PDB structure has the largest number of distinct superfamilies ? § How many percent of PDB structures have only one type of superfamily , how many percent have at least two? § Which is the most popular superfamily ? § Are all superfamilies equally likely to co-occur or do they have preferences? § Which superfamily has the most co-occurrence partners ? § Is the number of co-occurrence partners and the frequency of the superfamily correlated? Can we do it with the knowledge you have so far? 13

What is a Database ? § SCOP contains relevant information, but we cannot answer the above questions through the web-interface of SCOP § The problem is that we do not have access to the underlying database What is a database? A database provides… § Logical organization of data § data models, schema design, dictionaries § Physical organization of data § Fast retrieval, indexing, compact storage of data 14

Relational Database Central Idea: Data as relations in a table § E.g. Employee +-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+ 15

Relational Database Central Idea: Data as relations in a table § E.g. SCOP, Structural Classification of Proteins +-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | - | Globin-like | | 46458 | sf | a.1.1 | - | Globin-like | | 46459 | fa | a.1.1.1 | - | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | - | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | - | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | - | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | - | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+ 16

SCOP database 17

SCOP Tables Cla +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ Des +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | - | All alpha proteins | +-------+------+------+------+--------------------+ Astral +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ Subchain +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+ Do you see any relation between tables? 18

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - PowerPoint PPT Presentation

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Organization Lectures: Thursday 9:20 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Probabilistic Reasoning a h C , N R wrt Time Decision Theoretic Agents Introduction to

The three-dimensional folding of the -globin gene domain reveals formation of chromatin

Which beach? Here are a few of our favourite beaches in Cornwall! Perranporth is a popular seaside

v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview Official release coming end of

New medicines for type 2 diabetes 4. Thiazolidinediones 5. GLP-1 receptor agonists 6. DPP-4

Global 1000 Conference and Showcase 2013 GLOBAL 1000 PANEL: LIFE SCIENCES Barbara Araneo, PhD,

Masking the GLP Lattice-Based Signature Scheme at any Order Gilles Barthe (IMDEA Software

G.l.p., optimal coefficients, rank-1 lattice rules, ... Dirk Nuyens Department of Computer

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden - PowerPoint PPT Presentation

Programming for Bioinformatics Michael Schroeder BIOTEC TU Dresden ms@biotec.tu-dresden.de Organization Lectures: Thursday 9:20 10:50, Auditorium right CRTD Prof. Michael Schroeder: michael.schroeder@tu-dresden.de Predoc Melissa

Data Mining in Bioinformatics Day 7: Clustering in Bioinformatics Karsten Borgwardt February 25

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why

Data Mining in Bioinformatics Day 6: Classification in Bioinformatics Karsten Borgwardt February

Data Mining in Bioinformatics Day 9: String &amp; Text Mining in Bioinformatics Karsten Borgwardt

Bioinformatics Outline What is bioinformatics? Who are bioinformaticians? Hardware

Bioinformatics Panel Presentation Peter D. Karp, Ph.D. Director, Bioinformatics Research Group

SciLifeLab Bioinformatics Platform National Bioinformatics Infrastructure Sweden (NBIS) Nina

Data Mining in Bioinformatics Day 8: Feature Selection in Bioinformatics Karsten Borgwardt

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course

Thailand Bioinformatics: Research and Applications Sissades T ongsima Bioinformatics

CAMDA: An Overview Michael Ochs Bioinformatics Fox Chase Cancer Center Bioinformatics Fox

Introduction to Cancer Bioinformatics and cancer biology Anthony Gitter Cancer Bioinformatics

Text Mining and Information Extraction Applications for Bioinformatics and Systems Biology Plant

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute

Bioinformatics Methods for Pathogen Bioinformatics Methods for Pathogen Identification

Practical Bioinformatics Mark Voorhies 4/16/2018 Mark Voorhies Practical Bioinformatics

Probabilistic Reasoning a h C , N R wrt Time Decision Theoretic Agents Introduction to

The three-dimensional folding of the -globin gene domain reveals formation of chromatin

Which beach? Here are a few of our favourite beaches in Cornwall! Perranporth is a popular seaside

v0.4 Stephan Ewen (stephan.ewen@tu-berlin.de) 1 Release Preview Official release coming end of

New medicines for type 2 diabetes 4. Thiazolidinediones 5. GLP-1 receptor agonists 6. DPP-4

Global 1000 Conference and Showcase 2013 GLOBAL 1000 PANEL: LIFE SCIENCES Barbara Araneo, PhD,

Masking the GLP Lattice-Based Signature Scheme at any Order Gilles Barthe (IMDEA Software

G.l.p., optimal coefficients, rank-1 lattice rules, ... Dirk Nuyens Department of Computer

Data Mining in Bioinformatics Day 9: String & Text Mining in Bioinformatics Karsten Borgwardt