20.09.2008 1
knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 - - PDF document
knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 - - PDF document
20.09.2008 Challenge of bioinformatics: knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 20.09.2008 Bioinformatics Management analysis and Management, analysis and interpretation of biological data Goal: gain new
20.09.2008 2
Bioinformatics
■ Management analysis and ■ Management, analysis and interpretation of biological data ■ Goal: gain new insights into the biology
4 Sep 2008
20.09.2008 3
EMBL nucleotide DB
Total nucleotides (current 216,455,190,745) Number of entries (current 136,401,022) http://www.ebi.ac.uk/embl/Services/DBStats/
20.09.2008 4
Many data types
Sequence (DNA, RNA,Protein…) S 6340 experiments, 192031 assays available 1 assay = 10-100MB “image”, converted into values for all genes on assay… Structure Variation Gene Expression Protein expression Metabolism Interactions Regulation … Imaging Imaging …
20.09.2008 5
Computer science and bioinformatics Communications of the ACM Volume 48 , Issue 3 (March 2005) The disappearing computer Pages: 72 - 78 Year of Publication: 2005 Jacques Cohen
Computer science and bioinformatics (Jacques Cohen, CACM 2005) ■ In barely half a century computer science has grown from infancy to science has grown from infancy to maturity. ■ Computer scientists should be encouraged to learn biology and biologists computer science to prepare themselves for an intellectually stimulating and financially rewarding future in bioinformatics.
20.09.2008 6
Computer Literacy Interview With Donald Knuth
By Dan Doernberg December 7th, 1993
CLB: If you were a soon-to-graduate college senior or Ph.D. and you didn't have any "baggage", what kind
- f research would you want to do? Or would you even choose research again?
Knuth: I think the most exciting computer research now is partly in robotics, and
partly in applications to biochemistry partly in applications to biochemistry. Robotics, for example, that's
- terrific. Making devices that actually move around and communicate with each other. Stanford has a big
robotics lab now, and our plan is for a new building that will have a hundred robots walking the corridors, to stimulate the students. It'll be two or three years until we move in to the building. Just seeing robots there, you'll think of neat projects. These projects also suggest a lot of good mathematical and theoretical
- questions. And high level graphical tools, there's a tremendous amount of great stuff in that area too. Yeah,
I'd love to do that... only one life, you know, but... CLB: Why do you mention biochemistry?
Knuth: There's millions and millions of unsolved problems. Biology is so digital, and incredibly complicated, but incredibly useful. The
trouble with biology is that if you have to work as a biologist it's boring Your experiments take you three trouble with biology is that, if you have to work as a biologist, it s boring. Your experiments take you three years and then, one night, the electricity goes off and all the things die! You start over. In computers we can create our own worlds. Biologists deserve a lot of credit for being able to slug it through. It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips, that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been
- discovered. It may not be true, but I can't predict an unending growth. I can't be as confident about
computer science as I can about biology. Biology easily has 500 years
- f exciting problems to work on, it's at that level.
Level 1 Level 2
A eukaryotic genome can be thought of as six Levels of
Level 0 ATCGCTGAATTCCAATGTG Level 3 Level 4
six Levels of DNA structure. The loops at Level 4 range from 0.5kb to 100kb in length.
Level 5 Level 6
If these loops were stabilized then the genes inside the loop would not be expressed.
20.09.2008 7
A simple gene
A: B: DNA, gene, RNA, protein, gene regulation, …
ATCGAAAT TAGCTTTA
Upstream/ promoter Downstream DNA:
+Modifications
20.09.2008 8
From parts list to a system Network
Undirected: 5+4+3+2+1 15 Undirected: 5+4+3+2+1=15 Directed graph: 52 = 25 Connection/not: 215 = 32768 A ti t / 315 14348907 H L K Activate/repress: 315= 14348907 20: 1M or 3400M J S
20.09.2008 9
Models and parameters
H L K Logical switches on nodes Boolean or continuous Firing thresholds, growth functions? J S
Problem:
■ We have parts (?) ■ We may have partial info on wirings ■ We may have partial info on wirings
– Scientific literature
■ We have some observations under some conditios at some timepoints ■ What’s the content of the “black box”?
20.09.2008 10
Reality
■ ~25,000 genes ■ 1 000 000 proteins ■ ~1,000,000 proteins ■ ~300 + 10,000 cell types ■ Infinite nr of conditions ■ Other levels of control:
Micro RNA – Micro-RNA – Chromosome level effects – Cell-cell signaling – …
Mid-term Review Mid-term Review
Embryonic Stem Cell (ES) key regulators
collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios
Embryonic Stem Cell (ES) key regulators
collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios
OCT4 SOX2 NANOG
20.09.2008 11
siRNA knockdown of SOX2 Identify positively and negatively affected gene lists siRNA knockdown of SOX2 Identify positively and negatively affected gene lists
OCT4 SOX2 NANOG
SOX2 -> SOX2 -|
Network reconstruction using gene expression Network reconstruction using gene expression
OCT4 SOX2 NANOG
SOX2 -> SOX2 -| OCT4 -> OCT4 -| NANOG -> NANOG -|
20.09.2008 12
Network reconstruction using gene expression Network reconstruction using gene expression
http://www.biology.emory.edu/research/Lucchesi/html/research.html
20.09.2008 13
Expression Profiler (2002):
Pattern + Sequence + Expression data combined view
Gene Expression
g ≈ gP ≈ αPmBtX+βPmCtY+γPmDtY
Chapter 3
20.09.2008 14
Can we model gene expression?
G ≈ Σ α M T Gij ≈ Σ αlk MilTkj
Chapter 3
Linear regression
G ≈ MAT G ≈ MAT
Chapter 3
20.09.2008 15
cc ~ expression+motifs
KOexpr motifs KOexpr motifs G1 M/G1 Mbp1+ Mbp1+ Ace2+ Ace2+ Swi4+ Swi5+ Mcm1+ Swi6- Fkh- Swi4+ Swi4+ Mbp1+ Mbp1+ Fkh2- Mcm1+ S KOexpr motifs Swi4+ Swi4+ KOexpr motifs Ace2+ Mcm1+ Swi4+ Swi5+ Fkh2+ KOexpr motifs Swi4+ Swi4+ Fkh1+ Ace2+ S/G2 G2/M Swi6+ Swi6+ Mbp1+ Mcm1 Ace2 Fkh1 Swi6 Fkh2 Swi4 knockout data
Predict new knowledge
■ Provided some knowledge of elements
- n pathways
- n pathways
■ Predict missing elements and links using all available knowledge
Reactome
using all available knowledge
– Collaborations with biologists: verification
20.09.2008 16
Ongoing EU projects
- Systems biology
Embryonic stem cell regulation – Embryonic stem cell regulation – Pathway reconstruction (LKB1, TGFB, …) – Dry-lab and wet-lab connection
- Cancer diagnostics
– Patient data entry and management Biomarker identification – Biomarker identification
- Stem cell based toxicology profiling
– Data management – Analysis
- Published
(i 1 )
http://biit.cs.ut.ee/software
(in 1 year)
- Ongoing…
20.09.2008 17
Research Focus
- Algorithms (Data Mining & Bioinformatics)
- Tools (web based)
- Databases & information systems
- Gene regulation & Systems Biology
- Cancer; Stem Cells;
- Microarray & other high-throughput data
Fast Approximate Hierarchical Clustering using Similarity Heuristics
Hierarchical clustering is applied in i d t l i gene expression data analysis, number of genes can be 20000+ Each subtree is a cluster. Hierarchy is built Hierarchical clustering: Hierarchy is built by iteratively joining two most similar clusters into a larger one.
20.09.2008 18
Fast Hierarchical Clustering
Avoid calculating all O(n2) distances:
– Estimate distances – Use pivots Fi d l bj t – Find close objects – Cluster with partial information Meelis Kull
20.09.2008 19
MEM MEM
20.09.2008 20
GraphWeb: mining biological networks for submodules with functional significance
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
- Genes as nodes
- omics define
edges
–
expression correlation
–
protein-protein interactions
–
literature co-occurrence
–
regulation
–
binding site discovery
20.09.2008 21
Gene modules
- Integrate data
sources as graph l
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
layers
- Find well-
connected subgraphs
- Combine
Combine evidence to infer knowledge about regulation and function
GO: cell cycle, regulation, growth. KEGG: Alzheimer’s disease
Data as graphs
.. everything is interconnected
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens
20.09.2008 22
Finding the modules
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
- Cliques
–
Fully connected graphs ~ protein complexes
- Hubs
–
Highly connected nodes ~ transcriptional regulators
- Sets of neighbors
Sets of neighbors
–
Specific genes of interest + near neighbors
- Graph clustering
–
MCL: Markov clustering (van Dongen, 2000), betweenness centrality clustering
Module evaluation
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007 GO: JAK-STAT cascade, Kinase inhibitor activity Insulin receptor signaling pw. KEGG: Type II diabetes mellitus GO:Transforming growth factor beta signaling pw. embryonic development, gastrulation KEGG: Cell cycle, cancers, WNT pw. GO: Brain development Pigment granule Melanine metabolic process
20.09.2008 23
Finding the modules
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens
Finding the modules
Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007
IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens
20.09.2008 24
Summary
- Bioinformatics: manage and make sense
f th d t
- f the data
- Methods, tools, discovery, interpretation
- Algorithms, Statistics, Machine Learning,
Data Mining, visualisation, Tools & Databases, => new scientific knowledge
20.09.2008 25