knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 - - PDF document

knowledge discovery
SMART_READER_LITE
LIVE PREVIEW

knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 - - PDF document

20.09.2008 Challenge of bioinformatics: knowledge discovery Jaak Vilo vilo@ut.ee biit.cs.ut.ee 1 20.09.2008 Bioinformatics Management analysis and Management, analysis and interpretation of biological data Goal: gain new


slide-1
SLIDE 1

20.09.2008 1

Challenge of bioinformatics:

knowledge discovery

Jaak Vilo vilo@ut.ee biit.cs.ut.ee

slide-2
SLIDE 2

20.09.2008 2

Bioinformatics

■ Management analysis and ■ Management, analysis and interpretation of biological data ■ Goal: gain new insights into the biology

4 Sep 2008

slide-3
SLIDE 3

20.09.2008 3

EMBL nucleotide DB

Total nucleotides (current 216,455,190,745) Number of entries (current 136,401,022) http://www.ebi.ac.uk/embl/Services/DBStats/

slide-4
SLIDE 4

20.09.2008 4

Many data types

Sequence (DNA, RNA,Protein…) S 6340 experiments, 192031 assays available 1 assay = 10-100MB “image”, converted into values for all genes on assay… Structure Variation Gene Expression Protein expression Metabolism Interactions Regulation … Imaging Imaging …

slide-5
SLIDE 5

20.09.2008 5

Computer science and bioinformatics Communications of the ACM Volume 48 , Issue 3 (March 2005) The disappearing computer Pages: 72 - 78 Year of Publication: 2005 Jacques Cohen

Computer science and bioinformatics (Jacques Cohen, CACM 2005) ■ In barely half a century computer science has grown from infancy to science has grown from infancy to maturity. ■ Computer scientists should be encouraged to learn biology and biologists computer science to prepare themselves for an intellectually stimulating and financially rewarding future in bioinformatics.

slide-6
SLIDE 6

20.09.2008 6

Computer Literacy Interview With Donald Knuth

By Dan Doernberg December 7th, 1993

CLB: If you were a soon-to-graduate college senior or Ph.D. and you didn't have any "baggage", what kind

  • f research would you want to do? Or would you even choose research again?

Knuth: I think the most exciting computer research now is partly in robotics, and

partly in applications to biochemistry partly in applications to biochemistry. Robotics, for example, that's

  • terrific. Making devices that actually move around and communicate with each other. Stanford has a big

robotics lab now, and our plan is for a new building that will have a hundred robots walking the corridors, to stimulate the students. It'll be two or three years until we move in to the building. Just seeing robots there, you'll think of neat projects. These projects also suggest a lot of good mathematical and theoretical

  • questions. And high level graphical tools, there's a tremendous amount of great stuff in that area too. Yeah,

I'd love to do that... only one life, you know, but... CLB: Why do you mention biochemistry?

Knuth: There's millions and millions of unsolved problems. Biology is so digital, and incredibly complicated, but incredibly useful. The

trouble with biology is that if you have to work as a biologist it's boring Your experiments take you three trouble with biology is that, if you have to work as a biologist, it s boring. Your experiments take you three years and then, one night, the electricity goes off and all the things die! You start over. In computers we can create our own worlds. Biologists deserve a lot of credit for being able to slug it through. It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips, that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been

  • discovered. It may not be true, but I can't predict an unending growth. I can't be as confident about

computer science as I can about biology. Biology easily has 500 years

  • f exciting problems to work on, it's at that level.

Level 1 Level 2

A eukaryotic genome can be thought of as six Levels of

Level 0 ATCGCTGAATTCCAATGTG Level 3 Level 4

six Levels of DNA structure. The loops at Level 4 range from 0.5kb to 100kb in length.

Level 5 Level 6

If these loops were stabilized then the genes inside the loop would not be expressed.

slide-7
SLIDE 7

20.09.2008 7

A simple gene

A: B: DNA, gene, RNA, protein, gene regulation, …

ATCGAAAT TAGCTTTA

Upstream/ promoter Downstream DNA:

+Modifications

slide-8
SLIDE 8

20.09.2008 8

From parts list to a system Network

Undirected: 5+4+3+2+1 15 Undirected: 5+4+3+2+1=15 Directed graph: 52 = 25 Connection/not: 215 = 32768 A ti t / 315 14348907 H L K Activate/repress: 315= 14348907 20: 1M or 3400M J S

slide-9
SLIDE 9

20.09.2008 9

Models and parameters

H L K Logical switches on nodes Boolean or continuous Firing thresholds, growth functions? J S

Problem:

■ We have parts (?) ■ We may have partial info on wirings ■ We may have partial info on wirings

– Scientific literature

■ We have some observations under some conditios at some timepoints ■ What’s the content of the “black box”?

slide-10
SLIDE 10

20.09.2008 10

Reality

■ ~25,000 genes ■ 1 000 000 proteins ■ ~1,000,000 proteins ■ ~300 + 10,000 cell types ■ Infinite nr of conditions ■ Other levels of control:

Micro RNA – Micro-RNA – Chromosome level effects – Cell-cell signaling – …

Mid-term Review Mid-term Review

Embryonic Stem Cell (ES) key regulators

collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios

Embryonic Stem Cell (ES) key regulators

collaboration between James Adjaye, Jaak Vilo and Ioannis Xenarios

OCT4 SOX2 NANOG

slide-11
SLIDE 11

20.09.2008 11

siRNA knockdown of SOX2 Identify positively and negatively affected gene lists siRNA knockdown of SOX2 Identify positively and negatively affected gene lists

OCT4 SOX2 NANOG

SOX2 -> SOX2 -|

Network reconstruction using gene expression Network reconstruction using gene expression

OCT4 SOX2 NANOG

SOX2 -> SOX2 -| OCT4 -> OCT4 -| NANOG -> NANOG -|

slide-12
SLIDE 12

20.09.2008 12

Network reconstruction using gene expression Network reconstruction using gene expression

http://www.biology.emory.edu/research/Lucchesi/html/research.html

slide-13
SLIDE 13

20.09.2008 13

Expression Profiler (2002):

Pattern + Sequence + Expression data combined view

Gene Expression

g ≈ gP ≈ αPmBtX+βPmCtY+γPmDtY

Chapter 3

slide-14
SLIDE 14

20.09.2008 14

Can we model gene expression?

G ≈ Σ α M T Gij ≈ Σ αlk MilTkj

Chapter 3

Linear regression

G ≈ MAT G ≈ MAT

Chapter 3

slide-15
SLIDE 15

20.09.2008 15

cc ~ expression+motifs

KOexpr motifs KOexpr motifs G1 M/G1 Mbp1+ Mbp1+ Ace2+ Ace2+ Swi4+ Swi5+ Mcm1+ Swi6- Fkh- Swi4+ Swi4+ Mbp1+ Mbp1+ Fkh2- Mcm1+ S KOexpr motifs Swi4+ Swi4+ KOexpr motifs Ace2+ Mcm1+ Swi4+ Swi5+ Fkh2+ KOexpr motifs Swi4+ Swi4+ Fkh1+ Ace2+ S/G2 G2/M Swi6+ Swi6+ Mbp1+ Mcm1 Ace2 Fkh1 Swi6 Fkh2 Swi4 knockout data

Predict new knowledge

■ Provided some knowledge of elements

  • n pathways
  • n pathways

■ Predict missing elements and links using all available knowledge

Reactome

using all available knowledge

– Collaborations with biologists: verification

slide-16
SLIDE 16

20.09.2008 16

Ongoing EU projects

  • Systems biology

Embryonic stem cell regulation – Embryonic stem cell regulation – Pathway reconstruction (LKB1, TGFB, …) – Dry-lab and wet-lab connection

  • Cancer diagnostics

– Patient data entry and management Biomarker identification – Biomarker identification

  • Stem cell based toxicology profiling

– Data management – Analysis

  • Published

(i 1 )

http://biit.cs.ut.ee/software

(in 1 year)

  • Ongoing…
slide-17
SLIDE 17

20.09.2008 17

Research Focus

  • Algorithms (Data Mining & Bioinformatics)
  • Tools (web based)
  • Databases & information systems
  • Gene regulation & Systems Biology
  • Cancer; Stem Cells;
  • Microarray & other high-throughput data

Fast Approximate Hierarchical Clustering using Similarity Heuristics

Hierarchical clustering is applied in i d t l i gene expression data analysis, number of genes can be 20000+ Each subtree is a cluster. Hierarchy is built Hierarchical clustering: Hierarchy is built by iteratively joining two most similar clusters into a larger one.

slide-18
SLIDE 18

20.09.2008 18

Fast Hierarchical Clustering

Avoid calculating all O(n2) distances:

– Estimate distances – Use pivots Fi d l bj t – Find close objects – Cluster with partial information Meelis Kull

slide-19
SLIDE 19

20.09.2008 19

MEM MEM

slide-20
SLIDE 20

20.09.2008 20

GraphWeb: mining biological networks for submodules with functional significance

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

  • Genes as nodes
  • omics define

edges

expression correlation

protein-protein interactions

literature co-occurrence

regulation

binding site discovery

slide-21
SLIDE 21

20.09.2008 21

Gene modules

  • Integrate data

sources as graph l

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

layers

  • Find well-

connected subgraphs

  • Combine

Combine evidence to infer knowledge about regulation and function

GO: cell cycle, regulation, growth. KEGG: Alzheimer’s disease

Data as graphs

.. everything is interconnected

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens

slide-22
SLIDE 22

20.09.2008 22

Finding the modules

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

  • Cliques

Fully connected graphs ~ protein complexes

  • Hubs

Highly connected nodes ~ transcriptional regulators

  • Sets of neighbors

Sets of neighbors

Specific genes of interest + near neighbors

  • Graph clustering

MCL: Markov clustering (van Dongen, 2000), betweenness centrality clustering

Module evaluation

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007 GO: JAK-STAT cascade, Kinase inhibitor activity Insulin receptor signaling pw. KEGG: Type II diabetes mellitus GO:Transforming growth factor beta signaling pw. embryonic development, gastrulation KEGG: Cell cycle, cancers, WNT pw. GO: Brain development Pigment granule Melanine metabolic process

slide-23
SLIDE 23

20.09.2008 23

Finding the modules

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens

Finding the modules

Jüri Reimand: GraphWeb. Genome Informatics, CSHL. Nov 1 2007

IntAct: Protein interactions (PPI), 18773 interactions IntAct: PPI via orthologs from IntAct, 6705 interactions MEM: gene expression similarity over 89 tumor datasets, 46286 interactions Transfac: gene regulation data, 5183 interactions Public datasets for H.sapiens

slide-24
SLIDE 24

20.09.2008 24

Summary

  • Bioinformatics: manage and make sense

f th d t

  • f the data
  • Methods, tools, discovery, interpretation
  • Algorithms, Statistics, Machine Learning,

Data Mining, visualisation, Tools & Databases, => new scientific knowledge

slide-25
SLIDE 25

20.09.2008 25

Anno 2007 (BIIT and Quretec)

2008