1 The data analysis process An Example: An Example: Microarray - - PowerPoint PPT Presentation

1 the data analysis process
SMART_READER_LITE
LIVE PREVIEW

1 The data analysis process An Example: An Example: Microarray - - PowerPoint PPT Presentation

Implications of microarray data Predicting Gene Ontology Biological Process Issues: The Linnaeus Centre for Bioinformatics The Linnaeus Centre for Bioinformatics Analysis, especially statistical aspects of developing From Temporal Gene


slide-1
SLIDE 1

1

The Linnaeus Centre for Bioinformatics

Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns

Jan Komorowski The Linnnaeus Centre for Bioinformatics Uppsala University and The Swedish University of Agricultural Sciences Uppsala, Sweden

2

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Implications of microarray data

Issues:

– Analysis, especially statistical aspects of developing models from this data (30 samples, 10,000 parameters!) – Annotation of biological data – Classification and function prediction

  • Re-use of biological knowledge

– The bibliome – Ontologies

  • Multiple sources of biomedical knowledge

– Proteomic, metabolomic, biobanks and clinical data

  • Organization of data:

– Management, analysis, interpretation and publication

3

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Functional Classification Functional Classification from time profiles from time profiles

  • Aim

– find relationships between

gene functions - gene expression profiles (i.e. a model)

4

Belfast November 2003

The Linnaeus Centre for Bioinformatics

The LCB Information System for Microarrays (MIAME standard)

BASE LCB DW

Export Visualization Export Analysis Publish Array Express

Public databases Annotations e.g. Gene Ontology

slide-2
SLIDE 2

2

Belfast November 2003 5

<PMAIP1,[*, 0.036)> <PC4([-0.716,-0.073)> <Class,Y> <PMAIP1,[*, 0.036)> <PC4([*,-0.716)> <Class,N> ∧ → ∧ →

Image Analysis Filtering Normalization Data mining Validation Interpretation Microarray scans Microarray data Filtered data Normalizated data Model Quality estimate

  • f the model

Knowledge

Σ !

The data analysis process

Discretization Feature selection Learning

TGFB3 <Hs.60478> MGC8471 ... Class [*, -0.152) [-0.124, 0.341) [-0.016, 0.318) ... Y [0.108, *) [0.341, *) [0.318, *) ... Y [-0.152, 0.108) [0.341, *) [0.318, *) ... Y [*, -0.152) [*, -0.124) [*, -0.016) ... N [0.108, *) [-0.124, 0.341) [0.318, *) ... Y [0.108, *) [-0.124, 0.341) [-0.016, 0.318) ... Y [-0.152, 0.108) [-0.124, 0.341) [-0.016, 0.318) ... N [0.108, *) [0.341, *) [0.318, *) ... Y [*, -0.152) Undefined [*, -0.016) ... N [*, -0.152) [*, -0.124) Undefined ... N PMAIP1 ENPEP [*, 0.036) [*, -0.046) [0.036, 0.440) [0.380, *) [0.440, *) [0.380, *) [*, 0.036) [*, -0.046) [0.440, *) [0.380, *) [*, 0.036) [-0.046, 0.380) [0.036, 0.440) [*, -0.046) [0.440, *) [0.380, *) [0.036, 0.440) [*, -0.046) Undefined [-0.046, 0.380) [*, 0.036) [*, -0.046) [0.108, *) [*, -0.124) [*, -0.016) ... N <PMAIP1,[*, 0.036)> <PC4([-0.716, -0.073)> <Class,Y> <PMAIP1,[*, 0.036)> <PC4([*, -0.716)> <Class,N> <PMAIP1,[0.036, 0.440)> <PC4([*, -0.716)> <Class,N> <LTGFB3,[*, -0.152)> <MGC8471([-0.016, 0.318)> <Class,Y> <TGFB3,[0.108, *)> <MGC8471([-0.016, 0.318)> <Class,Y> ∧ → ∧ → ∧ → ∧ → ∧ →

  • Hs. Cluster

NAME SYMBOL Hs.291 glutamyl aminopeptidase ENPEP Hs.823 hepsin (transmembrane protease, serine 1) HPN Hs.74861 activated RNA polymerase II transcription cofactor 4 PC4 Hs.60478 ESTs, Moderately similar to protein HZF2 <Hs.60478> Hs.284266 hypothetical protein MGC8471 MGC8471 Hs.96 phorbol-12-myristate-13-acetate-induced protein 1 PMAIP1 Hs.2025 transforming growth factor, beta 3 TGFB3 P-value 0.001 0.001 0.002 0.002

Selected genes Discretized training data Rule model

Belfast November 2003 6

An Example: An Example: The The Transcriptional Transcriptional Program in Program in the the Response Response of Human

  • f Human

Fibroblasts Fibroblasts to Serum to Serum

Iyer et al, Science, 283: 83, 1999

8 hours serum treatment

1, protein disulfide isomerase-related protein 2, IL-8 precursor 3, EST AA057170 4, vascular endothelial growth factor Belfast November 2003 7

fibroblast fibroblast -

  • 24 h serum response

24 h serum response

1 4 8 24

quiescent

non-proliferating

proliferating

serum serum

samples for microarray analysis

Belfast November 2003 8

fibroblast serum response fibroblast serum response

  • wound healing

wound healing

  • blood coagulation and hemostasis

(PAI1, Factor III, Endothelin-1)

  • chemotaxis and activation of immune cells

(COX2, MCP1, IL-8, ICAM-1)

  • angiogenesis

(VEGF)

  • migration and proliferation of fibroblasts

(CTGF)

  • differensiation of fibroblast to myofibroblasts

(vimentin)

  • migration and proliferation of keratinocytes

(FGF7)

slide-3
SLIDE 3

3

Belfast November 2003 9

dynamic processes dynamic processes

quiescent

non-proliferating

proliferating immediate early delayed immediate early intermediate 1 4 8 24

late

primary secondary tertiary

Belfast November 2003 10

Molecular mechanisms of Molecular mechanisms of transcriptional response transcriptional response

immediate early response genes delayed immediate early response genes intermediate/late response genes

effectors effectors

= cellular = cellular response response

serum serum

= signal = signal

immediate early response factors secondary transcription factors

Belfast November 2003 11

1 4 8 24

gene transcript protein

Protein dynamics is not always Protein dynamics is not always similar to transcript dynamics similar to transcript dynamics

Belfast November 2003 12

quiescent

non-proliferating

proliferating 1 4 8 24 primary secondary tertiary

Protein appears Protein appears after after the transcript the transcript

slide-4
SLIDE 4

4

Belfast November 2003 13

1 4 8 24 quiescent

non-proliferating

proliferating protein protein synthesis synthesis lipid lipid synthesis synthesis stress stress response response cell cell motility motility

re re-

  • entry

entry cell cell cycle cycle

  • rganelle
  • rganelle

biogenesis biogenesis transcription transcription

Processes Processes

Belfast November 2003 14

quiescent

non-proliferating

proliferating

1 4 8 24 protein synthesis DNA synthesis energy metabolism cell motility stress response cell motility cell adhesion DNA synthesis lipid synthesis cell cycle regulation

The The dynamics dynamics of

  • f cellular

cellular processes processes

cell proliferation, negative regulation

Belfast November 2003 15

pro-endothelin active endothelin inactive endothelin

co co-

  • regulation of genes

regulation of genes

coding for proteins in a network coding for proteins in a network

Belfast November 2003 16

pro-endothelin active endothelin inactive endothelin

furin

CALLA/CD10

co co-

  • regulation of genes

regulation of genes

coding for proteins in a network coding for proteins in a network

slide-5
SLIDE 5

5

Belfast November 2003 17

pro-endothelin active endothelin inactive endothelin

furin

CALLA/CD10

+

Belfast November 2003 18

pro-endothelin active endothelin inactive endothelin

furin

CALLA/CD10

+ +

Belfast November 2003 19

pro-endothelin active endothelin inactive endothelin

furin

CALLA/CD10

+

  • +

co co-

  • regulation of genes

regulation of genes

coding for proteins in a network coding for proteins in a network

Belfast November 2003 20

fibroblast serum fibroblast serum-

  • response

response transcriptional program transcriptional program

517 gene-probes with differential gene expression 497 unique genes 284 known genes 213 unknown genes

slide-6
SLIDE 6

6

21

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Iyer’s analysis of transcriptional fibroblast serum response

Functional clusters Expression clusters

These genes usually do not cluster!!!

22

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Functional Classification Functional Classification from time profiles from time profiles

  • Aim

– find relationship between

gene function - gene expression profile (i.e. a model)

Belfast November 2003 23

Selected Challenges in Gene- expression Analysis

  • Function similarity corresponds to expression similarity

but:

– Functionally corelated genes may be expression-wise dissimilar (e.g. anti-coregulated) – Genes usually have multiple function – Measurements may be approximate and contradictory

  • Can we obtain clusters of biologically related genes?
  • Can we build models that classify unknown genes to

functional classes, that are human legible, and that handle approximate and often contradictory data?

  • How can we re-use biological knowledge?

Belfast November 2003 24

G ene 0HR 15M IN 30M IN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g 1 0.00

  • 0.47
  • 3.32
  • 0.81

0.11

  • 0.60
  • 1.36
  • 1.03
  • 1.84
  • 1.00
  • 0.60
  • 0.94

Unknow n g 2 0.00 0.66 0.07 0.20 0.29

  • 0.89
  • 0.45
  • 0.29
  • 0.29
  • 0.15
  • 0.45
  • 0.42

Transport and defense response g 3 0.00 0.14

  • 0.04

0.00

  • 0.15
  • 0.58
  • 0.30
  • 0.18
  • 0.38
  • 0.49
  • 0.81
  • 1.12

Cell cycle control g 4 0.00

  • 0.04

0.00 -0.23 -0.25

  • 0.47
  • 0.60
  • 0.56
  • 1.09
  • 0.71
  • 0.76
  • 0.62

Positive control of cell proliferation g 5 0.00 0.28 0.37 0.11

  • 0.17
  • 0.18
  • 0.60
  • 0.23
  • 0.58
  • 0.79
  • 0.29
  • 0.74

Positive control of cell proliferation ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Process

Positive control

  • f cell

proliferation Defense response Cell cycle control

Ontology

Transport

g2

...

g2

...

g3

...

g4

...

g5

0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)

Methodology

  • 1. Mining functional

classes from an

  • ntology
  • 2. Extracting features for learning
  • 3. Inducing minimal decision rules

using rough sets

  • 4. The function of unknown genes

is predicted using the rules

!

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 4 6 8 10 12 14 16 18 20 22 24

slide-7
SLIDE 7

7

Belfast November 2003 25

Gene Ontology Gene Ontology*

*

* The homepage of Ashburner’s Gene Ontology: http://genome-www.standford.edu/GO/

GENE FUNCTION

CELLULAR COMPARTMENT PROCESS FUNCTION Cell growth and maintenance

Metabolism Energy pathways Nucleotide and nucleic acid metabolism DNA metabolism Transcription DNA packaging DNA repair Mutagenesis Intracellular protein traffic Ion homeostasis Transport Lipid metabolism Protein metabolism and modification Amino-acid and derivative metabolism Protein targeting Cell death Cell motility Stress response Organelle organizaton and response Oncogenesis Cell proliferation Cell cycle Cell communication Cell adhesion Signal transduction Cell surface receptor linked signal transduction Intracellular signalling cascade Developmental processes Physiological processes Blood Coagulation Circulation

Belfast November 2003 26

GENE SYMBOL GENE NAME GENEBANK ACCESSION NUMBER ANNOTATIONS AT THE MOST SPECIFIC LEVEL OF GO ANNOTATIONS TO THE 23 BROAD CELLULAR PROCESSES USED FOR LEARNING SEPP1 selenoprotein P, plasma, 1 AA045003

  • xidative stress response(GO:0006979),

metal ion transport(GO:0006823) stress response(GO:0006950), transport(GO:0006810) EPB41L2 erythrocyte membrane protein band 4.1-like 2 W88572 positive control of cell proliferation(GO:0008284) cell proliferation(GO:0008283) OA48-18 acid-inducible phosphoprotein AA029909 cell proliferation(GO:0008283) cell proliferation(GO:0008283) CTSK cathepsin K (pycnodysostosis) AA044619 proteolysis and peptidolysis(GO:0006508) protein metabolism and modification(GO:0006411) CPT1B carnitine palmitoyltransferase I, muscle W89012 fatty acid beta-oxidation(GO:0006635) lipid metabolism(GO:0006629) CLDN11 claudin 11 (oligodendrocyte transmembrane protein) N22392 cell adhesion(GO:0007155), substrate-bound cell migration(GO:0006929), cell proliferation(GO:0008283), developmental processes(GO:0007275) cell adhesion(GO:0007155), cell motility(GO:0006928), cell proliferation(GO:0008283), developmental processes(GO:0007275) RPL5 ribosomal protein L5 AA027277 protein biosynthesis(GO:0006412), ribosomal large subunit assembly and maintenance(GO:0000027) protein metabolism and modification(GO:0006411), cell organization and biogenesis(GO:0006996) Homo sapiens clone 23785 mRNA sequence N32247 calcium-independent cell-cell matrix adhesion(GO:0007161) cell adhesion(GO:0007155)

annotations annotations

Table 1. Annotation of Known Genes

Belfast November 2003 27

Energy pathways DNA metabolism Amino acid and derivative metabolism Protein targeting Lipid metabolism Transport Ion hemostasis Intracellular traffic Cell death Cell motility Stress response Organelle organization and biogenesis Oncogenesis Cell cycle Cell adhesion Cell surface receptor linked signal transduction Intracellular signaling cascade Developmental processes Blood coagulation Circulation

Biological processes from GO

Belfast November 2003 28

Gene Ontology vs. Clusters Found by Iyer et al.

Nota bene: transcription genes belong to all 10 clusters!

slide-8
SLIDE 8

8

Belfast November 2003 29

G ene 0HR 15M IN 30M IN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g 1 0.00

  • 0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84
  • 1.00
  • 0.60
  • 0.94

Unknow n g 2 0.00 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29

  • 0.15
  • 0.45
  • 0.42

Transport and defense response g 3 0.00 0.14

  • 0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38
  • 0.49
  • 0.81
  • 1.12

Cell cycle control g 4 0.00

  • 0.04

0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09

  • 0.71
  • 0.76
  • 0.62

Positive control of cell proliferation g 5 0.00 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58

  • 0.79
  • 0.29
  • 0.74

Positive control of cell proliferation ... ... ... ... ... ... ... ... ... ... ... ... ... ...

Process

Positive control

  • f cell

proliferation Defense response Cell cycle control

Ontology

Transport

g2

...

g2

...

g3

...

g4

...

g5

0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)

Methodology

  • 1. Mining functional

classes from an

  • ntology
  • 2. Extracting features for learning
  • 3. Inducing minimal decision rules

using rough sets

  • 4. The function of unknown genes

is predicted using the rules

!

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 4 6 8 10 12 14 16 18 20 22 24

Belfast November 2003 30

Examples of template definitions

  • MIN. 0.6

MAX 0.2

  • MIN. 0.1
  • MIN. 0.1

2HR 8HR 6HR 4HR

MEAN

  • MIN. 0.2

8HR 6HR 4HR

  • MIN. 0.2

Constant-template Increasing-template

  • MIN. 0
  • MIN. 0

12HR 8HR 12HR 1.0 0.5

Belfast November 2003 31

Template-based feature synthesis

All possible subintervals in the time series Templates: Increasing Decreasing Constant Gene expression time series data Groups containing genes matching the same templates over the same subinterval + MATCH 12 measurement points, 55 possible intervals of length >2

Belfast November 2003 32

Rule example 1

M35296 J02783 D13748 X05130 X60957 D13748 U90918 (unknown) 0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR GO(mesoderm development) OR GO(protein biosynthesis)

Covered genes Rule

  • 1
  • 0.5

0.5 1 1.5 2 2.5 3 2 4 6 8 10 12 14 16 18 20 22 24

slide-9
SLIDE 9

9

Belfast November 2003 33

Rule example 2

Y07909 X58377 U66468 X58377 X85106 Y07909 0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation) OR GO(cell-cell signaling) OR GO(intracellular signaling cascade) OR GO(oncogenesis)

Covered genes

Rule

  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 4 6 8 10 12 14 16 18 20 22 24

Belfast November 2003 34

Classification using template- based rules

IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN …

IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(prot. met. and mod.) OR …

IF … THEN IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … IF … THEN … …

X60957

  • 1
  • 0.5

0.5 1 1.5 2 2.5 3 2 4 6 8 10 12 14 16 18 20 22 24

Process Votes

protein metabolism and modification 6 protein amino acid phosphorylation 3 proteolysis and peptidolysis 2 transcription 1 transport 1 vision 1 …

+4 Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions

Belfast November 2003 35

Protein Metabolism and Modification

A B C D E

A – annotations B – false negatives C – false positives D – true positives E – pred. unknown gene

Belfast November 2003 36

Cross validation estimates

PROCESS AUC SE

Ion homeostasis 1.00 0.00 Protein targeting 0.99 0.03 Blood coagulation 0.96 0.08 DNA metabolism 0.94 0.09 Intracellular signaling cascade 0.94 0.06 Energy pathways 0.93 0.12 Cell cycle 0.93 0.04 Oncogenesis 0.92 0.11 Circulation 0.91 0.11 Cell death 0.90 0.10 Developmental processes 0.90 0.07 Transcription 0.88 0.11 Defense (immune) response 0.88 0.05 Cell adhesion 0.87 0.09 Stress response 0.86 0.15 Protein metabolism and modification 0.85 0.10 Cell motility 0.84 0.11 Cell surface rec linked signal transd 0.82 0.15 Lipid metabolism 0.81 0.14 Transport 0.79 0.17 Cell organization and biogenesis 0.79 0.11 Cell proliferation 0.79 0.06 Amino acid and derivative metabolism 0.69 0.06

AVERAGE 0.88 0.09

A: Coverage: 84% Precision: 50% B: Coverage: 71% Precision: 60% C: Coverage: 39% Precision: 90% Coverage = TP/(TP+FN) Precision = TP/(TP+FP)

slide-10
SLIDE 10

10

Belfast November 2003 37

Randomization results

PROCESS AVG STD.DEV. MIN MAX transcription 0,5559 0,0390 0,5004 0,6601 DNA metabolism 0,5555 0,0413 0,5000 0,6760 cell cycle 0,5596 0,0461 0,5006 0,7123 cell surface receptor linked signal transduction 0,5586 0,0448 0,5009 0,7062 cell organization and biogenesis 0,5544 0,0404 0,5004 0,6830 stress response 0,5586 0,0417 0,5000 0,6715 ion homeostasis 0,5584 0,0433 0,5009 0,7264 amino-acid and derivative metabolism 0,5579 0,0482 0,5011 0,7425 defense (immune) response 0,5528 0,0438 0,5019 0,7293 protein targeting 0,5596 0,0446 0,5003 0,6883 blood coagulation 0,5582 0,0438 0,5017 0,7043 cell death 0,5644 0,0502 0,5002 0,6984 developmental processes 0,5650 0,0479 0,5005 0,7556 protein metabolism and modification 0,5559 0,0444 0,5004 0,7278 energy pathways 0,5524 0,0408 0,5003 0,7132 cell motility 0,5579 0,0460 0,5009 0,7532 cell adhesion 0,5550 0,0404 0,5024 0,6777 circulation 0,5540 0,0417 0,5002 0,6915

  • ncogenesis

0,5609 0,0450 0,5023 0,7326 transport 0,5552 0,0448 0,5000 0,7110 intracellular signalling cascade 0,5601 0,0449 0,5001 0,6973 cell proliferation 0,5548 0,0453 0,5000 0,7420 lipid metabolism 0,5556 0,0433 0,5001 0,7109 Belfast November 2003 38

Annotations, Rules and Classifications

Annotated genes within the 23 broad classes of GO biological process 273 Gene probes associated with the 273 genes within the 23 broad biological process classes 284 Training examples annotations associated with the genes in the 23 broad biological process classes co-annotations associated with the genes in the 23 broad biological process classes 549 444 Rules generated from the training examples 18064 Estimated quality of classifications of unknown genes (cross-validation estimates) Sensitivity 84% Specificity 91% Fraction of classifications that are correct 49% Classifications for unknown (uncharacterized) genes 548 classifications were obtained for 211 of the 213 unknown genes (Re-)Classifications for training examples 728 True positive classifications 519 True positive co-classifications 356 False positive classifications 219 False negative (missing) classifications 30 For 272 of the 273 training examples at least one correct (re-)classification was obtained

the model the model

Belfast November 2003 39

biological processes per gene # genes 1 105 2 100 3 41 ≥ 4 27

several different several different biological processes biological processes annotated for each gene annotated for each gene

Belfast November 2003 40

the model predicts the model predicts several different several different biological processes biological processes for each gene for each gene

# biological processes annotated or classified per gene

# biological processes per gene annotations training example genes (re-)classifications training example genes classifications unknown genes

1 105 30 27 2 100 93 84 3 41 96 59 ≥ 4 27 54 41

slide-11
SLIDE 11

11

Belfast November 2003 41

Symbol Gene name Molecular function Comment Reference (PMID)

CCNG1 Cyclin G1 CDK kinase regulator p53 target 11327114 CDKN1C cyclin-dependent kinase inhibitor 1C cyclin-dependent protein kinase inhibitor tumor suppressor 7729684 CAT catalase

  • xidoreductase

tumor progression 8513880, ALDH3A2 aldehyde dehydrogenase 10 aldehyde dehydrogenase tumor progression 92393980 ADD3 adducin 3 (gamma) membrane-cytoskeleton-associated protein tumor progression 9607561 TFDP2 transcription factor Dp-2 (E2F dimerization partner 2) transcription co-factor cell cycle regulation 7784053 ATRX alpha thalassemia/mental retardation syndrome DNA helicase transcription & DNA repair 10362365 EPS15 epidermal growth factor receptor pathway substrate 15 kinase substrate growth regulation 93361014 EGR1 early growth response 1 transcription factor tumor suppressor 9109500 NR4A2 nuclear receptor subfam 4, group A, m2 (Nurr1, Not) ligand-dependent nuclear receptor proto-oncogene 9592180 NR4A3 nuclear receptor subfam 4, group A, m 3 (Nor1) ligand-dependent nuclear receptor proto-oncogene 9592180 COPEB core promoter element binding protein transcription factor proto-oncogene 9268646

24 false positive predictions for 24 false positive predictions for oncogenesis

  • ncogenesis

12 of these were correct (that is “missing annotations”) 12 of these were correct (that is “missing annotations”)

Belfast November 2003 42

Symbol Gene name Homology information Classification Classification matching assumed biological process

LOC55977 hypothetical protein 24636 H.sapiens thromboxane A-2 receptor, 53%/55 aa protein metabolism and modification, lipid metabolism, developmental processes, blood coagulation developmental processes blood coagulation FLJ10217

  • xysterol-binding protein-related

protein 1 H.sapiens oxysterol-binding protein, 36% / 711 aa cell death, blood coagulation cell death blood coagulation WW45 WW Domain-Containing Gene WW domain cell surface receptor linked signal transduction, transcription transcription PIST PDZ/coiled-coil domain binding partner for the rho-family GTPase TC10 M.musculus syntrophin, 43% / 110 aa stress response, protein metabolism and modification, cell surface receptor linked signal transduction, transcription stress response ESTs ESTs, Weakly similar to A Chain A, Human Cd69 - Trigonal Form {SUB 82-199 [H.sapiens] H.sapiens cd69, 52% / 37 aa cell surface receptor linked signal transduction, cell cycle, transcription cell surface receptor linked signal transduction H-l(3)mbt-l H-l(3)mbt-like protein D.melanogaster T13797 tumor supressor protein, 46% / 176 aa cell proliferation, lipid metabolism, transcription, oncogenesis cell proliferation, oncogenes cyclin L ania-6a H.sapiens cyclin K, 27% / 363 aa cell proliferation, transcription, oncogenesis cell proliferation, transcriptio HMGE GrpE-like protein cochaperone E.coli heat shock protein grpE, 32% / 178 aa stress response, protein metabolism and modification stress response, protein metabolism and modificatio ESTs, Highly similar to SMHU1B metallothionein 1B [H.sapiens] H.sapiens SMHU1B metallothionein 1B, 98% / 60 aa stress response, protein metabolism and modification, ion homeostasis stress response, ion homeostasis

11 of 24 unknown genes had correct predictions 11 of 24 unknown genes had correct predictions (homology (homology-

  • based annotations)

based annotations)

43

Belfast November 2003

The Linnaeus Centre for Bioinformatics

In preparation

  • “Yeast responses to environmental factors”

(Gash 2001) (graduate student Anna Johansson)

  • Cardiac arrest in rats (Ellingsen et al)
  • Inflammatory mechanisms in cardiovascular

diseases (Johansen et al)

  • Studies of endothelial cells (Cross et al)

(Affymetrix data)

  • A Rough Set Approach to Gene Network

Reverse Engineering, (graduate student Martin Eklund)

44

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Results – The Gasch expression data set*

  • Data

– Expression data from 18 different microarray time-course experiments on yeast that have been exposed to different kinds of stress (Gasch 2001)

  • Method:

– The differentially expressed genes were selected from the data set (approximately 950 genes out of 6000 were selected) – The microarray-data was translated into templates to handle the time profiles experiments in an orderly way. – Some of the genes were annotated with more general terms to obtain more populated classes than what is possible when using the most specific annotations.

*The work of Anna Johansson

slide-12
SLIDE 12

12

45

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Results – AUC values for classifiers

  • btained from the Gasch dataset

GO-classes HS1 HS2 HPT Shuffled GO:0006006 - glucose metabolism 0.69 0.66 0.82 0.52 GO:0030163 - protein catabolism 0.74 0.76 0.77 0.55 GO:0030503 - regulation of redox homeostasis 0.77 0.78 0.83 0.51 GO:0016310 - phosphorylation 0.77 0.52 0.54 0.52 GO:0042273 - ribosomal large subunit biogenesis 0.77 0.54 0.66 0.56 GO:0006725 - aromatic compound metabolism 0.75 0.58 0.59 0.69 GO:0006979 - response to oxidative stress 0.68 0.54 0.67 0.58 Mean for all classes 0.63 0.56 0.60 0.55

HS1 – Heat shock from 25°C to 37 °C HS2 – Temperature shift from 37 °C to 25°C HPT – Hydrogen Peroxide Treatment Shuffled – Annotations have been randomly assigned. The same settings as in HS1

46

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Results – The Gasch expression data set

Some examples:

GO-classes HS 1+2+3 1-6 Shuffled GO:0006006 - glucose metabolism 0.69 0.75 0.99 GO:0030163 - protein catabolism 0.74 0.67 0.72 GO:0030503 - reg of redox homeost 0.77 0.69 0.92 GO:0016310 - phosphorylation 0.77 0.86 0.75 GO:0030490 - proc of 20S pre-rRNA 0.78 0.66 0.84 GO:0042273 - ribosomal lrg subunit biogenes 0.77 0.74 0.81 GO:0006725 - aromatic compound metabolism 0.75 0.72 0.67 0.69

47

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Conclusions

  • Our methodology

– Incorporates background biological knowledge – Handles well the noise and incompleteness in the microarray data – Can be objectively evaluated – Predicts multiple functions per gene – Can re-classify known genes and provide possible new functions of the known genes – Can provide hypotheses about the function of unknown genes

  • Experimental work needs to be done to

confirm our predictions

Belfast November 2003 48

Genomic ROSETTA:

http://www.idi.ntnu.no/~aleks/rosetta

slide-13
SLIDE 13

13

49

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Acknowledgements

Uppsala

  • Erik Bongcam-Rudloff
  • Herman Midelfart
  • Torgeir Hvidsten
  • Claes Andersson
  • Helena Strömbergsson
  • Aleksejs Kontijevskis
  • Vladimir Yankovski
  • Anna Johansson
  • Martin Eklund
  • Eva Berglund
  • Adam Ameur
  • Jakub Orzechowski
  • Michael Cross
  • Ingrid Lönnstedt
  • Mats Gustafsson
  • Anders Isaksson
  • The Ludwig Institute for Cancer

Research Trondheim

  • Astrid Lägreid
  • Vidar Beisvåg, Öyvind

Ellingsen

  • Arne Sandvik, Fekadu

Yadetsie

  • Berit Johansen, Sjur Huseby
  • Hans Richard Brattbakk

Warsaw

  • Andrzej Skowron
  • Son Nguyen
  • Witold Rudnicki
  • Jan Radomski

Funding

  • The Wallenberg Foundation
  • The Swedish Strategic

Research Foundation

  • The Swedish Research Council

50

Belfast November 2003

The Linnaeus Centre for Bioinformatics

Publications

  • Astrid Lægreid, Torgeir R. Hvidsten, Herman

Midelfart, Jan Komorowski, and Arne K. Sandvik, in Genome Research 2003 May;13(5):965-79

  • T.R. Hvidsten, A. Lægreid, J. Komorowski, Special

Issue on Microarrays of Bioinformatics journal, Oxford University Press, pp. 1116–1123.

  • A literature network of human genes for high-

throughput gene-expression analysis, (T.-K. Jenssen,

  • A. Lægreid, J. Komorowski and E. Hovig), Nature

Genetics, pp. 21–28,Vol 28, May 2001.

  • T. R. Hvidsten, J. Komorowski, A. K. Sandvik and A.

Lægreid, Proc. of the Pacific Symposium on Bio- computing, pp. 299–310, Hawaii, January 2001.

51

Belfast November 2003

The Linnaeus Centre for Bioinformatics

For publications see:

http://www.lcb.uu.se/~janko/