What is Bioinformatics? DNA mrab funktsiooni (?) DNA VALK - - PDF document

what is bioinformatics
SMART_READER_LITE
LIVE PREVIEW

What is Bioinformatics? DNA mrab funktsiooni (?) DNA VALK - - PDF document

Eesmrgid Tutvustada bioloogiliste andmete analsi Teadmiste kaevandamine (kaevandamise) problemaatikat bioloogilistest andmetest Kas bioinformaatikal on midagi pakkuda Jaak Vilo


slide-1
SLIDE 1

1

Teadmiste kaevandamine bioloogilistest andmetest

Jaak Vilo Teoreetilise arvutiteaduse talvekool, Arula, 3. veebruar 2003

Eesmärgid

  • Tutvustada bioloogiliste andmete analüüsi

(kaevandamise) problemaatikat

  • Kas bioinformaatikal on midagi pakkuda

teoreetilisele arvutiteadusele?

  • Näiteid bio-andmete kaevandamisest

What is Bioinformatics?

  • Bioinformatics is the use of

information technology to store and analyze genetic information

  • Bioinformatic researchers develop and

apply computing tools to extract the secrets of the life and death of organisms from the genetic blueprints and molecular structure stored in digital collections

DNA

GenBank / EMBL Bank

VALK

SwissProt/TrEMBL

STRUKTUUR

PDB/Molecular Structure Database

DNA määrab funktsiooni (?)

4 Nucleotides 20+ Amino Acids

(3nt 1 AA)

Funktsioon?

A Simple Gene

ATCGAAAT TAGCTTTA

✂✁ ✄ ✁ ☎✂✆✞✝ ✟✂✠ ✡ ✠ ☛ ☞ ✌ ✠ ✝✂✍ ✎

Upstream/ promoter Downstream

DNA:

✏✂✁

David S. Goodsell http://www.scripps.edu/pub/goodsell/

slide-2
SLIDE 2

2

Uurimisküsimusi

  • Milline on iga geeni ja geeniprodukti

(valk, RNA) funktsioon?

  • Kuidas täpselt toimuvad bioloogilised

protsessid?

  • Transkriptsioon ja translatsioon ja nende

reguleerimine

  • Millised valgud interakteeruvad ja miks
  • Metaboolsed rajad ja võrgud
  • Signaali ülekanne rakus ja organismis

Kus vaja arvutiteadust?

  • Andmete kogumise protsessi tugi
  • Toorandmetest üldistatud andmete

saamine

  • Andmete haldamine
  • Andmete analüüs
  • Teadmiste esitamine
  • Modelleerimine ja simulatsioon

Excerpts from curricula descriptions

  • New experimental techniques generating mass

data are being developed.

  • Every new biological research idea requires a

specifically tailored, algorithmic approach, raising an abundance of challenging questions in algorithm design, analysis and implementation.

  • The demands of biosequence analysis require

advances in many classical fields of algorithmics. Our recent work concentrates on suffix trees and related index structures, on efficient pattern matching in strings and trees, and on a new, algebraic style of dynamic programming.

DNA (shotgun) sequencing

Cut randomly, sequence pieces 5-10x coverage

Assemble

shortest supersequence

Data Mining Andmete kaevandamine

  • Data Mining ja Knowledge Discovery from

Databases (KDD) on uued arvutiteaduse alad, palju ühist statistikaga

  • Eesmärgiks on väga suurte andmekogude

analüüs

  • Otsi (lihtsaid) reegleid ja seoseid mis

kehtivad (vähemalt teatud osas) andmetes

  • Statistika, masinõppimine,

andmebaasiteooria, + domain knowledge

Bio-andmete kaevandamine

  • 1. Millised bio-andmed on olemas ja

kuidas neid võiks kasutada

  • 2. Andmete kogumine, puhastamine,

ettevalmistamine ja ühendamine

  • 3. Analüüs arvutil (peamine algoritm)
  • 4. Tulemuste esitamine, reeglite

tuvastamine, visualiseerimine

  • 5. Tulemuste tõlgendamine
slide-3
SLIDE 3

3

TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCC TTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCA TCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTC TTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAAT GCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAA GTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTT GGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCT TCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTT CTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTG TGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACT TTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTAC TTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTA GATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGC TTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCG AGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGT CTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC

A Challenge Problem (P. Pevzner, 2000)

  • Insert into every sequence a 15-mer

where 4 positions have been randomly changed

  • Challenge: discover what was the
  • riginal sequence inserted
  • Why? 4^15 sequences in total, variants

can differ as much as in 8 positions out

  • f 15

Tekstist mustriga otsimine

  • Kas ATGCAGA esineb tekstis?
  • Kas ATGCAGA esineb tekstis

ligilähedselt (ATCCAGA, ATGCGA)?

  • Kas A[TA].C[CG].{3,7} A[TA].C[CG]

esineb tekstis?

  • Kui kaua võtab aega ülaltoodud

päringutele vastamine?

  • Kuidas eeltöödelda ja indekseerida?
  • Suffiksipuud ja massiivid

Patterns: AT Algoritmid

  • Tehniline: loenda kõik etteantud stringides

esinevad mustrid ja nende sagedused

  • Bioloogiline: kogu kokku sarnaselt

ekspresseerunud geenide promootorid ja

  • tsi võimalikke transkriptsiooni-faktorite

seondumiseks soodsaid mustreid

Pattern Discovery

  • 1. Choose the language (formalism)

to represent the patterns

  • 2. Choose the rating for patterns, to tell

that one pattern is “better” than other

  • 3. Design an algorithm that finds the

best patterns from the pattern class,

fast.

slide-4
SLIDE 4

4

Patterns: AT

Patterns: WHAT ([AT][ACT]AT)

Bioinformaatika algoritmid

  • Mitte ainult arvutiressursside küsimus
  • Viisid kuidas kombineerida andmeid
  • Kas saadav tulemus on bioloogiliselt

relevantne

  • Kas see edendab meie arusaamist

bioloogilistest fenomenidest?

Näidis-meetodid

  • Geeniekspressiooni analüüs
  • Transkriptsioonifaktorite

seondumiskohtade ennustamine

  • Valk-valk interaktsioonide analüüs
  • G-valk retseptorite ja G-valkude seoste

analüüs

  • Teadustekstide analüüs

LASER, scanning culture 1 culture 2

mRNA cDNA hybridise

DB Analysis of biological samples with microarrays

From microarray images to gene expression data

Raw data

Array scans Image quantifications

Spots Spot/Image quantiations

Intermediate data

Samples

Genes Gene expression levels

Final data

slide-5
SLIDE 5

5

Eisen etal, PNAS 98 Spellman etal Mol Biol Cell 98 Tumor classification

Golub et al, Science Oct 15th 1999

  • 38 samples of acute

myeloic leukemia (AML) and acute lymphoblastic leukemia (ALL)

  • 6817 genes
  • classificator built based on

50 best correlated genes

  • tested on 34 new samples,

29 of them predicted accurately

ALL AML ALL AML

✂✁✄✁ ☎✝✆✟✞✟✠☛✡✟✁ ☞✟✌✍✌ ✎✑✏ ✒ ✓ ✔ ✕ ✖✑✗ ✘ ✏ ✙ ✚ ✕ ✛ ✙ ✗ ✜ ✢ ✙ ✗ ✣✑✕ ✖ ✗ ✙ ✎ ✗ ✗ ✎ ✜ ✤ ✘ ✥ ✘✑✘ ✦ ✏ ✗ ✘ ✚ ✚ ✕ ✙ ✥✑✧ ✎ ✛ ✎ ★✍✩ ✩ ✪✬✫✮✭✬✯✱✰✮✩ ✲✬✳✄✳

MIAMExpress

Expression Profiler MAGE-ML Internet

✴ ✴ ✴

MAGE-ML Simplified ArrayExpress model

Getting data into EP

ArrayExpress

Internet EP URL http://host/data.txt

http://host/q.cgi?d=D&t=ratio

Expression Profiler: EPCLUST

DATA SELECT/ FILTER

FOLDER

ANALYZE A “CLUSTER”

URLMAP

GeneOntology Pathways Databases SPEXS Other tools

slide-6
SLIDE 6

6

Unsupervised vs. Supervised

Find groups inherent to data (clustering) Find a “classifier” for known classes

Clustering – it’s “easy” (for humans) Clustering cont… Distance measures:

which two profiles are similar to each other?

Euclidean , Manhattan etc. Correlation, angle, etc. 1. 3. 2. Time warping Rank correlation

F.C.P. Holstege, E.G. Jenning, J.J. Wyrick, Tong Ihn Lee, C.J. Hengartner, M.R. Green, T.R. Golub, E.S. Lander, and R.A. Young Dissecting the Regulatory Circuitry of a Eukaryotic Genome

Cell 95: 717-728 (1998) Model of RNA Polymerase II Transcription Initiation

  • Machinery. The machinery

depicted here encompasses over 85 polypeptides in ten (sub) complexes: core RNA

polymerase II (RNAPII) consists of 12 subunits; TFIIH, 9 subunits; TFIIE, 2 subunits; TFIIF, 3 subunits; TFIIB, 1 subunit, TFIID, 14 subunits; core SRB/mediator, more than 16 subunits; Swi/Snf complex, 11 subunits; Srb10 kinase complex, 4 subunits; and SAGA, 13 subunits.

Cluster of co-expressed genes, pattern discovery in regulatory regions

✁✄✂ ☎ ✆ ✝ ✞ ☎ ✟ ✠ ✆ ✡ ☛ ☞ ✌ ✍ ✎ ✎ ✏ ✑ ✒ ☞ ✌ ✑ ✓ ✏ ✔ ✍ ✎ ✕ ✖ ✗ ✘ ✙ ✚ ✛ ✜✢✙ ✚ ✣ ✤ ✥ ✦ ✗ ✧ ✝ ★ ✠ ✟ ✝ ✩ ✝ ✪ ✫ ✬ ✭✯✮ ✰ ✱ ✱ ✲ ✳ ✬ ✴✄✵ ✶ ✲ ✳ ✷ ✳ ✲ ✮ ✳ ✲ ✴ ✲ ✬ ✱ ✲ ✭✹✸✄✫ ✱ ✺ ✫ ✬✹✻ ✼ ✽ ✴ ✱ ✲ ✳

Genome Research 1998; ISMB (Intelligent Systems in Mol. Biol.) 2000

slide-7
SLIDE 7

7

>YAL036C chromo=1 coord=(76154-75048(C)) start=- 600 end=+2 seq=(76152-76754)

TGTTCTTTCTTCTTCTGCTTCTCCTTTTCCTTTTTTTCCTTCTCCTTTTCCTTCTTGGACTTTAGTATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTTTCTTG CTTCTTCTTCGATTGCTTCAAAGTAGACATGAAGTCGCCTTCAATGGCCTCAGCACCTTCAGCACTTGCACTTGCTTCTCTGGAAGTGTCATCTGCACCTGCGCTGCTTT CTGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGTTCTGGGCGGCGTCTTCTTCGAATTCCTCATCCCAGTAGTTCTGTTGGTTCTTTTTACTCTTTTTCGCCATCTTT CACTTATCTGATGTTCCTGATTGCCCTTCTTATCCCCTCAAAGTTCACCTTTGCCACTTATTCTAGTGCAAGATCTCTTGCTTTCAATGGGCTTAAAGCTTGAAAAATTT TTTCACATCACAAGCGACGAGGGCCCGTTTTTTTCATCGATGAGCTATAAGAGTTTTCCACTTTTAAGATGGGATATTACGGTGTGATGAGGGCGCAATGATAGGAAGTG TTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTTGTTGATTGAGCAAA_ATG_ >YAL025C chromo=1 coord=(101147-100230(C)) start=-600 end=+2 seq=(101145-101747) CTTAGAAGATAAAGTAGTGAATTACAATAAATTCGATACGAACGTTCAAATAGTCAAGAATTTCATTCAAAGGGTTCAATGGTCCAAGTTTTACACTTTCAAAGTTAACC ACGAATTGCTGAGTAAGTGTGTTTATATTAGCACATTAACACAAGAAGAGATTAATGAACTATCCACATGAGGTATTGTGCCACTTTCCTCCAGTTCCCAAATTCCTCTT GTAAAAAACTTTGCATATAAAATATACAGATGGAGCATATATAGATGGAGCATACATACATGTTTTTTTTTTTTTAAAAACATGGACTCGAACAGAATAAAAGAATTTAT AATGATAGATAATGCATACTTCAATAAGAGAGAATACTTGTTTTTAAATGAGAATTGCTTTCATTAGCTCATTATGTTCAGATTATCAAAATGCAGTAGGGTAATAAACC TTTTTTTTTTTTTTTTTTTTTTTTGAAAAATTTTCCGATGAGCTTTTGAAAAAAAATGAAAAAGTGATTGGTATAGAGGCAGATATTGCATTGCTTAGTTCTTTCTTTTG ACAGTGTTCTCTTCAGTACATAACTACAACGGTTAGAATACAACGAGGAT_ATG_

...

>YBR084W chromo=2 coord=(411012-413936) start=-600 end=+2 seq=(410412-411014) CCATGTATCCAAGACCTGCTGAAGATGCTTACAATGCCAATTATATTCAAGGTCTGCCCCAGTACCAAACATCTTATTTTTCGCAGCTGTTATTATCATCACCCCAGCAT TACGAACATTCTCCACATCAAAGGAACTTTACGCCATCCAACCAATCGCATGGGAACTTTTATTAAATGTCTACATACATACATACATCTCGTACATAAATACGCATACG TATCTTCGTAGTAAGAACCGTCACAGATATGATTGAGCACGGTACAATTATGTATTAGTCAAACATTACCAGTTCTCGAACAAAACCAAAGCTACTCCTGCAACACTCTT CTATCGCACATGTATGGTTCTTATTGTTTCCCGAGTTCTTTTTTACTGACGCGCCAGAACGAGTAAGAAAGTTCTCTAGCGCCATGCTGAAATTTTTTTCACTTCAACGG ACAGCGATTTTTTTTCTTTTTCCTCCGAAATAATGTTGCAGCGGTTCTCGATGCCTCAAGAATTGCAGAAGTAAACCAGCCAATACACATCAAAAAACAACTTTCATTAC TGTGATTCTCTCAGTCTGTTCATTTGTCAGATATTTAAGGCTAAAAGGAA_ATG_

101 Sequences relative to ORF start

GATGAG.T 1:52/70 2:453/508 R:7.52345 BP:1.02391e-33 G.GATGAG.T 1:39/49 2:193/222 R:13.244 BP:2.49026e-33 AAAATTTT 1:63/77 2:833/911 R:4.95687 BP:5.02807e-32 TGAAAA.TTT 1:45/53 2:333/350 R:8.85687 BP:1.69905e-31 TG.AAA.TTT 1:53/61 2:538/570 R:6.45662 BP:3.24836e-31 TG.AAA.TTTT 1:40/43 2:254/260 R:10.3214 BP:3.84624e-30 TGAAA..TTT 1:54/65 2:608/645 R:5.82106 BP:1.0887e-29 ...

GATGAG.T TGAAA..TTT

YGR128C + 100

Upstream sequence (600bp)

GATGAG.T TGAAA..TTT

GATGAG.T W/30 TGAAA..TTT 1 mismatch

Pattern + Sequence + Expression data combined view

  • 1: ..[AG][AG][AG]CAGTCAC[AG]..

Homol-D 121 vs 249

Probability < 1e-117

  • 1: ..[AG]CCCTA[CA]CCT..

Homol-E 58 vs. 159

  • S. Pombe GO+genome

Cytosolic Ribosome

187 vs. 4897 genes in total ATG W C

Valk-valk interaktsioonid (PPI pairs)

Probleem: uued tehnoloogiad võimaldavad leida tuhandeid potentsiaalseid interaktsioone. Millised neist on rohkem usaldusväärsed?

Kemmeren et.al.

Randomized expression data Yeast 2-hybrid studies Known (literature) PPI MPK1 YLR350w SNF4 YCL046W SNF7 YGR122W

Molecular Cell, Vol. 9, 1133–1143, May, 2002

slide-8
SLIDE 8

8

d

Interacting pairs of proteins A and B; C and D Which would you trust?

A B

1

d

13 12 7

C D

EP:PPI – combine PPI and expression

Text mining

  • Teadusartiklid sisaldavad tohutul hulgal

teavet

  • Kuidas seda süstemaatiliselt kokku

koguda ja esitada?

  • Tuvasta Medline abstraktides esinevad

geenide nimed ja seosed nende vahel

  • Moodusta selle info põhjal graafilised

esitlused seoste kohta

Signal transduction pathway from Text mining and experimental data

Lappe, Schlitt, Dietmann, Holm, submitted (2003)

GPCR

  • G-valk retseptorid on ühed tähtsamad

ravimite sihtmärgid

  • Küsimus – kas saab ennustada

retseptori põhjal millise signaali ülekande raja see käivitab?

  • Millised G-valgud seonduvad selle

GPCR valguga?

GPCR coupling

Current perspective G-protein

Signal: Agonist

Effector Enzyme channels

Intracellular messengers GPCR:

slide-9
SLIDE 9

9

Our Computational Approach

  • Using a new membrane topology prediction algorithm

(designed specifically for GPCRs), we constrained our pattern search to the intracellular domains of ≈ 100 receptor sequences with well-characterised, and non-promiscuous coupling (split into Gs, Gi/o and Gq/11)

Receptor Match Positions

Croning, Vilo, Möller, ISMB 2001

[RK]....R.{0,9}EK DR.{4,11}H...[AGS] FR....[RK].{0,3}L S...L.{1,10}T[ILV] C.[FWY].{2,11}K [ILV].L.{6,10}A.T S....[RK]A.{3,10}S A[ILV].{1,5}Y..[ILV].T LR.{1,9}T...[ILV]

✂☎✄ ✆☎✝ ✞

Expression data

✟✡✠ ☛☎☞✡✌✍✠ ✎

sequence, function, annotation

✎ ✏ ✠ ✑ ✎

discover patterns

✒✔✓ ✕ ✌✗✖☎✏

provide links

✘ ✙ ✚✔✛ ✙ ✜ ✢ ✜ ✣ ✤ ✙ ✥ ✦✡✧✩★✗✪ ✫✡✬☎✬✡✭ ✮✍✯✱✰✍✪ ✮✡✲ ✭ ✳ ✫✍✪ ✴ ✵ ✵ ✶ ✷ ✸ ✸ ✹ ✶ ✺ ✹ ✻ ✼ ✺ ✽ ✾ ✺ ✿ ❀ ✸

Expression data

External data, tools pathways, function, etc.

✏ ✖ ❁ ✌✗✖ ❁ ❂☎❃

visualise patterns

✷ ❄✔❅

GeneOntology

✷ ✁ ✁ ❆

Prot-Prot ia.

✎ ✠ ❇ ✕ ☞✔✟✔☞

Networks

  • Graphical models
  • Directed labelled graph
  • Nodes

genes

  • Arcs/Edges

relationships

  • Labels

types of relationships

Start node (gene) End node (gene) Connection weight, w

Graph drawing

A B W

slide-10
SLIDE 10

10

Different interpretation of arcs

  • Edges can have different meanings,

hence different networks

  • Binding site for A is in front of B
  • Proteins A and B interact
  • Deletion of gene A affects expression of

B (is somewhere in regulation cascade)

  • “Literature” mentions genes together

Features/distributions that do not depend on discretisation thresholds

  • Visual inspection, biological interpretation
  • General statistics and features of the

graphs

  • Indegree/Outdegree
  • Complexity of the networks
  • What is the modularity?
  • How many components?
  • Deletion of hot-spots, does it break the net?

∆A ∆B ∆C

gene B gene C gene D gene A

A D B C

Hughes, T. R. et al: “Functional Discovery via a Compendium

  • f Expression Profiles”, Cell 102 (2000), 109-126.

Green arrows - upregulation Red arrows - downregulation Thickness of arrow represents certainty of direction (up/down)

A complete graph

slide-11
SLIDE 11

11

Filter

  • choose a list of genes

(MATING, marked in red)

  • filter for these genes plus

neighbouring genes from the graph

CUP5 AKR1 VMA8 YAR 014C SST2 YEL044W YER050C MFA1 STE2 BAR1 MFA2 AGA1 AFG3 FUS1 FKS1 FU S3 VCX1 ADR1 URA3 ICL1 YGR250C PGU1 YLR042C YNR 067C HOG1 FIG1 AGA2 KSS1 RAD6 STE6 RAS2 R PD 3 C RS4 ASG7 KAR4 NR C465 YIL080W FUS2 YNL279W YOL154W YPL156C YPL192C YML048W

  • A

STE11 STE12 GPA1 STE18 STE24 STE4 STE5 STE7 TUP1 YER044C YJL107C AFR1 SHE4 C MK2 PHO89 R AD 16 CYC8 QC R2 SW I4 NPR2

Mutation network ∆γ=4

A E P 2 A K R 1 C M K 2 A N P 1 R A D 1 6 A F R 1 C E M 1 C U P 5 S S T 2 D IG 1 U B P 1 S T E 2 E R G 2 P HO 8 9 E RG 6 G A S 1 P T P 2 G Y P 1 H IR 2 H P T 1 IS W 1 F IG 1 IS W 2 K IN3 M A C1 M R P L 3 3 M S U 1 N P R 2 P E T 1 1 1 R A D 5 7 R IP 1 R R P 6 A S G7 S T E 6 R T S 1 S C S 7 S G S 1 M F A 1 S H E 4 A G A 1 S W I4 FU S 1 S W I5 V A C8 V M A 8 Y A L0 4 W Y A R 01 4 C Y E L0 4 4 W Y E R 5 C F U S 3 G P A 1 B A R 1 M F A 2 Y E R 8 3 C R T T 10 4 Y M R0 1 4 W Y M R 02 9 C A G A 2 Y M R 03 1 W

  • A

Y M R2 9 3 C Y O R 07 8 W A D E 2 A F G3 B NI1 C L A 4 E R G 3 F K S 1 K A R 4 Y A R 6 4W C H S 3 V A P 1 IC S 2 Y C LX 09 W Y D L 9C S T P 4 P M T 1 V CX 1 H O TH I1 3 A D R 1 Y D R 24 9 C P A M 1 Y D R2 7 5 W H X T 7 H X T 6 Y D R 3 6 6 C Y D R 53 4 C UR A 3 Y E L 71 W M N N 1 IC L 1 R NR 1 Y E R 1 3 C Y E R 1 3 5C S P I1 D M C 1 HS P 1 2 N IL 1 GS C 2 K S S 1 M U P 1 Y G R 1 38 C S K N 1 Y G R 2 5 C Y H R 97 C Y H R 1 1 6W Y H R 1 2 2W Y H R 1 45 C Y IL 06 W Y IL 09 6 C Y IL 1 1 7C R H O 3 Y IL 1 2 2W F K H1 NC A 3 Y J L1 4 5 W R P L 1 7 B Y JL 21 7 W C Y C1 D A N 1 P G U 1 G F A 1 H A P 4 R R N 3 S TE 3 P R Y 2 K T R 2 S R L 3 Y L R0 4 C Y L R 42 C S S P 1 2 H S P 6 Y L R 2 97 W R P S 22 B Y L R 41 3 W HO F 1 D DR 4 8 R N A 1 Y M R2 6 6 W Y N L 07 8 W S P C 9 8 Y N L 13 3 C Y N L 21 7 W W S C 2 Y P T 1 1 R FA 2 Y N R 00 9 W Y N R 6 7C M D H 2 Y O L 15 4 W N D J1 W S C 3 C DC 2 1 P F Y 1 R G A 1 M S B 1 S R L 1 Y O R 2 4 8W Y O R 2 9 6W Y O R 33 8 W G D S 1 P D E 2 FR E 5 Y P L 08 C R P S 9 A B B P 1 Y P L 25 6 C S U A 7 M E P 3 Y P R 15 6 C HM G 1 HO G 1 M E D2 Q C R2 RA D 6 RA S 2 R P D 3 R P S 2 4 A C R S 4 C Y C 8 Y A R0 3 1 W Y B R 01 2 C H IS 7 Y CL X 7 W Y C R X 1 8 C P C L 2 Y D R 1 24 W E C M 1 8 A P A 2 Y E R0 2 4 W H O M 3 TH I5 Y G L 5 3W N R C 4 6 5 Y G R 1 6 1C Y H R 5 5 C Y IL 3 7C Y IL0 8 W Y IL 08 2 W H IS 5 Y JL 03 7 W S A G 1 C P A 2 A A D1 H Y M 1 M E T 1 M ID2 Y M L0 4 7 C K A R 5 CIK 1 F U S 2 S C W 1 B O P 3 Y N L2 7 9 W T H I1 2 Y O L 11 9 C Y O R 20 3 W T E A 1 IS U 1 Y P L1 5 6 C Y P L 1 9 2C Y P L 25 C K A R 3 Y IL0 8 2 W

  • A

Y M L0 4 8 W

  • A

Y M R 8 5W S TE 1 1 S T E 1 2 S T E 18 U R A 1 U RA 4 S T E 2 4 S T E 4 S T E 5 S TE 7 S W I6 M A K 1 T U P 1 Y E R 04 4 C Y JL 1 7 C

Mutation network ∆γ=2

unknown aam mat cos ribo mitochondrial

Probability network Π(γ=2.0, τ=0.8, ξ=10), underlayed in green are groups of genes which are more interconnected. The genes are coloured according to annotation in YPD (“cellular role”). The genes which are more interconnected are involved in the same cellular processes, like mating behaviour (mat, green), aminoacid metabolism (aam, red), cos gene family (cos, light blue), mitochondrial function (mitochondrial, dark blue), ribosome (ribo, purple) and a group of genes of unknown function (unknown, grey)

YC R043 C GCV3 YOR0 49C YAL06 1W ID H2 YAR075W YAR0 73W GL T1 YAR 07 4C YLR432 W LCB1 TUF1 YMR04 5C YBL 005W
  • B
YM R050C RPL2 5 RPL1 9B YOL 082W FUI1 YO R135 C YAR0 09C YBL 101W
  • B
YDR4 25W PDX3 YO L017 W Y KL155 C YBR0 47W YI L003 W BAP2 PAU1 HSP26 GPA1 TEC1 R IB1 VID 24 YLR290 C ATR1 YNL1 11C PO S5 LYS1 2 LYS2 YPL 273W YBR04 3C YBR1 45W YG L22 4C YG R257 C YBR2 07W F RE3 AR O4 HIS7 ARO 3 T RP2 ER G 4 YI L165 C BNA1 YOL 119C MAL 33 YDR4 26C YER1 38C YC L0 19W YJR02 7W YPL 060W AG P1 FUS1 KAR4 HIS4 H AP1 ILV2 LE U4 O RT1 SER33 YAL0 53W YCL0 49C BO P1 ARO2 YC LX1 1W ASN 2 CCP1 YKR0 81C YMR19 6W YER 126 C STP4 YDL 054C YJ L213 W STB4 YM L12 8C YDL11 0C YM R093W UG A3 RIB5 G LN3 YER1 75C APG 1 ERV1 YIL1 64C YJ R130C MET1 YLR 152 C YM R097 C YNR069C PIP2 YPL033 C YPL26 4C YM C1 YDR53 1W L YS20 CH O 2 YHM1 YDL20 4W CO S5 C O S7 N T H1 YL R1 49C UBC5 CL B1 CYS4 YDR2 42W ADE17 ARO 1 CAF16 YPR0 59C HO M 2 URA8 YDR22 2W Y GL 157W HSP78 L AC1 MR PL1 3 F RE2 AKR1 C CC2 YI L108 W YDR 340 W T RP4 SER 1 YI L060 W YDR36 6C M FA1 QCR 7 ATP7 YET1 YRO2 YDR534 C IC S2 YG L26 1C BIO 2 ECM 4 SM F3 YM R0 90W DDR4 8 YMR25 1W YNR00 9W PAU6 YO L15 3C TIR 2 YO R 383 C YPL 222W U SV1 YAL06 8C M AM33 YDR539 W YLR 218C IMP 4 SDH4 RI P1 YLR39 5C GL Y1 YG R294 W PAU2 MIS1 SIT1 YBR187 W G CV1 PDR15 YDR 47 6C SPF1 YER0 45C YER1 30C FTR 1 YFR 02 4C ARC1 YG R05 2W YHL0 35C YHL 047C YIL 088C C IS3 F RE6 ACO 1 CYB2 FET3 ID H1 YTP1 ATP11 BIO 5 YN R0 61C YOL 158C YPL2 79C MNN1 YER0 24W ECM33 AL G 7 SUL1 SIR 2 SER3 YI L056 W CPA2 YJ R1 11C PSD 1 YO L1 18C YPL 052W AL G5 YPL25 0C HO M 3 YGL 117W L YS1 H IS1 ARG 10 HO R2 PST1 THI5 PDE1 CO X6 YLR02 3C YLR09 9C ER G 5 C UP9 ARG5 YNL 276 C TEA1 ALD5 MET 10 YHL 029C QC R8 YLR 183C ILV5 YOR1 08W YER12 4C YER1 60C YJR096 W SPI 1 YDR0 32C YER1 58C HSP12 G DH 3 ECM 13 CO R1 YBL054 W YBR005W ZTA1 TPS1 KT R 3 AM I1 PHO 89 RSC6 MCD1 CO X9 YDL 193W YDL22 3C G YP7 YDL24 1W SN Q2 NRG 1 YDR070C R RP 1 YDR18 6C ADR1 HXT7 YDR3 91C ADE8 PHO8 SAM2 EUG 1 C VT1 H PA3 PMI4 CHO1 YER037 W SS U8 1 BUR 6 F RS2 SCW11 RPL30 YGL 101W YGL 179 C YG L18 4C CO X4 YGR04 3C RM E1 CT T1 CL B6 PEX21 CO S6 RIM1 01 YHR03 3W YHR08 7W H XT4 YHR097C YH R1 12C YHR122W YHR143W YHR 162W O YE2 SRA1 CBR1 YIL 055C YIL12 3W VTH1 MET2 8 G TT1 NCA3 YJL2 12C ECM 17 MRT4 TEF4 GF A1 O AC1 YKL1 61C T RP3 YKL 224C YKR 049 C KTR2 YKR071 C BAS1 KNS1 ISA1 SDH2 YLL06 4C YL R0 89C YPS3 TFS1 YL R1 94C TH I7 CTS1 EXG 1 MET 17 YLR31 2C ADE13 ER O1 YMR002 W PLB1 SNO 1 YM R102C AL D2 YM R1 81C G CV2 YM R191 W DFG 5 YHM2 URA10 YM R295C YMR31 5W YNL 115C M FA2 CH S1 YNL208 W L AP3 CIT 1 LYS9 PL B3 P FK27 YO L030 W M ET22 YO R052C ADE2 HES1 SRL1 YOR 24 8W YO R2 71C YO R 289 W C PA1 TYE7 PDE2 ALD4 YPL08 8W ISU 1 SVS1 YPL25 6C DIP5 SAM3 ATP20 C LB2 M EP3 YPR1 39C YPR156 C YPR 158W MET 16 YPR 18 4W Q C R2 YF R0 24C
  • A
ALD3 YM R316C
  • A
DDR2 STE2 STE6 CO S4 YG L009 C YML 056C AG A2 BAR1 YHR1 77W YG L121 C AM S1 G SC 2 RPS21A YHB1 AG M1 SAM1 SM F1 YOR10 7W YPL 283C YG R2 96W ARN1 YHR029 C SLT 2 YJ L108 C SRL3 YNR0 65C PTP2 YPR0 78C YG R260W YHR055 C G PH1 ARO 9 BIO 4 TWT1 YG R 086C M SF1' CSI2 YHR2 09W PUR5 YER083 C YIL 011W HEM 15 HIS5 L YS21 YKL 218C BO P2 HI S3 SSU1 YIL 117C PCA1 AFR1 ST P2 YPS1 YLR 3 34C C L U1 ZR C1 YMR316 W CM K2 YIL12 1W YDR26 2W YDR 38 0W FC Y21 YI R0 35C ARG3 MR PS9 YJ L163 C ERG 6 YJL2 00C VTH2 YJR02 9W YBR 012 W
  • B
YML 039W BAT 2 R LM 1 YPL282 C YLR04 2C YNL2 79W RAD16 YLR32 7C CRH1 YLR34 9W RPS22 B HSP60 YLR414 C SST2 HM G S YMR 145 C YKL15 3W YPL1 59C T HI11 YM R2 15W GAS1 ESBP6 ADE12 FU S3 AG A1 ASG 7 YLL 067C YNR053 C BIO 3 YLR101C MD H2 AR G8 YO L15 0C F IG 1 YO L1 54W YOL 106 W G LC3 YO R1 73W YO R203 W YO R220W YOR30 6C YO R3 38W G DH1 YO R382 W SUC2 YO R 394 W YPL2 78C N O RF10 YER 00 4W M ET6 TR P5 YHR2 13W YL R0 49C SH M2 ACS2 CYT 1 YOR22 5W ISU2 CO T1 YPL 272C NO RF18 YHR214C
  • B
YI L082 W
  • A
RPS30A PLB2 ECM4 HO R7 LSC2 M ET3 UG P1 YLR 231 C SCJ 1 YO R175 C YL R0 35C
  • A

Kokkuvõtteks

  • (Molekulaar-) Bioloogilisi andmeid on

palju erinevaid tüüpe

  • Bioloogiliste andmete mahud kasvavad

praegu eksponentsiaalse kiirusega

  • Arvutianalüüs on ainuke reaalne viis

neid andmeid ära kasutada

  • See eeldab loomingulist lähenemist nii

küsimusepüstituses kui ka tehniliselt heade algoritmide kasutust või väljamõtlemist

Bio-andmete kaevandamine

  • 1. Millised bio-andmed on olemas ja

kuidas neid võiks kasutada

  • 2. Andmete kogumine, puhastamine,

ettevalmistamine ja ühendamine

  • 3. Analüüs arvutil (peamine algoritm)
  • 4. Tulemuste esitamine, reeglite

tuvastamine, visualiseerimine

  • 5. Tulemuste tõlgendamine

Acknowledgements

Alvis Brazma + the EBI microarray team Misha Kapushesky, EBI (EP development) Patrick Kemmeren, Frank Holstege, Utrecht U. (PPI) Esko Ukkonen, Kimmo Palin, Helsinki Univ. Meelis Kull Tanel Kaart Hedi Peterson Kristo Käärmann jne. Maido Remm Reidar Andreson …