Inductive Inductive Inductive Inductive Databases Databases - - PowerPoint PPT Presentation

inductive inductive inductive inductive databases
SMART_READER_LITE
LIVE PREVIEW

Inductive Inductive Inductive Inductive Databases Databases - - PowerPoint PPT Presentation

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and andQueries and and Queries Queries Queries for for for for Computational Computational Computational Computational Scientific


slide-1
SLIDE 1

SašoDžeroski Jozef StefanInstitute, DepartmentofKnowledgeTechnologies Ljubljana,Slovenia

Inductive Inductive Inductive Inductive Databases Databases Databases Databases and and and andQueries Queries Queries Queries for for for for Computational Computational Computational Computational Scientific Scientific Scientific Scientific Discovery Discovery Discovery Discovery

slide-2
SLIDE 2

Outline Outline Outline Outline

  • WhatisComputationalScientificDiscovery

– Introduction – Examples(ecologicalmodels,reactionpathways)

  • WhatareInductiveDatabasesandQueries

– Introduction – Examples(QSAR,integrativegenomics)

  • Howthetwocanbeconnected,i.e.,howInductive

DatabasesandQueriescanbeusedfor ComputationalScientificDiscovery

slide-3
SLIDE 3

ComputationalScientificDiscovery ComputationalScientificDiscovery ComputationalScientificDiscovery ComputationalScientificDiscovery

  • WhatisScientificDiscovery:

Theprocessbywhichascientistcreatesorfinds somehithertounknownknowledge suchasclassofobjects,anempiricallaw,oran explanatorytheory

  • ComputationalScientificDiscoveryattemptsto

providecomputationalsupportforthisprocess

– Earlyresearchreconstructedepisodes fromthehistoryofscience – Recenteffortsinthisareahavefocussed on individualscientificactivities (suchasformulatingquantitativelaws)andhaveled toseveralnewdiscoveries

slide-4
SLIDE 4

ElementsofScientificBehavior ElementsofScientificBehavior ElementsofScientificBehavior ElementsofScientificBehavior

  • Scientificknowledgestructures

– Observations – Taxonomies:

  • Defineordescribeconceptsforadomain,alongwith

specializationrelationsamongthem

  • Specifytheconceptsandtermsusedtostatelawsand

theories – Laws:Summarizerelationsamongobservedvariables,

  • bjectsorevents

– Theories:

  • Statementsaboutthestructuresorprocessesthatarisein

theenvironment

  • Statedusingtermsfromthedomain'staxonomy
  • Interconnectlawsintoaunifiedtheoreticalaccount

– Models,Predictions,Explanations(Derivedfromabove)

slide-5
SLIDE 5

ElementsofScientificBehavior ElementsofScientificBehavior ElementsofScientificBehavior ElementsofScientificBehavior

  • Scientificprocesses/activitiesareconcernedwith

generatingandmanipulatingscientificdataand knowledgestructures

  • Scientificactivities

– Collectingdata/observations – Formationandrevisionof:

  • Taxonomies:Organizeobservationsintoclassesand

subclasses;definethoseclassesandsubclasses

  • Laws:Givenobserveddata,findempiricallaws
  • Theories:Givenoneormorelaws,generateatheory

– Derivingmodels,predictions,andexplanations

slide-6
SLIDE 6

LawsofDynamicSystems LawsofDynamicSystems LawsofDynamicSystems LawsofDynamicSystems’ ’ ’ ’ Behavior Behavior Behavior Behavior

  • Input:Observedbehaviorofdynamicssystems
  • Output:Setofdifferentialequations
slide-7
SLIDE 7

ExplanatoryModels ExplanatoryModels ExplanatoryModels ExplanatoryModels

  • Lookingdeeperintothemodel
  • Threeprocesses

– Exponentialgrowth ofharepopulation – Exponentialloss offoxpopulation – Predator=preyinteraction betweenthetwospecies

  • Terms inequations correspond toprocesses
slide-8
SLIDE 8

Domain Domain Domain Domain Knowledge Knowledge Knowledge Knowledge:GenericProcesses :GenericProcesses :GenericProcesses :GenericProcesses

  • Genericprocessforpredator=preyinteraction
  • Instantiationtospecificprocesses
  • Inthiscase:Pred=fox,Prey=hare,r=0.3,e=0.1
slide-9
SLIDE 9

Process Process Process Process= = = =basedModelsof basedModelsof basedModelsof basedModelsofDyn Dyn Dyn Dyn Sys Sys Sys Sys

  • Input:Observedbehavior+Setofgenericprocesses
  • Output:Setofinstantiatedprocesses+ODEs
slide-10
SLIDE 10

IntegratingDataandKnowledge IntegratingDataandKnowledge IntegratingDataandKnowledge IntegratingDataandKnowledge

  • Usingdifferenttypesofdomainknowledge

– Backgroundknowledgeonbasicprocesses – Usingexistingmodelsandrevisingthem – Completingpartiallyspecifiedmodels

slide-11
SLIDE 11

ExampleApplications:Ecology ExampleApplications:Ecology ExampleApplications:Ecology ExampleApplications:Ecology

  • Modelling aquaticecosystems

– Venicelagoon – LakeGlumsoe,Denmark – Manyother:LakeBled(Slovenia),LakeKasumigaura (Japan),LakeGreifensee (Switzerland),LakeKinnereth (Israel),LakeOhrid (Macedonia)

slide-12
SLIDE 12

ExampleApps:MetabolicNetworks ExampleApps:MetabolicNetworks ExampleApps:MetabolicNetworks ExampleApps:MetabolicNetworks

slide-13
SLIDE 13

CSD CSD CSD CSDFocusses Focusses Focusses Focusses

  • Onstandardscientificformalisms(e.g.,

equations,pathways)introducedandroutinely usedbyscientists

  • Theresultsshouldbecommunicablewithdomain

scientistsandpublishableinrelevantscientific literature

  • Integrationofdomainknowledgeisofprimary

importance(e.g.,conceptsfromtherelevant scientificdomain,existinglaws/models)

  • Interactionwithdomainscientistandincremental

approachalsocrucial

  • Manyoftheseconcernsillmetbydatamining,

someaddressedbyinductivedatabases/queries

slide-14
SLIDE 14

InductiveDatabasesandQueries InductiveDatabasesandQueries InductiveDatabasesandQueries InductiveDatabasesandQueries

  • Adatabase perspective onknowledge discovery:

Knowledge discovery processes arequery processes

  • ”There isnodiscovery inKDD, it’sall amatter of the

expressive power of the query language”

  • Inductive database =Database +Patterns/Models
  • Sets of patterns can be materialized orviews
  • Data mining operations =Inductive queries
  • IQ:InductiveQueriesforMiningPatternsandModels

(EUfundedproject,FutureandEmergingTechnol.)

slide-15
SLIDE 15

InductiveQueries InductiveQueries InductiveQueries InductiveQueries

  • Inductivequery=Set of constraints that a

pattern/model has tosatisfy

– Language constraints (only onthe pattern/model) – Evaluation constraints (concern the validity of the pattern/model with respect toadatabase)

  • GivenIDB=D+B+P,wehavedifftypesofqueries

– Data Data Data Data retrieval retrieval retrieval retrieval (D+B (D+B (D+B (D+B= = = =>D) >D) >D) >D):“classical” database query – Cross Cross Cross Cross over

  • ver
  • ver
  • ver (D+B+P

(D+B+P (D+B+P (D+B+P= = = =>D) >D) >D) >D):usespatterns and data toobtain new data – Processing Processing Processing Processing patterns patterns patterns patterns (P+B (P+B (P+B (P+B= = = =>P) >P) >P) >P):patterns queried without access tothe data (post=processing) – Data Data Data Data mining mining mining mining (D+B+P (D+B+P (D+B+P (D+B+P= = = =>P) >P) >P) >P):new patterns generated

  • nthe basis of the data and the existing patterns
slide-16
SLIDE 16

InductiveDatabasesforQSAR InductiveDatabasesforQSAR InductiveDatabasesforQSAR InductiveDatabasesforQSAR

QSAR=QuantitativeStructureActivityRelationships

  • Basicdatastructure:Molecule

– Representedaslabeledgraph,or – relationallythroughatom/bondfacts

  • Patterns:Molecularfragments/substructures
  • Models:Equations(linear)orotherpredictivemodels

(e.g.,regressiontrees)basedonbulkfeaturesand molecularfragmentsasindicatorvariables

  • Domainknowledge:Functionalgroups
slide-17
SLIDE 17

InductiveDatabasesforQSAR InductiveDatabasesforQSAR InductiveDatabasesforQSAR InductiveDatabasesforQSAR

Inductivequeries

  • Findfrequentpatterns(molecularfragments)
  • Checkforoccurrenceoffragmentsinmoleculesto
  • btainfeatures
  • Buildpredictivemodelsfrombulkfeaturesand

molecularfragments/functionalgroupsasindicator variables Underlyingapplication:Drugdesign

slide-18
SLIDE 18

ExampleI ExampleI ExampleI ExampleInductive nductive nductive nductive Q Q Q Queries ueries ueries ueries forQSAR forQSAR forQSAR forQSAR

Letusbegivendatasets D1and D2of molecules Q1:Inthe context of dataset D1,find all molecular fragments that – appear inthe compound AZT(which isadrugfor AIDS) – occur frequently inthe active compounds (≥ 15% ofthem) and – occur infrequently inthe inactive ones (≤ 5%ofthem) Q2:Usethe fragments resulting from Q1 asfeatures todescribe the molecules inD2 Q3:Usethe data resulting from Q2 tofind adecision tree forpredictingactivitythat – isof size atmost7(leaves) – isasaccurate aspossible

slide-19
SLIDE 19

IDBs IDBs IDBs IDBs forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics

  • Basicdatastructure:Amicroarray

– Inthedataset,rowsarepatients(withdiagnoses), – columnsareprobes/genes, – entriesaregeneexpressionlevels

  • Patterns:Rankingsofgenes(wrt differental

expressioninthelightofdiagnosis)

  • Models:Relationalregressiontrees/rules

predictingtherankofageneintermsofDK

  • Domainknowledge:geneontology,gene

interactions,pathways

slide-20
SLIDE 20

IDBs IDBs IDBs IDBs forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics

  • Takemicroarray datafromthreeneuroblastoma

studies(M1,M2,M3),whereforeachpatientwe havethestatus(relapseor‘noevent’)

  • Oneachofthesedatasets,rankthegeneswrt

differentialexpressioninrelapsevs.‘noevent’ producingrankingsR1,R2,R3

  • FromR1,R2,andR3,produceanaggregate

rankingR

  • BuildamodelforpredictingtherankRofagene

fromthedomainknowledge,i.e.,characterize highlyrankedgenesintermsofGO/int./pathways

slide-21
SLIDE 21

IDBs IDBs IDBs IDBs forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics forIntegrativeGenomics

  • Takemicroarray datafromneuroblastoma patients

(N)andWilm’s tumor(W),aswellascontrols(C)

  • Rankthegeneswrt differentialexpressioninNvs.C

andWvs.C,producingrankingsR1,andR2

  • Findthepathwayswiththehighestnumberofhighly

rankedgenes(accordingtoR1andR2separately)

  • FindthepathwayscommonforR1andR2
  • Underlyingapplication:identifygenes/pathwaysto

betargetedwithnewdrugs

slide-22
SLIDE 22

IDBs IDBs IDBs IDBs andIQsforCSD andIQsforCSD andIQsforCSD andIQsforCSD

  • IDBs andIQsaddresssomeofthecentral

concernsofComputationalScientificDiscovery

– Theexplicitstorageofpatterns/modelsand backgroundknowledgeallowsforthe(re)use of domainknowledgetogetherwithdata – Theprocessof(inductive)queryingisinteractive andallowsforsignificantuserinvolvement – Theuseofconstraint=baseddatamining approachesallowsforadditionalinfluenceofthe useronthediscoveryprocess

slide-23
SLIDE 23

Outlook Outlook Outlook Outlook

  • Scientifictask:Constructamodelofanewlake

ecosystem,forwhichsomemeasurementsare available

  • First,findamodelfromtheexistingliterature

thathasbeenconstructedforasimilarecosystem [queryonpatterns/models]

  • Applythismodeltothedatasetathand[cross=
  • verquery]
  • Ifthefitofthemodeltothedataisbad,revise

themodelbyusingthedataorconstructanew modelbyusingdataanddomainknowledge[IQ]

  • Forthis,bothscientificdataandmodelsneedto

bestoredin(distributed)scientificIDBs!

slide-24
SLIDE 24
slide-25
SLIDE 25