Analyzing manuscript traditions using constraint-based data mining - - PowerPoint PPT Presentation

analyzing manuscript traditions using constraint based
SMART_READER_LITE
LIVE PREVIEW

Analyzing manuscript traditions using constraint-based data mining - - PowerPoint PPT Presentation

Analyzing manuscript traditions using constraint-based data mining A case study in declarative data mining Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Mac, Jan Ramon KU Leuven,


slide-1
SLIDE 1

Analyzing manuscript traditions using constraint-based data mining

Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Macé, Jan Ramon KU Leuven, Department of Computer Science KU Leuven, Faculty of Arts

A case study in declarative data mining

slide-2
SLIDE 2

Overview

Principles of “declarative data mining” IDP: a modeling language based on first

  • rder logic

Using IDP for data analysis in stemmatology

slide-3
SLIDE 3

Declarative data mining

slide-4
SLIDE 4

Data mining

Current state of the art in data mining: a large variety of tasks, methods, and systems data analysis is limited to: mapping your problem on one of the predefined tasks, then running an existing system not much flexibility

.DT .ANN .SVM .PCA .k-means .Assoc-rules

slide-5
SLIDE 5

Constraint-based data mining

More flexibility: define the task more precisely by imposing constraints on the solutions you want to find E.g.: “find all frequent itemsets that have Cheese & Beer in the IF part”; “find the clustering with minimal SSE that has a & b in the same cluster, and c&d in another cluster” (must-link/cannot-link constraints), ... Basic task structure remains the same

.DT .ANN .SVM .PCA .k-means .Assoc-rules

slide-6
SLIDE 6

Inductive querying

Fits in the “inductive databases” viewpoint (Imielinski & Mannila, 1996) patterns are DB objects that can be stored, queried, manipulated data mining = “querying for patterns” Most inductive query languages still focus on particular types of data mining approaches (e.g., MINE RULE extension to SQL, Meo et al. 1998: association rule mining)

.DT .ANN .SVM .PCA .k-means .Assoc-rules

A unified approach

slide-7
SLIDE 7

Is a more generic approach possible?

A general-purpose modeling language for data mining? allowing to model the task, the background knowledge, the inputs, constraints on the outputs, ... In (numerical) ML, linear algebra & optimization play a similar role First steps towards this in DM: Nijssen & Guns, 2010 rephrase itemset mining in a constraint programming framework, demonstrate efficiency of the approach This work continues in that direction

.DT .ANN .SVM .PCA .k-means .Assoc-rules

DM task modeling system

slide-8
SLIDE 8

IDP

slide-9
SLIDE 9

IDP

An environment for knowledge-based programming (Wittocx et al. 2008) Combines imperative and declarative elements declarative objects: vocabularies, theories, structures (predefined) procedures to create and manipulate these objects perform inference on them (model expansion, ...) Includes state-of-the-art model generator (ref. ASP competition)

slide-10
SLIDE 10

FO(.)IDP

FO(.) = family of extensions of first order logic IDP supports FO(.)IDP, an FO(.) language that supports integer & real algebra aggregates inductive definitions ...

slide-11
SLIDE 11

Inductive definitions

Inductive definition: model = “minimal” interpretation that fulfills the constraints (“minimal” = number of true facts is minimal) Set of constraints: model = any interpretation that fulfills the constraints FO itself cannot express inductive definitions, it needs an extension for that

{ integer(0). integer(s(X)) <- integer(X). } integer(0). integer(s(X)) <= integer(X).

slide-12
SLIDE 12

Example: find shortest path

vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). } subgraph begins in from, ends in to not branching must connect from & to connected theory satisfied <=> edgeOnPath represents a path from ‘from’ to ‘to’

slide-13
SLIDE 13

Example: find shortest path

vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). }

structure sp_struct: sp_voc { node = {A..D} / / shorthand for A,B,C,D edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { #{ x y : edgeOnPath(x,y) } } input graph (= partial interpretation

  • f sp_voc)

defines length

slide-14
SLIDE 14

Example: find shortest path

vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). } structure sp_struct: sp_voc { node = {A..D} / / shorthand for A,B,C,D edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { #{ x y : edgeOnPath(x,y) } }

procedure main() { sols = minimize(sp_theory,sp_struct,lengthOfPath) if sols then print(sols[1]) else print("No models exist.\n") end } main procedure: finds the path with minimal length in the given graph; prints it, if it exists

slide-15
SLIDE 15

Example: find frequent itemsets

vocabulary FrequentItemsetMiningVoc { type Transaction type Item Freq: int Includes(Transaction,Item) FrequentItemset(Item) } theory FrequentItemsetMiningTh: FrequentItemsetMiningVoc { #{t: !i: FrequentItemset(i) => Includes(t,i) } >= Freq. } structure Input : FrequentItemsetMiningVoc { Freq = 7 / / threshold for frequent itemsets Transaction = { t1; ... ; tn } / / n transactions Item = {i1 ; ... ; im } / / m items Includes = {t1,i2; t1,i7; ...} / / items of transactions } FrequentItemset represents a set of items #{t: FrequentItemset ⊆ t} >= Freq.

slide-16
SLIDE 16

IDP for stemmatology

slide-17
SLIDE 17

Stemmatology (stemmatics)

Subfield of philology concerned with studying relationships between surviving variants of an old text (for instance, in

  • rder to reconstruct a lost original)

Monks copied manuscripts manually, made changes -> “evolution” of the story Stemma = “family tree” of a set of manuscripts Somewhat similar to phylogenetic trees in bioinformatics but there are some differences... solutions specific to stemmatology are needed

slide-18
SLIDE 18

Stemma

A B C D E F H G

contamination multifurcation stemma = connected DAG with one root (“rooted DAG”)

slide-19
SLIDE 19

Stemma with witnesses

A B C D E F H G

non-leaf witness

... : “witness”

slide-20
SLIDE 20

The data

Given: A set of manuscripts, which differ in particular places Each manuscript is described by a fixed set of attributes; an attribute indicates for a particular position which variant occurs there

P1 P2 P3 ... text1

has Fred “no”, he said

...

text2

had he he said no

...

text3

has he “never”, he said

...

slide-21
SLIDE 21

The “classical” task

Classical task: given the data, hypothesize a stemma DAG indicating relationships between the documents may include nodes for “lost” documents, the existence of which is hypothesized But this is not the only task we can consider (nor the task our philologists were interested in)

slide-22
SLIDE 22

Other tasks

In this case, for a number of cases a stemma is given together with the dataset for synthetic data: the correct stemma for real data: current best guess Analyze the relationship between the stemmata & data in order to learn something about the evolution of manuscript traditions E.g., which types of copying errors are more/less commonly made, ... ?

slide-23
SLIDE 23

Task 1

Tara’ s original question: “Is there an algorithm that solves the following problem: given a directed graph, with some nodes assigned to particular ‘groups’, is it possible to complete the groups such that each node

  • ccurs in at most one group, and each group

is connected?”

slide-24
SLIDE 24

DAG formulation

In a DAG with some groups of nodes defined, complete the groups such that each group forms a rooted DAG itself (“is connected”)

given solution

slide-25
SLIDE 25

How to solve?

Several algorithms had been tried; all but one found incorrect on at least one case “I haven’ t been able to find any case where my latest algorithm won’ t work - but I can’ t prove it’ s correct either. ” (370 lines of Perl code, excluding I/O etc.) So we tried a declarative approach v1: model groups using equivalence relation v2: model groups using labels v3: use concept of “source” (we only discuss this one)

slide-26
SLIDE 26

Terminology

A source of a variant = document where the variant first occurred (= parents do not have that variant) Problem reduces to: “given a partially labeled DAG, can you complete the labeling such that each label has only one source?”

slide-27
SLIDE 27

IDP formulation

/* ---------- Knowledge base ------------------------- */ vocabulary V { type Manuscript type Variant CopiedBy(Manuscript,Manuscript) VariantIn(Manuscript): Variant } vocabulary Vsrc { extern vocabulary V SourceOf(Variant): Manuscript } theory Tsrc : Vsrc { ! x : (x ~= SourceOf(VariantIn(x))) => ? y: CopiedBy(y,x) & VariantIn(y) = VariantIn(x). }

slide-28
SLIDE 28

IDP formulation

/* --------- Check whether sample fits stemma -------- */ procedure check(sample) { idpintern.setvocabulary(sample,Vsrc) return sat(Tsrc,sample) }

slide-29
SLIDE 29

IDP formulation

procedure main() { process("besoin") process("parzival") process("florilegium") process("sermon158") process("heinrichi") } /* ---------- Procedures for processing -------------- */ procedure process(name) { io.write("Processing ",name,".\n") local path = "data/" local stemmafilename = path..name..".dot" local samplefilename = path..name..".json" processFiles(stemmafilename,samplefilename) } procedure processFiles(stemmafilename,samplefilename) { local stemma,nbnodes,nbedges = readStemma(stemmafilename) io.write("Stemma has ",nbnodes," nodes and ",nbedges, " edges.\n") local nbp,nbs,time = processSamples(stemma,samplefilename) io.write("Found ",nbp," positive out of ",nbs," groupings ") io.write("in ",time," sec.\n") } procedure readStemma(stemmafilename) { ... } procedure processSamples(stemma,samplefilename) { ... }

creates structures

slide-30
SLIDE 30

Results

Tested on five datasets: same results as earlier procedural implementation About equally efficient (slightly faster) Easier to write, and provably correct ! The original implementation turned out to be incorrect some suspicions after we proved the problem NP- complete, and noticed the implementation was polynomial counterexample found

slide-31
SLIDE 31

Further steps...

Noticed that many problems were not satisfiable (stemma + observed variants contradict one-source hypothesis) So, what’ s the minimal number of sources needed to explain the observations for a particular stemma & attribute?

slide-32
SLIDE 32

IDP formulation

vocabulary V { ... } vocabulary Vms { extern vocabulary V IsSource(Manuscript) } theory Tms : Vms { { !x: IsSource(x) <- ~?y: CopiedBy(y,x) & VariantIn(y)=VariantIn(x). } } term NbOfSources : Vms { #{x:IsSource(x)} } procedure minSources(sample) { idpintern.setvocabulary(sample,Vms) return minimize(Tms, sample, NbOfSources)[1] }

slide-33
SLIDE 33

Results

With limited changes to the declarative specification, this problem gets solved in seconds Adapting the procedural program would not be trivial

Processing besoin. Stemma has 13 nodes and 13 edges. IsSource = { T2; U } IsSource = { C; T2 } IsSource = { D; J; L; M; T2; U; V } ... IsSource = { B; F; J; T2 } Minimized for 44 groupings in 0 sec.

slide-34
SLIDE 34

Reversions

Reversion = document differs from parent, but has the same variant as some earlier ancestor Assign reversions a lower cost than sources Try to find minimal-cost solution

slide-35
SLIDE 35

IDP formulation

vocabulary Vcls { extern vocabulary V type Cost isa nat type Class Copy: Class Revert: Class Source: Class ClassOf(Manuscript): Class IndirectAncestor(Manuscript,Manuscript) } theory Tcls : Vcls { !x: (ClassOf(x)=Copy) <=> ?y: CopiedBy(y,x) & VariantIn(y) = VariantIn(x). !x: (ClassOf(x)=Revert) <=> ClassOf(x) ~= Copy & ?y: IndirectAncestor(y,x) & VariantIn(y) = VariantIn(x). {!x y: IndirectAncestor(x,y) <- ?z: CopiedBy(x,z) & IndirectAncestor(z,y). !x y: IndirectAncestor(x,y) <- ?z: CopiedBy(x,z) & CopiedBy(z,y).} NbOfSources = #{x: ClassOf(x)=Source}. NbOfReverts = #{x: ClassOf(x)=Revert}. } term TotalCost : Vcls { 3 * NbOfSources + NbOfReverts }

slide-36
SLIDE 36

Results

Works, but a lot slower (one run on the most difficult dataset takes hours) not sure why no attempt to write a procedural solution until now

Processing besoin. Stemma has 13 nodes and 13 edges. ClassOf = {T2->s; U->s} ClassOf = {C->s; T2->s} ClassOf = {A->s; D->r; J->r; L->s; M->r; T2->s; U->r; V->r} ... ClassOf = {A->s; B->s; F->r; J->r; T2->s} Minimized for 44 groupings in 3 sec.

slide-37
SLIDE 37

Conclusions

slide-38
SLIDE 38

Conclusions

Modeling and solving these data analysis problems with IDP is feasible Apart from “write your own program”, no better approach known for this type of analysis not a standard DM problem not trivial to solve procedurally We believe this is an interesting step towards “Declarative Data Mining”, offering flexibility, ease, correctness, efficiency Use of an advanced modeling language (FO(.)-based) and advanced solvers is crucial!