Analyzing manuscript traditions using constraint-based data mining - PowerPoint PPT Presentation

Analyzing manuscript traditions using constraint-based data mining A case study in declarative data mining Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Macé, Jan Ramon KU Leuven, Department of Computer Science KU Leuven, Faculty of Arts

Overview Principles of “declarative data mining” IDP: a modeling language based on first order logic Using IDP for data analysis in stemmatology

Declarative data mining

Data mining Current state of the art in data mining: a large variety of tasks, methods, and systems data analysis is limited to: mapping your problem on one of the predefined tasks, then running an existing system not much flexibility . ANN . DT . Assoc-rules . PCA . k-means . SVM

Constraint-based data mining More flexibility: define the task more precisely by imposing constraints on the solutions you want to find E.g.: “find all frequent itemsets that have Cheese & Beer in the IF part”; “find the clustering with minimal SSE that has a & b in the same cluster, and c&d in another cluster” (must-link/cannot-link constraints), ... Basic task structure remains the same . ANN . DT . Assoc-rules . PCA . k-means . SVM

Inductive querying Fits in the “inductive databases” viewpoint (Imielinski & Mannila, 1996) patterns are DB objects that can be stored, queried, manipulated data mining = “querying for patterns” Most inductive query languages still focus on particular types of data mining approaches (e.g., MINE RULE extension to SQL, Meo et al. 1998: association rule mining) A unified . ANN . DT . Assoc-rules approach . PCA . k-means . SVM

Is a more generic approach possible? A general-purpose modeling language for data mining? allowing to model the task, the background knowledge, the inputs, constraints on the outputs, ... In (numerical) ML, linear algebra & optimization play a similar role First steps towards this in DM: Nijssen & Guns, 2010 rephrase itemset mining in a constraint programming framework, demonstrate efficiency of the approach This work continues in that direction DM task modeling . ANN . DT . Assoc-rules system . PCA . k-means . SVM

IDP An environment for knowledge-based programming (Wittocx et al. 2008) Combines imperative and declarative elements declarative objects: vocabularies, theories, structures (predefined) procedures to create and manipulate these objects perform inference on them (model expansion, ...) Includes state-of-the-art model generator (ref. ASP competition)

FO(.) IDP FO(.) = family of extensions of first order logic IDP supports FO(.) IDP , an FO(.) language that supports integer & real algebra aggregates inductive definitions ...

Inductive definitions Inductive definition: model = “minimal” interpretation that fulfills the constraints (“minimal” = number of true facts is minimal) Set of constraints: model = any interpretation that fulfills the constraints FO itself cannot express inductive definitions, it needs an extension for that { integer(0). integer(0). integer(s(X)) <- integer(X). } integer(s(X)) <= integer(X).

Example: find shortest path theory satisfied <=> vocabulary sp_voc { edgeOnPath represents a type node path from ‘from’ to ‘to’ from, to: node edge(node,node) edgeOnPath(node,node) subgraph reaches(node,node) } begins in from, theory sp_theory: sp_voc { ends in to ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). not branching !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). must connect reaches(x,y) <- reaches(x,z) & reaches(z,y). } from & to reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). connected }

Example: find shortest path vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). input graph } structure sp_struct: sp_voc { (= partial interpretation node = {A..D} / / shorthand for A,B,C,D of sp_voc) edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { defines length #{ x y : edgeOnPath(x,y) } }

Example: find shortest path vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). } structure sp_struct: sp_voc { node = {A..D} / / shorthand for A,B,C,D edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { #{ x y : edgeOnPath(x,y) } } procedure main() { sols = minimize(sp_theory,sp_struct,lengthOfPath) main procedure: finds the if sols path with minimal length then print(sols[1]) in the given graph; prints else print("No models exist.\n") it, if it exists end }

Example: find frequent itemsets vocabulary FrequentItemsetMiningVoc { type Transaction type Item Freq: int Includes(Transaction,Item) FrequentItemset represents FrequentItemset(Item) a set of items } theory FrequentItemsetMiningTh: FrequentItemsetMiningVoc { #{t: !i: FrequentItemset(i) => Includes(t,i) } >= Freq. } #{t: FrequentItemset ⊆ t} structure Input : FrequentItemsetMiningVoc { >= Freq. Freq = 7 / / threshold for frequent itemsets Transaction = { t1; ... ; tn } / / n transactions Item = {i1 ; ... ; im } / / m items Includes = {t1,i2; t1,i7; ...} / / items of transactions }

IDP for stemmatology

Stemmatology (stemmatics) Subfield of philology concerned with studying relationships between surviving variants of an old text (for instance, in order to reconstruct a lost original) Monks copied manuscripts manually, made changes -> “evolution” of the story Stemma = “family tree” of a set of manuscripts Somewhat similar to phylogenetic trees in bioinformatics but there are some differences... solutions specific to stemmatology are needed

Stemma stemma = connected DAG with one root A (“rooted DAG”) B F multifurcation C D E G contamination H

Stemma with witnesses A non-leaf witness B F C D E G ... : “witness” H

The data Given: A set of manuscripts, which differ in particular places Each manuscript is described by a fixed set of attributes; an attribute indicates for a particular position which variant occurs there P1 P2 P3 ... text1 ... has Fred “no”, he said text2 ... had he he said no text3 ... has he “never”, he said

The “classical” task Classical task: given the data, hypothesize a stemma DAG indicating relationships between the documents may include nodes for “lost” documents, the existence of which is hypothesized But this is not the only task we can consider (nor the task our philologists were interested in)

Other tasks In this case, for a number of cases a stemma is given together with the dataset for synthetic data: the correct stemma for real data: current best guess Analyze the relationship between the stemmata & data in order to learn something about the evolution of manuscript traditions E.g., which types of copying errors are more/less commonly made, ... ?

Task 1 Tara’ s original question: “Is there an algorithm that solves the following problem: given a directed graph, with some nodes assigned to particular ‘groups’, is it possible to complete the groups such that each node occurs in at most one group, and each group is connected?”

DAG formulation In a DAG with some groups of nodes defined, complete the groups such that each group forms a rooted DAG itself (“is connected”) given solution

How to solve? Several algorithms had been tried; all but one found incorrect on at least one case “I haven’ t been able to find any case where my latest algorithm won’ t work - but I can’ t prove it’ s correct either. ” (370 lines of Perl code, excluding I/O etc.) So we tried a declarative approach v1: model groups using equivalence relation v2: model groups using labels v3: use concept of “source” (we only discuss this one)

Terminology A source of a variant = document where the variant first occurred (= parents do not have that variant) Problem reduces to: “given a partially labeled DAG, can you complete the labeling such that each label has only one source?”

IDP formulation /* ---------- Knowledge base ------------------------- */ vocabulary V { type Manuscript type Variant CopiedBy(Manuscript,Manuscript) VariantIn(Manuscript): Variant } vocabulary Vsrc { extern vocabulary V SourceOf(Variant): Manuscript } theory Tsrc : Vsrc { ! x : (x ~= SourceOf(VariantIn(x))) => ? y: CopiedBy(y,x) & VariantIn(y) = VariantIn(x). }

Analyzing manuscript traditions using constraint-based data mining - PowerPoint PPT Presentation

Analyzing manuscript traditions using constraint-based data mining A case study in declarative data mining Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Mac, Jan Ramon KU Leuven,

Constraint Networks Dario Maggi University Basel October 9, 2014 Dario Maggi Constraint

HHS Public Access Author manuscript J Orthop Res . Author manuscript; available in PMC 2015 May

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

Constraint Satisfaction Problems Chapter 5 Section 1 3 Constraint Satisfaction 1 Outline

RCIA 12: The Battle of Prayer and Traditions of Prayer 12: The Battle of Prayer and Traditions

Beinecke Rare Book and Manuscript Library The Beinecke Rare Book and Manuscript Library was built

Europe PMC Funders Group Author Manuscript Clin Toxicol (Phila) . Author manuscript; available in

TEI Manuscript Description James Cummings July 2014 1/35 Manuscript Description Why are

WEM Reform: Constraint Development Responsibilities PSO-WG Meeting 3 February 2019 1

On Minimal Constraint Networks Georg Gottlob Minimal Constraint Networks Montanari 1974: To

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

Constraint Satisfaction Problems Chapter 6 Constraint Satisfaction Problems A constraint

Using the Global Constraint Seeker for Learning Structured Constraint Models: A First Attempt N.

Tractable Constraint Languages Zion Schell Based on Chapter 11 of Rina Dechter's Constraint

Constraint Programming (CP) eVITA Winter School 2009 Optimization Tomas Eric Nordlander Outline

Chapter 3 Constraint Programming Paragraph 2 Constraint Programs and Consistency Search and

WPSE: F ORTIFYING W EB P ROTOCOLS VIA B ROWSER -S IDE S ECURITY M ONITORING Marco Squarcina

Tuakiri update: eduGAIN & service status Vlad Mencl Wallace Chase 1 Tuakiri update:

Mathematical Logic - 2015 Propositional Logic: exercises Fausto Giunchiglia and Mattia Fumagalli

OpenRegistry Revisiting the Management of Electronic Identity Benjamin Oshrin Rutgers

Development of a hardware-abstraction layer for the Baltikum test framework IDP final talk

Introduction to Identity Federations Brook Schofield eduGAIN Task Leader, GN3 Project &

SPaCIoS Secure Provision and Consumption in the Internet of Services STREP Project number: 257876

ENGR 5770G : Service Computing / Winter 2013 Anwar Abdalbari Outline Introduction.

Analyzing manuscript traditions using constraint-based data mining - PowerPoint PPT Presentation

Analyzing manuscript traditions using constraint-based data mining A case study in declarative data mining Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Mac, Jan Ramon KU Leuven,

Constraint Networks Dario Maggi University Basel October 9, 2014 Dario Maggi Constraint

HHS Public Access Author manuscript J Orthop Res . Author manuscript; available in PMC 2015 May

Constraint-Based Refactoring Rename Field Problem Proven Correct Solution Constraint- Based

Constraint Satisfaction Problems Chapter 5 Section 1 3 Constraint Satisfaction 1 Outline

RCIA 12: The Battle of Prayer and Traditions of Prayer 12: The Battle of Prayer and Traditions

Beinecke Rare Book and Manuscript Library The Beinecke Rare Book and Manuscript Library was built

Europe PMC Funders Group Author Manuscript Clin Toxicol (Phila) . Author manuscript; available in

TEI Manuscript Description James Cummings July 2014 1/35 Manuscript Description Why are

WEM Reform: Constraint Development Responsibilities PSO-WG Meeting 3 February 2019 1

On Minimal Constraint Networks Georg Gottlob Minimal Constraint Networks Montanari 1974: To

Combining Combining Constraint Programming Constraint Programming and Integer Programming and

Constraint Satisfaction Problems Chapter 6 Constraint Satisfaction Problems A constraint

Using the Global Constraint Seeker for Learning Structured Constraint Models: A First Attempt N.

Tractable Constraint Languages Zion Schell Based on Chapter 11 of Rina Dechter's Constraint

Constraint Programming (CP) eVITA Winter School 2009 Optimization Tomas Eric Nordlander Outline

Chapter 3 Constraint Programming Paragraph 2 Constraint Programs and Consistency Search and

WPSE: F ORTIFYING W EB P ROTOCOLS VIA B ROWSER -S IDE S ECURITY M ONITORING Marco Squarcina

Tuakiri update: eduGAIN &amp; service status Vlad Mencl Wallace Chase 1 Tuakiri update:

Mathematical Logic - 2015 Propositional Logic: exercises Fausto Giunchiglia and Mattia Fumagalli

OpenRegistry Revisiting the Management of Electronic Identity Benjamin Oshrin Rutgers

Development of a hardware-abstraction layer for the Baltikum test framework IDP final talk

Introduction to Identity Federations Brook Schofield eduGAIN Task Leader, GN3 Project &amp;

SPaCIoS Secure Provision and Consumption in the Internet of Services STREP Project number: 257876

ENGR 5770G : Service Computing / Winter 2013 Anwar Abdalbari Outline Introduction.

Tuakiri update: eduGAIN & service status Vlad Mencl Wallace Chase 1 Tuakiri update:

Introduction to Identity Federations Brook Schofield eduGAIN Task Leader, GN3 Project &