Declarative Information Extraction Declarative Information - - PowerPoint PPT Presentation

declarative information extraction declarative
SMART_READER_LITE
LIVE PREVIEW

Declarative Information Extraction Declarative Information - - PowerPoint PPT Presentation

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with Embedded with Embedded Using Extraction Predicates Extraction Predicates Warren Shen, AnHai Doan, Jeffrey Naughton University of Wisconsin,


slide-1
SLIDE 1

Declarative Information Extraction Declarative Information Extraction Using Using Datalog Datalog with Embedded with Embedded Extraction Predicates Extraction Predicates

Warren Shen, AnHai Doan, Jeffrey Naughton University of Wisconsin, Madison Raghu Ramakrishnan Yahoo! Research

slide-2
SLIDE 2

2

Information Extraction Information Extraction

Extracting structured information from unstructured data

“Feedback in IR” Relevance feedback is important ... “Personalized Search” Customizing rankings with relevance feedback ...

Talks

“Relevance feedback is important...” “Feedback in IR” “Customizing rankings with relevance feedback...” “Personalized Search” talks abstract title

slide-3
SLIDE 3

3

IE Plays a Crucial Role in Many Applications IE Plays a Crucial Role in Many Applications

 Examples

– Business intelligence – Enterprise search – Personal information management – Community information management – Scientific data management – Web search and advertising – Many more…

 Increasing attention in the DB community

– Columbia, Google, IBM Almaden, IBM T.J. Watson, IIT-Bombay, MIT, MSR, Stanford, UIUC, UMass Amherst, U. Washington,

  • U. Wisconsin, Yahoo! Research

– Recent tutorials in SIGMOD-06, KDD-06, KDD-03

slide-4
SLIDE 4

4

Previous Solutions Unsatisfactory Previous Solutions Unsatisfactory

 Employ an off-the-shelf monolithic “blackbox”

– Limited expressiveness

 Stitch together blackboxes, e.g. with Perl or Java

– Example: DBlife – Difficult to understand, debug, modify, reuse, optimize

 Compositional frameworks, e.g. UIMA, GATE

– Easier to develop IE programs – Still difficult to optimize because no formal semantics of interactions between blackboxes

slide-5
SLIDE 5

5

Optimization However is Critical Optimization However is Critical

 Many real-world systems run complex IE programs on large

data sets

– DBlife: Unoptimized IE program takes more than a day to process 10,000 documents – Avatar: IE program to extract band reviews from blogs takes 8 hours to process 4.5 million blogs

 Optimization is also critical for debugging and development

slide-6
SLIDE 6

6

“Feedback in IR” Relevance feedback is important ... “Personalized Search” Customizing rankings with relevance feedback ...

Talks

Proposed Solution: Proposed Solution: Datalog with Embedded Procedural Predicates Datalog with Embedded Procedural Predicates

titles(d,t) :- docs(d), extractTitle(d,t). abstracts(d,a) :- docs(d), extractAbstract(d,a). talks(d,t,a) :- titles(d,t), abstracts(d,a), immBefore(t,a), contains(a,“relevance feedback”). perl module C++ module perl module perl module d1 d2 d3 docs

“Relevance feedback is important...” “Feedback in IR” “Customizing rankings with relevance feedback...” “Personalized Search” abstract title

slide-7
SLIDE 7

7

Benefits of Our Solution Benefits of Our Solution

 Easier to understand, debug, modify, reuse

– People already write IE programs by stitching blackboxes together – Stitching them together using Datalog is a more natural way

 Can optimize IE programs effectively

– based on data set characteristics – automatically

slide-8
SLIDE 8

8

Example 1 Example 1

σcontains(d, “relevance feedback”) σcontains(a, “relevance feedback”)

extractAbstract(d,a)

σcontains(d, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d) extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d)

“Feedback in IR” Relevance feedback is important ... “Personalized Search” Customizing rankings with relevance feedback ...

SIGIR Talks

“Information Extraction” Text data is everywhere... “Query Optimization” Optimizing queries is important because ...

SIGMOD Talks

slide-9
SLIDE 9

9

Example 2 Example 2

 Tested our framework on an IE program in DBlife

– Originally took 7+ hours on one snapshot (9572 pages, 116 MB) – Manually optimized by 2 grad students over 3 days in 2005 to 24 minutes

 Converted this IE program to our language

– Automatically optimized in 1 minute after a conversion cost of 3 hours by 1 student to 61 minutes

 Our framework can drastically speed up development

time by eliminating labor-intensive manual optimization

slide-10
SLIDE 10

10

Challenges and Contributions Challenges and Contributions

 How do we formally define the Datalog extension?

– Xlog language

 How do we optimize IE programs?

– Three text-centric optimization techniques – Cost-based plan selection

 Extensive experiments on real-world data

slide-11
SLIDE 11

11

Xlog: Syntax Xlog: Syntax

p-predicate

contains

(d1, “relevance feedback”) true Talk titles “Exploiting Clicks” “Relevance feedback”

titles(d,t) :- docs(d), extractTitle(d,t). abstracts(d,t) :- docs(d), extractAbstract(d,t). talks(d,t,a) :- titles(d,t), abstracts(d,a), immBefore(t,a), contains(a,“relevance feedback”). namePatterns (“Dave”, “Smith” )

(“Dave”, “Smith”, “Smith,\s+D.”) (“Dave”, “Smith”, “Dr.\s+Smith”)

p-function

extractTitle (d1)

(d1, t1) (d1, t2)

IE-predicate

slide-12
SLIDE 12

12

Xlog: Semantics Xlog: Semantics

extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d)

t2 d1 d1 t1 a2 d1 d1 a1 a1 t2 d2 a2 a2 a1 t2 t1 t1 d1 d2 d1 a2 a1 t2 t1 d1 d1 d2 d1 d2 d1 a1 t1 d1

titles(d,t) :- docs(d), extractTitle(d,t). abstracts(d,t) :- docs(d), extractAbstract(d,t). talks(d,t,a) :- titles(d,t), abstracts(d,a), immBefore(t,a), contains(a,“relevance feedback”).

slide-13
SLIDE 13

13

Optimization 1: Pushing Down Text Properties Optimization 1: Pushing Down Text Properties

extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d)

σcontains(d, “relevance feedback”) σcontains(a, “relevance feedback”)

extractAbstract(d,a)

σcontains(d, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d)

extractAbstract

contains(a,w) Λ comes-from(a,d) → contains(d,w) italics(s) Λ overlaps(s,t) → containsItalics(t) (lengthWord(s) = 3) Λ comes-from(s,t) → lengthWord(t) > 3

d: a:

slide-14
SLIDE 14

14

Optimization 2: Scoping Extractions Optimization 2: Scoping Extractions

Narrow the text regions that an IE predicate must operate over

– Exploit location conditions used to prune span pairs

extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d) extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d’,t) docs(d) docs(d) V(a) V(a) sp(a,immBefore(t,a),d,d’)

“Feedback in IR” Relevance feedback is important ... “Personalized Search” Customizing rankings with relevance feedback ... “Information Extraction” “Data mining” ...

Talks Papers

slide-15
SLIDE 15

15

 IE programs often match many patterns  Matching all patterns against all documents is expensive

– Unoptimized DBlife takes 14 hours to match 148,514 name patterns against 10,000 documents daily

 Usually only a few patterns occur in a document

– Index patterns to consider only promising patterns for each document

Optimization 3: Pattern Matching Optimization 3: Pattern Matching

p1 = “Peter\s\s*Haas” p2 = “Laura\s\s*Haas” p3 = “(Jeff\s|Jeffrey\s)\s*Ullman” “Homepage of Laura Haas” p1, p2 p2 p1 p3 “Ullman” “Peter” “Laura” “Haas” “Homepage of Laura Haas” Candidate patterns: p1 , p2 “Ullman” “Peter” “Laura” “Haas”

slide-16
SLIDE 16

16

Estimating Plan Cost Estimating Plan Cost

 Similar to estimating cost of relational plans with

user-defined operators and functions

 But, need to adapt cost model to account for text data

– Model cost of IE-predicates to account for length of input text spans

extractAbstract(d,a)

σcontains(a, “relevance feedback”) σimmBefore(t,a)

extractTitle(d,t) docs(d) docs(d)

slide-17
SLIDE 17

17

Finding the Optimal Plan Finding the Optimal Plan

At start, no statistics about procedural predicates and functions

Adopt reoptimization strategy

  • 1. Execute default plan for k documents and

collect statistics for each procedural predicate and function – runtime – number of output tuples – extracted span lengths

  • 2. Update cost model with new statistics
  • 3. Search plan space for the plan with the lowest cost
  • 4. Finish executing with reoptimized plan
slide-18
SLIDE 18

18

Experimental Setup Experimental Setup

Size Number of Documents Data Set 3.2 MB 294 Homepages 5.5 KB 90 DBWorld 2.5 MB 142 Conferences Find (X,Y) where conference X is held during date Y. confDate Find (X,Y) where topic X is discussed at conference Y. confTopic Find (X,Y) where person X is affiliated with organization Y. affiliation Find (X,Y) where person X is advising person Y. advise Description IE Programs Find (X,Y, Z) where person X is a chair of type Y at conference Z. chair

slide-19
SLIDE 19

19

The Need for Optimization The Need for Optimization

 Optimization reduces runtime significantly by 52-99%

confTopic program

Conferences DBWorld Homepages

800 600 400 200

(1474) seconds Conferences DBWorld Homepages

60 45 30 15

seconds

confDate program

(1078) (1664)

affiliation program

Conferences DBWorld Homepages

120 90 60 30

seconds Conferences DBWorld Homepages

20 15 10 5

seconds

advise program

(80) (1015) Conferences DBWorld Homepages

80 60 40 20

seconds

chair program

(6240) (1586)

Unoptimized Optimized

(148)

slide-20
SLIDE 20

20

Component Contributions Component Contributions

 Removing one optimization produces an inferior plan in

35 out of 45 cases

confTopic program

Conferences DBWorld Homepages

800 600 400 200

seconds

affiliation program

Conferences DBWorld Homepages

120 90 60 30

seconds Conferences DBWorld Homepages

40 30 20 10

seconds

advise program

(261)(891)(258)(146)

Conferences DBWorld Homepages

80 60 40 20

seconds

chair program

`

Optimized with no pushing down text properties Optimized with no scoping Optimized with no optimized match Optimized Conferences DBWorld Homepages

60 45 30 15

seconds

confDate program

(1560) (1064)

slide-21
SLIDE 21

21

The Need for Modeling Text Properties The Need for Modeling Text Properties

 Accounting for text span length in cost model results in a superior

  • r comparable plan in 12 out of 15 cases

 Optimizing for given data set vs. optimizing for a different data set

reduces runtime by 3-78% in 8 out of 15 cases (See paper for more experiments)

16.8 1056.0 C confDate 5.6 33.2 D 45.5 1557.5 H 297.7 240.9 C confTopic 33.2 33.2 D 147.0 143.0 H 30.7 10.0 C affiliation 3.7 3.7 D 12.6 22.0 H Optimized Optimized w/out modeling text span length Data set Runtime (seconds) H D C H D C chair advise 16.1 3.9 8.1 146.0 0.9 7.1 19.3 4.5 8.1 176.4 1.0 9.6

slide-22
SLIDE 22

22

Related Work Related Work

 Much work on IE in the DB, AI, Web, and KDD communities

– Most works have focused on improving IE accuracy – See tutorials in KDD-03 and SIGMOD-06

 Optimizing IE

– Pruning useless documents [Ipeirotis, Agichtein, Jain, Gravano, SIGMOD-06] – Incorporating large dictionaries [Chandel, Nagesh, Sarawagi, ICDE-06] – Indexing documents for pattern matching [Cho, Rajagopalan, ICDE-02] – See tutorial on scalable IE [Agichtein, Sarawagi, KDD-06]

 Datalog extensions

– Wrappers in LIXTO [Gottlob, Koch, Baumgartner, Herzog, Flesca, PODS-04] – Declarative networking [Loo et al., SIGMOD-06] – Diagnosis of distributed systems [Abiteboul, Abrams, Haar, Milo, PODS-05] – Software analysis [Whaley, Lam, APLAS-05]

slide-23
SLIDE 23

23

Conclusion and Future Work Conclusion and Future Work

 Datalog with embedded IE predicates provides a

natural framework to write IE programs

– Easier to understand, debug, modify, reuse – Can automatically optimize programs based on the data they are run on

 Defined challenges and provided initial solutions

– Xlog language – Three text-centric optimizations – Cost-based plan search

 Promising empirical results

– Tested on deployed system and real-world data

 Future work

– Richer data models, recursion and negation – “Workbench” to let developers easily build scalable IE programs

slide-24
SLIDE 24

24

slide-25
SLIDE 25

25

The Need for Data-specific Optimization The Need for Data-specific Optimization

Optimizing for given data set vs. optimizing for a different data set reduces runtime by 3-78% in 8 out of 15 cases

confTopic program

Conferences DBWorld Homepages

800 600 400 200

(982) seconds Conferences DBWorld Homepages

60 45 30 15

seconds

confDate program affiliation program

Conferences DBWorld Homepages

60 45 30 15

seconds Conferences DBWorld Homepages

20 15 10 5

seconds

advise program

(167)(215)(146)

Conferences DBWorld Homepages

20 15 10 5

seconds

chair program Optimized for Conferences Optimized for DBWorld Optimized for Homepages