Querying Probabilistic Information Extraction Daisy Zhe Wang, - - PowerPoint PPT Presentation

querying probabilistic information
SMART_READER_LITE
LIVE PREVIEW

Querying Probabilistic Information Extraction Daisy Zhe Wang, - - PowerPoint PPT Presentation

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein VLDB, Singapore, 16 th September, 2010 Outline Information Extraction Systems Information Extraction (IE)


slide-1
SLIDE 1

Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein

VLDB, Singapore, 16th September, 2010

Querying Probabilistic Information Extraction

slide-2
SLIDE 2

Outline

1

 Information Extraction Systems

 Information Extraction (IE)  “Extract-then-Query” – Standard IE System  “Query-Time-Extraction” – BayesStore IE System

 Primer on CRF  Query-Driven Extraction

 Select-over-Top1 Queries

 Probabilistic SPJ Queries

 Probabilistic Join Queries

 Experimental Results  Conclusion

slide-3
SLIDE 3

Information Extraction (IE)

2

 Steve Jobs introduced the iPhone 4's

videoconferencing feature FaceTime at WWDC

  • 2010. Apple will hold a press conference

Wednesday, where Steve Jobs is expected to announce the birth of new stars in his product galaxy, including (probably) new iPods and (possibly) a successor to Apple TV.

  • -- From WIRED August 30, 2010
slide-4
SLIDE 4

Information Extraction (IE)

3

 Steve Jobs introduced the iPhone 4's

videoconferencing feature FaceTime at WWDC

  • 2010. Apple will hold a press conference

Wednesday, where Steve Jobs is expected to announce the birth of new stars in his product galaxy, including (probably) new iPods and (possibly) a successor to Apple TV.

  • -- From WIRED August 30, 2010

Labels: Person Company Product Event Other

slide-5
SLIDE 5

4

“Extract-then-Query” – Standard IE Systems

Traditional DBMS Query Answer Top-1 Entity Extractions Text Problems:

  • 1. Exhaustive extraction for all entities over all in-coming documents
  • 2. Loses uncertainties and probabilities which are inherent in IE

Information Extraction Systems

slide-6
SLIDE 6

Exhaustive vs. Query-Driven Extraction Example

5

Example Query: SELECT persons FROM blog articles WHERE company = “Apple”

 Steve Jobs introduced the iPhone 4's

videoconferencing feature FaceTime at WWDC

  • 2010. Apple will hold a press conference...

 The Big Apple lands „14 Super Bowl. Giants co-

  • wner Jonathan Tisch said: “The greatest game

will be played on the greatest stage!”…

 Apple Soufflé recipe by Julia Child: ... Pare, cut

up, and stew …

slide-7
SLIDE 7

Exhaustive vs. Query-Driven Extraction Example

6

Example Query: SELECT persons FROM blog articles WHERE company = “Apple”

 Steve Jobs introduced the iPhone 4's

videoconferencing feature FaceTime at WWDC

  • 2010. Apple will hold a press conference...

 The Big Apple lands „14 Super Bowl. Giants co-

  • wner Jonathan Tisch said: “The greatest game

will be played on the greatest stage!”…

 Apple Soufflé recipe by Julia Child: ... Pare, cut

up, and stew …

slide-8
SLIDE 8

Exhaustive vs. Query-Driven Extraction Example

7

Example Query: SELECT persons FROM blog articles WHERE company = “Apple”

 Steve Jobs introduced the iPhone 4's

videoconferencing feature FaceTime at WWDC

  • 2010. Apple will hold a press conference...

 The Big Apple lands  Apple Soufflé recipes

How to perform fast filtering without full inference? Challenge: Need to push condition Label = ‘company’ into inference by deep integration of inference and relational ops.

slide-9
SLIDE 9

8

“Extract-then-Query” – Storing Extractions and Probabilities

Probabilistic DBMS Query p(Answer) p(Entities) probabilities Still performs exhaustive extraction Does not have the right representations to support IE probabilistic models inside of PDB [Gupta,VLDB2005] Text Information Extraction Systems

slide-10
SLIDE 10

“Query-Time-Extraction” – BayesStoreIE

9

IE Probabilistic Model+ Inf. Engine Query Pr[Answer] BayesStoreIE Text Relational Query Engine Pr[Entities] Constraints Our Contributions:

  • Deep Integration between Inference and Relational Operators
  • Enable Query-Driven On-line Extraction
  • Enable Probabilistic Queries over IE models

Y Y 3 X X 3

slide-11
SLIDE 11

Outline

10

 Information Extraction Systems

 Information Extraction (IE)  “Extract-then-Query” – Standard IE Approach  “Query-Time-Extraction” – BayesStore IE Approach

 Primer on CRF  Query-Driven Extraction

 Select-over-Top1 Queries

 Probabilistic SPJ Queries

 Probabilistic Join Queries

 Experimental Results  Conclusion

slide-12
SLIDE 12

Conditional Random Fields (CRF)

11

y y 1 y 2 y 3 y 4 y 5 x x 1 x 2 x 3 x 4 x 5 2181 Shattuck North Berkeley CA USA X=tokens Y=labels

CRF Model: Text (address string):

E.g., “2181 Shattuck North Berkeley CA USA”

Possible Extraction Worlds:

… … … … … … … …

slide-13
SLIDE 13

Two Query Families

12

Query Family 1: (SPJ-over-Top1) Queries using only most-likely Extractions Query Family 2: (Probabilistic SPJ) Queries using probabilistic distributions

slide-14
SLIDE 14

Query Family 1: Select-over-Top1

13

Example Query: Select * From Top-1 extractions of document set D Where company like “%Apple%”

slide-15
SLIDE 15

Viterbi Top-1 Inference on CRF

pos street num street name city state country 1 2 3 4 5

Y Y 3 X X 3

2181 Shattuck North Berkeley

X=tokens Y=labels 14

Viterbi Dynamic Programming Algorithm: CRF Model: Dynamic Programming V matrix:

pos street num street name city state country 5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17 3 21 32 32 30 26 4 29 40 38 42 35 pos street num street name city state country 5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17 3 21 32 32 30 26 pos street num street name city state country 5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17 pos street num street name city state country 5 1 1 1 1 2 15 7 8 7 pos street num street name city state country 5 1 1 1 pos street num street name city state country 5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17 3 21 32 32 30 26 4 29 40 38 42 35 5 39 47 46 46 50

slide-16
SLIDE 16

Query Family 1: Select-over-Top1 – Viterbi Early-Stopping Algorithm

pos event city comp any state

  • ther

5 1 1 1

STOP!

15

pos event city compa ny state

  • ther

5 1 1 1 1 2 15 7 8 7

Example Query: Select * From Viterbi-Top1 extractions of document set D Where company like “%Apple%”

Big Apple lands `14 Super Bowl

pos event city comp any state

  • ther

5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17

Implemented in PostgreSQL using recursive queries and array functions

slide-17
SLIDE 17

Query Family 2: Probabilistic Join

16

Example Query: Select Top-1 results From extraction distributions of documents in D1, D2 Where D1.city = D2.city

Probabilistic Join

slide-18
SLIDE 18

Query Family 2: Probabilistic Join

17

Naïve algorithm: First compute top-k extractions for both input document sets, then compute join Problem: k needed to compute Top-1 results varies for different documents Solution: Probabilistic Rank-Join algorithm based on Incremental Ranked Access to the List of Possible Extractions

Example Query: Select Top-1 results From extraction distributions of documents in D1, D2 Where D1.city = D2.city

slide-19
SLIDE 19

Accessing Ranked List of Extractions – Incremental Viterbi Algorithm

18

 A novel variation of the Top-1 Viterbi algorithm, which computes

the next highest-probability extraction incrementally and more efficiently

pos street num street name city state countr y 5 1 1 1 1 2 15 7 8 7 2 12 24 21 18 17 3 21 32, 32 30 26 4 29 40 38 42 35 5 39 47 46 46 50

Sacramento Avenue San Francisco CA USA

slide-20
SLIDE 20

Accessing Ranked List of Extractions – Incremental Viterbi Algorithm

19

 A novel variation of the Top-1 Viterbi algorithm, which computes

the next highest-probability extraction incrementally and more efficiently

pos street num street name city state countr y 5 1 1 1 1 2 15,10 7 8 7 2 12 24,18 21 18 17 3 21 32, 32,31 30 26 4 29 40 38 42,38 35 5 39 47 46 46 50,48

Sacramento Avenue San Francisco CA USA

3rd highest-probability extraction can be computed by another call…

slide-21
SLIDE 21

Probabilistic Rank-Join

20

Rank-join is applied to each pair of “joinable” document to compute Top-1 join results

key

  • Ext. p

A .83 B .12 C .02 key

  • Ext. p

D .77 C .15 A .03

O_top O_bottom I1_top I1_bottom k Outer Doc_i Inner Doc_j

slide-22
SLIDE 22

Probabilistic Rank-Join

21

A set of rank-joins are computed simultaneously for a set of outer documents and a set of inner documents

key

  • Ext. p

A .83 B .12 C .02 key

  • Ext. p

D .77 C .15 B .03 key

  • Ext. p

C .95 D .02 A .01

……… O_top O_bottom I1_top I1_bottom k Outer Doc_1 Inner Doc_1 Inner Doc_n ………

slide-23
SLIDE 23

Other Algorithms

22

 Probabilistic Selection  Probabilistic Projection  Query-Driven Join-over-Top1

slide-24
SLIDE 24

Outline

23

 Information Extraction Systems

 Information Extraction (IE)  “Extract-then-Query” – Standard IE Approach  “Query-Time-Extraction” – BayesStore IE Approach

 Primer on CRF  Query-Driven Extraction

 Select-over-Top1 Queries

 Probabilistic SPJ Queries

 Probabilistic Join Queries

 Experimental Results  Conclusion

slide-25
SLIDE 25

Evaluation 1: [Efficiency Improvement] Exhaustive vs. Query-Driven Extraction with Inverted Index

24

Select-over-Top1 Queries

slide-26
SLIDE 26

Evaluation 2: [Efficiency Improvement] Query-Driven Extraction Inverted Index vs. Early-Stopping

25

Take-away: Query-Driven Extraction improves Efficiency.

Select-over-Top1 Queries

slide-27
SLIDE 27

Evaluation 3: [Accuracy Improvement] Probabilistic Join vs. Join-over-Top1

26

Take-away: Probabilistic SPJ improves accuracy at a computation cost A Query Design Space: efficiency vs. accuracy

slide-28
SLIDE 28

Conclusion

27

 Querying Probabilistic IE

 BayesStoreIE framework  Deep Integration of Relational and Inference  Query-Driven Extraction  Probabilistic SPJ Queries

 Current & Future Work

 MCMC inference in DB  Conditional and Aggregation Queries in IE  Optimizer for Inference Operators (cost-accuracy co-

  • ptimization)
slide-29
SLIDE 29

Thank you! ... Questions?

28

BayesStore Project Page:

http://www.cs.berkeley.edu/~daisyw/BayesStore.html