Querying Probabilistic Information Extraction Daisy Zhe Wang, - PowerPoint PPT Presentation

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein VLDB, Singapore, 16 th September, 2010

Outline  Information Extraction Systems  Information Extraction (IE)  “Extract -then- Query” – Standard IE System  “Query -Time- Extraction” – BayesStore IE System  Primer on CRF  Query-Driven Extraction  Select-over-Top1 Queries  Probabilistic SPJ Queries  Probabilistic Join Queries  Experimental Results  Conclusion 1

Information Extraction (IE)  Steve Jobs introduced the iPhone 4's videoconferencing feature FaceTime at WWDC 2010. Apple will hold a press conference Wednesday, where Steve Jobs is expected to announce the birth of new stars in his product galaxy, including (probably) new iPods and (possibly) a successor to Apple TV. --- From WIRED August 30, 2010 2

Information Extraction (IE)  Steve Jobs introduced the iPhone 4's videoconferencing feature FaceTime at WWDC 2010. Apple will hold a press conference Wednesday, where Steve Jobs is expected to announce the birth of new stars in his product galaxy, including (probably) new iPods and (possibly) a successor to Apple TV. --- From WIRED August 30, 2010 Labels: Person Company Product Event Other 3

“Extract -then- Query” – Standard IE Systems Query Top-1 Entity Text Extractions Traditional DBMS Information Extraction Systems Answer Problems: 1. Exhaustive extraction for all entities over all in-coming documents 2. Loses uncertainties and probabilities which are inherent in IE 4

Exhaustive vs. Query-Driven Extraction Example Example Query: SELECT persons FROM blog articles WHERE company = “ Apple ”  Steve Jobs introduced the iPhone 4's videoconferencing feature FaceTime at WWDC 2010. Apple will hold a press conference...  The Big Apple lands „14 Super Bowl. Giants co - owner Jonathan Tisch said: “The greatest game will be played on the greatest stage!”…  Apple Soufflé recipe by Julia Child: ... Pare, cut up, and stew … 5

Exhaustive vs. Query-Driven Extraction Example Example Query: SELECT persons FROM blog articles WHERE company = “ Apple ”  Steve Jobs introduced the iPhone 4's videoconferencing feature FaceTime at WWDC 2010. Apple will hold a press conference...  The Big Apple lands „14 Super Bowl. Giants co- owner Jonathan Tisch said: “The greatest game will be played on the greatest stage!”…  Apple Soufflé recipe by Julia Child: ... Pare, cut up, and stew … 6

Exhaustive vs. Query-Driven Extraction Example Example Query: SELECT persons FROM blog articles WHERE company = “ Apple ”  Steve Jobs introduced the iPhone 4's videoconferencing feature FaceTime at WWDC 2010. Apple will hold a press conference...  The Big Apple lands  Apple Soufflé recipes How to perform fast filtering without full inference? Challenge: Need to push condition Label = ‘company’ into inference by deep integration of inference and relational ops. 7

“Extract -then- Query” – Storing Extractions and Probabilities Query Text p(Entities) Probabilistic DBMS probabilities Information Extraction Systems p(Answer) Still performs exhaustive extraction Does not have the right representations to support IE probabilistic models inside of PDB [Gupta,VLDB2005] 8

9 “Query -Time- Extraction” – BayesStoreIE Query Constraints X X Text 0 3 Y Y Relational 0 3 IE Probabilistic Query Engine Pr[Entities] Model+ Inf. BayesStoreIE Engine Pr[Answer] Our Contributions: • Deep Integration between Inference and Relational Operators • Enable Query-Driven On-line Extraction • Enable Probabilistic Queries over IE models

Outline  Information Extraction Systems  Information Extraction (IE)  “Extract -then- Query” – Standard IE Approach  “Query -Time- Extraction” – BayesStore IE Approach  Primer on CRF  Query-Driven Extraction  Select-over-Top1 Queries  Probabilistic SPJ Queries  Probabilistic Join Queries  Experimental Results  Conclusion 10

11 Conditional Random Fields (CRF) Text (address string): E.g., “2181 Shattuck North Berkeley CA USA” CRF Model: 2181 Shattuck North Berkeley CA USA x x x x x x X=tokens 0 1 2 3 4 5 y y y y y y Y=labels 0 1 2 3 4 5 Possible Extraction Worlds: … … … … … … … …

Two Query Families Query Family 1: (SPJ-over-Top1) Queries using only most-likely Extractions Query Family 2: (Probabilistic SPJ) Queries using probabilistic distributions 12

Query Family 1: Select-over-Top1 Example Query: Select * From Top-1 extractions of document set D Where company like “%Apple%” 13

14 Viterbi Top-1 Inference on CRF Viterbi Dynamic Programming Algorithm: CRF Model: Dynamic Programming V matrix: 2181 Shattuck North Berkeley pos pos pos pos pos pos pos street street street street street street street street street street street street street street city state country city state country city state country city state country city state country city state country city state country num num num num num num num name name name name name name name X X X=tokens 0 3 0 0 0 0 0 0 0 5 5 5 5 5 5 1 1 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 15 15 15 15 15 7 7 7 7 7 8 8 8 8 8 7 7 7 7 7 Y Y Y=labels 0 3 2 2 2 2 2 12 12 12 12 24 24 24 24 21 21 21 21 18 18 18 18 17 17 17 17 3 3 3 3 21 21 21 32 32 32 32 32 32 30 30 30 26 26 26 4 4 4 29 29 40 40 38 38 42 42 35 35 5 5 39 47 46 46 50

Query Family 1: Select-over-Top1 – 15 Viterbi Early-Stopping Algorithm Example Query: Select * From Viterbi-Top1 extractions of document set D Where company like “%Apple%” pos pos event city event city compa comp state state other other pos event city comp state other ny any any 0 0 5 5 1 1 0 0 1 1 1 1 0 5 1 0 1 1 Big Apple 1 1 2 2 15 15 7 7 8 8 7 7 lands 2 12 24 21 18 17 `14 STOP! Super Bowl Implemented in PostgreSQL using recursive queries and array functions

16 Query Family 2: Probabilistic Join Example Query: Select Top-1 results From extraction distributions of documents in D1, D2 Where D1.city = D2.city Probabilistic Join

Query Family 2: Probabilistic Join Example Query: Select Top-1 results From extraction distributions of documents in D1, D2 Where D1.city = D2.city Naïve algorithm: First compute top-k extractions for both input document sets, then compute join Problem: k needed to compute Top-1 results varies for different documents Solution: Probabilistic Rank-Join algorithm based on Incremental Ranked Access to the List of Possible Extractions 17

Accessing Ranked List of Extractions – Incremental Viterbi Algorithm  A novel variation of the Top-1 Viterbi algorithm, which computes the next highest-probability extraction incrementally and more efficiently pos street street city state countr num name y 0 5 1 0 1 1 Sacramento Avenue 1 2 15 7 8 7 San 2 12 24 21 18 17 Francisco 3 21 32, 32 30 26 CA 4 29 40 38 42 35 USA 5 39 47 46 46 50 18

Accessing Ranked List of Extractions – 19 Incremental Viterbi Algorithm  A novel variation of the Top-1 Viterbi algorithm, which computes the next highest-probability extraction incrementally and more efficiently pos street street city state countr num name y 0 5 1 0 1 1 Sacramento Avenue 1 2 15,10 7 8 7 San 2 12 24,18 21 18 17 Francisco 3 21 32, 32,31 30 26 CA 4 29 40 38 42,38 35 USA 5 39 47 46 46 50,48 3 rd highest- probability extraction can be computed by another call…

Probabilistic Rank-Join Rank-join is applied to each pair of “joinable” document to compute Top-1 join results key Ext. p key Ext. p O_top I1_top A .83 D .77 k B .12 C .15 O_bottom I1_bottom C .02 A .03 Outer Doc_i Inner Doc_j 20

Probabilistic Rank-Join A set of rank-joins are computed simultaneously for a set of outer documents and a set of inner documents key Ext. p key Ext. p O_top I1_top A .83 D .77 k B .12 Inner Doc_1 C .15 Outer Doc_1 O_bottom I1_bottom C .02 B .03 ……… ……… key Ext. p C .95 D .02 Inner Doc_n A .01 21

Other Algorithms  Probabilistic Selection  Probabilistic Projection  Query-Driven Join-over-Top1 22

Outline  Information Extraction Systems  Information Extraction (IE)  “Extract -then- Query” – Standard IE Approach  “Query -Time- Extraction” – BayesStore IE Approach  Primer on CRF  Query-Driven Extraction  Select-over-Top1 Queries  Probabilistic SPJ Queries  Probabilistic Join Queries  Experimental Results  Conclusion 23

Evaluation 1: [Efficiency Improvement] Exhaustive vs. Query-Driven Extraction with Inverted Index Select-over-Top1 Queries 24

Evaluation 2: [Efficiency Improvement] Query-Driven Extraction Inverted Index vs. Early-Stopping Select-over-Top1 Queries Take-away: Query-Driven Extraction improves Efficiency. 25

Evaluation 3: [Accuracy Improvement] Probabilistic Join vs. Join-over-Top1 Take-away: Probabilistic SPJ improves accuracy at a computation cost A Query Design Space: efficiency vs. accuracy 26

Querying Probabilistic Information Extraction Daisy Zhe Wang, - PowerPoint PPT Presentation

Querying Probabilistic Information Extraction Daisy Zhe Wang, Michael J. Franklin, Minos Garofalakis, Joseph M. Hellerstein VLDB, Singapore, 16 th September, 2010 Outline Information Extraction Systems Information Extraction (IE)

The problem Combining querying of XML data with ontology queries Example XML document

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

Wavelets for Efficient Querying of Large Wavelets for Efficient Querying of Large

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Querying XML Documents Querying XML Documents How XML may be supported in databases with

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Querying Probabilistic XML Databases Sept. 21 st 2012 Asma Souihli Network and Computer Science

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

Partial Recursive Functions and Finality Gordon Plotkin Laboratory for the Foundations of

Red-Black Trees ! Motivation: a binary search tree that is guaranteed to be balanced (operations

Seminario di Teoria delle Categorie Universit degli Studi di Milano 9.30 - 10:15 - Semidirect

Exploring the parameter space sntrup761 evaluations from in lattice attacks NTRU Prime: round

Dial C for Cipher Le chiffrement etait presque parfait Thomas Baign` eres Matthieu Finiasz

Counters and Async Infrastructures Proposal for Code Contribution Guy Sela, Senior Engineer, HPE

Samantha Parfait, Indies Guercin and Emily Murdock Supporting Small Businesses Post COVID-19

MINCS - The Container in the Shell (script) - Masami Hiramatsu