Integrating Structured Data and Text A Tagged Document < - PDF document

✬ ✩ Integrating Structured Data and Text ✫ ✪

✬ ✩ A Tagged Document < DOC > < DOCNO > WSJ870323-0180 < /DOCNO > < HL > Italy’s Commercial Vehicle Sales < /HL > < DD > 03/23/87 < /DD > < DATELINE > TURIN, Italy < /DATELINE > < TEXT > Commercial-vehicle sales in Italy rose 11.4% in February from a year earlier, to 8,848 units, according to provisional figures from the Italian Asso- ciation of Auto Makers. Sales for the Association are expected to rise an additional 2% in July. < /TEXT > < /DOC > ✫ ✪

✬ ✩ Another Tagged Document < DOC > < DOCNO > WSJ870323-0181 < /DOCNO > < HL > Ford Discontinues Taurus SHO Five-Speed Vehicle < /HL > < DD > 01/21/95 < /DD > < DATELINE > George, Atlanta < /DATELINE > < TEXT > Ford Motor Company announced that beginning in 1996, the Taurus SHO will no longer include a five-speed vehicle. < /TEXT > < /DOC > ✫ ✪

✬ ✩ Representing Unstructured Text as Relations DOC (doc id, doc name, date, dateline) DOC TERM (doc id, term, tf) DOC TERM PROX (doc id, term, offset) IDF (term, idf) STOP TERM (term) QUERY (term, tf) ✫ ✪

✬ ✩ DOC: doc id doc name date dateline 1 WSJ870323-0180 3/23/87 Turin, Italy 2 WSJ870323-0181 1/21/95 Georgia, Atlanta ✫ ✪

✬ ✩ DOC TERM: DOC TERM PROX IDF doc id term tf doc id term offset term idf 1 commercial 1 1 commercial 1 11.4% 2.9595 1 vehicle 1 1 vehicle 2 2% 1.5911 1 sales 2 1 sales 3 8848 4.3936 1 italy 1 1 italy 4 according 0.7782 1 rose 1 1 rose 5 additional 1.0792 1 11.4% 1 1 11.4% 6 association 1.0792 1 february 1 1 february 7 auto 1.2788 1 year 1 1 year 8 commercial 1.0000 1 earlier 1 1 earlier 9 earlier 0.6021 1 8,848 1 1 8,848 10 expected 0.6990 1 according 1 1 according 11 february 1.3222 1 provisional 1 1 provisional 12 figures 1.2553 1 figures 1 1 figures 13 italian 1.8451 1 italian 1 1 italian 14 italy 1.6721 1 association 2 1 association 15 July 1.0414 1 auto 1 1 auto 16 makers 1.3010 1 makers 1 1 makers 17 provisional 2.5172 1 expected 1 1 sales 18 rose 0.7782 1 additional 1 1 association 19 sales 0.6990 1 2% 1 ... ... ... vehicle 1.7709 1 July 1 1 July 24 year 0.0000 ... ... ... ... ... ... 2 vehicle 1 2 vehicle 14 ✫ ... ... ... ... ... ... ✪

✬ ✩ QUERY STOP TERM term tf term vehicle 1 a sales 1 about after all also an and ... the ... ✫ ✪

✬ ✩ Assumptions • DOC TERM does not contain duplicate terms for the same document. • QUERY does not contain duplicate terms The text preprocessor that creates the relations can remove dupli- cates that it finds in documents or queries. ✫ ✪

✬ ✩ Boolean OR SELECT d.doc id FROM DOC TERM d WHERE d.term = input term1 OR d.term = input term2 OR d.term = input term3 OR ... d.term = input termN ✫ ✪

✬ ✩ Boolean AND SELECT a.doc id FROM DOC TERM a,DOC TERM b, ... DOC TERM n WHERE a.term = input term1 AND b.term = input term2 AND c.term = input term3 AND d.term = input term4 AND ... n.term = input termN ✫ ✪

✬ ✩ Boolean AND • A better implementation of boolean AND. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) = (SELECT COUNT(*) FROM QUERY) Returns document 1 for our example. ✫ ✪

✬ ✩ Boolean OR • A better implementation of boolean OR. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id ✫ ✪

✬ ✩ Threshold AND • Retrieve documents that conatin at least k of the specified terms in the query. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) ≥ k ✫ ✪

✬ ✩ Proximity Searches • Request for all documents that contain n terms within a term window of size width . • Use a relation DOC TERM PROX with attributes docid, term, offset . • The query insists that not only must all terms found in QUERY be contained in a given document, but at least one occurrence of each term in QUERY must fall within a term window of size width . ✫ ✪

✬ ✩ Proximity Search using SQL SELECT a.doc id FROM DOC TERM PROX a, DOC TERM PROX b WHERE a.term IN (SELECT q.term FROM QUERY q) AND b.term IN (SELECT q.term FROM QUERY q) AND a.doc id = b.doc id AND (b.offset - a.offset) BETWEEN 0 AND ( width − 1) GROUP BY a.doc id, a.term, a.offset HAVING COUNT(DISTINCT b.term) = (SELECT COUNT(*) FROM QUERY) Example. Find documents that have the terms vehicle and sales in a window of size 4. ✫ ✪

✬ ✩ After the first three conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 vehicle 2 1 sales 18 1 sales 3 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 1 vehicle 2 1 sales 18 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 ✫ ✪

✬ ✩ The following tuples satisfy all four conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 ✫ ✪

✬ ✩ GROUP BY partitions the result set based on document id, term , and offset . a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 HAVING makes sure that all terms are found in a group. DISTINCT ensures that duplicate terms within the term window are not counted. Answer: Document 1. ✫ ✪

✬ ✩ Computing Relevance Using Unchanged SQL • If large number of documents match the query, we want to retrieve the documents ranked by their relevance to the query. • How do we compute the relevance? – Model each document and query as a vector of terms. Then use cartesian distance between the query and a document to rank the documents. ✫ ✪

✬ ✩ d = total number of documents n = number of distinct terms in the document collection tf ij = number of occurrences of term t j in document D i f j = number of documents which contain t j d f j = log 10 ( d f j ) id d The calculation of the weighting factor ( w ) for a term in a document is defined as a combination of term frequency ( tf ) and inverse document frequency ( id f ). The value of the j th entry in the vector corresponding to document i is: D ij = tf ij × id f j , 1 ≤ j ≤ n Similarly, we compute the vector for the query. ✫ ✪ Q j = tf j × id f j , 1 ≤ j ≤ n

✬ ✩ Similarity Coefficients A similarity coefficient between a query Q and a document D i is just the dot-product of the corresponding vectors, which is euqivalent to the following sum: sim ( Q, D i ) = � t j =1 Q j × d ij ✫ ✪

✬ ✩ SQL for Relevance Ranking SELECT d.doc id, SUM(q.tf * i.idf * d.tf * i.idf) FROM QUERY q, DOC TERM d, IDF i WHERE q.term = i.term AND d.term = i.term GROUP BY d.doc id ORDER BY 2 DESC ✫ ✪

✬ ✩ Example for Relevance Ranking The result set for the query is computed as: Since vehicle contains an idf of 1.7709 and sales contains an idf .6990, the similarity coefficient for the documents is computed as: (vehicle) (sales) D 1 = (1 . 7709)(1)(1 . 7709)(1) + ( . 6990)(1)( . 6990)(2) D 2 = (1 . 7709)(1)(1 . 7709)(1) + ( . 6990)(1)( . 6990)(0) RESULT: doc id similarity coefficient 1 4.113 2 3.136 ✫ ✪

Integrating Structured Data and Text A Tagged Document < - PDF document

Integrating Structured Data and Text A Tagged Document < DOC > < DOCNO > WSJ870323-0180 < /DOCNO > < HL > Italys Commercial Vehicle Sales < /HL > < DD > 03/23/87 < /DD >

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Semi-structured data Data is not just text, but is not as well- Semi-structured data

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

(XML from Chapter 20 of text) Outline Why Structured Data? Types of Structured Data

Introduction to SparkSQL Structured Data Processing in Spark 1 Structured Data Processing A

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

an optimized data exchange policy Hisham Mohamed and Stphane Marchand-Maillet Viper group, CVML

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

Fake News Spreader Identification in Twitter using Ensemble Modeling 8 th Author Profiling Task

Matching Scores TVM, Session 4 CS6200: Information Retrieval Slides by: Jesse Anderton Finding

Feature selection LING 572 Advanced Statistical Methods for NLP January 21, 2020 1

The effect of parental job loss on child school dropout: evidence from the Occupied Palestinian

Efficient visual search of local features Cordelia Schmid Visual search change in viewing

for Finding Similar Images Cyrill Stachniss Slides have been created by Cyrill Stachniss. Most