Integrating Structured Data and Text A Tagged Document < - - PDF document

integrating structured data and text
SMART_READER_LITE
LIVE PREVIEW

Integrating Structured Data and Text A Tagged Document < - - PDF document

Integrating Structured Data and Text A Tagged Document < DOC > < DOCNO > WSJ870323-0180 < /DOCNO > < HL > Italys Commercial Vehicle Sales < /HL > < DD > 03/23/87 < /DD >


slide-1
SLIDE 1

✬ ✫ ✩ ✪

Integrating Structured Data and Text

slide-2
SLIDE 2

✬ ✫ ✩ ✪

A Tagged Document

<DOC> <DOCNO> WSJ870323-0180 </DOCNO> <HL> Italy’s Commercial Vehicle Sales </HL> <DD> 03/23/87 </DD> <DATELINE> TURIN, Italy </DATELINE> <TEXT> Commercial-vehicle sales in Italy rose 11.4% in February from a year ear- lier, to 8,848 units, according to provisional figures from the Italian Asso- ciation of Auto Makers. Sales for the Association are expected to rise an additional 2% in July. </TEXT> </DOC>

slide-3
SLIDE 3

✬ ✫ ✩ ✪

Another Tagged Document

<DOC> <DOCNO> WSJ870323-0181 </DOCNO> <HL> Ford Discontinues Taurus SHO Five-Speed Vehicle </HL> <DD> 01/21/95 </DD> <DATELINE> George, Atlanta </DATELINE> <TEXT> Ford Motor Company announced that beginning in 1996, the Taurus SHO will no longer include a five-speed vehicle. </TEXT> </DOC>

slide-4
SLIDE 4

✬ ✫ ✩ ✪

Representing Unstructured Text as Relations

DOC (doc id, doc name, date, dateline) DOC TERM (doc id, term, tf) DOC TERM PROX (doc id, term, offset) IDF (term, idf) STOP TERM (term) QUERY (term, tf)

slide-5
SLIDE 5

✬ ✫ ✩ ✪

DOC: doc id doc name date dateline 1 WSJ870323-0180 3/23/87 Turin, Italy 2 WSJ870323-0181 1/21/95 Georgia, Atlanta

slide-6
SLIDE 6

✬ ✫ ✩ ✪

DOC TERM: DOC TERM PROX IDF doc id term tf 1 commercial 1 1 vehicle 1 1 sales 2 1 italy 1 1 rose 1 1 11.4% 1 1 february 1 1 year 1 1 earlier 1 1 8,848 1 1 according 1 1 provisional 1 1 figures 1 1 italian 1 1 association 2 1 auto 1 1 makers 1 1 expected 1 1 additional 1 1 2% 1 1 July 1 ... ... ... 2 vehicle 1 ... ... ... doc id term

  • ffset

1 commercial 1 1 vehicle 2 1 sales 3 1 italy 4 1 rose 5 1 11.4% 6 1 february 7 1 year 8 1 earlier 9 1 8,848 10 1 according 11 1 provisional 12 1 figures 13 1 italian 14 1 association 15 1 auto 16 1 makers 17 1 sales 18 1 association 19 ... ... ... 1 July 24 ... ... ... 2 vehicle 14 ... ... ... term idf 11.4% 2.9595 2% 1.5911 8848 4.3936 according 0.7782 additional 1.0792 association 1.0792 auto 1.2788 commercial 1.0000 earlier 0.6021 expected 0.6990 february 1.3222 figures 1.2553 italian 1.8451 italy 1.6721 July 1.0414 makers 1.3010 provisional 2.5172 rose 0.7782 sales 0.6990 vehicle 1.7709 year 0.0000

slide-7
SLIDE 7

✬ ✫ ✩ ✪

QUERY STOP TERM term tf vehicle 1 sales 1 term a about after all also an and ... the ...

slide-8
SLIDE 8

✬ ✫ ✩ ✪

Assumptions

  • DOC TERM does not contain duplicate terms for the same doc-

ument.

  • QUERY does not contain duplicate terms

The text preprocessor that creates the relations can remove dupli- cates that it finds in documents or queries.

slide-9
SLIDE 9

✬ ✫ ✩ ✪

Boolean OR

SELECT d.doc id FROM DOC TERM d WHERE d.term = input term1 OR d.term = input term2 OR d.term = input term3 OR ... d.term = input termN

slide-10
SLIDE 10

✬ ✫ ✩ ✪

Boolean AND

SELECT a.doc id FROM DOC TERM a,DOC TERM b, ... DOC TERM n WHERE a.term = input term1 AND b.term = input term2 AND c.term = input term3 AND d.term = input term4 AND ... n.term = input termN

slide-11
SLIDE 11

✬ ✫ ✩ ✪

Boolean AND

  • A better implementation of boolean AND.

SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) = (SELECT COUNT(*) FROM QUERY) Returns document 1 for our example.

slide-12
SLIDE 12

✬ ✫ ✩ ✪

Boolean OR

  • A better implementation of boolean OR.

SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id

slide-13
SLIDE 13

✬ ✫ ✩ ✪

Threshold AND

  • Retrieve documents that conatin at least k of the specified terms

in the query. SELECT d.doc id FROM DOC TERM d,QUERY q WHERE d.term = q.term GROUP BY d.doc id HAVING COUNT(DISTINCT d.term) ≥ k

slide-14
SLIDE 14

✬ ✫ ✩ ✪

Proximity Searches

  • Request for all documents that contain n terms within a term

window of size width.

  • Use a relation DOC TERM PROX with attributes docid, term,
  • ffset.
  • The query insists that not only must all terms found in QUERY

be contained in a given document, but at least one occurrence

  • f each term in QUERY must fall within a term window of size

width.

slide-15
SLIDE 15

✬ ✫ ✩ ✪

Proximity Search using SQL

SELECT a.doc id FROM DOC TERM PROX a, DOC TERM PROX b WHERE a.term IN (SELECT q.term FROM QUERY q) AND b.term IN (SELECT q.term FROM QUERY q) AND a.doc id = b.doc id AND (b.offset - a.offset) BETWEEN 0 AND (width − 1) GROUP BY a.doc id, a.term, a.offset HAVING COUNT(DISTINCT b.term) = (SELECT COUNT(*) FROM QUERY)

  • Example. Find documents that have the terms vehicle and sales

in a window of size 4.

slide-16
SLIDE 16

✬ ✫ ✩ ✪ After the first three conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 vehicle 2 1 sales 18 1 sales 3 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 1 vehicle 2 1 sales 18 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14

slide-17
SLIDE 17

✬ ✫ ✩ ✪ The following tuples satisfy all four conditions in the where clause. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14

slide-18
SLIDE 18

✬ ✫ ✩ ✪ GROUP BY partitions the result set based on document id, term, and offset. a.doc id a.term a.offset b.doc id b.term b.offset 1 vehicle 2 1 vehicle 2 1 vehicle 2 1 sales 3 1 sales 3 1 sales 3 1 sales 18 1 sales 18 2 vehicle 14 2 vehicle 14 HAVING makes sure that all terms are found in a group. DISTINCT ensures that duplicate terms within the term window are not counted. Answer: Document 1.

slide-19
SLIDE 19

✬ ✫ ✩ ✪

Computing Relevance Using Unchanged SQL

  • If large number of documents match the query, we want to re-

trieve the documents ranked by their relevance to the query.

  • How do we compute the relevance?

– Model each document and query as a vector of terms. Then use cartesian distance between the query and a document to rank the documents.

slide-20
SLIDE 20

✬ ✫ ✩ ✪ d = total number of documents n = number of distinct terms in the document collection tfij = number of occurrences of term tj in document Di d fj = number of documents which contain tj id fj = log10( d

d fj )

The calculation of the weighting factor (w) for a term in a docu- ment is defined as a combination of term frequency (tf) and inverse document frequency (id f). The value of the jth entry in the vector corresponding to document i is: Dij = tfij × id fj, 1 ≤ j ≤ n Similarly, we compute the vector for the query. Qj = tfj × id fj, 1 ≤ j ≤ n

slide-21
SLIDE 21

✬ ✫ ✩ ✪

Similarity Coefficients

A similarity coefficient between a query Q and a document Di is just the dot-product of the corresponding vectors, which is euqivalent to the following sum: sim(Q, Di) = t

j=1 Qj × dij

slide-22
SLIDE 22

✬ ✫ ✩ ✪

SQL for Relevance Ranking

SELECT d.doc id, SUM(q.tf * i.idf * d.tf * i.idf) FROM QUERY q, DOC TERM d, IDF i WHERE q.term = i.term AND d.term = i.term GROUP BY d.doc id ORDER BY 2 DESC

slide-23
SLIDE 23

✬ ✫ ✩ ✪

Example for Relevance Ranking

The result set for the query is computed as: Since vehicle contains an idf of 1.7709 and sales contains an idf .6990, the similarity coefficient for the documents is computed as: (vehicle) (sales) D1 = (1.7709)(1)(1.7709)(1) + (.6990)(1)(.6990)(2) D2 = (1.7709)(1)(1.7709)(1) + (.6990)(1)(.6990)(0) RESULT: doc id similarity coefficient 1 4.113 2 3.136