Extraction and Integration of Web Data by End-Users Sudhir Agarwal - - PowerPoint PPT Presentation

extraction and integration of web data by end users
SMART_READER_LITE
LIVE PREVIEW

Extraction and Integration of Web Data by End-Users Sudhir Agarwal - - PowerPoint PPT Presentation

Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University Introduction End users often need to quickly find and analyze information


slide-1
SLIDE 1

Extraction and Integration of Web Data by End-Users

Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University

slide-2
SLIDE 2

Introduction

  • End users often need to quickly find and analyze information from

many web sites

  • Search engines:

– (+) good at finding individual documents – (-) not good at fulfilling a complex information requirement

  • End users have new ideas as soon as some information is easily

available  Web aggregators only shift but don’t solve the problem

  • We propose an approach that empowers end-users to

– easily extract information from web pages, – clean, integrate and search extracted information by writing Datalog rules and queries

slide-3
SLIDE 3

Extraction of Data from the Web

  • While browsing normally, end-users select the

information they want to remember

  • Our extraction algorithm automatically creates a table

from user’s selection

– Input: HTML DOM tree D (representing the selection) – D’ compress D by replacing parents

  • f lone children by their resp. children

– D’’ remove all nodes of D’ that have no text content – T D’’ create a table from D’’ by interpreting nodes in level 1 as rows and that in level 2 as column values – Output T

D’’k D’’1

|D’’1| |D’’k|

T k

max( |D’’i| ) 1<=i<=k

slide-4
SLIDE 4

Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman

slide-5
SLIDE 5

Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman

slide-6
SLIDE 6

Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)

slide-7
SLIDE 7

Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)

slide-8
SLIDE 8

Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)

slide-9
SLIDE 9

Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)

slide-10
SLIDE 10

Extraction of text paragraphs (illustrated with the New York Times front page)

slide-11
SLIDE 11

Extraction of text paragraphs (illustrated with the New York Times front page)

slide-12
SLIDE 12

Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)

slide-13
SLIDE 13

Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)

slide-14
SLIDE 14

Rule based Cleaning

ct3(A,C,E,F):-t3(A,B) & distinct(B,"") & matches(B,"[^:]+","0,1",C) & matches(B,"[^:]+","1",D) & matches(D,"[^.]+","0,1",E) & matches(D,"[^.]+","1",F) t3 ct3

slide-15
SLIDE 15

Rule-based Integration

  • End users write simple Datalog rules to

integrate (clean) tables in GAV fashion

  • Multiple rules with the same head define a

view as a union

  • Body can contain multiple tables and views

pubOf("Jeffrey Ullman",A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct4(A,B,C,D) Assume table ct2 contains names of faculty members in column A faculty(A):-ct2(A,B,C,D,E,F) facPubs(B,C,D,E):-pubOf(A,B,C,D,E)&faculty(A)

slide-16
SLIDE 16

Conclusion

  • We presented an end-user driven web data extraction,

integration and search approach

  • The approach is implemented as a browser plugin (pls.

refer to http://seamail.ksri.kit.edu/swb/ for details)

  • Our approach can suggest cleaning and integration

rules that could be reused for tables of same arity and

  • rigin (not part of this presentation, pls. refer to paper)
  • As a next step we plan to derive reusable extraction

scripts from users’ browsing logs and extraction actions

Thank you !

slide-17
SLIDE 17

Cleaning Extracted Data

  • Extracted data often need to be cleaned

before it can be integrated with other data

  • Simplest way of allowing end users to freely

edit the extracted tables can quickly become very tedious if similar steps need to be performed repetitively for multiple tables

  • We propose rule based cleaning since cleaning

rules are reusable and thus save time

slide-18
SLIDE 18

D’’k D’’1

|D’’1| |D’’k|

T k

max( |D’’i| ) 1<=i<=k