Extraction and Integration of Web Data by End-Users Sudhir Agarwal - - PowerPoint PPT Presentation
Extraction and Integration of Web Data by End-Users Sudhir Agarwal - - PowerPoint PPT Presentation
Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University Introduction End users often need to quickly find and analyze information
Introduction
- End users often need to quickly find and analyze information from
many web sites
- Search engines:
– (+) good at finding individual documents – (-) not good at fulfilling a complex information requirement
- End users have new ideas as soon as some information is easily
available Web aggregators only shift but don’t solve the problem
- We propose an approach that empowers end-users to
– easily extract information from web pages, – clean, integrate and search extracted information by writing Datalog rules and queries
Extraction of Data from the Web
- While browsing normally, end-users select the
information they want to remember
- Our extraction algorithm automatically creates a table
from user’s selection
– Input: HTML DOM tree D (representing the selection) – D’ compress D by replacing parents
- f lone children by their resp. children
– D’’ remove all nodes of D’ that have no text content – T D’’ create a table from D’’ by interpreting nodes in level 1 as rows and that in level 2 as column values – Output T
D’’k D’’1
…
|D’’1| |D’’k|
T k
max( |D’’i| ) 1<=i<=k
Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman
Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman
Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)
Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)
Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)
Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)
Extraction of text paragraphs (illustrated with the New York Times front page)
Extraction of text paragraphs (illustrated with the New York Times front page)
Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)
Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)
Rule based Cleaning
ct3(A,C,E,F):-t3(A,B) & distinct(B,"") & matches(B,"[^:]+","0,1",C) & matches(B,"[^:]+","1",D) & matches(D,"[^.]+","0,1",E) & matches(D,"[^.]+","1",F) t3 ct3
Rule-based Integration
- End users write simple Datalog rules to
integrate (clean) tables in GAV fashion
- Multiple rules with the same head define a
view as a union
- Body can contain multiple tables and views
pubOf("Jeffrey Ullman",A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct4(A,B,C,D) Assume table ct2 contains names of faculty members in column A faculty(A):-ct2(A,B,C,D,E,F) facPubs(B,C,D,E):-pubOf(A,B,C,D,E)&faculty(A)
Conclusion
- We presented an end-user driven web data extraction,
integration and search approach
- The approach is implemented as a browser plugin (pls.
refer to http://seamail.ksri.kit.edu/swb/ for details)
- Our approach can suggest cleaning and integration
rules that could be reused for tables of same arity and
- rigin (not part of this presentation, pls. refer to paper)
- As a next step we plan to derive reusable extraction
scripts from users’ browsing logs and extraction actions
Thank you !
Cleaning Extracted Data
- Extracted data often need to be cleaned
before it can be integrated with other data
- Simplest way of allowing end users to freely
edit the extracted tables can quickly become very tedious if similar steps need to be performed repetitively for multiple tables
- We propose rule based cleaning since cleaning
rules are reusable and thus save time
D’’k D’’1
…
|D’’1| |D’’k|
T k
max( |D’’i| ) 1<=i<=k