Extraction and Integration of Web Data by End-Users Sudhir Agarwal - PowerPoint PPT Presentation

Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University

Introduction End users often need to quickly find and analyze information from • many web sites Search engines: • – (+) good at finding individual documents – (-) not good at fulfilling a complex information requirement End users have new ideas as soon as some information is easily • available  Web aggregators only shift but don’t solve the problem We propose an approach that empowers end-users to • – easily extract information from web pages, – clean, integrate and search extracted information by writing Datalog rules and queries

Extraction of Data from the Web • While browsing normally, end-users select the information they want to remember • Our extraction algorithm automatically creates a table from user’s selection – Input: HTML DOM tree D (representing the selection) – D’  compress D by replacing parents of lone children by their resp. children – D’’  remove all nodes of D’ that have no text content – T  D’’ create a table from D’’ by interpret ing nodes in level 1 as rows … k T and that in level 2 as column values D’’1 D’’k – Output T |D’’1| |D’’k| max( |D’’i| ) 1<=i<=k

Extraction from HTML tables illustrated with the DBLP page of Jeffrey Ullman

Extraction from HTML Tables that have DIV elements for layout (illustrated with the Stanford CS faculty web page)

Extraction from arbitrary HTML elements (illustrated with the MIT CS faculty page)

Extraction of text paragraphs (illustrated with the New York Times front page)

Extraction of non-adjacent elements that may even be on different web pages with the help of a clipboard (illustrated by the amazon web page)

Rule based Cleaning t3 ct3(A,C,E,F):-t3(A,B) & distinct(B,"") & matches(B,"[^:]+","0,1",C) & matches(B,"[^:]+","1",D) & matches(D,"[^.]+","0,1",E) & matches(D,"[^.]+","1",F) ct3

Rule-based Integration • End users write simple Datalog rules to integrate (clean) tables in GAV fashion pubOf("Jeffrey Ullman",A,B,C,D):-ct3(A,B,C,D) • Multiple rules with the same head define a view as a union csPubs(A,B,C,D):-ct3(A,B,C,D) csPubs(A,B,C,D):-ct4(A,B,C,D) • Body can contain multiple tables and views Assume table ct2 contains names of faculty members in column A faculty(A):-ct2(A,B,C,D,E,F) facPubs(B,C,D,E):-pubOf(A,B,C,D,E)&faculty(A)

Conclusion • We presented an end-user driven web data extraction, integration and search approach • The approach is implemented as a browser plugin (pls. refer to http://seamail.ksri.kit.edu/swb/ for details) • Our approach can suggest cleaning and integration rules that could be reused for tables of same arity and origin (not part of this presentation, pls. refer to paper) • As a next step we plan to derive reusable extraction scripts from users’ browsing logs and extraction actions Thank you !

Cleaning Extracted Data • Extracted data often need to be cleaned before it can be integrated with other data • Simplest way of allowing end users to freely edit the extracted tables can quickly become very tedious if similar steps need to be performed repetitively for multiple tables • We propose rule based cleaning since cleaning rules are reusable and thus save time

… k T D’’1 D’’k |D’’1| |D’’k| max( |D’’i| ) 1<=i<=k

Extraction and Integration of Web Data by End-Users Sudhir Agarwal - PowerPoint PPT Presentation

Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University Introduction End users often need to quickly find and analyze information

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

End User Programming Glenn Vanderburg Relevance, Inc. End Users Software Programmers Your

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

Is End-to-End Integrity Verification Really End- to-End? Ahmed Alhussen, Batyr Charyyev, and Engin

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Data Mining l The Extraction of useful information from data l The automated extraction of hidden

Appendix Deposits and Loans Appendix Deposits (ending balance) Loans (ending balance) (Unit:

End-to-End Argument Jeff Chase Duke University End-To-End Argument Application TCP Where to

Pisgah Legal Services Pursuing Justice, Im proving Lives Pisgah Legal Services is a leading,

up 15.6% - Acquisition of PLS further expands the Groups coverage within Scotland and North

A Unified Regularized Group PLS Algorithm Scalable to Big Data Pierre Lafaye de Micheaux 1 ,

First Quarter 2019 Earnings Report Forward-Looking Statements This presentation contains

City Pro je c ts Co nc e pts, Co sting & Co nstruc tio n Da le C. He g lund, PE / PL

Environmental Health & Safety 352-392-1591 www.ehs.ufl.edu bso@ehs.ufl.edu What is the

Early Days at PLS FissionUranium.com 1 Disclaimer The following information may contain

We build solutions to make people financially secure. Mission Commonwealth strengthens the

Extraction and Integration of Web Data by End-Users Sudhir Agarwal - PowerPoint PPT Presentation

Extraction and Integration of Web Data by End-Users Sudhir Agarwal and Michael Genesereth Stanford Logic Group, Stanford Computer Science Department, Stanford University Introduction End users often need to quickly find and analyze information

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

End User Programming Glenn Vanderburg Relevance, Inc. End Users Software Programmers Your

Fermilab Users Meeting Fermilab Users Meeting Fermilab Users Meeting Fermilab Users

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Writing reliable end to end tests End to end browser tests They take a long time to run. Around

Is End-to-End Integrity Verification Really End- to-End? Ahmed Alhussen, Batyr Charyyev, and Engin

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Data Mining l The Extraction of useful information from data l The automated extraction of hidden

Appendix Deposits and Loans Appendix Deposits (ending balance) Loans (ending balance) (Unit:

End-to-End Argument Jeff Chase Duke University End-To-End Argument Application TCP Where to

Pisgah Legal Services Pursuing Justice, Im proving Lives Pisgah Legal Services is a leading,

up 15.6% - Acquisition of PLS further expands the Groups coverage within Scotland and North

A Unified Regularized Group PLS Algorithm Scalable to Big Data Pierre Lafaye de Micheaux 1 ,

First Quarter 2019 Earnings Report Forward-Looking Statements This presentation contains

City Pro je c ts Co nc e pts, Co sting &amp; Co nstruc tio n Da le C. He g lund, PE / PL

Environmental Health &amp; Safety 352-392-1591 www.ehs.ufl.edu bso@ehs.ufl.edu What is the

Early Days at PLS FissionUranium.com 1 Disclaimer The following information may contain

We build solutions to make people financially secure. Mission Commonwealth strengthens the

City Pro je c ts Co nc e pts, Co sting & Co nstruc tio n Da le C. He g lund, PE / PL

Environmental Health & Safety 352-392-1591 www.ehs.ufl.edu bso@ehs.ufl.edu What is the