SLIDE 1 Interactive Wrapper Generation with Minimal User Effort
Utku Irmak and Torsten Suel
CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu
SLIDE 2 Introduction
Information on WWW is usually unstructured in
nature, and presented via HTML
Not appropriate for (certain types of) automatic processing
Significant amount of embedded structured data
Stock data, product/price data, various statistics, …
Expressed through layout, HTML structure
Wrapper: a software tool and set of rules for
extracting such structured data from web pages
Challenge: different sites, variations within sites
SLIDE 3
An Example: Meta Search Engine
SLIDE 4 An Example: Meta Search Engine
… Shared Cache – The future … csdl2.computer.org… Shared Cache – The Future of Parallel Databases 3 … Distributed and Parallel… www.informatik.uni- trier.edu/... Distributed and Parallel Databases 4 springerlink.com/app... distributed and parallel databases 2 ... Introduction … www.csse.monash... Parallel and Distributed Databases 1 Snippet URL Title Rank
SLIDE 5 Introduction
Extracting the relevant data embedded in web
pages and store in a relational structure for further processing
Specialized software programs called wrappers
Manual wrappers: e.g., Perl scripts … Due to shortcomings of manually developing
wrappers, many tools have been proposed for generating wrappers
Semi-automatic (interactive and non-interactive)
Fully-automatic
SLIDE 6
An Example: Meta Search Engine
SLIDE 7 Our Goal in this Work
Design a complete interactive system
for generating wrappers
Developed for industrial application
Overcome common obstacles such as
Missing (multiple) attributes Visual variations
Minimize user effort Create robust and reliable wrappers on
future pages
SLIDE 8 Related Work
Semi-automatic approaches
WIEN, SoftMealy, STALKER, Active learning techniques are employed
by Muslea et al.
Semi-automatic interactive approaches
W4F, XWrap, Lixto
Fully-automatic approaches
IEPAD, RoadRunner, work by Zhai et al.
SLIDE 9 Our Contributions
We describe a new system for semi-automatic wrapper generation based on
an interactive interface
a powerful extraction language
ranking of likely candidate sets
To implement the interface, we describe a framework based on active learning
We propose the use of a category utility function for ranking the tuple sets
We perform a detailed experimental evaluation
SLIDE 10 Framework
User Training Webpage Verification Set Wrapper Generation System
Input:
- a training webpage
- a number of verification pages
SLIDE 11 Framework
User Training Webpage Verification Set Wrapper Generation System
(1)User highlights a tuple
SLIDE 12 Framework
User Training Webpage Verification Set Wrapper Generation System
(2) Selected tuple submitted to our system, which generates several wrappers
SLIDE 13 Framework
User Training Webpage Verification Set Wrapper Generatio n System Wrapper Generation System
?
(3a) System presents user with a candidate tuple set
SLIDE 14 Framework
User Training Webpage Verification Set Wrapper Generation System
? ? ?
(3b) System presents user with another candidate tuple set
SLIDE 15 Framework
User Training Webpage Verification Set Wrapper Generation System
?
(3c) System presents user with another candidate tuple set
SLIDE 16 Framework
User Training Webpage Verification Set Wrapper Generation System
(4) User selects one of the proposed candidate tuple set
SLIDE 17 Framework
User Training Webpage Verification Set Wrapper Generation System
(5) System refines wrapper and tests it on verification set
SLIDE 18 Framework
User Training Webpage Verification Set Wrapper Generation System
!
(6) System finds one page where the wrapper “disagrees”
SLIDE 19 Framework
User Training Webpage Verification Set Wrapper Generation System
? ? ?
(7a) System presents user with a candidate tuple set on this page in verification set
SLIDE 20 Framework
User Training Webpage Verification Set Wrapper Generation System
? ?
(7b) System presents user with another candidate tuple set
- n page in verification set
SLIDE 21 Framework
User Training Webpage Verification Set Wrapper Generation System
(8) User selects one of the proposed candidate tuple set
SLIDE 22 Framework
User Verification Set Wrapper Generation System Wrapper Training Webpage
(9) System outputs final wrapper
SLIDE 23
Definition: Wrapper
A wrapper is a set of extraction rules
that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)
The extraction rules within a wrapper
may disagree on not yet encountered web pages
In this case, a wrapper can be refined by
removing some of the extraction rules
SLIDE 24 Summary of Interaction Steps:
User highlights a tuple on training page
This allows system to generate a number of wrappers that capture different candidate tuple sets
System presents candidate tuple sets on the
training page to user, in order of “plausibility”
User selects the correct tuple set System tests resulting wrapper on verification
set to find any “disagreements”
For any disagreement, user selects the correct
set from a ranked list of choices
SLIDE 25 A Real Example: half.ebay.com
Extract tuple with attributes:
Price, Total Price, Shipping, Seller
Only extract those tuples that:
Are listed in “Like New Items” and
Whose sellers are awarded a Red Star
SLIDE 26
A Real Example: half.ebay.com
SLIDE 27 A Real Example: half.ebay.com
Training page:
SLIDE 28
Observations:
There can be a lot of unexpected cases
and variations on real websites
A powerful language is needed to specify
extraction rules
Simple extraction followed by SQL
filtering conditions will often not work
The final wrapper may still contain many
extraction rules and may disagree on webpages encountered in the future
SLIDE 29 User Effort:
(0) Cost of defined table structure: number
- f attribute, their names, maybe types
(1) Cost of highlighting one (or maybe two) tuples on training pages (2) Cost of one or more selections from a ranked list of candidate tuple sets
SLIDE 30 To Implement We Need:
(0) User interface based browser extensions (1) Powerful extraction language (2) Algorithms for generating extraction rules and grouping them into wrappers (3) Techniques for ranking wrappers in terms
SLIDE 31
System Architecture Overview
SLIDE 32
Document Representation
SLIDE 33 Extraction Language Overview
Based on DOM-tree with auxiliary properties Extraction patterns consists of a sequence of
expressions on the path from root to a tuple attribute
Each expression consists of conjunctions and
disjunctions of predicates
If a node at depthi
Satisfies its expression: Accept
Otherwise: Reject
Only children of accepted nodes are checked
further for the expression defined at depthi+1
SLIDE 34 Predicates in the Extraction Language
Element Nodes
tagName
tagAttr
tagAttrArray
elementSiblingPosition
tagPstn
…
Text Nodes
textNode
textSiblingPosition
syntax
leftTextNode
leftElementNode
…
SLIDE 35
The Wrapper Structure
SLIDE 36 Wrapper Generation Algorithm
Creating dom_path and LCA objects Creating patterns that extract tuple attributes Creating initial wrappers Generating the tuple validation rules and new
wrappers
Combining the wrappers Ranking the tuple sets Getting confirmation from the user Testing the wrapper on the verification set
SLIDE 37 Ranking the Tuple Sets
We adopt the concept of category utility:
Maximize inter-cluster dissimilarity
Minimize intra-cluster similarity
Dom-Path, specific value, missing attributes, indexing, content specification
1)
The weight of attribute A
2)
The probability that an item has value v for attribute A, given it belongs to cluster C
3)
The probability that an item belongs to cluster C, given it has value v for attribute A
S0 T
SLIDE 38
Ranking: Discussion
Note: we are ranking tuple sets and
wrappers
A wrapper is more plausible if the tuples
is extracted are very similar to each other, and if those tuples are very different from the non-tuples
One could also try to rank extraction
patterns, say using MDL
SLIDE 39 Experimental Evaluations
Number of training tuples required by our system and previous works
Results on four previously used data sets from RISE
Okra, BigBook, Internet Address Finder, Quote Server
SLIDE 40 Experimental Evaluations
We chose ten well- known web sites and collected fifty web pages from each:
AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)
SLIDE 41 Experimental Evaluation
Updating Term Weights (effect of adaptive approach):
The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites
SLIDE 42
Summary
An approach to interactive wrapper
generation that combines
Powerful extraction language Techniques for deriving extraction
patterns from user input
A framework using active learning A ranking technique using a
category utility function