Interactive Wrapper Generation with Minimal User Effort Utku Irmak - - PowerPoint PPT Presentation

interactive wrapper generation with minimal user effort
SMART_READER_LITE
LIVE PREVIEW

Interactive Wrapper Generation with Minimal User Effort Utku Irmak - - PowerPoint PPT Presentation

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu Introduction Information on WWW is usually unstructured in


slide-1
SLIDE 1

Interactive Wrapper Generation with Minimal User Effort

Utku Irmak and Torsten Suel

CIS Department Polytechnic University Brooklyn, NY 11201 uirmak@cis.poly.edu and suel@poly.edu

slide-2
SLIDE 2

Introduction

 Information on WWW is usually unstructured in

nature, and presented via HTML

Not appropriate for (certain types of) automatic processing

 Significant amount of embedded structured data

Stock data, product/price data, various statistics, …

Expressed through layout, HTML structure

 Wrapper: a software tool and set of rules for

extracting such structured data from web pages

 Challenge: different sites, variations within sites

slide-3
SLIDE 3

An Example: Meta Search Engine

slide-4
SLIDE 4

An Example: Meta Search Engine

… Shared Cache – The future … csdl2.computer.org… Shared Cache – The Future of Parallel Databases 3 … Distributed and Parallel… www.informatik.uni- trier.edu/... Distributed and Parallel Databases 4 springerlink.com/app... distributed and parallel databases 2 ... Introduction … www.csse.monash... Parallel and Distributed Databases 1 Snippet URL Title Rank

slide-5
SLIDE 5

Introduction

 Extracting the relevant data embedded in web

pages and store in a relational structure for further processing

Specialized software programs called wrappers

 Manual wrappers: e.g., Perl scripts …  Due to shortcomings of manually developing

wrappers, many tools have been proposed for generating wrappers

Semi-automatic (interactive and non-interactive)

Fully-automatic

slide-6
SLIDE 6

An Example: Meta Search Engine

slide-7
SLIDE 7

Our Goal in this Work

 Design a complete interactive system

for generating wrappers

 Developed for industrial application

 Overcome common obstacles such as

 Missing (multiple) attributes  Visual variations

 Minimize user effort  Create robust and reliable wrappers on

future pages

slide-8
SLIDE 8

Related Work

 Semi-automatic approaches

 WIEN, SoftMealy, STALKER,  Active learning techniques are employed

by Muslea et al.

 Semi-automatic interactive approaches

 W4F, XWrap, Lixto

 Fully-automatic approaches

 IEPAD, RoadRunner, work by Zhai et al.

slide-9
SLIDE 9

Our Contributions

We describe a new system for semi-automatic wrapper generation based on

an interactive interface

a powerful extraction language

ranking of likely candidate sets

To implement the interface, we describe a framework based on active learning

We propose the use of a category utility function for ranking the tuple sets

We perform a detailed experimental evaluation

slide-10
SLIDE 10

Framework

User Training Webpage Verification Set Wrapper Generation System

Input:

  • a training webpage
  • a number of verification pages
slide-11
SLIDE 11

Framework

User Training Webpage Verification Set Wrapper Generation System

(1)User highlights a tuple

  • n training webpage
slide-12
SLIDE 12

Framework

User Training Webpage Verification Set Wrapper Generation System

(2) Selected tuple submitted to our system, which generates several wrappers

slide-13
SLIDE 13

Framework

User Training Webpage Verification Set Wrapper Generatio n System Wrapper Generation System

?

(3a) System presents user with a candidate tuple set

slide-14
SLIDE 14

Framework

User Training Webpage Verification Set Wrapper Generation System

? ? ?

(3b) System presents user with another candidate tuple set

slide-15
SLIDE 15

Framework

User Training Webpage Verification Set Wrapper Generation System

?

(3c) System presents user with another candidate tuple set

slide-16
SLIDE 16

Framework

User Training Webpage Verification Set Wrapper Generation System

(4) User selects one of the proposed candidate tuple set

slide-17
SLIDE 17

Framework

User Training Webpage Verification Set Wrapper Generation System

(5) System refines wrapper and tests it on verification set

slide-18
SLIDE 18

Framework

User Training Webpage Verification Set Wrapper Generation System

!

(6) System finds one page where the wrapper “disagrees”

slide-19
SLIDE 19

Framework

User Training Webpage Verification Set Wrapper Generation System

? ? ?

(7a) System presents user with a candidate tuple set on this page in verification set

slide-20
SLIDE 20

Framework

User Training Webpage Verification Set Wrapper Generation System

? ?

(7b) System presents user with another candidate tuple set

  • n page in verification set
slide-21
SLIDE 21

Framework

User Training Webpage Verification Set Wrapper Generation System

(8) User selects one of the proposed candidate tuple set

slide-22
SLIDE 22

Framework

User Verification Set Wrapper Generation System Wrapper Training Webpage

(9) System outputs final wrapper

slide-23
SLIDE 23

Definition: Wrapper

 A wrapper is a set of extraction rules

that agree on all pages considered thusfar (i.e., that extract exactly the same set of tuples on these pages)

 The extraction rules within a wrapper

may disagree on not yet encountered web pages

 In this case, a wrapper can be refined by

removing some of the extraction rules

slide-24
SLIDE 24

Summary of Interaction Steps:

 User highlights a tuple on training page

This allows system to generate a number of wrappers that capture different candidate tuple sets

 System presents candidate tuple sets on the

training page to user, in order of “plausibility”

 User selects the correct tuple set  System tests resulting wrapper on verification

set to find any “disagreements”

 For any disagreement, user selects the correct

set from a ranked list of choices

slide-25
SLIDE 25

A Real Example: half.ebay.com

 Extract tuple with attributes:

Price, Total Price, Shipping, Seller

 Only extract those tuples that:

Are listed in “Like New Items” and

Whose sellers are awarded a Red Star

slide-26
SLIDE 26

A Real Example: half.ebay.com

slide-27
SLIDE 27

A Real Example: half.ebay.com

Training page:

slide-28
SLIDE 28

Observations:

 There can be a lot of unexpected cases

and variations on real websites

 A powerful language is needed to specify

extraction rules

 Simple extraction followed by SQL

filtering conditions will often not work

 The final wrapper may still contain many

extraction rules and may disagree on webpages encountered in the future

slide-29
SLIDE 29

User Effort:

(0) Cost of defined table structure: number

  • f attribute, their names, maybe types

(1) Cost of highlighting one (or maybe two) tuples on training pages (2) Cost of one or more selections from a ranked list of candidate tuple sets

slide-30
SLIDE 30

To Implement We Need:

(0) User interface based browser extensions (1) Powerful extraction language (2) Algorithms for generating extraction rules and grouping them into wrappers (3) Techniques for ranking wrappers in terms

  • f plausibility
slide-31
SLIDE 31

System Architecture Overview

slide-32
SLIDE 32

Document Representation

slide-33
SLIDE 33

Extraction Language Overview

 Based on DOM-tree with auxiliary properties  Extraction patterns consists of a sequence of

expressions on the path from root to a tuple attribute

 Each expression consists of conjunctions and

disjunctions of predicates

 If a node at depthi

Satisfies its expression: Accept

Otherwise: Reject

 Only children of accepted nodes are checked

further for the expression defined at depthi+1

slide-34
SLIDE 34

Predicates in the Extraction Language

 Element Nodes

tagName

tagAttr

tagAttrArray

elementSiblingPosition

tagPstn

 Text Nodes

textNode

textSiblingPosition

syntax

leftTextNode

leftElementNode

slide-35
SLIDE 35

The Wrapper Structure

slide-36
SLIDE 36

Wrapper Generation Algorithm

 Creating dom_path and LCA objects  Creating patterns that extract tuple attributes  Creating initial wrappers  Generating the tuple validation rules and new

wrappers

 Combining the wrappers  Ranking the tuple sets  Getting confirmation from the user  Testing the wrapper on the verification set

slide-37
SLIDE 37

Ranking the Tuple Sets

We adopt the concept of category utility:

Maximize inter-cluster dissimilarity

Minimize intra-cluster similarity

Dom-Path, specific value, missing attributes, indexing, content specification

1)

The weight of attribute A

2)

The probability that an item has value v for attribute A, given it belongs to cluster C

3)

The probability that an item belongs to cluster C, given it has value v for attribute A

S0 T

slide-38
SLIDE 38

Ranking: Discussion

 Note: we are ranking tuple sets and

wrappers

 A wrapper is more plausible if the tuples

is extracted are very similar to each other, and if those tuples are very different from the non-tuples

 One could also try to rank extraction

patterns, say using MDL

slide-39
SLIDE 39

Experimental Evaluations

Number of training tuples required by our system and previous works

Results on four previously used data sets from RISE

Okra, BigBook, Internet Address Finder, Quote Server

slide-40
SLIDE 40

Experimental Evaluations

We chose ten well- known web sites and collected fifty web pages from each:

AltaVista, CNN, Google, Hotjobs, IMDb, YMB (Yahoo! Message Board), MSN Q (MSN Money - Quotes), Weather, Art, and BN (Barnes & Noble)

slide-41
SLIDE 41

Experimental Evaluation

Updating Term Weights (effect of adaptive approach):

The effect of pregenerating wrappers for the same extraction scenario on Art and BN websites

slide-42
SLIDE 42

Summary

 An approach to interactive wrapper

generation that combines

 Powerful extraction language  Techniques for deriving extraction

patterns from user input

 A framework using active learning  A ranking technique using a

category utility function