Automatic Wrapper Generation and Data Extraction Kristina Lerman - PowerPoint PPT Presentation

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern California August 25, 2010 University of Southern California 1

Automatic Data Extraction  Data extraction with wrappers  Users specifies the schema of the information source  Single tuple  Nested type  List of tuples or nested types  Labels data on several example HTML pages  Tedious, especially for lists  Goal of Automatic Data Extraction is to eliminate user intervention  Automatically generate wrappers to extract data from Web pages  cf Information Extraction: Automatically extracts data from text August 25, 2010 University of Southern California 2

Overview  Methods for automatic wrapper generation and data extraction  Grammar Induction approach  Towards Automatic Data Extraction from Large Web Sites  Website structure-based approach  AutoFeed: An Unsupervised Learning System for Generating Webfeeds  [Optional] Using the Structure of Web Sites for Automatic Segmentation of Tables August 25, 2010 University of Southern California 3

Grammar Induction Approach  Pages automatically generated by scripts that encode results of db query into HTML  Script = grammar  Given a set of pages generated by the same script  Learn the grammar of the pages  Wrapper induction step  Use the grammar to parse the pages  Data extraction step August 25, 2010 University of Southern California 4

Limitations  Learnable grammars  Union-Free Regular Expressions (RoadRunner)  Variety of schema structure: tuples (with optional attributes) and lists of (nested) tuples  Does not efficiently handle disjunctions – pages with alternate presentations of the same attribute  Context-free Grammars  Limited learning ability  User needs to provide a set of pages of the same type August 25, 2010 University of Southern California 5

Website Structure-based Approach  Websites attempt to simplify user navigation and interaction with data by organizing how data is presented across the site  Hierarchical organization  List of results  Detail pages …  Machine learning methods take advantage of structure to extract data August 25, 2010 University of Southern California 6

RoadRunner: Towards Automatic Data Extraction from Large Web Sites by Crescenzi, Mecca, & Merialdo August 25, 2010 University of Southern California 7

RoadRunner Overview  Automatically generates a wrapper from large web pages  Pages of the same class  No dynamic content from javascript, ajax, etc  Infers source schema  Supports nested structures and lists  Extracts data from pages  Efficient approach to large, complex pages with regular structure August 25, 2010 University of Southern California 8

Example Pages  Compares two pages at a time to find similarities and differences  Infers nested structure (schema) of page  Extracts fields August 25, 2010 University of Southern California 9

Extracted Result August 25, 2010 University of Southern California 10

Union-Free Regular Expression (UFRE)  Web page structure can be represented as Union-Free Regular Expression (UFRE)  UFRE is Regular Expressions without disjunctions  If a and b are UFRE, then the following are also UFREs  a.b  (a)+  (a)? August 25, 2010 University of Southern California 11

Union-Free Regular Expression (UFRE)  Web page structure can be represented as Union-Free Regular Expression (UFRE)  UFRE is Regular Expressions without disjunctions  If a and b are UFRE, then the following are also UFREs  a.b  string fields  (a)+  lists (possibly nested)  (a)?  optional fields  Strong assumption that usually holds August 25, 2010 University of Southern California 12

Approach  Given a set of example pages  Generate the Union-Free Regular Expression which contains example pages  Find the least upper bounds on the RE lattice to generate a wrapper in linear time  Reduces to finding the least upper bound on two UFREs August 25, 2010 University of Southern California 13

Matching/Mismatches Given a set of pages of the same type  Take the first page to be the wrapper (UFRE)  Match each successive sample page against the wrapper  Mismatches result in generalizations of wrapper  String mismatches  Tag mismatches August 25, 2010 University of Southern California 14

Matching/Mismatches Given a set of pages of the same type  Take the first page to be the wrapper (UFRE)  Match each successive sample page against the wrapper  Mismatches result in generalizations of wrapper  String mismatches  Discover fields  Tag mismatches  Discover optional fields  Discover iterators August 25, 2010 University of Southern California 15

Example Matching August 25, 2010 University of Southern California 16

String Mismatches: Discovering Fields  String mismatches are used to discover fields of the document  Wrapper is generalized by replacing “John Smith” with #PCDATA <HTML>Books of: John Smith  <HTML> Books of: #PCDATA August 25, 2010 University of Southern California 17

Tag Mismatches: Discovering Optionals  First check to see if mismatch is caused by an iterator (described next)  If not, could be an optional field in wrapper or sample  Cross search used to determine possible optionals  Image field determined to be optional:  ( <img src=…/>)? August 25, 2010 University of Southern California 19

Example Matching String Mismatch String Mismatch August 25, 2010 University of Southern California 20

Tag Mismatches: Discovering Iterators  Assume mismatch is caused by repeated elements in a list  End of the list corresponds to last matching token: </LI>  Beginning of list corresponds to one of the mismatched tokens: <LI> or </UL>  These create possible “squares”  Match possible squares against earlier squares  Generalize the wrapper by finding all contiguous repeated occurrences:  ( <LI>Title:#PCDATA</LI> )+ August 25, 2010 University of Southern California 21

Internal Mismatches  Generate internal mismatch while trying to match square against earlier squares on the same page  Solving internal mismatches yield further refinements in the wrapper  List of book editions  Special! August 25, 2010 University of Southern California 23

Recursive Example August 25, 2010 University of Southern California 24

Discussion  Assumptions:  Pages are well-structured  Structure can be modeled by UFRE (no disjunctions)  Search space for explaining mismatches is huge  Uses a number of heuristics to prune space  Limited backtracking  Limit on number of choices to explore  Patterns cannot be delimited by optionals  Will result in pruning possible wrappers August 25, 2010 University of Southern California 25

AutoFeed: An Unsupervised Learning System for Generating Webfeeds August 25, 2010 University of Southern California 26

Relational Model of a Web Site Homepage  Sites are well structured 0 AutoFeedWeather to improve user StateList 0 0 experience Homepage 0 1 page-type  Generation : Given States 0 California CA relational data, scripts 1 Pennyslvania PA generate web site, e.g., CityList 0 0 State page-type weather site 0 1 0 2 1 3  Extraction is opposite 1 4 task: Given web site, find CityWeather 0 Los Angeles 70 underlying relational data 1 San Francisco 65 2 San Diego 75 CityWeather 3 Pittsburgh 50 page-type 4 Philadelphia 55 August 25, 2010 University of Southern California 27

Chicken ‘n Egg Problem Homepage  If we could pick out a set 0 AutoFeedWeather of pages of the same StateList 0 0 class, we could learn Homepage 0 1 page-type their grammar and States 0 California CA extract data 1 Pennyslvania PA  But how do we pick the CityList 0 0 State page-type right pages in the first 0 1 0 2 place? 1 3 1 4 CityWeather 0 Los Angeles 70 1 San Francisco 65 2 San Diego 75 CityWeather 3 Pittsburgh 50 page-type 4 Philadelphia 55 August 25, 2010 University of Southern California 28

Overview of Approach  Many types of structure within site  Graph structure of site’s links  URL naming scheme  Content of pages  HTML structure within page types, …  Experts focus on individual structures and output discoveries as hints  Experts are heterogeneous  Probabilistically combine experts (don’t have to be correct all the time) August 25, 2010 University of Southern California 29

Hints  Hints describe local structural similarities within pages or within data  Hints help find relational structure of the site  Simultaneously cluster pages and data using hints August 25, 2010 University of Southern California 30

Automatic Wrapper Generation and Data Extraction Kristina Lerman - PowerPoint PPT Presentation

Automatic Wrapper Generation and Data Extraction Kristina Lerman University of Southern California August 25, 2010 University of Southern California 1 Automatic Data Extraction Data extraction with wrappers Users specifies the schema

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Wrapper Learning Wrapper Learning Craig Knoblock University of Southern California This

Outline Outline Introduction Introduction Using R as a Wrapper in Using R as a Wrapper

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic Extraction From Automatic Extraction From and Reasoning About and Reasoning About

Write a Foreign Data Wrapper in 15 minutes Error: Reference source not found Table des matires

Toward full ACID distributed transaction support with Foreign Data Wrapper Masahiko Sawada

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Wrapper Classes for Primitive Types in Java Primitive Data Types Include... byte, short, int,

Automatic Wrapper Adaptation by Tree Edit Distance Matching E. Ferrara 1 R. Baumgartner 2 1

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear

A Framework for Automatic Generation A Framework for Automatic Generation of Configuration Files

Automatic Collocation Extraction from Text Corpora Pavel Pecina Ustav form aln a

Interactive Wrapper Generation with Minimal User Effort Utku Irmak and Torsten Suel CIS

Organising Deep Networks Edouard Oyallon advisor: Stphane Mallat following the works of

Kenya is located in East Africa, which lies on the Equator The population of Kenya is 47.5

Computational Models of Language Learning Jelle Zuidema Institute for Logic, Language and

Constructing Sentiment Sensitive Vectors for Word Polarity Classification Speaker: Johann Chu

IP Infrastructure Geolocation Guan-Yan Cai, Michael McCarrin ,

Linpack Evaluation on Linpack Evaluation on a Supercomputer with p p Heterogeneous

Flexible Hierarchical Execution of Parallel Task Loops Michael Robson, Villanova University

Data-Intensive Applications on Numerically-Intensive Supercomputers David Daniel / James Ahrens