Databases and Information Systems Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Learning Rules to Pre-process Web Data for Automatic Integration - - PowerPoint PPT Presentation
Learning Rules to Pre-process Web Data for Automatic Integration - - PowerPoint PPT Presentation
Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages 28th October 2006 D ata b ases and I nformation S ystems Research Group
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Motivation
- Most Information available on the Web is only human
accessible through presentation-oriented HTML pages.
- We still lack techniques which enable machines
[agents]
- to extract and
- understand
presentation-oriented HTML pages to act on behalf
- f humans
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outline
Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining
System Overview
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
System Overview
embedded by scripts DB Materialized View 1st access of an unknown source
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
System Overview
embedded by scripts extract & align records DB Materialized View 1st access of an unknown source
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
System Overview
embedded by scripts extract & align records DB Materialized View Table Mining
- Column Splitting
- Label Assignment
- Arithmetic Dependencies
1st access of an unknown source
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
System Overview
embedded by scripts extract & align records DB Materialized View Export result of Table Mining Heuristics Table Mining Rules 1st access of an unknown source
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
System Overview
embedded by scripts extract & align records DB Materialized View Apply Table Mining Rules
- Column Splitting
- Label Assignment
- Arithmetic Dependencies
Access of a known source
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outline
Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining
ViPER [CIKM'05]
- Automatic Data Extraction
- Tabular Alignment
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Automatic Data Extraction
Scan the Web page for similar data records
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Automatic Data Extraction
Scan the Web page for similar data records use visual information to
- segment the data records
- compute the relevance
according to the location inside the Web page.
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Automatic Data Extraction
Scan the Web page for similar data records use visual information to
- segment the data records
- compute the relevance
according to the location inside the Web page. extract similar data records with the highest relevance.
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Tabular Alignment
Data record alignment
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Data Representation
F-Logic Facts
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outline
Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining
Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
21%) Save: £3.00 (13%) Save: £6.00 (21%) Save: £3.00 (9%)
Column Splitting
9%) Save: £3.00 Save: £6.00 Save: £3.00 13%)
£3.00 Save: (13%) £ 3.00 text cur- rency float punc- tuation text subset 1 subset 2 Save: £3.00 (13%) data item 1 ( 13 %) int Save: £6.00 (21%) data item 2 Save: £3.00 (9%) data item 3 £6.00 6.00 float (21%) 21 int (9%) 9 int £3.00 Save: (13%) £ 3.00 text cur- rency float punc- tuation text subset 1 subset 2 Save: £3.00 (13%) data item 1 ( 13 %) int Save: £6.00 (21%) data item 2 Save: £3.00 (9%) data item 3 £6.00 6.00 float (21%) 21 int (9%) 9 int
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Splitting Rules
… … … …
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Splitting Rules
… …
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outline
Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining
Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Label Assignment
Coli Coli+1 Coli+2 Our Price: $299.95 List Price: $499.99 Coli Coli+1 Coli+2 $299.95 Our Price $499.99 List Price Coli Coli+1 Coli+2 List Price $299.95 Coli+3 Our Price $499.99
Visual representation HTML source code Assignment strategy
b Our Price: span $299.95 List Price: $499.99 br br b Our Price span $299.95 $499.99 List Price br Our Price tr List Price b td td tr td td $299.95 $499.99
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Inter Label Assignment
Inter label assignment
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Inner Label Assignment
Inner label assignment
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Column Label Assignment Rules
Inner label assignment Inter label assignment
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Functional Methods and Updates
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Functional Methods and Updates
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Functional Methods and Updates
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Solution
Functional Methods and Updates
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Improvement of Heuristics
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Improvement of Heuristics
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Improvement of Heuristics
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Rules can be:
- modified
- removed
- added
Improvement of Heuristics
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outline
Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining
Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Arithmetic Dependencies
Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations:
- r
for non trivial solutions.
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Arithmetic Dependencies
Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations:
- r
for non trivial solutions.
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Arithmetic Dependencies
The system has discovered the arithmetic dependency: newPrice = oldPrice - discount
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Arithmetic Dependencies
The system has discovered the arithmetic dependency: newPrice = oldPrice - discount
- ldPrice - newPrice - discount < threshold
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Arithmetic Dependencies
The system has discovered the arithmetic dependency: newPrice = oldPrice - discount
- ldPrice - newPrice - discount < threshold
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Conclusion
Advantages
- Table Mining Heuristics are only applied once for each resource
- Manual post-processing of heuristics
- Qualitative information integration based on identified constraints
- Annotating HTML streams on-the-fly (OntoGather [PPSWR ‘06])
newPrice discount
- ldPrice
brand description
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg
Outlook
So far
- Conversion of structured HTML pages to F-Logic facts
What’s next
- Use Text Mining techniques to push the limit of
structured content (e.g. rental listings)
DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg