Learning Rules to Pre-process Web Data for Automatic Integration - - PowerPoint PPT Presentation

learning rules to pre process web data for automatic
SMART_READER_LITE
LIVE PREVIEW

Learning Rules to Pre-process Web Data for Automatic Integration - - PowerPoint PPT Presentation

Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages 28th October 2006 D ata b ases and I nformation S ystems Research Group


slide-1
SLIDE 1

Databases and Information Systems Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Learning Rules to Pre-process Web Data for Automatic Integration

Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages“ 28th October 2006

slide-2
SLIDE 2

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Motivation

  • Most Information available on the Web is only human

accessible through presentation-oriented HTML pages.

  • We still lack techniques which enable machines

[agents]

  • to extract and
  • understand

presentation-oriented HTML pages to act on behalf

  • f humans
slide-3
SLIDE 3

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outline

Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining

System Overview

slide-4
SLIDE 4

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

System Overview

embedded by scripts DB Materialized View 1st access of an unknown source

slide-5
SLIDE 5

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

System Overview

embedded by scripts extract & align records DB Materialized View 1st access of an unknown source

slide-6
SLIDE 6

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

System Overview

embedded by scripts extract & align records DB Materialized View Table Mining

  • Column Splitting
  • Label Assignment
  • Arithmetic Dependencies

1st access of an unknown source

slide-7
SLIDE 7

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

System Overview

embedded by scripts extract & align records DB Materialized View Export result of Table Mining Heuristics Table Mining Rules 1st access of an unknown source

slide-8
SLIDE 8

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

System Overview

embedded by scripts extract & align records DB Materialized View Apply Table Mining Rules

  • Column Splitting
  • Label Assignment
  • Arithmetic Dependencies

Access of a known source

slide-9
SLIDE 9

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outline

Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining

ViPER [CIKM'05]

  • Automatic Data Extraction
  • Tabular Alignment
slide-10
SLIDE 10

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Automatic Data Extraction

Scan the Web page for similar data records

slide-11
SLIDE 11

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Automatic Data Extraction

Scan the Web page for similar data records use visual information to

  • segment the data records
  • compute the relevance

according to the location inside the Web page.

slide-12
SLIDE 12

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Automatic Data Extraction

Scan the Web page for similar data records use visual information to

  • segment the data records
  • compute the relevance

according to the location inside the Web page. extract similar data records with the highest relevance.

slide-13
SLIDE 13

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Tabular Alignment

Data record alignment

slide-14
SLIDE 14

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Data Representation

F-Logic Facts

slide-15
SLIDE 15

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outline

Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining

Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods

slide-16
SLIDE 16

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

21%) Save: £3.00 (13%) Save: £6.00 (21%) Save: £3.00 (9%)

Column Splitting

9%) Save: £3.00 Save: £6.00 Save: £3.00 13%)

£3.00 Save: (13%) £ 3.00 text cur- rency float punc- tuation text subset 1 subset 2 Save: £3.00 (13%) data item 1 ( 13 %) int Save: £6.00 (21%) data item 2 Save: £3.00 (9%) data item 3 £6.00 6.00 float (21%) 21 int (9%) 9 int £3.00 Save: (13%) £ 3.00 text cur- rency float punc- tuation text subset 1 subset 2 Save: £3.00 (13%) data item 1 ( 13 %) int Save: £6.00 (21%) data item 2 Save: £3.00 (9%) data item 3 £6.00 6.00 float (21%) 21 int (9%) 9 int

slide-17
SLIDE 17

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Splitting Rules

… … … …

slide-18
SLIDE 18

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Splitting Rules

… …

slide-19
SLIDE 19

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outline

Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining

Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods

slide-20
SLIDE 20

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Label Assignment

Coli Coli+1 Coli+2 Our Price: $299.95 List Price: $499.99 Coli Coli+1 Coli+2 $299.95 Our Price $499.99 List Price Coli Coli+1 Coli+2 List Price $299.95 Coli+3 Our Price $499.99

Visual representation HTML source code Assignment strategy

b Our Price: span $299.95 List Price: $499.99 br br b Our Price span $299.95 $499.99 List Price br Our Price tr List Price b td td tr td td $299.95 $499.99

slide-21
SLIDE 21

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Inter Label Assignment

Inter label assignment

slide-22
SLIDE 22

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Inner Label Assignment

Inner label assignment

slide-23
SLIDE 23

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Column Label Assignment Rules

Inner label assignment Inter label assignment

slide-24
SLIDE 24

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Functional Methods and Updates

slide-25
SLIDE 25

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Functional Methods and Updates

slide-26
SLIDE 26

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Functional Methods and Updates

slide-27
SLIDE 27

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Solution

Functional Methods and Updates

slide-28
SLIDE 28

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Improvement of Heuristics

slide-29
SLIDE 29

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Improvement of Heuristics

slide-30
SLIDE 30

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Improvement of Heuristics

slide-31
SLIDE 31

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Rules can be:

  • modified
  • removed
  • added

Improvement of Heuristics

slide-32
SLIDE 32

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outline

Introduction Introduction Extraction and Alignment Extraction and Alignment Table Mining Table Mining

Column Splitting Label Assignment Arithmetic Dependencies Data driven / statistical methods

slide-33
SLIDE 33

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Arithmetic Dependencies

Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations:

  • r

for non trivial solutions.

slide-34
SLIDE 34

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Arithmetic Dependencies

Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations:

  • r

for non trivial solutions.

slide-35
SLIDE 35

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Arithmetic Dependencies

The system has discovered the arithmetic dependency: newPrice = oldPrice - discount

slide-36
SLIDE 36

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Arithmetic Dependencies

The system has discovered the arithmetic dependency: newPrice = oldPrice - discount

  • ldPrice - newPrice - discount < threshold
slide-37
SLIDE 37

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Arithmetic Dependencies

The system has discovered the arithmetic dependency: newPrice = oldPrice - discount

  • ldPrice - newPrice - discount < threshold
slide-38
SLIDE 38

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Conclusion

Advantages

  • Table Mining Heuristics are only applied once for each resource
  • Manual post-processing of heuristics
  • Qualitative information integration based on identified constraints
  • Annotating HTML streams on-the-fly (OntoGather [PPSWR ‘06])

newPrice discount

  • ldPrice

brand description

slide-39
SLIDE 39

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

Outlook

So far

  • Conversion of structured HTML pages to F-Logic facts

What’s next

  • Use Text Mining techniques to push the limit of

structured content (e.g. rental listings)

slide-40
SLIDE 40

DBIS Research Group Computer Science Department Albert-Ludwigs-University Freiburg

??? Questions ???

Thank you for your attention!