learning rules to pre process web data for automatic
play

Learning Rules to Pre-process Web Data for Automatic Integration - PowerPoint PPT Presentation

Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages 28th October 2006 D ata b ases and I nformation S ystems Research Group


  1. Learning Rules to Pre-process Web Data for Automatic Integration Kai Simon, Thomas Hornung, Georg Lausen Workshop "Model Checking and Semantic Web Rules Languages“ 28th October 2006 D ata b ases and I nformation S ystems Research Group Computer Science Department Albert-Ludwigs-University Freiburg

  2. DBIS Research Group Computer Science Department Motivation Albert-Ludwigs-University Freiburg Most Information available on the Web is only human - accessible through presentation-oriented HTML pages. We still lack techniques which enable machines - [agents] - to extract and - understand presentation-oriented HTML pages to act on behalf of humans

  3. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction System Overview Extraction Extraction and Alignment and Alignment Table Mining Table Mining

  4. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB

  5. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records

  6. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB extract & align records - Column Splitting Table Mining - Label Assignment - Arithmetic Dependencies

  7. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts 1st access of Materialized View an unknown source DB Table Mining Rules extract & align records Export result of Table Mining Heuristics

  8. DBIS Research Group Computer Science Department System Overview Albert-Ludwigs-University Freiburg embedded by scripts Access of Materialized View a known source DB - Column Splitting - Label Assignment - Arithmetic Dependencies extract & align records Apply Table Mining Rules

  9. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Introduction Introduction ViPER [CIKM'05] Extraction Extraction • Automatic Data Extraction and Alignment and Alignment • Tabular Alignment Table Mining Table Mining

  10. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records

  11. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page.

  12. DBIS Research Group Computer Science Department Automatic Data Extraction Albert-Ludwigs-University Freiburg Scan the Web page for similar data records use visual information to • segment the data records • compute the relevance according to the location inside the Web page. extract similar data records with the highest relevance.

  13. DBIS Research Group Computer Science Department Tabular Alignment Albert-Ludwigs-University Freiburg Data record alignment

  14. DBIS Research Group Computer Science Department Data Representation Albert-Ludwigs-University Freiburg F-Logic Facts

  15. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  16. DBIS Research Group Computer Science Department Column Splitting Albert-Ludwigs-University Freiburg Save: £3.00 (13%) 13%) Save: £3.00 Save: £6.00 (21%) Save: £6.00 21%) Save: £3.00 (9%) Save: £3.00 9%) data item 1 data item 1 data item 2 data item 2 data item 3 data item 3 Save: £3.00 (13%) Save: £3.00 (13%) Save: £6.00 (21%) Save: £6.00 (21%) Save: £3.00 (9%) Save: £3.00 (9%) Save: Save: £3.00 £3.00 £6.00 £6.00 (13%) (13%) (21%) (21%) (9%) (9%) £ £ 3.00 3.00 6.00 6.00 ( ( 13 13 21 21 9 9 %) %) punc- punc- cur- cur- text text float float float float int int int int int int text text tuation tuation rency rency subset 1 subset 1 subset 2 subset 2

  17. DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … … … …

  18. DBIS Research Group Computer Science Department Splitting Rules Albert-Ludwigs-University Freiburg … …

  19. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  20. DBIS Research Group Computer Science Department Label Assignment Albert-Ludwigs-University Freiburg Visual HTML source code Assignment strategy representation Col i Col i+1 Col i+2 b span br List Price: br $499.99 Our Price: $299.95 List Price: $499.99 Our $299.95 Price: Col i Col i+1 Col i+2 $499.99 span b br List Price $299.95 Our Price $499.99 List Price $299.95 Our Price Col i Col i+1 Col i+2 Col i+3 tr tr Our Price List Price $299.95 $499.99 td td td td b List $299.95 $499.99 Price Our Price

  21. DBIS Research Group Computer Science Department Inter Label Assignment Albert-Ludwigs-University Freiburg Inter label assignment

  22. DBIS Research Group Computer Science Department Inner Label Assignment Albert-Ludwigs-University Freiburg Inner label assignment

  23. DBIS Research Group Computer Science Department Column Label Assignment Rules Albert-Ludwigs-University Freiburg Inter label Inner label assignment assignment

  24. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  25. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  26. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg

  27. DBIS Research Group Computer Science Department Functional Methods and Updates Albert-Ludwigs-University Freiburg Solution

  28. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  29. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  30. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg

  31. DBIS Research Group Computer Science Department Improvement of Heuristics Albert-Ludwigs-University Freiburg Rules can be: - modified - removed - added

  32. DBIS Research Group Computer Science Department Outline Albert-Ludwigs-University Freiburg Data driven / statistical methods Introduction Introduction Column Splitting Extraction Extraction Label Assignment and Alignment and Alignment Arithmetic Dependencies Table Mining Table Mining

  33. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.

  34. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg Find arithmetic dependencies between numeric columns by checking the homogeneous systems of linear equations: or for non trivial solutions.

  35. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount

  36. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold

  37. DBIS Research Group Computer Science Department Arithmetic Dependencies Albert-Ludwigs-University Freiburg The system has discovered the arithmetic dependency: newPrice = oldPrice - discount oldPrice - newPrice - discount < threshold

  38. DBIS Research Group Computer Science Department Conclusion Albert-Ludwigs-University Freiburg oldPrice description newPrice brand discount Advantages - Table Mining Heuristics are only applied once for each resource - Manual post-processing of heuristics - Qualitative information integration based on identified constraints - Annotating HTML streams on-the-fly (OntoGather [PPSWR ‘06])

  39. DBIS Research Group Computer Science Department Outlook Albert-Ludwigs-University Freiburg So far Conversion of structured HTML pages to F-Logic facts - What’s next Use Text Mining techniques to push the limit of - structured content (e.g. rental listings)

  40. DBIS Research Group Computer Science Department ??? Questions ??? Albert-Ludwigs-University Freiburg Thank you for your attention!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend