ontology based web information extraction in practice
play

Ontology-based Web Information Extraction in Practice eRecruitment - PowerPoint PPT Presentation

Ontology-based Web Information Extraction in Practice eRecruitment eTourism - eProcurement Japan-Austria Joint Workshop on ICT Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Prll Application Oriented Knowledge


  1. Ontology-based Web Information Extraction in Practice eRecruitment – eTourism - eProcurement Japan-Austria Joint Workshop on “ICT” Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Pröll Application Oriented Knowledge Processing bproell@faw.jku.at

  2. Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler Ontology Aware WebIE � � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2

  3. Web Information Extraction (WebIE) …extracting structured data from Web pages templates accomodation’s name accomodation’s name phone phone address address pool facility pool facility Alpenrose ++43 (0)524352930 A-6212 Maurach - Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3

  4. WebIE Projects in cooperation with Austrian Industry Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4

  5. Projects‘ Requirements and Approach Taken Some WebIE pecularities in the given projects • Heterogeneously designed Web pages • Mixture of (semi-)structured data and full text • Significant structural aspects, e.g., • location of information on Web page • information „hidden“ in Web tables • Information scattered over several Web pages • Web site evolution WebIE Approaches • Screen scraping approaches (wrapper generation) • Automatically trainable systems (machine learning) • Knowledge-engineering approach • Knowledge-engineering approach + Web crawler + structural analysis + … [Appelt et al., 1999] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5

  6. Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6

  7. Overall Architecture Pre-Processing Information Extraction Post-Processing Crawler IE-Pipeline (GATE *) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Knowledge Domain Ontology Base *) [Cunningham et al, 2006] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7

  8. Web Crawler � Collects relevant Web pages � Classifies Web pages Home page, price pages, location pages, etc. � � Based on Support Vector Machine � Recognises language � Using meta-tags and an n-gram based algorithm Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8

  9. Overall Architecture Types of annotations - syntactical, morphological - ontological - structural Pre-Processing Information Extraction Post-Processing - relevance judging Crawler IE-Pipeline (GATE) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Evaluation Correction Manual Manual Knowledge Domain Ontology Base Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9

  10. Regular Expressions & Gazetteer Lookup Rule : Phone1 Gazet teer l i s t ‘phone keywords ’ ( Phone {Token .s t r i ng=="+" } Te lephone {Token .k i nd==number } Te l . ( {SpaceToken .k i n d==space } ) * Te l : {Token .s t r i ng==" ( " } Te l . : {Token .k i nd==number } Te le fon {Token .s t r i ng==" ) " } ( ( {SpaceToken .k i nd==space } ) * {Token .k i nd==number } )+ ) : phone - -> : phone .MyPhone= { } Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10

  11. Ontology-Aware Entity Recognition (1/2) hotel amenities rdfs:subClassOf rdfs:Label swimming pool lang=en pool facilities Schwimmbad lang=de rdfs:Label owl:Synonym Hallenbad Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11

  12. Ontology-Aware Entity Recognition (2/2) We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoor swimming pools, new heated natural outdoor pool with sandy beach, open air whirlpool with a wonderful view of lake Caldaro, large sauna world, and our private beach directly at lake Caldaro, full fill all wishes! hotel amenities rdfs:subClassOf pool pool facilities hasAttibute (ObjectProperty) pool heated attributes Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12

  13. Structure Analysis: Web Page Segmentation templates Top part job title Senior Java Developer IT skills + level JAVA + perfect MySQL + basic operation area Content part SW programming, testing language skills English fluently contact - [Debnath et al., 2005] Bottom part [Chakrabarti et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll powered by Typo3 13

  14. Structure Analysis: Block Identification Content part Block Responsibilities Block Requirements Block Offer Block Block Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14

  15. Structure Analysis: Table Data Extraction in Marlies [Yang et al., 2002] [Gatterbauer et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll 15

  16. Structure Analysis: Table Data Extraction in TourIE Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16

  17. Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17

  18. Evaluation: TourIE � Evaluation results were satisfactory with respect to the preliminary study. � Pool facility extraction quality was poor because of incomplete ontology. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18

  19. Evaluation: JobOlize 1,2000 1,0000 0,8000 0,6000 0,4000 0,2000 0,0000 Precision Recall F-Meas. Precision Recall F-Meas. Precision Recall F-Meas. Operation IT-Skill Language Context-driven Extraction Area Skill Non Context-driven Extraction � Page segmentation & block identification considerably rises precision. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19

  20. Evaluation: Marlies Marlies Ontology Classes: 2313 Instances: 2661 Gesamt Assignments of object properties to instances: 42791 Preliminary results for recall: 1,00 0,90 0,80 0,70 0,60 % 0,50 0,40 0,30 0,20 0,10 0,00 G N A M T F U B A M H H H M X e a d e a I e t a - - a a Y a l x D t E W s m r i e z r t u t e l i Z a e b e p i e e s f n e o i u r r m s c i t h r i n t a m t a h e t e e l l a n i ß t u n g � Work in progress (e.g., table extraction). Templates Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20

  21. Manual Correction via Rich Client GUI Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21

  22. Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22

  23. Lessons Learned � Today‘s Web pages do not adhere to standards or semantic Web proposals. � Only a few RDF resouces available; proposed microformats rarely used � Poor HTML, e.g, tables used for layout purposes � Web 2.0 coded Web pages in progress; content-based image retrieval & OCR � Development & maintenance of knowledge-based WebIE systems is expensive. � Domain experts & knowledge engineers are needed. � Rule-coding is tedious and errorprone. � Evaluation of numerous methods & algorithms; multiplied due to multilnguality � Manual evaluation is time consuming. � WebIE performance considerably depends on quality of domain ontology. � We have to observe (evolving) legal issues � Robots exclusion standard, Sitemap etc. � Further processing of extracted data Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend