 
              Ontology-based Web Information Extraction in Practice eRecruitment – eTourism - eProcurement Japan-Austria Joint Workshop on “ICT” Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Pröll Application Oriented Knowledge Processing bproell@faw.jku.at
Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler Ontology Aware WebIE � � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2
Web Information Extraction (WebIE) …extracting structured data from Web pages templates accomodation’s name accomodation’s name phone phone address address pool facility pool facility Alpenrose ++43 (0)524352930 A-6212 Maurach - Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3
WebIE Projects in cooperation with Austrian Industry Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4
Projects‘ Requirements and Approach Taken Some WebIE pecularities in the given projects • Heterogeneously designed Web pages • Mixture of (semi-)structured data and full text • Significant structural aspects, e.g., • location of information on Web page • information „hidden“ in Web tables • Information scattered over several Web pages • Web site evolution WebIE Approaches • Screen scraping approaches (wrapper generation) • Automatically trainable systems (machine learning) • Knowledge-engineering approach • Knowledge-engineering approach + Web crawler + structural analysis + … [Appelt et al., 1999] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5
Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6
Overall Architecture Pre-Processing Information Extraction Post-Processing Crawler IE-Pipeline (GATE *) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Knowledge Domain Ontology Base *) [Cunningham et al, 2006] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7
Web Crawler � Collects relevant Web pages � Classifies Web pages Home page, price pages, location pages, etc. � � Based on Support Vector Machine � Recognises language � Using meta-tags and an n-gram based algorithm Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8
Overall Architecture Types of annotations - syntactical, morphological - ontological - structural Pre-Processing Information Extraction Post-Processing - relevance judging Crawler IE-Pipeline (GATE) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Evaluation Correction Manual Manual Knowledge Domain Ontology Base Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9
Regular Expressions & Gazetteer Lookup Rule : Phone1 Gazet teer l i s t ‘phone keywords ’ ( Phone {Token .s t r i ng=="+" } Te lephone {Token .k i nd==number } Te l . ( {SpaceToken .k i n d==space } ) * Te l : {Token .s t r i ng==" ( " } Te l . : {Token .k i nd==number } Te le fon {Token .s t r i ng==" ) " } ( ( {SpaceToken .k i nd==space } ) * {Token .k i nd==number } )+ ) : phone - -> : phone .MyPhone= { } Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10
Ontology-Aware Entity Recognition (1/2) hotel amenities rdfs:subClassOf rdfs:Label swimming pool lang=en pool facilities Schwimmbad lang=de rdfs:Label owl:Synonym Hallenbad Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11
Ontology-Aware Entity Recognition (2/2) We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoor swimming pools, new heated natural outdoor pool with sandy beach, open air whirlpool with a wonderful view of lake Caldaro, large sauna world, and our private beach directly at lake Caldaro, full fill all wishes! hotel amenities rdfs:subClassOf pool pool facilities hasAttibute (ObjectProperty) pool heated attributes Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12
Structure Analysis: Web Page Segmentation templates Top part job title Senior Java Developer IT skills + level JAVA + perfect MySQL + basic operation area Content part SW programming, testing language skills English fluently contact - [Debnath et al., 2005] Bottom part [Chakrabarti et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll powered by Typo3 13
Structure Analysis: Block Identification Content part Block Responsibilities Block Requirements Block Offer Block Block Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14
Structure Analysis: Table Data Extraction in Marlies [Yang et al., 2002] [Gatterbauer et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll 15
Structure Analysis: Table Data Extraction in TourIE Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16
Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17
Evaluation: TourIE � Evaluation results were satisfactory with respect to the preliminary study. � Pool facility extraction quality was poor because of incomplete ontology. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18
Evaluation: JobOlize 1,2000 1,0000 0,8000 0,6000 0,4000 0,2000 0,0000 Precision Recall F-Meas. Precision Recall F-Meas. Precision Recall F-Meas. Operation IT-Skill Language Context-driven Extraction Area Skill Non Context-driven Extraction � Page segmentation & block identification considerably rises precision. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19
Evaluation: Marlies Marlies Ontology Classes: 2313 Instances: 2661 Gesamt Assignments of object properties to instances: 42791 Preliminary results for recall: 1,00 0,90 0,80 0,70 0,60 % 0,50 0,40 0,30 0,20 0,10 0,00 G N A M T F U B A M H H H M X e a d e a I e t a - - a a Y a l x D t E W s m r i e z r t u t e l i Z a e b e p i e e s f n e o i u r r m s c i t h r i n t a m t a h e t e e l l a n i ß t u n g � Work in progress (e.g., table extraction). Templates Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20
Manual Correction via Rich Client GUI Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21
Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22
Lessons Learned � Today‘s Web pages do not adhere to standards or semantic Web proposals. � Only a few RDF resouces available; proposed microformats rarely used � Poor HTML, e.g, tables used for layout purposes � Web 2.0 coded Web pages in progress; content-based image retrieval & OCR � Development & maintenance of knowledge-based WebIE systems is expensive. � Domain experts & knowledge engineers are needed. � Rule-coding is tedious and errorprone. � Evaluation of numerous methods & algorithms; multiplied due to multilnguality � Manual evaluation is time consuming. � WebIE performance considerably depends on quality of domain ontology. � We have to observe (evolving) legal issues � Robots exclusion standard, Sitemap etc. � Further processing of extracted data Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23
Recommend
More recommend