Ontology-based Web Information Extraction in Practice eRecruitment - PowerPoint PPT Presentation

Ontology-based Web Information Extraction in Practice eRecruitment – eTourism - eProcurement Japan-Austria Joint Workshop on “ICT” Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Pröll Application Oriented Knowledge Processing bproell@faw.jku.at

Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler Ontology Aware WebIE � � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2

Web Information Extraction (WebIE) …extracting structured data from Web pages templates accomodation’s name accomodation’s name phone phone address address pool facility pool facility Alpenrose ++43 (0)524352930 A-6212 Maurach - Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3

WebIE Projects in cooperation with Austrian Industry Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4

Projects‘ Requirements and Approach Taken Some WebIE pecularities in the given projects • Heterogeneously designed Web pages • Mixture of (semi-)structured data and full text • Significant structural aspects, e.g., • location of information on Web page • information „hidden“ in Web tables • Information scattered over several Web pages • Web site evolution WebIE Approaches • Screen scraping approaches (wrapper generation) • Automatically trainable systems (machine learning) • Knowledge-engineering approach • Knowledge-engineering approach + Web crawler + structural analysis + … [Appelt et al., 1999] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5

Contents � Motivation � Web Information Extraction (WebIE) by Examples � General Architecture � Web Crawler � Ontology Aware WebIE � Structure Analysis: Page Segementation, Table Extraction � Evaluation & Manual Correction of Results � Lessons Learned & Future Work Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6

Overall Architecture Pre-Processing Information Extraction Post-Processing Crawler IE-Pipeline (GATE *) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Knowledge Domain Ontology Base *) [Cunningham et al, 2006] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7

Web Crawler � Collects relevant Web pages � Classifies Web pages Home page, price pages, location pages, etc. � � Based on Support Vector Machine � Recognises language � Using meta-tags and an n-gram based algorithm Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8

Overall Architecture Types of annotations - syntactical, morphological - ontological - structural Pre-Processing Information Extraction Post-Processing - relevance judging Crawler IE-Pipeline (GATE) Output Splitter (e.g.) Transducers Gazetteer- Tokenizer Sentence- Ontology- <?xml version=1.0“> Plugin <masterdata> <accname> Lists Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata> Annotated Web pages Web sites XML Gazetteer Rules lists Evaluation Correction Manual Manual Knowledge Domain Ontology Base Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9

Regular Expressions & Gazetteer Lookup Rule : Phone1 Gazet teer l i s t ‘phone keywords ’ ( Phone {Token .s t r i ng=="+" } Te lephone {Token .k i nd==number } Te l . ( {SpaceToken .k i n d==space } ) * Te l : {Token .s t r i ng==" ( " } Te l . : {Token .k i nd==number } Te le fon {Token .s t r i ng==" ) " } ( ( {SpaceToken .k i nd==space } ) * {Token .k i nd==number } )+ ) : phone - -> : phone .MyPhone= { } Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10

Ontology-Aware Entity Recognition (1/2) hotel amenities rdfs:subClassOf rdfs:Label swimming pool lang=en pool facilities Schwimmbad lang=de rdfs:Label owl:Synonym Hallenbad Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11

Ontology-Aware Entity Recognition (2/2) We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoor swimming pools, new heated natural outdoor pool with sandy beach, open air whirlpool with a wonderful view of lake Caldaro, large sauna world, and our private beach directly at lake Caldaro, full fill all wishes! hotel amenities rdfs:subClassOf pool pool facilities hasAttibute (ObjectProperty) pool heated attributes Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12

Structure Analysis: Web Page Segmentation templates Top part job title Senior Java Developer IT skills + level JAVA + perfect MySQL + basic operation area Content part SW programming, testing language skills English fluently contact - [Debnath et al., 2005] Bottom part [Chakrabarti et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll powered by Typo3 13

Structure Analysis: Block Identification Content part Block Responsibilities Block Requirements Block Offer Block Block Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14

Structure Analysis: Table Data Extraction in Marlies [Yang et al., 2002] [Gatterbauer et al., 2007] Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll 15

Structure Analysis: Table Data Extraction in TourIE Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16

Evaluation: TourIE � Evaluation results were satisfactory with respect to the preliminary study. � Pool facility extraction quality was poor because of incomplete ontology. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18

Evaluation: JobOlize 1,2000 1,0000 0,8000 0,6000 0,4000 0,2000 0,0000 Precision Recall F-Meas. Precision Recall F-Meas. Precision Recall F-Meas. Operation IT-Skill Language Context-driven Extraction Area Skill Non Context-driven Extraction � Page segmentation & block identification considerably rises precision. Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19

Evaluation: Marlies Marlies Ontology Classes: 2313 Instances: 2661 Gesamt Assignments of object properties to instances: 42791 Preliminary results for recall: 1,00 0,90 0,80 0,70 0,60 % 0,50 0,40 0,30 0,20 0,10 0,00 G N A M T F U B A M H H H M X e a d e a I e t a - - a a Y a l x D t E W s m r i e z r t u t e l i Z a e b e p i e e s f n e o i u r r m s c i t h r i n t a m t a h e t e e l l a n i ß t u n g � Work in progress (e.g., table extraction). Templates Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20

Manual Correction via Rich Client GUI Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21

Lessons Learned � Today‘s Web pages do not adhere to standards or semantic Web proposals. � Only a few RDF resouces available; proposed microformats rarely used � Poor HTML, e.g, tables used for layout purposes � Web 2.0 coded Web pages in progress; content-based image retrieval & OCR � Development & maintenance of knowledge-based WebIE systems is expensive. � Domain experts & knowledge engineers are needed. � Rule-coding is tedious and errorprone. � Evaluation of numerous methods & algorithms; multiplied due to multilnguality � Manual evaluation is time consuming. � WebIE performance considerably depends on quality of domain ontology. � We have to observe (evolving) legal issues � Robots exclusion standard, Sitemap etc. � Further processing of extracted data Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23

Ontology-based Web Information Extraction in Practice eRecruitment - PowerPoint PPT Presentation

Ontology-based Web Information Extraction in Practice eRecruitment eTourism - eProcurement Japan-Austria Joint Workshop on ICT Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Prll Application Oriented Knowledge

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Ontology Based Information Exchange Management Ontology Based Information Exchange Management

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

2014 Ontology Summit & Symposium Big Data and Semantic Web Meet Applied Ontology Summary

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

Presentation Highlights Prof. Gordon Cheng Director of Institute for Cognitive Systems Faculty

Programming Robots with Westside Boiler Invasion and AndyMark Parts of Robots Sensors Computer/

THE REELER PROJECT Responsib ible le Ethic ical l Learnin ing in in Robotic ics Robots:

A COMPANY A COMP ANY WITH WITH ROO OOTS TS IN IN SK SKYPE YPE Operations in Silicon

Argo Group Investor Presentation September 2019 1 Forward-Looking Statements This presentation

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

1 Here are the highlights of our consolidated financial results. Consolidated net sales for the

Q1 2018 Results 22 May 2018 Disclaimer This presentation (the Presentation) has been

Ontology-based Web Information Extraction in Practice eRecruitment - PowerPoint PPT Presentation

Ontology-based Web Information Extraction in Practice eRecruitment eTourism - eProcurement Japan-Austria Joint Workshop on ICT Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Prll Application Oriented Knowledge

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Data driven Ontology Alignment Data driven Ontology Alignment Nigam Shah nigam@stanford.edu

Ontology Languages for the Semantic Web Ontology Languages Wide variety of languages for

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Ontology Based Information Exchange Management Ontology Based Information Exchange Management

Some (more) Burning Issues for Ontology Initiatives Background: Current Ontology Work in Bremen

Ontology Development 101: A Guide to Creating Your First Ontology Natalya F. Noy and Deborah L.

Systematic Annotation Mark Voorhies 4/5/2011 The Gene Ontology Three directed acyclic graphs

Combining XML querying Combining XML querying with ontology reasoning: with ontology reasoning:

Ontology Jan Pettersen Nytun Knowledge Representation Part I, JPN, UiA 1 Outline S O P

Ontology Engineering Lecture 7: Top-down (and middle-out) Ontology Development II Maria Keet

ODPReco - A Tool to Recommend Ontology Design Patterns Maleeha Arif Yasvi, Raghava Mutharaju

2014 Ontology Summit &amp; Symposium Big Data and Semantic Web Meet Applied Ontology Summary

Multi-Source Information Extraction Valentin Tablan University of Sheffield University of

Presentation Highlights Prof. Gordon Cheng Director of Institute for Cognitive Systems Faculty

Programming Robots with Westside Boiler Invasion and AndyMark Parts of Robots Sensors Computer/

THE REELER PROJECT Responsib ible le Ethic ical l Learnin ing in in Robotic ics Robots:

A COMPANY A COMP ANY WITH WITH ROO OOTS TS IN IN SK SKYPE YPE Operations in Silicon

Argo Group Investor Presentation September 2019 1 Forward-Looking Statements This presentation

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

1 Here are the highlights of our consolidated financial results. Consolidated net sales for the

Q1 2018 Results 22 May 2018 Disclaimer This presentation (the Presentation) has been

2014 Ontology Summit & Symposium Big Data and Semantic Web Meet Applied Ontology Summary