Ontology-based Web Information Extraction in Practice eRecruitment - - PowerPoint PPT Presentation
Ontology-based Web Information Extraction in Practice eRecruitment - - PowerPoint PPT Presentation
Ontology-based Web Information Extraction in Practice eRecruitment eTourism - eProcurement Japan-Austria Joint Workshop on ICT Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Prll Application Oriented Knowledge
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2
Contents
- Motivation
- Web Information Extraction (WebIE) by Examples
- General Architecture
- Web Crawler
- Ontology Aware WebIE
- Structure Analysis: Page Segementation, Table Extraction
- Evaluation & Manual Correction of Results
- Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3
Web Information Extraction (WebIE)
…extracting structured data from Web pages
accomodation’s name address phone pool facility
templates
accomodation’s name Alpenrose address A-6212 Maurach phone ++43 (0)524352930 pool facility
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4
WebIE Projects in cooperation with Austrian Industry
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5
Projects‘ Requirements and Approach Taken
WebIE Approaches
- Screen scraping approaches (wrapper generation)
- Automatically trainable systems (machine learning)
- Knowledge-engineering approach
Some WebIE pecularities in the given projects
- Heterogeneously designed Web pages
- Mixture of (semi-)structured data and full text
- Significant structural aspects, e.g.,
- location of information on Web page
- information „hidden“ in Web tables
- Information scattered over several Web pages
- Web site evolution
- Knowledge-engineering approach
+ Web crawler + structural analysis + …
[Appelt et al., 1999]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6
Contents
- Motivation
- Web Information Extraction (WebIE) by Examples
- General Architecture
- Web Crawler
- Ontology Aware WebIE
- Structure Analysis: Page Segementation, Table Extraction
- Evaluation & Manual Correction of Results
- Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7
Tokenizer
IE-Pipeline (GATE *)
Gazetteer- Lists Sentence- Splitter (e.g.) Ontology- Plugin Transducers
Crawler
Pre-Processing Information Extraction
Knowledge Base
Post-Processing
Output
Web sites Annotated Web pages
<?xml version=1.0“> <masterdata> <accname> Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata>
XML
Domain Ontology
Overall Architecture
Gazetteer lists Rules
*) [Cunningham et al, 2006]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8
Web Crawler
- Collects relevant Web pages
- Classifies Web pages
- Home page, price pages, location pages, etc.
- Based on Support Vector Machine
- Recognises language
- Using meta-tags and an n-gram based algorithm
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9
Tokenizer
IE-Pipeline (GATE)
Gazetteer- Lists Sentence- Splitter (e.g.) Ontology- Plugin Transducers
Crawler
Pre-Processing Information Extraction
Knowledge Base
Post-Processing
Output
Web sites Annotated Web pages
<?xml version=1.0“> <masterdata> <accname> Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata>
XML
Domain Ontology
Overall Architecture
Types of annotations
- syntactical, morphological
- ontological
- structural
- relevance judging
Manual Evaluation Manual Correction
Gazetteer lists Rules
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10
Regular Expressions & Gazetteer Lookup
Rule : Phone1 ( {Token .s t r i ng=="+" } {Token .k i nd==number } ( {SpaceToken .k i n d==space } ) * {Token .s t r i ng==" ( " } {Token .k i nd==number } {Token .s t r i ng==" ) " } ( ( {SpaceToken .k i nd==space } ) * {Token .k i nd==number } )+ ) : phone
- >
: phone .MyPhone= { } Gazet teer l i s t ‘phone keywords ’ Phone Te lephone Te l . Te l : Te l . : Te le fon
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11
Ontology-Aware Entity Recognition (1/2)
pool facilities swimming pool hotel amenities
rdfs:subClassOf
lang=en
rdfs:Label
Schwimmbad
rdfs:Label
lang=de Hallenbad
- wl:Synonym
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12
Ontology-Aware Entity Recognition (2/2)
hasAttibute (ObjectProperty)
pool facilities pool pool attributes heated hotel amenities
rdfs:subClassOf
We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoor swimming pools, new heated natural outdoor pool with sandy beach, open air whirlpool with a wonderful view of lake Caldaro, large sauna world, and our private beach directly at lake Caldaro, full fill all wishes!
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 13
Structure Analysis: Web Page Segmentation
Top part Content part Bottom part
job title Senior Java Developer
templates
IT skills + level JAVA + perfect MySQL + basic
- peration area
SW programming, testing language skills English fluently contact
- powered by Typo3
[Debnath et al., 2005] [Chakrabarti et al., 2007]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14
Structure Analysis: Block Identification
Requirements Offer Responsibilities Block Block Block Block Block Content part
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 15
Structure Analysis: Table Data Extraction in Marlies
Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll
[Yang et al., 2002] [Gatterbauer et al., 2007]
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16
Structure Analysis: Table Data Extraction in TourIE
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17
Contents
- Motivation
- Web Information Extraction (WebIE) by Examples
- General Architecture
- Web Crawler
- Ontology Aware WebIE
- Structure Analysis: Page Segementation, Table Extraction
- Evaluation & Manual Correction of Results
- Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18
Evaluation: TourIE
Evaluation results were satisfactory with respect to the preliminary study. Pool facility extraction quality was poor because of incomplete ontology.
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19
Evaluation: JobOlize
0,0000 0,2000 0,4000 0,6000 0,8000 1,0000 1,2000 Precision Recall F-Meas. Precision Recall F-Meas. Precision Recall F-Meas. Operation Area IT-Skill Language Skill
Context-driven Extraction Non Context-driven Extraction
Page segmentation & block identification considerably rises precision.
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20
Gesamt
0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90 1,00 G e s a m t N a m e A d r e s s e M a i l T e l e f
- n
F a x U I D B e z e i c h n u n g A t t r i b u t e M a t e r i a l H a u p t m a ß H
- E
i n h e i t H
- W
e r t X Y Z M a t e r i a l
Templates %
Evaluation: Marlies
Work in progress (e.g., table extraction). Marlies Ontology
Classes: 2313 Instances: 2661 Assignments of object properties to instances: 42791
Preliminary results for recall:
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21
Manual Correction via Rich Client GUI
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22
Contents
- Motivation
- Web Information Extraction (WebIE) by Examples
- General Architecture
- Web Crawler
- Ontology Aware WebIE
- Structure Analysis: Page Segementation, Table Extraction
- Evaluation & Manual Correction of Results
- Lessons Learned & Future Work
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23
Lessons Learned
- Today‘s Web pages do not adhere to standards or semantic Web proposals.
- Only a few RDF resouces available; proposed microformats rarely used
- Poor HTML, e.g, tables used for layout purposes
- Web 2.0 coded Web pages in progress; content-based image retrieval & OCR
- Development & maintenance of knowledge-based WebIE systems is expensive.
- Domain experts & knowledge engineers are needed.
- Rule-coding is tedious and errorprone.
- Evaluation of numerous methods & algorithms; multiplied due to multilnguality
- Manual evaluation is time consuming.
- WebIE performance considerably depends on quality of domain ontology.
- We have to observe (evolving) legal issues
- Robots exclusion standard, Sitemap etc.
- Further processing of extracted data
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 24
Future Work: Ontosophia
Ontology-driven IE Supported by (Semi-) Automatic Corrective Feedback
Domain- Expert(s)
Validation GUI
Ontology-driven Information Extraction Rule-Generator Information Extraction Pipeline IE-Evaluation
Documents
Ontology Optimization
Ontology Learning, -Population, -Evaluation Ontology Lookup Document Annotation Template Filling
Visualization of IE-Process
Error Trace Back
IE-Rules Ontology Lookup Extraction Domain Ontology Domain- Expert(s) Domain- Expert(s)
Validation GUI
Ontology-driven Information Extraction Rule-Generator Information Extraction Pipeline IE-Evaluation
Documents
Ontology Optimization
Ontology Learning, -Population, -Evaluation Ontology Lookup Document Annotation Template Filling
Visualization of IE-Process
Error Trace Back
IE-Rules Ontology Lookup Extraction Domain Ontology
(3) Visual Error Trace Back (2) Ontology Optimization (4) Evaluation Support ( 1 ) E x t r a c t i
- n
D
- m
a i n O n t
- l
- g
y
Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 25