Ontology-based Web Information Extraction in Practice eRecruitment - - PowerPoint PPT Presentation

ontology based web information extraction in practice
SMART_READER_LITE
LIVE PREVIEW

Ontology-based Web Information Extraction in Practice eRecruitment - - PowerPoint PPT Presentation

Ontology-based Web Information Extraction in Practice eRecruitment eTourism - eProcurement Japan-Austria Joint Workshop on ICT Tokyo, October 18-19, 2010 Institute for a.Univ.-Prof. Dr. DI Birgit Prll Application Oriented Knowledge


slide-1
SLIDE 1

Ontology-based Web Information Extraction in Practice eRecruitment – eTourism - eProcurement

Japan-Austria Joint Workshop on “ICT”

Tokyo, October 18-19, 2010

Institute for a.Univ.-Prof. Dr. DI Birgit Pröll Application Oriented Knowledge Processing bproell@faw.jku.at

slide-2
SLIDE 2

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 2

Contents

  • Motivation
  • Web Information Extraction (WebIE) by Examples
  • General Architecture
  • Web Crawler
  • Ontology Aware WebIE
  • Structure Analysis: Page Segementation, Table Extraction
  • Evaluation & Manual Correction of Results
  • Lessons Learned & Future Work
slide-3
SLIDE 3

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 3

Web Information Extraction (WebIE)

…extracting structured data from Web pages

accomodation’s name address phone pool facility

templates

accomodation’s name Alpenrose address A-6212 Maurach phone ++43 (0)524352930 pool facility

slide-4
SLIDE 4

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 4

WebIE Projects in cooperation with Austrian Industry

slide-5
SLIDE 5

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 5

Projects‘ Requirements and Approach Taken

WebIE Approaches

  • Screen scraping approaches (wrapper generation)
  • Automatically trainable systems (machine learning)
  • Knowledge-engineering approach

Some WebIE pecularities in the given projects

  • Heterogeneously designed Web pages
  • Mixture of (semi-)structured data and full text
  • Significant structural aspects, e.g.,
  • location of information on Web page
  • information „hidden“ in Web tables
  • Information scattered over several Web pages
  • Web site evolution
  • Knowledge-engineering approach

+ Web crawler + structural analysis + …

[Appelt et al., 1999]

slide-6
SLIDE 6

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 6

Contents

  • Motivation
  • Web Information Extraction (WebIE) by Examples
  • General Architecture
  • Web Crawler
  • Ontology Aware WebIE
  • Structure Analysis: Page Segementation, Table Extraction
  • Evaluation & Manual Correction of Results
  • Lessons Learned & Future Work
slide-7
SLIDE 7

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 7

Tokenizer

IE-Pipeline (GATE *)

Gazetteer- Lists Sentence- Splitter (e.g.) Ontology- Plugin Transducers

Crawler

Pre-Processing Information Extraction

Knowledge Base

Post-Processing

Output

Web sites Annotated Web pages

<?xml version=1.0“> <masterdata> <accname> Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata>

XML

Domain Ontology

Overall Architecture

Gazetteer lists Rules

*) [Cunningham et al, 2006]

slide-8
SLIDE 8

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 8

Web Crawler

  • Collects relevant Web pages
  • Classifies Web pages
  • Home page, price pages, location pages, etc.
  • Based on Support Vector Machine
  • Recognises language
  • Using meta-tags and an n-gram based algorithm
slide-9
SLIDE 9

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 9

Tokenizer

IE-Pipeline (GATE)

Gazetteer- Lists Sentence- Splitter (e.g.) Ontology- Plugin Transducers

Crawler

Pre-Processing Information Extraction

Knowledge Base

Post-Processing

Output

Web sites Annotated Web pages

<?xml version=1.0“> <masterdata> <accname> Alpengasthof XYZ </accname> <adress> Baumbachstr. 4040 Linz </adress> …. …. </masterdata>

XML

Domain Ontology

Overall Architecture

Types of annotations

  • syntactical, morphological
  • ontological
  • structural
  • relevance judging

Manual Evaluation Manual Correction

Gazetteer lists Rules

slide-10
SLIDE 10

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 10

Regular Expressions & Gazetteer Lookup

Rule : Phone1 ( {Token .s t r i ng=="+" } {Token .k i nd==number } ( {SpaceToken .k i n d==space } ) * {Token .s t r i ng==" ( " } {Token .k i nd==number } {Token .s t r i ng==" ) " } ( ( {SpaceToken .k i nd==space } ) * {Token .k i nd==number } )+ ) : phone

  • >

: phone .MyPhone= { } Gazet teer l i s t ‘phone keywords ’ Phone Te lephone Te l . Te l : Te l . : Te le fon

slide-11
SLIDE 11

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 11

Ontology-Aware Entity Recognition (1/2)

pool facilities swimming pool hotel amenities

rdfs:subClassOf

lang=en

rdfs:Label

Schwimmbad

rdfs:Label

lang=de Hallenbad

  • wl:Synonym
slide-12
SLIDE 12

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 12

Ontology-Aware Entity Recognition (2/2)

hasAttibute (ObjectProperty)

pool facilities pool pool attributes heated hotel amenities

rdfs:subClassOf

We offer a wonderful 2500m2 wellness area, lead by a trained wellness team. Indoor swimming pools, new heated natural outdoor pool with sandy beach, open air whirlpool with a wonderful view of lake Caldaro, large sauna world, and our private beach directly at lake Caldaro, full fill all wishes!

slide-13
SLIDE 13

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 13

Structure Analysis: Web Page Segmentation

Top part Content part Bottom part

job title Senior Java Developer

templates

IT skills + level JAVA + perfect MySQL + basic

  • peration area

SW programming, testing language skills English fluently contact

  • powered by Typo3

[Debnath et al., 2005] [Chakrabarti et al., 2007]

slide-14
SLIDE 14

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 14

Structure Analysis: Block Identification

Requirements Offer Responsibilities Block Block Block Block Block Content part

slide-15
SLIDE 15

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 15

Structure Analysis: Table Data Extraction in Marlies

Japanese-Austrian Workshop on Natural Language and Spatio-Temporal Information, 1st-2nd Oct. 2009 – Birgit Pröll

[Yang et al., 2002] [Gatterbauer et al., 2007]

slide-16
SLIDE 16

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 16

Structure Analysis: Table Data Extraction in TourIE

slide-17
SLIDE 17

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 17

Contents

  • Motivation
  • Web Information Extraction (WebIE) by Examples
  • General Architecture
  • Web Crawler
  • Ontology Aware WebIE
  • Structure Analysis: Page Segementation, Table Extraction
  • Evaluation & Manual Correction of Results
  • Lessons Learned & Future Work
slide-18
SLIDE 18

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 18

Evaluation: TourIE

Evaluation results were satisfactory with respect to the preliminary study. Pool facility extraction quality was poor because of incomplete ontology.

slide-19
SLIDE 19

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 19

Evaluation: JobOlize

0,0000 0,2000 0,4000 0,6000 0,8000 1,0000 1,2000 Precision Recall F-Meas. Precision Recall F-Meas. Precision Recall F-Meas. Operation Area IT-Skill Language Skill

Context-driven Extraction Non Context-driven Extraction

Page segmentation & block identification considerably rises precision.

slide-20
SLIDE 20

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 20

Gesamt

0,00 0,10 0,20 0,30 0,40 0,50 0,60 0,70 0,80 0,90 1,00 G e s a m t N a m e A d r e s s e M a i l T e l e f

  • n

F a x U I D B e z e i c h n u n g A t t r i b u t e M a t e r i a l H a u p t m a ß H

  • E

i n h e i t H

  • W

e r t X Y Z M a t e r i a l

Templates %

Evaluation: Marlies

Work in progress (e.g., table extraction). Marlies Ontology

Classes: 2313 Instances: 2661 Assignments of object properties to instances: 42791

Preliminary results for recall:

slide-21
SLIDE 21

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 21

Manual Correction via Rich Client GUI

slide-22
SLIDE 22

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 22

Contents

  • Motivation
  • Web Information Extraction (WebIE) by Examples
  • General Architecture
  • Web Crawler
  • Ontology Aware WebIE
  • Structure Analysis: Page Segementation, Table Extraction
  • Evaluation & Manual Correction of Results
  • Lessons Learned & Future Work
slide-23
SLIDE 23

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 23

Lessons Learned

  • Today‘s Web pages do not adhere to standards or semantic Web proposals.
  • Only a few RDF resouces available; proposed microformats rarely used
  • Poor HTML, e.g, tables used for layout purposes
  • Web 2.0 coded Web pages in progress; content-based image retrieval & OCR
  • Development & maintenance of knowledge-based WebIE systems is expensive.
  • Domain experts & knowledge engineers are needed.
  • Rule-coding is tedious and errorprone.
  • Evaluation of numerous methods & algorithms; multiplied due to multilnguality
  • Manual evaluation is time consuming.
  • WebIE performance considerably depends on quality of domain ontology.
  • We have to observe (evolving) legal issues
  • Robots exclusion standard, Sitemap etc.
  • Further processing of extracted data
slide-24
SLIDE 24

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 24

Future Work: Ontosophia

Ontology-driven IE Supported by (Semi-) Automatic Corrective Feedback

Domain- Expert(s)

Validation GUI

Ontology-driven Information Extraction Rule-Generator Information Extraction Pipeline IE-Evaluation

Documents

Ontology Optimization

Ontology Learning, -Population, -Evaluation Ontology Lookup Document Annotation Template Filling

Visualization of IE-Process

Error Trace Back

IE-Rules Ontology Lookup Extraction Domain Ontology Domain- Expert(s) Domain- Expert(s)

Validation GUI

Ontology-driven Information Extraction Rule-Generator Information Extraction Pipeline IE-Evaluation

Documents

Ontology Optimization

Ontology Learning, -Population, -Evaluation Ontology Lookup Document Annotation Template Filling

Visualization of IE-Process

Error Trace Back

IE-Rules Ontology Lookup Extraction Domain Ontology

(3) Visual Error Trace Back (2) Ontology Optimization (4) Evaluation Support ( 1 ) E x t r a c t i

  • n

D

  • m

a i n O n t

  • l
  • g

y

slide-25
SLIDE 25

Japan-Austria Joint Workshop on “ICT”, Oct. 18-19, 2010 – Birgit Pröll 25

Thank you for your Attention!

Acknowledgements to:

Christina Stefan Christina Michael Feilmayr Parzer Buttinger Guttenbrunner