BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille - - PowerPoint PPT Presentation

▶

Jul 06, 2023 324 likes •526 views

http://aksw.org/files/boa.pdf BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille Ngonga Ngomo AKSW, Universitt Leipzig http://www.volunteer-conservation-peru.org Motivation most knowledge bases extracted from

SLIDE 1

BOA

Bootstrapping the Linked Data Web

Daniel Gerber, Axel-Cyrille Ngonga Ngomo

AKSW, Universität Leipzig

http://www.volunteer-conservation-peru.org

http://aksw.org/files/boa.pdf

SLIDE 2

Motivation

most knowledge bases extracted from

(semi)-structured data

Linked Data Cloud grows
BUT: only 15-20 % of information
How can we extract data from the

document-oriented web?

SLIDE 3

Idea

start with triples from the Data Web
extract natural language patterns which

express predicates found in triples

combine patterns & NLP to find labels

which stand in relation with predicate

generate RDF and feed it into Data Web

SLIDE 4

Related Work

NLP & RDF:
Fox, Extractiv, Alchemy, OpenCalais
NELL [CAR+10]
initial ontology: 100+ categories/relations
PROSPERA [NDA+11]
harvesting of n-grams-itemset patterns ➤ generalisation

without adding noise

[JUR+10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant

supervision for relation extraction without labeled data. ACL, pages 1003–1011, 2009.

SLIDE 5

The BOA approach

SLIDE 6

Corpus extraction

Crawler
Seed Pages, removes HTML
Cleaner
SBD, UTF-8 filters to remove noise
Indexer
sentences get index

SLIDE 7

Knowledge acquisition

Class C that serves as the rdfs:domain
r as the rdfs:range of predicate p
knowledge base for background

knowledge

extract statements with entities of

rdf:type C as subject or object

db:Place, db:Person, db:Organisation

SLIDE 8

Pattern Search

set of entities s and o connected through p
find sentences which contain s & o, strip the rest
replace labels with variables (?D?, ?R?)
A BOA pattern is a pair P = (μ(p), θ), where μ(p) is p’s

URI and θ is a natural language representation of p.

A BOA pattern mapping is a function M such that M(p)

= S , where S is the set of natural language representations for p.

Occurrences, sentences, labels p is learned from,

number of occurrences for each label combination

SLIDE 9

Pattern Scoring

Pattern Filtering: Length, Stop Words, Occurrence
Support: used across several triples in background

knowledge

Typicity: allows to map ?D?, ?R? to entities with

rdf:type of domain/range of p

Specificity: used exclusively to express p, IDF

adopted to patterns

(Similarity: how similar is a pattern to label of

predicate)

Combine Support, Typicity, Specificity to calculate

local maximum

SLIDE 10

RDF Generation

use top-n pattern for each relation
find sentences which contain pattern
NER-tag sentence
look for token’s classes which match

domain/range

extract labels
URI retrieval above threshold do not create new

URI

SLIDE 11

Demo

SLIDE 12

http://139.18.2.164:8080/boa

SLIDE 13

Evaluation

Corpora
en_wiki (44.7M), en_news (256.1M)
Background Knowledge
Organisation, Place and Person (283

relations from 1 to 471920 triples)

Parameters
top1,2 pattern, kappa, 500 sentences for

Typicity, 100 example sentences for 12 different KBs

SLIDE 14

Results

SLIDE 15

Examples

Relation Top-2 Pattern URI Domain/Range en-wiki en-news foundationPerson Organisation/Person

1. R , co-founder of D
2. R , founder of D
1. R, the co-founder of D
2. R, founder of the D

subsidiary Organisation/Organisati

n
1. R , a subsidiary of D
2. - (R , a division of D)
1. R, a division of D
2. D‘s acquisition of R

birthPlace Person/PopulatedPlace 1.D was born in R

2. - (D , the mayor of R)
1. D has been named in the

2. D, MP for R

SLIDE 16

Discussion

we can use patterns from wiki for every

corpus

we create many new triples
we create correct triples
we need 15 minutes for one iteration
Q1 & Q2 answered with YES

SLIDE 17

Future Work

Train NER on DBpedia classes
Iteration 1+
Human feedback
Pattern generalization
rdf:type extractor
Languages/Corpora
Webservices

SLIDE 18

Thank you!

Questions?

SLIDE 19

References

[NDA+11]
Ndapandula Nakashole, Martin Theobald, and Gerhard
Weikum. Scalable knowledge harvesting with high

precision and high recall. In WSDM, pages 227–236, Hong Kong, 2011.

[CAR+10]
Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr

Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language

learning. In AAAI, 2010.