BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille - - PowerPoint PPT Presentation

boa
SMART_READER_LITE
LIVE PREVIEW

BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille - - PowerPoint PPT Presentation

http://aksw.org/files/boa.pdf BOA Bootstrapping the Linked Data Web Daniel Gerber, Axel-Cyrille Ngonga Ngomo AKSW, Universitt Leipzig http://www.volunteer-conservation-peru.org Motivation most knowledge bases extracted from


slide-1
SLIDE 1

BOA

Bootstrapping the Linked Data Web

Daniel Gerber, Axel-Cyrille Ngonga Ngomo

AKSW, Universität Leipzig

http://www.volunteer-conservation-peru.org

http://aksw.org/files/boa.pdf

slide-2
SLIDE 2

Motivation

  • most knowledge bases extracted from

(semi)-structured data

  • Linked Data Cloud grows
  • BUT: only 15-20 % of information
  • How can we extract data from the

document-oriented web?

slide-3
SLIDE 3

Idea

  • start with triples from the Data Web
  • extract natural language patterns which

express predicates found in triples

  • combine patterns & NLP to find labels

which stand in relation with predicate

  • generate RDF and feed it into Data Web
slide-4
SLIDE 4

Related Work

  • NLP & RDF:
  • Fox, Extractiv, Alchemy, OpenCalais
  • NELL [CAR+10]
  • initial ontology: 100+ categories/relations
  • PROSPERA [NDA+11]
  • harvesting of n-grams-itemset patterns ➤ generalisation

without adding noise

  • [JUR+10] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant

supervision for relation extraction without labeled data. ACL, pages 1003–1011, 2009.

slide-5
SLIDE 5

The BOA approach

slide-6
SLIDE 6

Corpus extraction

  • Crawler
  • Seed Pages, removes HTML
  • Cleaner
  • SBD, UTF-8 filters to remove noise
  • Indexer
  • sentences get index
slide-7
SLIDE 7

Knowledge acquisition

  • Class C that serves as the rdfs:domain
  • r as the rdfs:range of predicate p
  • knowledge base for background

knowledge

  • extract statements with entities of

rdf:type C as subject or object

  • db:Place, db:Person, db:Organisation
slide-8
SLIDE 8

Pattern Search

  • set of entities s and o connected through p
  • find sentences which contain s & o, strip the rest
  • replace labels with variables (?D?, ?R?)
  • A BOA pattern is a pair P = (μ(p), θ), where μ(p) is p’s

URI and θ is a natural language representation of p.

  • A BOA pattern mapping is a function M such that M(p)

= S , where S is the set of natural language representations for p.

  • Occurrences, sentences, labels p is learned from,

number of occurrences for each label combination

slide-9
SLIDE 9

Pattern Scoring

  • Pattern Filtering: Length, Stop Words, Occurrence
  • Support: used across several triples in background

knowledge

  • Typicity: allows to map ?D?, ?R? to entities with

rdf:type of domain/range of p

  • Specificity: used exclusively to express p, IDF

adopted to patterns

  • (Similarity: how similar is a pattern to label of

predicate)

  • Combine Support, Typicity, Specificity to calculate

local maximum

slide-10
SLIDE 10

RDF Generation

  • use top-n pattern for each relation
  • find sentences which contain pattern
  • NER-tag sentence
  • look for token’s classes which match

domain/range

  • extract labels
  • URI retrieval above threshold do not create new

URI

slide-11
SLIDE 11

Demo

slide-12
SLIDE 12

http://139.18.2.164:8080/boa

slide-13
SLIDE 13

Evaluation

  • Corpora
  • en_wiki (44.7M), en_news (256.1M)
  • Background Knowledge
  • Organisation, Place and Person (283

relations from 1 to 471920 triples)

  • Parameters
  • top1,2 pattern, kappa, 500 sentences for

Typicity, 100 example sentences for 12 different KBs

slide-14
SLIDE 14

Results

slide-15
SLIDE 15

Examples

Relation Top-2 Pattern URI Domain/Range en-wiki en-news foundationPerson Organisation/Person

  • 1. R , co-founder of D
  • 2. R , founder of D
  • 1. R, the co-founder of D
  • 2. R, founder of the D

subsidiary Organisation/Organisati

  • n
  • 1. R , a subsidiary of D
  • 2. - (R , a division of D)
  • 1. R, a division of D
  • 2. D‘s acquisition of R

birthPlace Person/PopulatedPlace 1.D was born in R

  • 2. - (D , the mayor of R)
  • 1. D has been named in the

R

  • 2. D, MP for R
slide-16
SLIDE 16

Discussion

  • we can use patterns from wiki for every

corpus

  • we create many new triples
  • we create correct triples
  • we need 15 minutes for one iteration
  • Q1 & Q2 answered with YES
slide-17
SLIDE 17

Future Work

  • Train NER on DBpedia classes
  • Iteration 1+
  • Human feedback
  • Pattern generalization
  • rdf:type extractor
  • Languages/Corpora
  • Webservices
slide-18
SLIDE 18

Thank you!

Questions?

slide-19
SLIDE 19

References

  • [NDA+11]
  • Ndapandula Nakashole, Martin Theobald, and Gerhard
  • Weikum. Scalable knowledge harvesting with high

precision and high recall. In WSDM, pages 227–236, Hong Kong, 2011.

  • [CAR+10]
  • Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr

Settles, Estevam R. Hruschka Jr., and Tom M. Mitchell. Toward an architecture for never-ending language

  • learning. In AAAI, 2010.