Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg and Ingeborg Sølvberg
APWeb‘2013 Sydney, Australia
April 4th, 2013
An Integrated Approach for Large-scale Relation Extraction from the Web
1
An Integrated Approach for Large-scale Relation Extraction from the - - PowerPoint PPT Presentation
An Integrated Approach for Large-scale Relation Extraction from the Web Naimdjon Takhirov, Fabien Duchateau, Trond Aalberg and Ingeborg Slvberg APWeb2013 Sydney, Australia April 4th, 2013 1 Knowledge extraction creation of
April 4th, 2013
1
An Integrated Approach for Large-scale Relation Extraction from the Web
2
An Integrated Approach for Large-scale Relation Extraction from the Web 3
An Integrated Approach for Large-scale Relation Extraction from the Web 3
An Integrated Approach for Large-scale Relation Extraction from the Web 4
An Integrated Approach for Large-scale Relation Extraction from the Web
Bored of the rings Lord of the rings J.R.R Tolkien Henry N. Beard Douglas C. Kenney National Lampoon (magazine) Signet (publisher) parodyOf creatorOf publishedBy creatorOf founderOf
4
An Integrated Approach for Large-scale Relation Extraction from the Web
5
An Integrated Approach for Large-scale Relation Extraction from the Web
– pattern generation – relationship and example generation – scalability
– relationship discovery – performance
6
An Integrated Approach for Large-scale Relation Extraction from the Web
– data created several decades ago (legacy data) – large investments (on the infrastructure and manpower)
– identification of entities – ambiguity
7
An Integrated Approach for Large-scale Relation Extraction from the Web
– huge source of information
– the place we discuss and share knowledge about our cultural heritage (news, wikis, blogs, personal catalogs etc.)
– extracting semantic information from the documents – reasonable recall/precision wrt state-of-the-art
8
An Integrated Approach for Large-scale Relation Extraction from the Web
– pattern generation – relationship example generation
9
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
10
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
10
An Integrated Approach for Large-scale Relation Extraction from the Web
how to select the right entity?
what to do when disambiguation is not possible?
11
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
12
An Integrated Approach for Large-scale Relation Extraction from the Web
13
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
14
An Integrated Approach for Large-scale Relation Extraction from the Web
the type of relationship (e.g. Bored of the Rings is a parody of Lord of the Rings)
words
15
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
16
An Integrated Approach for Large-scale Relation Extraction from the Web
conf(p) = ✓αsupp + βoccp + γprovp α + β + γ ◆
PageRank
SpamScore
RelScore
17
An Integrated Approach for Large-scale Relation Extraction from the Web
{e2} is a parody of {e1} a spoof of {e1}, entitled{e2} …. {e2} is a short satirical novel by … parodying {e1}
e1 e2
lord of the rings EXTENSION DOCUMENT EXTRACTION GENERALIZATION SELECTION
DBpedia Freebase OpenCyc
l11 l12 ... l1m l21 l22 ... l2n
L O T R L
R … L
d _
_ t h e _ R i n g s T L
R
candidate patterns ranked patterns generic patterns
collection of documents
Simple strategy Contextual strategy {e2} is/VBZ a/DT parody of/IN {e1} …. a/DT spoof of/IN {e1}, entitled/VBD {e2}
18
An Integrated Approach for Large-scale Relation Extraction from the Web
– relationship discovery – discover new examples
19
An Integrated Approach for Large-scale Relation Extraction from the Web
– similarity between a sentence and a pattern
– calculate the semantic distance – flexible because missing or extra words in the sentence do not affect significantly the similarity score
– NER is important and an integral part of the approach – need to identify and extract entities – rely on the formalism of our patterns (since they have been POS- tagged, the tags serve as a delimiter and may constrain the candidate entities)
semdist(wi
p, wj s) =
0.0 if wj
s ∈ FT p
resnik(wi
p, wj s)
if wj
s /
∈ FT p and ti
p = tj s
1.0
20
An Integrated Approach for Large-scale Relation Extraction from the Web
21
An Integrated Approach for Large-scale Relation Extraction from the Web
22
An Integrated Approach for Large-scale Relation Extraction from the Web
23
An Integrated Approach for Large-scale Relation Extraction from the Web
– precision: number of correct discovered results divided by the total number of discovered results – recall: sample-based estimation (3600 manually validated entries by 8 people) – F-measure: harmonic mean
24
An Integrated Approach for Large-scale Relation Extraction from the Web
– manually constructed (200 relationships)
Example Discovered type Confidence
score birthplace 0.42 Obama, Hawai senator 0.31 president-elect 0.18 amazon 0.32 cockatoo, yellow parrot 0.31 tail 0.16 plant 0.51 eucalyptus, myrtaceae family 0.43 specie 0.27 inventor 0.60 Bartolomeo Cristofori, instrument 0.43 piano maker 0.19 Cobain 0.34 Dave Grohl, Nirvana band member 0.27 drummer 0.16 parody 0.53 Bored of the Rings, links 0.24 Lord of the Rings Middle-Earth 0.23
25
An Integrated Approach for Large-scale Relation Extraction from the Web
20 40 60 80 100 top-1 top-3 top-5 top-10 Precision in percentage no training 1 example 5 examples 20 40 60 80 100 top-1 top-3 top-5 top-10 Estimated recall in percentage no training 1 example 5 examples
26
An Integrated Approach for Large-scale Relation Extraction from the Web
– pattern generation – relation and example generation
– experiments from different domains (recently released dataset - ClueWeb2012) – study impact of parameters and contradictory cases – enriching instances with attributes –
SPARQL endpoint) – a demo is under submission
27
An Integrated Approach for Large-scale Relation Extraction from the Web
takhirov@idi.ntnu.no
28