Entity Representation and Retrieval
Laura Dietz
University of New Hampshire
Alexander Kotov
Wayne State University
Edgar Meij
Bloomberg
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity Representation and Retrieval Laura Dietz University of New - - PowerPoint PPT Presentation
Entity Representation and Retrieval Laura Dietz University of New Hampshire Alexander Kotov Wayne State University Edgar Meij Bloomberg SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR Knowledge Graph Fragment SIGIR 2018 Tutorial on
Laura Dietz
University of New Hampshire
Alexander Kotov
Wayne State University
Edgar Meij
Bloomberg
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Besides documents, users often search for concrete or abstract entities/objects (i.e. people, products, organizations, books) Users are willing to express these information needs more elaborately than with a few keywords [Balog et al., SIGIR’08] Entities (or entity cards) provide immediate answers to such queries → natural units for organizing search results Knowledge graphs are built around entities → Entity Retrieval from Knowledge Graph(s) (ERKG)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity Search: simple queries aimed at finding a particular entity or an entity which is an attribute of another entity
◮ “Ben Franklin” ◮ “Einstein Relativity theory” ◮ “England football player highest paid”
List Search: descriptive queries with several relevant entities
◮ “US presidents since 1960” ◮ “animals lay eggs mammals” ◮ “Formula 1 drivers that won the Monaco Grand Prix”
Question Answering: queries are questions in natural language
◮ “Who founded Intel?” ◮ “For which label did Elvis record his first album?”
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Evolution of entity retrieval tasks:
◮ Expert search at TREC 2005–2008 enterprise track: find experts knowledgeable about a given topic ◮ Entity ranking track at INEX 2007–2009: find Wikipedia page of entities with a given target type ◮ Related entity search at TREC 2009–2011 entity track: find Web pages of entities related to a given entity in a certain way
Can be used for entity linking: fragment of text as query, list of linked entities as result Can be combined with methods using KGs for ad-hoc or Web search (part 3 of this tutorial)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Unique IR problem: there are no documents. Entities in KG have no textual representation, apart from their names Challenging IR problem: knowledge graphs are best suited for structured graph pattern-based SPARQL queries, not for traditional IR models
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
ERKG requires accurate interpretation of unstructured textual queries and matching them with entity semantics:
entity properties and relations to other entities?
entity representations?
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Tonon, Demartini et al., SIGIR’12]
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity representation Entity retrieval Entity set expansion Entity ranking
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Build a textual representation (i.e. “document”) for each entity by considering all triples, where it stands as a subject (or object)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Simple approach: each predicate corresponds to one entity document field Problem: there are infinitely many predicates → optimization of field importance weights is computationally intractable Predicate folding: group predicates into a small set of predefined categories → entity documents with smaller number of fields
◮ by predicate type (attributes, incoming/outgoing links)[P´ erez-Ag¨ uera et al., SemSearch 2010] ◮ by predicate importance (determined based on predicate popularity)[Blanco et al., ISWC 2011]
The number and type of fields depends on a retrieval task
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Neumayer, Balog et al., ECIR’12]
Each entity is represented as a two-field document: title
“label” or “title” content
concatenated together into a flat text representation This simple scheme is effective for entity retrieval
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov and Agichtein, CIKM’13]
Each entity is represented as a three-field document: names literals of foaf:name, rdfs:label predicates along with tokens extracted from entity URIs attributes literals of all other predicates
names of entities in the object position This scheme is effective for entity retrieval
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov, Kotov et al., SIGIR’15]
Each entity is represented as a five-field document: names labels or names of entities attributes all entity properties, other than names categories classes or groups, to which the entity has been assigned similar entity names names of the entities that are very similar or identical to a given entity related entity names names of entities in the object position This flexible scheme is effective for a variety of tasks: entity search, list search, question answering
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Vocabulary mismatch between relevant entity(ies) description(s) and the query terms that can be used to search for it(them) Associations between words and entities depend on the context:
◮ Germany should be returned for queries related to World War II and 2006 Soccer World Cup
Real-life events change the descriptions of entities:
◮ Ferguson, Missouri before and after August 2014
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Graus, Tsagkias et al., WSDM’16]
Idea: create static entity representations using knowledge bases and leverage different social media sources to dynamically update them Represent entities as fielded documents, in which each field corresponds to different source Tweak the weights of different fields over time
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity representation Entity retrieval Entity set expansion Entity ranking
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
ERKG has been addressed in a probabilistic generative framework: P(e|q) ∝ P(q|e)P(e) Besides keywords qw, query q implicitly or explicitly contains target entity type(s) qt, which can be incorporated into entity retrieval models
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Two ways to combine term-based similarity P(qw|e) and type-based similarity P(qt|e): Filtering [Bron et al., CIKM’10]: P(q|e) = P(qw|e)P(qt|e) Interpolation [Balog et al., TOIS’11; Kaptein et al., AI’13; Pehcevski et al., IR’10; Raviv et al., JIWES’12]: P(q|e) = (1 − λt)P(qw|e) + λtP(qt|e)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Possible options for P(qw|e): unigram bag-of-words models for structured document retrieval:
◮ Mixture of Language Models (MLM) [Ogilvie and Callan, SIGIR’03] ◮ BM25 for multi-field documents (BM25F) [Robertson et al., CIKM’04] ◮ Probabilistic Retrieval Model for Semi-structured Data (PRMS) [Kim and Croft, ECIR’09]
term dependence (bigrams) models:
◮ Sequential Dependence Model (SDM) [Metzler and Croft, SIGIR’05]
term dependence models for structured document retrieval:
◮ Fielded Sequential Dependence Model (FSDM) [Zhiltsov et al., SIGIR’15] ◮ Parameterized Fielded Sequential Dependence Model (PFSDM) [Nikolaev et al., SIGIR’16]
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov, Kotov et al., SIGIR’15]
Idea: account both for phrases (bigrams) and document structure Document score is a linear combination of matching functions for unigrams and bigrams in each document field: PΛ(D|Q)
rank
= λT
˜ fT(qi, D) + λO
˜ fO(qi, qi+1, D) + λU
˜ fU(qi, qi+1, D) MLM is a special case of FSDM, when λT = 1, λO = 0, λU = 0
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov, Kotov et al., SIGIR’15]
Idea: account both for phrases (bigrams) and document structure Document score is a linear combination of matching functions for unigrams and bigrams in each document field: PΛ(D|Q)
rank
= λT
˜ fT(qi, D) + λO
˜ fO(qi, qi+1, D) + λU
˜ fU(qi, qi+1, D) MLM is a special case of FSDM, when λT = 1, λO = 0, λU = 0
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov, Kotov et al., SIGIR’15]
Idea: account both for phrases (bigrams) and document structure Document score is a linear combination of matching functions for unigrams and bigrams in each document field: PΛ(D|Q)
rank
= λT
˜ fT(qi, D) + λO
˜ fO(qi, qi+1, D) + λU
˜ fU(qi, qi+1, D) MLM is a special case of FSDM, when λT = 1, λO = 0, λU = 0
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov, Kotov et al., SIGIR’15]
Idea: account both for phrases (bigrams) and document structure Document score is a linear combination of matching functions for unigrams and bigrams in each document field: PΛ(D|Q)
rank
= λT
˜ fT(qi, D) + λO
˜ fO(qi, qi+1, D) + λU
˜ fU(qi, qi+1, D) MLM is a special case of FSDM, when λT = 1, λO = 0, λU = 0
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
FSDM matching function for unigrams: ˜ fT(qi, D) = log
w T
j P(qi|θj D) = log
w T
j
tfqi,Dj + µj
cf j
qi
|Cj|
|Dj| + µj
apollo astronauts who walked on the moon Parameters:
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
FSDM matching function for unigrams: ˜ fT(qi, D) = log
w T
j P(qi|θj D) = log
w T
j
tfqi,Dj + µj
cf j
qi
|Cj|
|Dj| + µj
apollo astronauts
category
who walked on the moon Parameters:
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
FSDM matching function for unigrams: ˜ fT(qi, D) = log
w T
j P(qi|θj D) = log
w T
j
tfqi,Dj + µj
cf j
qi
|Cj|
|Dj| + µj
apollo astronauts
category
who walked on the moon
category
Parameters:
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Same field weights for all query unigrams and all query bigrams
capitals in Europe which were host cities of summer Olympic games
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Same field weights for all query unigrams and all query bigrams
capitals
category in Europe which were host cities of summer Olympic games
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Same field weights for all query unigrams and all query bigrams
capitals
category in Europe attribute
which were host cities of summer Olympic games
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Same field weights for all query unigrams and all query bigrams
capitals
category in Europe attribute
which were host cities of summer
category Olympic games
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Nikolaev, Kotov et al., SIGIR’16]
Idea: calculate field weight for each unigram and bigram based on features: w T
qi,j =
αU
j,kφk(qi, j)
φk(qi, j) is the the k-th feature value for unigram qi in field j αU
j,k are feature weights that are learned by coordinate ascent to
maximize target retrieval metric
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Source Description CT Collection statistics Posterior probability P(Ej|κ) UG BG Top SDM score of the j-th field when κ is used as a query BG Stanford POS Tagger Is concept κ a proper noun? UG Is κ a plural non-proper noun? UG BG Is κ a superlative adjective? UG Stanford Parser Is κ part of a noun phrase? BG Is κ the only singular non-proper noun in a noun phrase? UG Intercept UG BG
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Hasibi et al., ICTIR’16]
Idea: linked entities as additional feature function in FSDM PΛ(D|Q)
rank
= λT
˜ fT(qi, D) + λO
˜ fO(qi, qi+1, D) + λU
˜ fU(qi, qi+1, D) + λE
˜ fE(e, D)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Garigliotti and Balog, ICTIR’17]
If target type(s) qt are provided with the query, the distribution of types for entity e is estimated as: P(t|Θe) = n(t, e) + µP(t)
With both Θq and Θe in place, type-based similarity between q and e is estimated as: P(qt|e) = z(max
e′ KL(Θq||Θe′) − KL(Θq||Θe))
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Garigliotti and Balog, ICTIR’17] (a) all assigned types (b) most general types (c) most specific types
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
If no target type(s) are provided with the query, they can be inferred using: Type-centric approach [Balog and Neumayer, CIKM’12]: build a document for each type by concatenating the descriptions of all entities that belong to it P(q|t) =
|q|
P(wi|θt) =
|q|
(1 − λ)
(P(w|ed)P(e|t) + λP(wi)) Entity-centric approach [Balog and Neumayer, CIKM’12]: aggregate retrieval scores and type distributions of top retrieved entities P(q|t) =
P(q|e)P(e|t)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Type ranking [Garigliotti et al., SIGIR’17]: combines scores of entity- and type-centric approaches with taxonomy and type label features Head-modifier approach [Ma et al., WWW’18]: query and type names are phrases, which consists of a head word (hq and ht) and a set of modifiers (Mq and Mt) (e.g. “Italian Nobel prize winners”, “Musicians who appeared in the Blues Brothers movies”) P(q|t) = P(ht|hq)α1P(Mt|hq)α2P(ht|Mq)α3P(Mt|Mq)α4
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Raviv et al., JIWES’17]
Entity name EN, description ED and types ET can be combined into Markov Random Field-based retrieval model: P(E|Q) = λENP(EN|Q) + λEDP(ED|Q) + λEPP(EP|Q)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity representation Entity retrieval Entity set expansion Entity ranking
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Tonon, Demartini et al., SIGIR’12]
Maintain inverted index for entity representations and triple store for entity relations Hybrid approach: IR models for initial entity retrieval and SPARQL queries for expansion
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Follow predicates leading to
Follow predicates leading to entity attributes Explore entity neighbors and the neighbors of neighbors
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Entity representation Entity retrieval Entity set expansion Entity ranking
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Dali and Fortuna, WWW’11]
Potential features:
◮ Popularity and importance of Wikipedia page: # of accesses from logs, # of edits, page length ◮ RDF features: # of triples E is subject/object/subject and object is a literal, # of categories Wikipedia page for E belongs to, size of the biggest/smallest/median category ◮ HITS scores and Pagerank of Wikipedia page and E in the RDF graph ◮ # of hits from search engine API for the top 5 keywords from the abstract of Wikipedia page for E ◮ Count of entity name in Google N-grams
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Features approximating the entity importance (hub and authority scores, PageRank) of Wikipedia page are effective PageRank and HITS scores on RDF graph are not effective (outperformed by simpler RDF features) Google N-grams is effective proxy for entity popularity, cheaper than search engine API Feature combinations improve both robustness and accuracy of ranking
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
For a knowledge graph with n distinct entities and m distinct predicates, we construct a tensor X of size n × n × m, where Xijk = 1, if there is k-th predicate between i-th entity and j-th entity, and Xijk = 0, otherwise Each k-th frontal tensor slice Xk is an adjacency matrix for the k-the predicate
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Nickel et al., ICML’11, WWW’12]
Given r is the number of latent factors, factorize each Xk: Xk = ARkAT, k = 1, m, where A is a dense n × r matrix, a matrix of latent embeddings for entities, and Rk is an r × r matrix of latent factors
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Idea: Represent KG entities and relations as dense real-valued vectors (i.e. embeddings) and predict relation between entities es and eo in a KG based on f (es, eo, Θ) Interaction-based methods
◮ RESCAL [Nikel et al., ICML’11]: w T
k es ⊗ eo
◮ LFM [Jenatton et al., NIPS’12]: esWpeo ◮ HolE [Nickel et al., AAAI’16]: σ(pT(es ⋆ eo))
Neural network-based methods
◮ ER-MLP [Dong et al., KDD’14]: w Tg
p g
TW [1:k] p
eo + C T
p [es]; eo
Distance-based methods
◮ Unstructured [Bordes et al., AAAI’11]: -es − eo2
2
◮ SE [Bordes et al., AAAI’11]: -Wes es − Weoeo1 ◮ TransE [Bordes et al., NIPS’13]: -es + p − eo1/2
⊗, ⋆, ∗, ·, [·; ·] and vec denote tensor product, cross-correlation, convolution, 2D reshaping, vector concatenation and tensor vectorization
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Jameel et al., SIGIR’17]
Salient properties of entities are modeled as hyperplanes that separate entities that have a property in their descriptions from the
Normals of separating hyperplanes point to the regions where entities with a salient property occur
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Zhiltsov and Agichtein, CIKM’13]
entities in low-dimensional space as features:
◮ cosine similarity: cos(e, etop) ◮ Euclidean distance: e − etop2 ◮ heat kernel: e−
e−etop2 2 σ SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
[Schuhmacher, Dietz et al., CIKM’15]
Aim: complex entity-focused informational queries (e.g. “Argentine British relations”)
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
Use dynamic entity representations built from different sources (not
Use retrieval models that account for query unigram and bigrams (FSDM and PFSDM) rather than bag-of-words structured document retrieval models (BM25F and MLM) to obtain candidate entities Leverage entity links and types in entity retrieval models Expand candidate entities by following KG links Re-rank candidate entities by using a variety of features including the ones based on KG entity embeddings
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR
SIGIR 2018 Tutorial on Utilizing KGs for Text-centric IR