1 EoS: Voting approach ( Balog06, MacDonald09 ) Markov Random Fields - - PDF document

1
SMART_READER_LITE
LIVE PREVIEW

1 EoS: Voting approach ( Balog06, MacDonald09 ) Markov Random Fields - - PDF document

A Ranking Framework for Entity Oriented Search using Markov Random Fields Hadas Hadas Raviv, David Carmel Oren Kurland IBM Research Haifa Lab , Faculty of IE and Management Israel Technion, Israel David Oren IBM Research - Haifa


slide-1
SLIDE 1

1

IBM Research - Haifa IE&M Technion, Haifa

A Ranking Framework for Entity Oriented Search

using Markov Random Fields

Hadas Raviv, David Carmel

IBM Research – Haifa Lab,

Israel Oren Kurland

Faculty of IE and Management Technion, Israel

2

MRF for EoS JIWES 2012, Portland OR

Hadas Oren David

3

MRF for EoS JIWES 2012, Portland OR

Outline Entities Oriented Search

Popular Approaches

MRF for information retrieval MRF for Entity Oriented search

Entity Document Scoring Entity Type Scoring Entity Name Scoring

Evaluation

Benchmarks: INEX entity tracks 2007 -2009 Experimental Results

Summary and future work

4

MRF for EoS JIWES 2012, Portland OR

Entity Oriented Search (EoS)

When people use retrieval systems they are often not searching for documents or text passages Often named entities play a central role in answering such information needs

  • persons, organizations,

locations, products…

At least 20-30% of the queries submitted to Web SE are simply named entities ~71% of Web search queries contain named entities

(Named entity recognition in query, Guo et al, SIGIR09)

5

MRF for EoS JIWES 2012, Portland OR

6

MRF for EoS JIWES 2012, Portland OR

EoS: Profile based Approach (Craswell et al 2001):

Represent each entity by a virtual document (a profile) e.g.

Entity home-page Concatenating passages mentioning the entity

Rank those profiles according to their relevance to the query

Using standard IR ranking techniques

Difficulties:

Co-resolution and name disambiguation Profiling is not an easy task e1 d_e1 e2 d_e2 e3 d_e3 e4 d_e4 q d_e1 d_e2 d_e3 d_e4

slide-2
SLIDE 2

2

7

MRF for EoS JIWES 2012, Portland OR

EoS: Voting approach (Balog06, MacDonald09)

Any relevant document is a “voter” for the entity mentioned within its content

What is the ratio behind?

An entity mentioned many times in relevant (top retrieved) docs is more likely to be relevant on the given topic?

  • d

d p Score q d Score q p Score ) , ( ) , ( ) , (

d1 d2 d3

p1 p2 p3 q

8

MRF for EoS JIWES 2012, Portland OR

Markov Random Fields for IR (Metzler & Croft 2005)

Full Independence CT=(qi,D) Sequential dependence CO=(#1(qi..qi+k),D) Full dependence CU=(#uwN{qi..qj},D)

( )

P( | ) ( )

c c C G

D Q f c

  • 9

MRF for EoS JIWES 2012, Portland OR

MRF for EoS

  • , ,

}

P( | ) P( | )

P

E P P D T N

E Q E Q

  • 1

{ ... },

n

Q q q T

  • 10

MRF for EoS JIWES 2012, Portland OR

MRF based Entity Document Scoring P(ED|Q)

We consider cliques of the three types

Full Independent (CT) Sequential dependent (CO) Full dependent (CU)

The feature functions fI

D(c) over clique of type I (I in {T,O,U})

measures how well the clique's terms represent the entity document Based on Dirichlet smoothed language model For CO and CU we replace qi with #1(qi..qi+k) and #uwN({qi,.. qj}) respectively

The entity document scoring function aggregates the feature functions

  • ver all clique types

,

, }

P( | ) ( )

D ED

I I D E D I T O U c I

E Q f c

  • ( ,

) ( )/ | | ( , ) log | |

T i D i D i D D

tf q E cf q C f q E E

  • 11

MRF for EoS JIWES 2012, Portland OR

Entity type Scoring P(ET|Q)

fT(c) is defined over a single clique composed of ET and QT

( , ) ( , ' ) '

( | ) ( ) log

T T T T

d Q E T T d Q E E R

e P E Q f c e

  • d(QT,ET) - the type distance,

is domain dependent In our experiments we measured the distance in the Wikipedia category graph The minimal path length between all pairs of the query and the entity’s page categories

12

MRF for EoS JIWES 2012, Portland OR

Entity Name Scoring

P(EN|Q) Clique types:

Query terms independent – SEN - a single node clique containing the entity name alone

Equivalent to the voting approach

Query terms dependent – Consider proximity with the query terms

Does the entity is usually mentioned in proximity to query terms In analogy to document scoring

  • TEN Full independent
  • OEN Sequential dependent
  • UEN Full dependent

Name Name Name

slide-3
SLIDE 3

3

13

MRF for EoS JIWES 2012, Portland OR

Entity Name Scoring

P(EN|Q) (cont.)

Local approach

Measure the relationship (e.g. proximity) between the query terms and the entity name in the top retrieved documents

Global approach

Measure the PMI between the query term(s) and the entity name in the whole collection PMI – the pointwise mutual information – the likelihood of finding one term in proximity to another term

{ , , , , , , }

P( | ) ( )

N EN T O U

X X N E N X A c X

A S T O U PMI PMI PMI

E Q f c

  • local

global

14

MRF for EoS JIWES 2012, Portland OR

Entity Scoring Process

(Q,T) E’1 E’2 E’n E’n+k E1 E2 En

P(ED|Q) P(ED,ET|Q)

E’1 E’2 E’n

E1,1 E1,2 Er,1 Er,2

E’’1 E’’2 E’’n

E’’n+k

P(ED,ET,EN|Q)

15

MRF for EoS JIWES 2012, Portland OR

Evaluation

  • The INEX Entity Ranking track 2007-2009

The Collection – Wikipedia articles A retrievable entity must have a Wikipedia page

No need for a third-party named-entity extraction tools Each entity has a unique name, document and type (WP categories)

INEX topics perfectly fit our model looking for relevant entities to a given topic Metrics: MAP for 2007, infAP for 2008-2009

55

  • 2,666,190

2009 2009 35 74 2008 46 28 659,388 2006 2007 Test topics Train topics #docs Wikipedia Year Data set

16

MRF for EoS JIWES 2012, Portland OR

INEX – Entity Ranking track

Entity Ranking (XER),

  • Return entities that satisfy a

topic described in natural language text

List Completion (LC)

  • Complete a partial list of given

answers

Entities:

  • Must have a Wikipedia page

Entity type is determined by corresponding Wikipedia categories

  • “movies”, “trees”,
  • “Italian politicians”

<inex_topic> <title> circus mammals</title> <description> I want a list of mammals which have ever been tamed to perform in circuses. </description> <narrative> Each answer should contain an article about mammal which can be a part of any circus show. </narrative> <categories> <category id="138">mammals</category> </categories> <entities> <entity id="379035">Asian Elephant</entity> <entity id="4402">Brown Bear</entity>

</entities> </inex_topic>

17

MRF for EoS JIWES 2012, Portland OR

Parameter tuning

The following parameters are common for all the entity ranking scores Values were selected after extensive search over a wide range of values

10 Query term proximity window size N ED 5

  • Max. distance in

category graph dmax ET 3 Category score decay alpha 500 #top-docs for voting score computation R EN Optimal Value Parameter name Symbol Entity Property 10 #top-docs for entity expansion Rinit 3 Entity name terms proximity window K

18

MRF for EoS JIWES 2012, Portland OR

Parameter tuning (cont) – Coordinate Ascent

The parameters of the scoring function were tuned using the Coordinate Ascent algorithm for each benchmark

Optimization process was done separately for each dataset, using the training topics Performance was estimated using the test topics

  • For 2009 we used

Cross-Validation

slide-4
SLIDE 4

4

19

MRF for EoS JIWES 2012, Portland OR

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 S(ED) S(ED,ET) S(ED,ET,EN) S(ED) S(ED,ET) S(ED,ET,EN) S(ED) S(ED,ET) S(ED,ET,EN) FI SD FD INEX top 2007 2008 2009

Results improved significantly when type and name scoring were added Final Results are superior to top INEX 2007,2008, and comparable to 2009 Dependence models (SD, FD) have not improved over Independence model (FI) ??? Global based name scoring (PMI) outperforms local based name scoring

20

MRF for EoS JIWES 2012, Portland OR

Summary

In this work we presented an entity ranking model using the MRF framework which integrates

  • Profile approach: query E-document relationship
  • Voting approach: query E-name relationship
  • Type filtering approach: query E-type relationship

Experiments over INEX benchmarks showed that

  • Performance is relatively high and comparative to leading INEX systems
  • Using dependence models did not result in significant improvement over Full

Independence model.

  • Global based name scoring outperforms local based name scoring

Future work

  • Explore this model with additional data collections, specifically, large web

collections

  • Using additional entity properties, e.g. exploring the entity graph
  • Further investigation of the dependence models

21

MRF for EoS JIWES 2012, Portland OR

Thank You! Questions?