Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University - - PowerPoint PPT Presentation

lecture 24 ner entity linking
SMART_READER_LITE
LIVE PREVIEW

Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University - - PowerPoint PPT Presentation

Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 CS6501-NLP 1 Organizing knowledge Its a version of Chicago the Chicago was used by default


slide-1
SLIDE 1

Lecture 24: NER & Entity Linking

Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16

1 CS6501-NLP

slide-2
SLIDE 2

Organizing knowledge

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997.. Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II. 2 Slides are adapted from Dan Roth

CS6501-NLP

slide-3
SLIDE 3

Cross-document co-reference resolution

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997.. Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II. 3

CS6501-NLP

slide-4
SLIDE 4

4

Reference resolution: (disambiguation to Wikipedia)

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997.. Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II.

CS6501-NLP

slide-5
SLIDE 5

5

The “Reference” Collection has Structure

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997.. Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II. Used_In Is_a Is_a Succeeded Released

CS6501-NLP

slide-6
SLIDE 6

6

Analysis of Information Networks

It’s a version of Chicago – the standard classic Macintosh menu font, with that distinctive thick diagonal in the ”N”. Chicago was used by default for Mac menus through MacOS 7.6, and OS 8 was released mid-1997.. Chicago VIII was one of the early 70s-era Chicago albums to catch my ear, along with Chicago II.

CS6501-NLP

slide-7
SLIDE 7

7

Wikipedia as a knowledge resource ….

Used_In Is_a Is_a Succeeded Released

CS6501-NLP

slide-8
SLIDE 8

Wikification: The Reference Problem

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State.

Cycles of Knowledge: Grounding for/using Knowledge

8

CS6501-NLP

slide-9
SLIDE 9

Challenging

v Dealing with Ambiguity of Natural Language

v Mentions of entities and concepts could have multiple meanings

v Dealing with Variability of Natural Language

v A given concept could be expressed in many ways

v Wikification addresses these two issues in a specific way: v The Reference Problem

v What is meant by this concept? (WSD + Grounding) v More than just co-reference (within and across documents)

9

CS6501-NLP

slide-10
SLIDE 10
  • Ambiguity
  • Concepts outside of

Wikipedia (NIL)

  • Blumenthal ?
  • Variability
  • Scale
  • Millions of labels

General Challenges

Blumenthal (D) is a candidate for the U.S. Senate seat now held by Christopher Dodd (D), and he has held a commanding lead in the race since he entered it. But the Times report has the potential to fundamentally reshape the contest in the Nutmeg State. Connecticut CT The Nutmeg State Times The New York Times The Times

CS6501-NLP 10

slide-11
SLIDE 11

Wikification: Subtasks

v Wikification and Entity Linking requires addressing several sub-tasks:

v Identifying Target Mentions

v Mentions in the input text that should be Wikified

v Identifying Candidate Titles

v Candidate Wikipedia titles that could correspond to each mention

v Candidate Title Ranking

v Rank the candidate titles for a given mention

v NIL Detection and Clustering

v Identify mentions that do not correspond to a Wikipedia title v Entity Linking: cluster NIL mentions that represent the same entity.

11

CS6501-NLP

slide-12
SLIDE 12

High-level Algorithmic Approach.

v Input: A text document d; Output: a set of pairs (mi ,ti) v mi are mentions in d; tj(mi ) are corresponding Wikipedia titles, or NIL. v (1) Identify mentions mi in d v (2) Local Inference v For each mi in d:

v Identify a set of relevant titles T(mi ) v Rank titles ti ∈ T(mi ) [E.g., consider local statistics of edges [(mi ,ti) , (mi ,*), and (*, ti )]

  • ccurrences in the Wikipedia graph]

v (3) Global Inference v For each document d:

v Consider all mi ∈ d; and all ti ∈ T(mi ) v Re-rank titles ti ∈ T(mi ) [E.g., if m, m’ are related by virtue of being in d, their corresponding titles t, t’ may also be related]

12

CS6501-NLP

slide-13
SLIDE 13

Local approach

§ Γ is a solution to the problem § A set of pairs (m,t) § m: a mention in the document § t: the matched Wikipedia Title A text Document Wikipedia Articles Identified mentions Local score of matching the mention to the title (decomposed by mi) 13

CS6501-NLP

slide-14
SLIDE 14

Global Approach: Using Additional Structure

Text Document(s)—News, Blogs,… Wikipedia Articles Adding a “global” term to evaluate how good the structure of the solution is.

  • Use the local solutions Γ’ (each

mention considered independently.

  • Evaluate the structure based on pair-

wise coherence scores Ψ(ti,tj)

  • Choose those that satisfy document

coherence conditions. 14

CS6501-NLP