[PPT] - Optimizing Sampling-based Entity Resolution over Streaming PowerPoint Presentation

SLIDE 1

Optimizing Sampling-based Entity Resolution over Streaming Documents

Christan Grant and Daisy Zhe Wang University of Florida

SIAM BSA Workshop 2015

SLIDE 2

Knowledge Bases are important structure for organizing and categorizing information.

SLIDE 3

Knowledge Bases are important structure for organizing and categorizing information.

SLIDE 4

Knowledge Bases are important structure for organizing and categorizing information.

SLIDE 5

Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

SLIDE 6

Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

All Wikipedia information is based on facts from (reputable?) web

sources.

SLIDE 7

Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

All Wikipedia information is based on facts from (reputable?) web

sources.

SLIDE 8

Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

All Wikipedia information is based on facts from (reputable?) web

sources.

SLIDE 9

Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

All Wikipedia information is based on facts from (reputable?) web

sources.

SLIDE 10

5 10 20 50 100 500 2000 0.000 0.005 0.010 0.015 time lag (days) proportion median = 356 days

The average time between an event and its appearance on Wikipedia is 356 days.

J. Frank et al. 2012

SLIDE 11

Knowledge Base Acceleration

SLIDE 12

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set of entities and suggests citations for wikipedia entities.

SLIDE 13

SLIDE 14

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

SLIDE 15

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

Challenges:

SLIDE 16

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents

SLIDE 17

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text

SLIDE 18

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text 3) Ambiguous Entities

SLIDE 19

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text 3) Ambiguous Entities 4) Finding relevant facts

SLIDE 20

Example System

Chunk&Files&Index& Generator& StreamItems&Index& Generator& Streaming&Slot&Value& Extrac:on& High&Accuracy& Filter& Web& Corpus& Streaming& Slot&Values& Manual&Aliases&Extrac:on& (TwiEer)&

&&

Alias&Extrac:on& (Wiki&API,&Wiki&text)& Name&Order&Generator& Training& Data& Wikipedia&

SLIDE 21

Example System

Chunk&Files&Index& Generator& StreamItems&Index& Generator& Streaming&Slot&Value& Extrac:on& High&Accuracy& Filter& Web& Corpus& Streaming& Slot&Values& Manual&Aliases&Extrac:on& (TwiEer)&

&&

Alias&Extrac:on& (Wiki&API,&Wiki&text)& Name&Order&Generator& Training& Data& Wikipedia&

SLIDE 22

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

SLIDE 23

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

Difficult because of ambiguity

SLIDE 24

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

Difficult because of ambiguity

Same Name, Different Person

SLIDE 25

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

SLIDE 26

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

SLIDE 27

Entity Resolution

Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

SLIDE 28

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

SLIDE 29

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

ψ

SLIDE 30

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

ψ

SLIDE 31

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

SLIDE 32

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

SLIDE 33

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

SLIDE 34

Entity Resolution Algorithm

The Baseline ER metropolis hastings takes a random mention and adds it to a random entity.

SLIDE 35

Entity Resolution Algorithm

The Baseline ER metropolis hastings takes a random mention and adds it to a random entity. Random Number Generator

SLIDE 36

Entity Resolution Algorithm

SLIDE 37

1.Select a source mention at random.

Entity Resolution Algorithm

SLIDE 38

1.Select a source mention at random.

Entity Resolution Algorithm

SLIDE 39

1.Select a source mention at random. 2.Select a destination mention at random.

Entity Resolution Algorithm

SLIDE 40

1.Select a source mention at random. 2.Select a destination mention at random.

Entity Resolution Algorithm

SLIDE 41

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 42

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

SLIDE 43

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

SLIDE 44

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 45

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 46

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 47

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 48

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 49

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 50

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

SLIDE 51

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 52

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 53

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 54

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 55

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

SLIDE 56

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. 4.Accept when it improves the state.

Entity Resolution Algorithm

SLIDE 57

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. 4.Accept when it improves the state.

Accept!

Entity Resolution Algorithm

SLIDE 58

Eventually converges. (State does not oscillate or vary)

Entity Resolution Algorithm

SLIDE 59

Eventually converges. (State does not oscillate or vary)

Entity Resolution Algorithm

Markov Chain Monte Carlo Metropolis Hastings!

SLIDE 60

Sampling Optimizations

Distributed Computations (Singh et al. 2011)

SLIDE 61

Sampling Optimizations

Distributed Computations (Singh et al. 2011) Query-Driven Computation (Grant et al. 2015)

SLIDE 62

Sampling Inefficiencies

SLIDE 63

Sampling Inefficiencies

1. Large clusters are the slowest.

SLIDE 64

Sampling Inefficiencies

1. Large clusters are the slowest.
Pairwise comparisons are expensive.

SLIDE 65

Sampling Inefficiencies

1. Large clusters are the slowest.
Pairwise comparisons are expensive.

Θ(n2)

SLIDE 66

Sampling Inefficiencies

1. Large clusters are the slowest.

Pairwise comparisons are expensive.

2. Excessive computation on unambiguous entities

Θ(n2)

SLIDE 67

Sampling Inefficiencies

1. Large clusters are the slowest.

Pairwise comparisons are expensive.

2. Excessive computation on unambiguous entities

Entities such as Carnegie Mellon are relatively unambiguous. Θ(n2)

SLIDE 68

Sampling Inefficiencies

1. Large clusters are the slowest.

Pairwise comparisons are expensive.

2. Excessive computation on unambiguous entities

Entities such as Carnegie Mellon are relatively unambiguous. Θ(n2)

Streaming documents exacerbates these problems.

SLIDE 69

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC.

SLIDE 70

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions:

SLIDE 71

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions: 1.Can I approximate the state score calculation?

SLIDE 72

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions: 1.Can I approximate the state score calculation? 2.Should I compress an Entity?

SLIDE 73

Experiments

Wikilink Data Set (Singh, Subramaniya,

Pereira, McCallum, 2011)

Largest fully-labeled data set
40 Million Mentions
180 GBs of data

SLIDE 74

Large Entity Sizes

SLIDE 75

Large Entity Sizes

SLIDE 76

Entity Compression

SLIDE 77

Entity Compression

Known matches can be compressed into a representative mention.

SLIDE 78

Entity Compression

Known matches can be compressed into a representative mention.
Entity compression can reduce the number of mentions (n).

SLIDE 79

Entity Compression

Known matches can be compressed into a representative mention.
Entity compression can reduce the number of mentions (n).
Compression of large and popular entities is costly.

SLIDE 80

Entity Compression

Known matches can be compressed into a representative mention.
Entity compression can reduce the number of mentions (n).
Compression of large and popular entities is costly.
Compression errors are permanent.

SLIDE 81

Compression Types

Run-Length Encoding
Hierarchical Compression (Wick et al.)

Run Length Encoding

SLIDE 82

Early Stopping

Can we estimate the computation of

the features?

Singh et al. EMNLP’12

SLIDE 83

Early Stopping

Can we estimate the computation of

the features?

Given a p value, randomly select

less values.

Singh et al. EMNLP’12

SLIDE 84

Early Stopping

Can we estimate the computation of

the features?

Given a p value, randomly select

less values.

Singh et al. EMNLP’12

SLIDE 85

Optimizer

Current work

1. Classifier for deciding when to perform early stopping.
2. Classifier for the decision to compress.

SLIDE 86

When should it compress?

SLIDE 87

When should it compress?

Power law says there are only a small number of very large clusters.

SLIDE 88

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way.

SLIDE 89

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

SLIDE 90

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

SLIDE 91

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization

SLIDE 92

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

SLIDE 93

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

SLIDE 94

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

SLIDE 95

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster.

SLIDE 96

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster. Compression must be worth it.

SLIDE 97

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster. Compression must be worth it.

SLIDE 98

When should we approximate?

SLIDE 99

When should we approximate?

Early stopping only makes sense

for clusters of medium size.

It is better to do full comparison

for small and large cluster sizes.

SLIDE 100

When should we approximate?

Early stopping only makes sense

for clusters of medium size.

It is better to do full comparison

for small and large cluster sizes.

SLIDE 101

When should we approximate?

Early stopping only makes sense

for clusters of medium size.

It is better to do full comparison

for small and large cluster sizes.

SLIDE 102

Optimizer for Query-Driven Sampling

while samples-- > 0: m ~ Mentions e ~ Entities state’ = move(state, m, e)

= Optimize(state, state’, m, e)

if (!score(state’,state, o)): state = state’ doCompress(state, m, e, o) Optimizer needs to know:

Current Cardinality of Items in

each entity.

Memory/CPU configuration for

estimating baseline time

SLIDE 103

Summary

We motivated the need and discussed the open space for
ptimization of MCMC sampling methods.
We plan to use the newly released labeled TREC stream corpus.
Want to collaborate?!
Lets talk if you want to do a Ph.D. at the University of Oklahoma!

SLIDE 104