Optimizing Sampling-based Entity Resolution over Streaming - - PowerPoint PPT Presentation

optimizing sampling based entity resolution over
SMART_READER_LITE
LIVE PREVIEW

Optimizing Sampling-based Entity Resolution over Streaming - - PowerPoint PPT Presentation

Optimizing Sampling-based Entity Resolution over Streaming Documents Christan Grant and Daisy Zhe Wang University of Florida SIAM BSA Workshop 2015 Knowledge Bases are important structure for organizing and categorizing information. Knowledge


slide-1
SLIDE 1

Optimizing Sampling-based Entity Resolution over Streaming Documents

Christan Grant and Daisy Zhe Wang University of Florida

SIAM BSA Workshop 2015

slide-2
SLIDE 2

Knowledge Bases are important structure for organizing and categorizing information.

slide-3
SLIDE 3

Knowledge Bases are important structure for organizing and categorizing information.

slide-4
SLIDE 4

Knowledge Bases are important structure for organizing and categorizing information.

slide-5
SLIDE 5
  • Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

slide-6
SLIDE 6
  • Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

  • All Wikipedia information is based on facts from (reputable?) web

sources.

slide-7
SLIDE 7
  • Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

  • All Wikipedia information is based on facts from (reputable?) web

sources.

slide-8
SLIDE 8
  • Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

  • All Wikipedia information is based on facts from (reputable?) web

sources.

slide-9
SLIDE 9
  • Many of these knowledge bases and new knowledge bases are

bootstrapped using Wikipedia/Freebase.

  • All Wikipedia information is based on facts from (reputable?) web

sources.

slide-10
SLIDE 10

5 10 20 50 100 500 2000 0.000 0.005 0.010 0.015 time lag (days) proportion median = 356 days

The average time between an event and its appearance on Wikipedia is 356 days.

  • J. Frank et al. 2012
slide-11
SLIDE 11

Knowledge Base Acceleration

slide-12
SLIDE 12

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set of entities and suggests citations for wikipedia entities.

slide-13
SLIDE 13
slide-14
SLIDE 14

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.
slide-15
SLIDE 15

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.

Challenges:

slide-16
SLIDE 16

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents

slide-17
SLIDE 17

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text

slide-18
SLIDE 18

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text 3) Ambiguous Entities

slide-19
SLIDE 19

Knowledge Base Acceleration

NIST TREC created a track that reads in streaming documents and a set

  • f entities and suggests citations for wikipedia entities.

Challenges: 1) A large amount of documents 2) Ambiguous text 3) Ambiguous Entities 4) Finding relevant facts

slide-20
SLIDE 20

Example System

Chunk&Files&Index& Generator& StreamItems&Index& Generator& Streaming&Slot&Value& Extrac:on& High&Accuracy& Filter& Web& Corpus& Streaming& Slot&Values& Manual&Aliases&Extrac:on& (TwiEer)&

&&

Alias&Extrac:on& (Wiki&API,&Wiki&text)& Name&Order&Generator& Training& Data& Wikipedia&

slide-21
SLIDE 21

Example System

Chunk&Files&Index& Generator& StreamItems&Index& Generator& Streaming&Slot&Value& Extrac:on& High&Accuracy& Filter& Web& Corpus& Streaming& Slot&Values& Manual&Aliases&Extrac:on& (TwiEer)&

&&

Alias&Extrac:on& (Wiki&API,&Wiki&text)& Name&Order&Generator& Training& Data& Wikipedia&

slide-22
SLIDE 22

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

slide-23
SLIDE 23

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

  • Difficult because of ambiguity
slide-24
SLIDE 24

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

  • Difficult because of ambiguity

Same Name, Different Person

slide-25
SLIDE 25

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

  • Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

slide-26
SLIDE 26

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

  • Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

slide-27
SLIDE 27

Entity Resolution

  • Entity resolution is the process of identifying and clustering different

manifestations (e.g., mentions, noun phrases, named entities) of the same real world object.

  • Difficult because of ambiguity

Same Name, Different Person Different Name, Same Person

slide-28
SLIDE 28

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

slide-29
SLIDE 29

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

ψ

slide-30
SLIDE 30

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

ψ

slide-31
SLIDE 31

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

slide-32
SLIDE 32

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

slide-33
SLIDE 33

Entity Resolution Model

Jimmy Fallon Jimmy Kimmel James Thomas Fallon Tonight Show Host

James Christian Kimmel

Find the best arrangement.

ψ

slide-34
SLIDE 34

Entity Resolution Algorithm

The Baseline ER metropolis hastings takes a random mention and adds it to a random entity.

slide-35
SLIDE 35

Entity Resolution Algorithm

The Baseline ER metropolis hastings takes a random mention and adds it to a random entity. Random Number Generator

slide-36
SLIDE 36

Entity Resolution Algorithm

slide-37
SLIDE 37

1.Select a source mention at random.

Entity Resolution Algorithm

slide-38
SLIDE 38

1.Select a source mention at random.

Entity Resolution Algorithm

slide-39
SLIDE 39

1.Select a source mention at random. 2.Select a destination mention at random.

Entity Resolution Algorithm

slide-40
SLIDE 40

1.Select a source mention at random. 2.Select a destination mention at random.

Entity Resolution Algorithm

slide-41
SLIDE 41

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-42
SLIDE 42

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

slide-43
SLIDE 43

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

slide-44
SLIDE 44

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-45
SLIDE 45

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-46
SLIDE 46

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-47
SLIDE 47

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-48
SLIDE 48

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-49
SLIDE 49

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-50
SLIDE 50

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. Reject!

Entity Resolution Algorithm

slide-51
SLIDE 51

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-52
SLIDE 52

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-53
SLIDE 53

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-54
SLIDE 54

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-55
SLIDE 55

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge.

Entity Resolution Algorithm

slide-56
SLIDE 56

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. 4.Accept when it improves the state.

Entity Resolution Algorithm

slide-57
SLIDE 57

1.Select a source mention at random. 2.Select a destination mention at random. 3.Propose a merge. 4.Accept when it improves the state.

Accept!

Entity Resolution Algorithm

slide-58
SLIDE 58

Eventually converges. (State does not oscillate or vary)

Entity Resolution Algorithm

slide-59
SLIDE 59

Eventually converges. (State does not oscillate or vary)

Entity Resolution Algorithm

Markov Chain Monte Carlo Metropolis Hastings!

slide-60
SLIDE 60

Sampling Optimizations

Distributed Computations (Singh et al. 2011)

slide-61
SLIDE 61

Sampling Optimizations

Distributed Computations (Singh et al. 2011) Query-Driven Computation (Grant et al. 2015)

slide-62
SLIDE 62

Sampling Inefficiencies

slide-63
SLIDE 63

Sampling Inefficiencies

  • 1. Large clusters are the slowest.
slide-64
SLIDE 64

Sampling Inefficiencies

  • 1. Large clusters are the slowest.
  • Pairwise comparisons are expensive.
slide-65
SLIDE 65

Sampling Inefficiencies

  • 1. Large clusters are the slowest.
  • Pairwise comparisons are expensive.

Θ(n2)

slide-66
SLIDE 66

Sampling Inefficiencies

  • 1. Large clusters are the slowest.

Pairwise comparisons are expensive.

  • 2. Excessive computation on unambiguous entities

Θ(n2)

slide-67
SLIDE 67

Sampling Inefficiencies

  • 1. Large clusters are the slowest.

Pairwise comparisons are expensive.

  • 2. Excessive computation on unambiguous entities

Entities such as Carnegie Mellon are relatively unambiguous. Θ(n2)

slide-68
SLIDE 68

Sampling Inefficiencies

  • 1. Large clusters are the slowest.

Pairwise comparisons are expensive.

  • 2. Excessive computation on unambiguous entities

Entities such as Carnegie Mellon are relatively unambiguous. Θ(n2)

Streaming documents exacerbates these problems.

slide-69
SLIDE 69

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC.

slide-70
SLIDE 70

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions:

slide-71
SLIDE 71

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions: 1.Can I approximate the state score calculation?

slide-72
SLIDE 72

Optimizer for MCMC Sampling

Database style optimizer for streaming MCMC. This optimizer makes two decisions: 1.Can I approximate the state score calculation? 2.Should I compress an Entity?

slide-73
SLIDE 73

Experiments

  • Wikilink Data Set (Singh, Subramaniya,

Pereira, McCallum, 2011)

  • Largest fully-labeled data set
  • 40 Million Mentions
  • 180 GBs of data
slide-74
SLIDE 74

Large Entity Sizes

slide-75
SLIDE 75

Large Entity Sizes

slide-76
SLIDE 76

Entity Compression

slide-77
SLIDE 77

Entity Compression

  • Known matches can be compressed into a representative mention.
slide-78
SLIDE 78

Entity Compression

  • Known matches can be compressed into a representative mention.
  • Entity compression can reduce the number of mentions (n).
slide-79
SLIDE 79

Entity Compression

  • Known matches can be compressed into a representative mention.
  • Entity compression can reduce the number of mentions (n).
  • Compression of large and popular entities is costly.
slide-80
SLIDE 80

Entity Compression

  • Known matches can be compressed into a representative mention.
  • Entity compression can reduce the number of mentions (n).
  • Compression of large and popular entities is costly.
  • Compression errors are permanent.
slide-81
SLIDE 81

Compression Types

  • Run-Length Encoding
  • Hierarchical Compression (Wick et al.)

Run Length Encoding

slide-82
SLIDE 82

Early Stopping

  • Can we estimate the computation of

the features?

Singh et al. EMNLP’12

slide-83
SLIDE 83

Early Stopping

  • Can we estimate the computation of

the features?

  • Given a p value, randomly select

less values.

Singh et al. EMNLP’12

slide-84
SLIDE 84

Early Stopping

  • Can we estimate the computation of

the features?

  • Given a p value, randomly select

less values.

Singh et al. EMNLP’12

slide-85
SLIDE 85

Optimizer

Current work

  • 1. Classifier for deciding when to perform early stopping.
  • 2. Classifier for the decision to compress.
slide-86
SLIDE 86

When should it compress?

slide-87
SLIDE 87

When should it compress?

Power law says there are only a small number of very large clusters.

slide-88
SLIDE 88

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way.

slide-89
SLIDE 89

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

slide-90
SLIDE 90

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

slide-91
SLIDE 91

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization

slide-92
SLIDE 92

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

slide-93
SLIDE 93

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

slide-94
SLIDE 94

When should it compress?

Power law says there are only a small number of very large clusters. We can treat these in a special way. Examining the Wiki Link data set.

Exact String Match Initialization Ground Truth

slide-95
SLIDE 95

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster.

slide-96
SLIDE 96

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster. Compression must be worth it.

slide-97
SLIDE 97

When should it compress?

We could make 100,000 insertions in the time it take to to compress a 300K mention cluster. Compression must be worth it.

slide-98
SLIDE 98

When should we approximate?

slide-99
SLIDE 99

When should we approximate?

  • Early stopping only makes sense

for clusters of medium size.

  • It is better to do full comparison

for small and large cluster sizes.

slide-100
SLIDE 100

When should we approximate?

  • Early stopping only makes sense

for clusters of medium size.

  • It is better to do full comparison

for small and large cluster sizes.

slide-101
SLIDE 101

When should we approximate?

  • Early stopping only makes sense

for clusters of medium size.

  • It is better to do full comparison

for small and large cluster sizes.

slide-102
SLIDE 102

Optimizer for Query-Driven Sampling

while samples-- > 0: m ~ Mentions e ~ Entities state’ = move(state, m, e)

  • = Optimize(state, state’, m, e)

if (!score(state’,state, o)): state = state’ doCompress(state, m, e, o) Optimizer needs to know:

  • Current Cardinality of Items in

each entity.

  • Memory/CPU configuration for

estimating baseline time

slide-103
SLIDE 103

Summary

  • We motivated the need and discussed the open space for
  • ptimization of MCMC sampling methods.
  • We plan to use the newly released labeled TREC stream corpus.
  • Want to collaborate?!
  • Lets talk if you want to do a Ph.D. at the University of Oklahoma!
slide-104
SLIDE 104

Thank you!