Robust Entity Clustering via Phylogenetic Inference Nicholas - - PowerPoint PPT Presentation

robust entity clustering via phylogenetic inference
SMART_READER_LITE
LIVE PREVIEW

Robust Entity Clustering via Phylogenetic Inference Nicholas - - PowerPoint PPT Presentation

Robust Entity Clustering via Phylogenetic Inference Nicholas Andrews with Jason Eisner and Mark Dredze Some data Did Taylor swift just dis harry sytles on the #grammys ? Lmao Some data Did Taylor swift just dis harry sytles on the #grammys ?


slide-1
SLIDE 1

Robust Entity Clustering via Phylogenetic Inference

Nicholas Andrews

with Jason Eisner and Mark Dredze

slide-2
SLIDE 2

Did Taylor swift just dis harry sytles on the #grammys ? Lmao

Some data

slide-3
SLIDE 3

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys

Some data

slide-4
SLIDE 4

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs.

Some data

slide-5
SLIDE 5

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Taylor swift is apart of the Illuminati #grammys

Some data

slide-6
SLIDE 6

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. Taylor swift is apart of the Illuminati #grammys

Some data

slide-7
SLIDE 7

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. LL Cool J is looking realllll chizzled! Taylor swift is apart of the Illuminati #grammys

Some data

slide-8
SLIDE 8

Did Taylor swift(1) just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift(1) will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle (1) is on drugs. Lots of drugs. Ladies STILL love LL Cool James(2). LL Cool J(2) is looking realllll chizzled! Taylor swift(1) is apart of the Illuminati #grammys

(1) (2)

Entity clustering

slide-9
SLIDE 9

Key idea: “Directed” name variation

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. LL Cool J is looking realllll chizzled! Taylor swift is apart of the Illuminati #grammys Abbreviation

slide-10
SLIDE 10

“Directed” name variation

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. LL Cool J is looking realllll chizzled! Taylor swift is apart of the Illuminati #grammys Copy

slide-11
SLIDE 11

“Directed” name variation

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. LL Cool J is looking realllll chizzled! Taylor swift is apart of the Illuminati #grammys

  • izzle
slide-12
SLIDE 12

“Directed” name variation

Did Taylor swift just dis harry sytles on the #grammys ? Lmao Lets see how bad T Swift will be. #grammys Watching the Grammy’s - it’s clear that T-Swizzle is on drugs. Lots of drugs. Ladies STILL love LL Cool James. LL Cool J is looking realllll chizzled! Taylor swift is apart of the Illuminati #grammys Abbreviation

slide-13
SLIDE 13

“phylogeny”?

LL Cool James T-swift T-swizzle Taylor swift NEW Taylor swift LL Cool J

slide-14
SLIDE 14

“phylogeny”?

LL Cool James T-swift T-swizzle Taylor swift NEW Taylor swift LL Cool J

mutation

slide-15
SLIDE 15

“phylogeny”?

LL Cool James T-swift T-swizzle Taylor swift NEW Taylor swift LL Cool J

species

slide-16
SLIDE 16

A Generative Story

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati

slide-17
SLIDE 17

A Generative Story

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati

slide-18
SLIDE 18

Pick topics

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool ]J is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 5 2 1 2 3 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7

slide-19
SLIDE 19

Pick words | topics

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 5 2 1 2 3 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7

[ ] is a placeholder for an entity mention

slide-20
SLIDE 20

Talk about NEW entity or an old one?

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

?

slide-21
SLIDE 21

Talk about NEW entity or an old one?

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-22
SLIDE 22

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

?

Talk about NEW entity or an old one?

slide-23
SLIDE 23

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

Talk about NEW entity or an old one?

slide-24
SLIDE 24

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

?

Talk about NEW entity or an old one?

slide-25
SLIDE 25

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 1 5 2 1 2 3 NEW 10 2 5 1 7

Talk about NEW entity or an old one?

slide-26
SLIDE 26

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 1 5 2 1 2 3 NEW 10 2 5 1 7

?

Talk about NEW entity or an old one?

slide-27
SLIDE 27

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 1 5 2 1 2 3 NEW 10 2 5 1 7

Talk about NEW entity or an old one?

slide-28
SLIDE 28

5 1 3 2 3 8

Selecting a parent

Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 2 3 10 NEW 10 2 5 1 7

slide-29
SLIDE 29

10 2 5 1 7 5 1 3 2 3 8

Selecting a parent

Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

Unknown parameters

1 2 3 10

slide-30
SLIDE 30

5 1 3 2 3 8

Selecting a parent

Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 2 3 10 NEW 10 2 5 1 7

slide-31
SLIDE 31

5 1 3 2 3 8

Selecting a parent

Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 2 3 10 NEW 10 2 5 1 7

Hyperparameter: controls # of entities

slide-32
SLIDE 32

Another view

[LL Cool James] [T-swift] [T-swizzle] [Taylor swift] NEW [Taylor swift] [LL Cool J]

Note: No names yet, that’s next…

slide-33
SLIDE 33

Generating a name

If parent = NEW: name new entity

[LL Cool James] NEW [Taylor swift]

slide-34
SLIDE 34

Generating a name

If parent = NEW: name new entity We draw a name from a simple character LM with trainable parameters θ

[LL Cool James] NEW [Taylor swift]

(... room for fancier distributions over ∑* …)

slide-35
SLIDE 35

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 1 5 2 1 2 3 NEW 3 1 2 5 1 2 4 3 4 1 2 5 1 2 7 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7

slide-36
SLIDE 36

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-37
SLIDE 37

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-38
SLIDE 38

Tεεε Tεε Tε T

If parent != NEW: copy (maybe mutate) the parent’s name We draw the name from a stochastic contextual edit model (see Cotterell et al., 2014) with trainable parameters θ

Tεεεεε-Swift Tεεεεε-Swif Tεεεεε-Swi Tεεεεε-Sw Tεεεεε-S Tεεεεε- Tεεεεε Tεεεε Tεεε Generating a name

[T-Swift] [Taylor swift]

Taylor swift

Copy Del Subst Copy Stop

slide-39
SLIDE 39

Generating a name

If parent != NEW: copy (maybe mutate) the parent’s name We draw the name from a stochastic contextual edit model (see Cotterell et al., 2014) with trainable parameters θ θ gives the contextual probability of different character edits

  • copy, delete, substitute c, insert c

[T-Swift] [Taylor swift]

slide-40
SLIDE 40

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-41
SLIDE 41

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-42
SLIDE 42

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-43
SLIDE 43

Pick mention name | parent

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7 1 5 2 1 2 3 NEW

slide-44
SLIDE 44

End result

LL Cool James T-swift T-swizzle Taylor swift NEW Taylor swift LL Cool J

slide-45
SLIDE 45

End result

LL Cool James T-swift T-swizzle Taylor swift NEW Taylor swift LL Cool J

slide-46
SLIDE 46

Inference (fixed parameters)

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati

slide-47
SLIDE 47

Many possible phylogenies

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

slide-48
SLIDE 48

Many possible phylogenies

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

slide-49
SLIDE 49

Many possible phylogenies

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

slide-50
SLIDE 50

Many possible phylogenies

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

slide-51
SLIDE 51

Many possible phylogenies

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati NEW

slide-52
SLIDE 52

Many possible topic assignments

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 5 2 1 2 3 3 4 1 2 5 1 2 7 3 1 2 5 1 2 4 5 1 3 2 3 8 1 2 3 10 10 2 5 1 7

slide-53
SLIDE 53

Many possible topic assignments

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 1 2 1 2 3 3 4 1 2 1 1 2 7 3 1 2 2 1 2 4 3 1 3 2 3 8 1 2 3 5 2 2 5 1 7

slide-54
SLIDE 54

Many possible topic assignments

Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! [Taylor swift] is apart of the Illuminati 1 3 2 1 2 3 3 4 1 2 3 1 2 7 3 1 2 3 1 2 4 3 1 3 2 3 8 1 2 3 3 3 2 5 1 7

slide-55
SLIDE 55

Block Gibbs (fixed parameters)

LL Cool James, 1 T-swift, 1 T-swizzle, 1 Taylor swift, 1 NEW Taylor swift, 1 LL Cool J, 1

Sample phylogeny | topics

slide-56
SLIDE 56

Sample topics | phylogeny (We use tree-structured BP to construct proposals; see paper for details.)

Block Gibbs (fixed parameters)

LL Cool James, 2 T-swift, 1 T-swizzle, 1 Taylor swift, 1 NEW Taylor swift, 1 LL Cool J, 3

slide-57
SLIDE 57

Block Gibbs (fixed parameters)

LL Cool James, 2 T-swift, 1 T-swizzle, 1 Taylor swift, 1 NEW Taylor swift, 1 LL Cool J, 3

Sample phylogeny | topics

slide-58
SLIDE 58

Block Gibbs (fixed parameters)

LL Cool James, 2 T-swift, 1 T-swizzle, 1 Taylor swift, 1 NEW Taylor swift, 1 LL Cool J, 2

Sample topics | phylogeny

slide-59
SLIDE 59

What is the sampler thinking about?

Taylor Swift, 1 T Swift, 5 T-swift, 1

Likelier mutation

(similar name)

?

Likelier source

(similar context)

slide-60
SLIDE 60

What is the sampler thinking about?

NEW John Jacob Jingleheimer Schmidt

?

Prefer to copy rather than generate from scratch

John Jacob Jingleheimer Schmidt (short or common names would be plausible to generate twice) (see paper for an improved mutation model that considers pragmatics)

slide-61
SLIDE 61

What do samples tell us?

LL Cool James T-swift T-Swizzle NEW Taylor swift Cool James LL Cool J

  • 1. Which names are copies of other names

(used for EM training of all parameters)

slide-62
SLIDE 62

What do samples tell us?

LL Cool James T-swift T-Swizzle NEW Taylor swift Cool James LL Cool J

  • 2. Which names corefer

(used for your IE task)

slide-63
SLIDE 63

A complication

Lets see how bad [T Swift] will be. #grammys

Wait a second…

slide-64
SLIDE 64

Authors copy from multiple sources

Lets see how bad [T Swift] will be. #grammys [Taylor Swift]

slide-65
SLIDE 65

Authors copy from multiple sources

Lets see how bad [T Swift] will be. #grammys …… [Taylor Swift]

slide-66
SLIDE 66

Authors copy from multiple sources

Lets see how bad [T Swift] will be. #grammys [Taylor Swift]

slide-67
SLIDE 67

Bad ordering

Did [Taylor swift] just dis harry sytles it’s clear that [T-Swizzle] is on drugs Lets see how bad [T Swift] will be. #grammys NEW

slide-68
SLIDE 68

Bad ordering

Did [Taylor swift] just dis harry sytles it’s clear that [T-Swizzle] is on drugs Lets see how bad [T Swift] will be. #grammys [Taylor Swift] [T Swift]

slide-69
SLIDE 69

Solution: Treat order as unknown

(5) LL Cool James (2) T-swift (3) T-swizzle (4) Taylor swift NEW (1) Taylor swift (6) LL Cool J

slide-70
SLIDE 70

Updated Block Gibbs (fixed params)

(5) LL Cool James (2) T-swift (3) T-swizzle (4) Taylor swift NEW (1) Taylor swift (6) LL Cool J

Sample phylogeny | ordering, topics

slide-71
SLIDE 71

Sample ordering | phylogeny, topics (we use a proposal distribution and correct with MH; see paper for details)

Updated Block Gibbs (fixed params)

(1) LL Cool James (4) T-swift (5) T-swizzle (6) Taylor swift NEW (2) Taylor swift (3) LL Cool J

slide-72
SLIDE 72

Summary

So far we’ve seen:

  • The generative story
  • A sampler for posterior inference

Wrap up with:

  • Parameter estimation
  • MBR decoding
  • Experiments
slide-73
SLIDE 73

Parameter Estimation

Monte Carlo EM Repeat:

  • E-Step: Count edges in sample
  • M-Step: Take stochastic gradient step

○ Update* parent model parameters ○ Update* mutation model parameters

* We use AdaGrad which has an adaptive learning rate

slide-74
SLIDE 74

Consensus Clustering

To get a single “hard” clustering C for evaluation, we use minimum Bayes risk: argminC EC’[ loss(C, C’) ]

  • Minimize expected loss of C

○ With respect to C’ drawn from the model posterior ○ Estimate this using our samples of C’ (we approximate argmin using spectral clustering; see paper for details)

slide-75
SLIDE 75

Summary

Unsupervised clustering procedure:

  • 1. Train model using Monte Carlo EM
  • 2. Sample from the posterior
  • 3. Pick the MBR clustering given posterior
slide-76
SLIDE 76

Evaluation: Twitter

Corpus of Twitter messages discussing the 2013 Grammy award ceremony (~5000 posts total, ~300 entities).

  • Procedure: 4-fold cross validation

○ Train: Tune weight of picking NEW as the parent to control precision / recall trade-off ○ Test: Run clustering procedure with this weight fixed

slide-77
SLIDE 77

Results: Twitter

System B3 F1 Exact-Match 69.8 (averages over 4 folds)

slide-78
SLIDE 78

Results: Twitter

System B3 F1 Exact-Match 69.8 Green et al. (2012) 79.3 (averages over 4 folds)

slide-79
SLIDE 79

Results: Twitter

System B3 F1 Exact-Match 69.8 Green et al. (2012) 79.3 Phylo (no context) 88.7 (averages over 4 folds)

slide-80
SLIDE 80

Results: Twitter

System B3 F1 Exact-Match 69.8 Green et al. (2012) 79.3 Phylo (no context) 88.7 Phylo + Topic 91.8 (averages over 4 folds)

slide-81
SLIDE 81

Results: Twitter

System B3 F1 Exact-Match 69.8 Green et al. (2012) 79.3 Phylo (no context) 88.7 Phylo + Topic 91.8 Phylo + Topic + MBR 91.9 (averages over 4 folds)

slide-82
SLIDE 82

Results: ACE 2008

Corpus of news articles, mostly politics ~4000 mentions, ~2000 entities

  • Procedure

Train: Tune weight of picking NEW as the parent to control precision / recall trade-off Test: Run clustering procedure with this weight fixed

slide-83
SLIDE 83

Evaluation: ACE 2008

System PER B3 F1 ORG B3 F1 Exact-Match 88.8 87.1

slide-84
SLIDE 84

Evaluation: ACE 2008

System PER B3 F1 ORG B3 F1 Exact-Match 88.8 87.1 Green et al. (2012) 91.9 90.3

slide-85
SLIDE 85

Evaluation: ACE 2008

System PER B3 F1 ORG B3 F1 Exact-Match 88.8 87.1 Green et al. (2012) 91.9 90.3 Phylo+Topic+MBR 92.7 87.6

slide-86
SLIDE 86

Thanks!

  • More experiments in the paper

○ Name canonicalization

  • Code will be released soon

○ https://bitbucket.org/noandrews/phyloinf

  • Future uses of model and code?

○ Track diffusion of memes through social media ○ Derivational morphology

slide-87
SLIDE 87

[Taylor swift] is apart of the Illuminati Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James].

Samples tell us which entities corefer

Useta love reading old [Tom Swift] books

More about this person?

slide-88
SLIDE 88

[Taylor swift] is apart of the Illuminati Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James].

Samples tell us which entities corefer

Useta love reading old [Tom Swift] books

More about this person?

slide-89
SLIDE 89

[Taylor swift] is apart of the Illuminati Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James].

Coref probabilities from many samples

Useta love reading old [Tom Swift] books

More about this person? .1 .8 .9 1.0 .8 0.0

slide-90
SLIDE 90

[Taylor swift] is apart of the Illuminati Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs Ladies STILL love [LL Cool James]. Useta love reading old [Tom Swift] books

More about this person? .1 .8 .9 1.0 .8 0.0

.8 .1