Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1

This lecture v Brown Clustering 6501 Natural Language Processing 2

Brown Clustering v Similar to language model But, basic unit is “word clusters” v Intuition: again, similar words appear in similar context v Recap: Bigram Language Models v 𝑄 𝑥 # , 𝑥 % , 𝑥 & , … , 𝑥 ( = 𝑄 𝑥 % 𝑥 # 𝑄 𝑥 & 𝑥 % … 𝑄 𝑥 ( 𝑥 (+% / = Π -.% P(w 3 ∣ 𝑥 3+% ) 𝑥 # is a dummy word representing ”begin of a sentence” 6501 Natural Language Processing 3

Motivation example v ”a dog is chasing a cat” v 𝑄 𝑥 # , “𝑏”, ”𝑒𝑝𝑕”, … , “𝑑𝑏𝑢” = 𝑄 ”𝑏” 𝑥 # 𝑄 ”𝑒𝑝𝑕” ”𝑏” … 𝑄 ”𝑑𝑏𝑢” ”𝑏” v Assume Every word belongs to a cluster Cluster 46 Cluster 64 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 4

Motivation example v Assume every word belongs to a cluster v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 5

Motivation example v Assume every word belongs to a cluster v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 6

Motivation example v Assume every word belongs to a cluster v “the boy is following a rabbit” C3 C8 C46 C3 C46 C64 the boy is following a rabbit Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 7

Motivation example v Assume every word belongs to a cluster v “a fox was chasing a bird” C3 C8 C46 C3 C46 C64 a fox was chasing a bird Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 8

Brown Clustering v Let 𝐷 𝑥 denote the cluster that 𝑥 belongs to v “a dog is chasing a cat” C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 9

Brown clustering model v P(“a dog is chasing a cat”) = P(C(“a”)| 𝐷 # ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... C3 C8 C46 C3 C46 C64 a dog is chasing a cat P(cat|C(cat)) P(C(dog)|C(a)) Cluster 46 Cluster 8 Cluster 3 Cluster 64 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 10

Brown clustering model v P(“a dog is chasing a cat”) = P(C(“a”)| 𝐷 # ) P(C(“dog”)|C(“a”)) P(C(“dog”)|C(“a”))… P(“a”|C(“a”))P(“dog”|C(“dog”))... v In general 𝑄 𝑥 # , 𝑥 % , 𝑥 & , … , 𝑥 ( = 𝑄 𝐷(𝑥 % ) 𝐷 𝑥 # 𝑄 𝐷(𝑥 & ) 𝐷(𝑥 % ) … 𝑄 𝐷 𝑥 ( 𝐷 𝑥 (+% 𝑄(𝑥 % |𝐷 𝑥 % 𝑄 𝑥 & 𝐷 𝑥 & … 𝑄(𝑥 ( |𝐷 𝑥 ( ) / = Π -.% P 𝐷 w 3 𝐷 𝑥 3+% 𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) 6501 Natural Language Processing 11

Model parameters 𝑄 𝑥 # , 𝑥 % , 𝑥 & , … , 𝑥 ( / = Π -.% P 𝐷 w 3 𝐷 𝑥 3+% 𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) Parameter set 2: Parameter set 1: 𝑄(𝑥 3 |𝐷 𝑥 3 ) 𝑄(𝐷(𝑥 3 )|𝐷 𝑥 3+% ) C8 C3 C46 C3 C46 C64 a dog is chasing a cat Parameter set 3: Cluster 46 Cluster 8 Cluster 3 Cluster 64 𝐷 𝑥 3 dog cat chasing a is fox rabbit following the was bird boy biting… 6501 Natural Language Processing 12

Model parameters 𝑄 𝑥 # , 𝑥 % , 𝑥 & , … , 𝑥 ( / = Π -.% P 𝐷 w 3 𝐷 𝑥 3+% 𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) v A vocabulary set 𝑋 v A function 𝐷: 𝑋 → {1, 2, 3, … 𝑙 } v A partition of vocabulary into k classes v Conditional probability 𝑄(𝑑′ ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 v Conditional probability 𝑄(𝑥 ∣ 𝑑) for 𝑑, 𝑑 J ∈ 1,… , 𝑙 ,𝑥 ∈ 𝑑 𝜄 represents the set of conditional probability parameters C represents the clustering 6501 Natural Language Processing 13

Log likelihood LL( 𝜄, 𝐷 ) = log 𝑄 𝑥 # , 𝑥 % , 𝑥 & ,… , 𝑥 ( 𝜄, 𝐷 / = log Π -.% P 𝐷 w 3 𝐷 𝑥 3+% 𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) / = ∑ -.% [log P 𝐷 w 3 𝐷 𝑥 3+% + log𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) ] v Maximizing LL( 𝜄, 𝐷 ) can be done by alternatively update 𝜄 and 𝐷 1. max X∈Y 𝑀𝑀(𝜄,𝐷) 2. max [ 𝑀𝑀(𝜄,𝐷) 6501 Natural Language Processing 14

max X∈Y 𝑀𝑀(𝜄, 𝐷) LL( 𝜄, 𝐷 ) = log 𝑄 𝑥 # , 𝑥 % , 𝑥 & , … , 𝑥 ( 𝜄, 𝐷 / = log Π -.% P 𝐷 w 3 𝐷 𝑥 3+% 𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) / = ∑ -.% [log P 𝐷 w 3 𝐷 𝑥 3+% + log𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) ] #(] ^ ,]) v 𝑄(𝑑′ ∣ 𝑑) = #] #(_,]) v 𝑄(𝑥 ∣ 𝑑) = #] 6501 Natural Language Processing 15

max [ 𝑀𝑀(𝜄, 𝐷) / max [ ∑ -.% [log P 𝐷 w 3 𝐷 𝑥 3+% + log𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) ] 𝑞 𝑑, 𝑑 J log a ],] ^ c c = n ∑ ∑ a ] a(] ^ ) + 𝐻 ].% ]J.% where G is a constant v Here, # ],] ^ 𝑞 𝑑, 𝑑 J = # ] , 𝑞 𝑑 = #(],] ^ ) ∑ ∑ #(]) d d,d^ a 𝑑 𝑑 J a ],] ^ a ] a(] ^ ) = v (mutual information) a ] 6501 Natural Language Processing 16

max [ 𝑀𝑀(𝜄, 𝐷) / max [ ∑ -.% [log P 𝐷 w 3 𝐷 𝑥 3+% + log𝑄(𝑥 3 ∣ 𝐷 𝑥 3 ) ] 𝑞 𝑑, 𝑑 J log a ],] ^ c c = n ∑ ∑ a ] a(] ^ ) + 𝐻 ].% ]J.% 6501 Natural Language Processing 17

Algorithm 1 v Start with |V| clusters each word is in its own cluster v The goal is to get k clusters v We run |V|-k merge steps: v Pick 2 clusters and merge them v Each step pick the merge maximizing 𝑀𝑀(𝜄, 𝐷) v Cost? (can be improved to 𝑃( 𝑊 g ) ) O(|V|-k) 𝑃( 𝑊 & ) 𝑃 ( 𝑊 & ) = 𝑃( 𝑊 h ) #Iters #pairs compute LL 6501 Natural Language Processing 18

Algorithm 2 v m : a hyper-parameter, sort words by frequency v Take the top m most frequent words, put each of them in its own cluster 𝑑 % , 𝑑 & ,𝑑 g ,… 𝑑 i v For 𝑗 = 𝑛 + 1 … |𝑊| v Create a new cluster 𝑑 il% (we have m+1 clusters) v Choose two cluster from m+1 clusters based on 𝑀𝑀 𝜄, 𝐷 and merge ⇒ back to m clusters v Carry out (m-1) final merges ⇒ full hierarchy v Running time O 𝑊 𝑛 & + 𝑜 , n=#words in corpus 6501 Natural Language Processing 19

Example clusters (Brown+1992) 6501 Natural Language Processing 20

Example Hierarchy (Miller+2004) 6501 Natural Language Processing 21

Quiz 1 v 30 min (9/20 Tue. 12:30pm-1:00pm) v Fill-in-the-blank, True/False v Short answer v Closed book, Closed notes, Closed laptop v Sample questions: v Add one smoothing v.s. Add-Lambda Smoothing v 𝑏 = 1,3,5 , 𝑐 = 2,3,6 what is the cosine similarity between a and 𝑐 ? 6501 Natural Language Processing 22

6501 Natural Language Processing 23

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Recursion continued Midterm Exam 2 parts Part 1 done in recitation Programming

Operations & Logistics Management in Air Transportation Professor David Gillen (University of

ALL ABOUT ME JULIE EDWARDS 25 years in Marketing/PR 8 years in NPO Management 4 years

Introduction to Mosel and Xpress ORLAB - Operations Research Laboratory Stefano Gualandi

ECO 199 B GAMES OF STRATEGY Spring Term 2004 B March 4 ASYMMETRIC INFORMATION DIRECT

IDROGEN High speed acquisition board Daniel Charlet 1 ,Cedric Viou 2 ,Jean-Pierre

Introduction to Alice Alice is named in honor of Lewis Carroll s Alice in Wonderland Slides

The Tree Lifting Algorithm Jim Belk, University of St Andrews Collaborators Justin Lanier, Dan

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of - PowerPoint PPT Presentation

Lecture 8: Word Clustering Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/NLP16 6501 Natural Language Processing 1 This lecture v Brown Clustering 6501 Natural Language Processing 2

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Recursion continued Midterm Exam 2 parts Part 1 done in recitation Programming

Operations &amp; Logistics Management in Air Transportation Professor David Gillen (University of

ALL ABOUT ME JULIE EDWARDS 25 years in Marketing/PR 8 years in NPO Management 4 years

Introduction to Mosel and Xpress ORLAB - Operations Research Laboratory Stefano Gualandi

ECO 199 B GAMES OF STRATEGY Spring Term 2004 B March 4 ASYMMETRIC INFORMATION DIRECT

IDROGEN High speed acquisition board Daniel Charlet 1 ,Cedric Viou 2 ,Jean-Pierre

Introduction to Alice Alice is named in honor of Lewis Carroll s Alice in Wonderland Slides

The Tree Lifting Algorithm Jim Belk, University of St Andrews Collaborators Justin Lanier, Dan

Operations & Logistics Management in Air Transportation Professor David Gillen (University of