Perplexity on Reduced Corpora Analysis of Cutoff by Power Law - - PowerPoint PPT Presentation

perplexity on reduced corpora
SMART_READER_LITE
LIVE PREVIEW

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law - - PowerPoint PPT Presentation

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law Hayato Kobayashi Yahoo Japan Corporation Cutoff Removing low-frequency words from a corpus Common practice to save computational costs in learning Language modeling


slide-1
SLIDE 1

Perplexity on Reduced Corpora

— Analysis of Cutoff by Power Law

Hayato Kobayashi Yahoo Japan Corporation

slide-2
SLIDE 2

Cutoff

2

 Removing low-frequency words from a corpus  Common practice to save computational costs in learning  Language modeling

 Needed even in a distributed environment, since the feature

space of k-grams is quite large [Brants+ 2007]

 Topic modeling

 Enough for roughly analyzing topics, since low-frequency words

have a small impact on the statistics [Steyvers&Griffiths 2007]

slide-3
SLIDE 3

Question

3

 How many low-frequency words can we remove while

maintaining sufficient performance?

 More generally, how much can we reduce a corpus/model using

a certain strategy?

 Many experimental studies addressing the question

 [Stoleke 1998], [Buchsbaum+ 1998], [Goodman&Gao 2000],

[Gao&Zhang 2002], [Ha+ 2006], [Hirsimaki 2007], [Church+ 2007]

 Discussing trade-off relationships between the size of reduced

corpus/model and its performance

 No theoretical study!

slide-4
SLIDE 4

This work

4

 First address the question from a theoretical standpoint  Derive the trade-off formulae of the cutoff strategy for k-

gram models and topic models

 Perplexity vs. reduced vocabulary size

 Verify the correctness of our theory on synthetic corpora

and examine the gap between theory and practice on several real corpora

slide-5
SLIDE 5

Approach

5

 Assume a corpus follows Zipf’s law (power law)

 Empirical rule representing a long-tail property in a corpus

 Essentially the same approach as in physics

 Constructing a theory while believing experimentally observed

results (e.g., gravity acceleration g)

We can derive the landing point of a ball by believing g. Similarly, we try to clarify the trade-off relationships by believing Zipf’s law.

) , ( 0  v

g v ) 2 sin(

2

slide-6
SLIDE 6

Outline

6

 Preliminaries

 Zipf’s law  Perplexity (PP)  Cutoff and restoring

 PP of unigram models  PP of k-gram models  PP of topic models  Conclusion

slide-7
SLIDE 7

Zipf’s law

7

 Empirical rule discovered on real corpora [Zipf, 1935]

 Word frequency f(w) is inversely proportional to its frequency

ranking r(w)

) ( ) ( w r C w f 

f(w) r(w)

Log-log graph

Real corpora roughly follow Zipf’s law (Linear on a log-log graph)

Frequency ranking Frequency

  • Max. frequency

Zipf random

slide-8
SLIDE 8

Perplexity (PP)

8

 Widely used evaluation measure of statistical models

 Geometric mean of the inverse of the per-word likelihood on

the held-out test corpus

 PP means how many possibilities one has for estimating the

next word

 Lower perplexity means better generalization performance

Corpus size Test corpus

slide-9
SLIDE 9

Cutoff

9

 Removing low frequency words

 f(remaining word) ≥ f(removed word) holds

f(w) r(w) Reduced corpus w’ Learned prob. Probability ranking Learn from w’ Need to infer

slide-10
SLIDE 10

Constant restoring

10

 Infer the prob. of the removed words as a constant

 Approximate the result learned from the original corpus

Inferred prob.

Probability ranking

Reduced corpus Learned from w’ Constant λ

slide-11
SLIDE 11

Outline

11

 Preliminaries

 Zipf’s law  Perplexity (PP)  Cutoff and restoring

 PP of unigram models  PP of k-gram models  PP of topic models  Conclusion

slide-12
SLIDE 12

Perplexity of unigram models

12

 Predictive distribution of unigram models  Optimal restoring constant

 Obtained by minimizing PP w.r.t. a constant λ, after substituting

the restored probability into PP

N w f w p      ) ( ) (

Reduced corpus size Corpus size

  • Vocab. size

Reduced vocab. size

ˆ p(w)

slide-13
SLIDE 13

Theorem (PP of unigram models)

13

 For any reduced vocabulary size W’, the perplexity PP1 of

the optimal restored distribution of a unigram model is calculated as

Bertrand series (special form) Harmonic series

slide-14
SLIDE 14

Approximation of PP of unigrams

14

 H(X) and B(X) can be approximated by definite integrals  Approximate formula o is obtained as 

is quasi polynomial (quadratic)

 Behaves as a quadratic function on a log-log graph

Reduced vocab. size Euler-Mascheroni const.

slide-15
SLIDE 15

PP of unigrams vs. reduced vocab. size

15

Log-log graph

Real (Reuters) Theory Zipf random same size as Reuters

Maximum f(w) Zipf rand: 234,705 Reuters: 136,371

Our theory is suited for inferring the growth rate of perplexity rather than the perplexity value itself

slide-16
SLIDE 16

Outline

16

 Preliminaries

 Zipf’s law  Perplexity (PP)  Cutoff and restoring

 PP of unigram models  PP of k-gram models  PP of topic models  Conclusion

slide-17
SLIDE 17

Perplexity of k-gram models

17

 Simple model where k-grams are calculated from a

random word sequence based on Zipf’s law

 The model is “stupid”

 Bigram “is is” is quite frequent  T

wo bigrams “is a” and “a is” have the same frequency

 Later experiment will uncover the fact that the model can

roughly capture the behavior of real corpora

) " is a (" ) " (" ) " (" ) " a is (" p a p is p p   ) " (" ) " (" ) " is is (" is p is p p 

slide-18
SLIDE 18

Frequency of a k-gram

18

 Frequency fk of a k-gram wk is defined by  Decay function g2 of bigrams is as follows  Decay function gk of k-grams is defined through its

inverse:

Decay function Piltz divisor function that represents # of divisors of n

slide-19
SLIDE 19

Exponent of k-gram distributions

19

 Assume k-gram frequencies follow a power law

 [Ha+ 2006] found k-gram frequencies roughly follow a power

law, whose exponent πk is smaller than 1 (k>1)

 Optimal exponent in our model based on the assumption

 By minimizing the sum of squared errors between the inverse

gradients gk

  • 1(r) and r1/πk on a log-log graph
slide-20
SLIDE 20

Exponent of k-grams vs. gram size

20

Normal graph

Real (Reuters) Theory

slide-21
SLIDE 21

Corollary (PP of k-gram models)

21

 For any reduced vocabulary size W’, the perplexity of the

  • ptimal restored distribution of a k-gram model is

calculated as

 

X x a a

x X H

1

1 : ) (

 

X x a a

x x a X B

1

ln : ) (

Bertrand series (another special form) Hyper harmonic series

slide-22
SLIDE 22

PP of k-grams vs. reduced vocab. size

22

Log-log graph

Theory (Bigram) Unigram Theory (Trigram) Zipf (Bigram) Zipf (Trigram)

Due to Sparseness

We need to make assumptions that include backoff and smoothing for higher order k-grams

slide-23
SLIDE 23

Additional properties by power-law

23

 Treat as a variant of the coupon collector’s problem

 How many trials are needed for collecting all coupons whose

  • ccurrence probabilities follow some stable distribution

 There exists several works about power law distributions

 Corpus size for collecting all of the k-grams, according to

[Boneh&Papanicolaou 1996]

 When πk = 1, , otherwise,

 Lower and upper bound of the number of k-grams from

the corpus size N and vocab. size W, according to [Atsonios+ 2011]

k k

kW   1

W ln2W

slide-24
SLIDE 24

Outline

24

 Preliminaries

 Zipf’s law  Perplexity (PP)  Cutoff and restoring

 PP of unigram models  PP of k-gram models  PP of topic models  Conclusion

slide-25
SLIDE 25

Perplexity of topic models

25

 Latent Dirichlet Allocation (LDA) [Blei+ 2003]  Learning with Gibbs sampling

 Obtain a “good” topic assignment zi for each word wi

 Posterior distributions of two hidden parameters

       

) ( ) (

) ( ˆ ) ( ˆ

w z z d z d

n w n z

[Griffiths&Steyvers 2004]

Document-topic distribution Mixture rate of topic z in document d Topic-word distribution Occurrence rate of word w in topic z

slide-26
SLIDE 26

Rough assumptions of ϕ and θ

26

 Assumption of ϕ

 Word distribution ϕz of each topic z follows Zipf’s law

 Assumptions of θ (two extreme cases)

 Case All: Each document evenly has all topics  Case One: Each document only has one topic (uniform dist.)

 Case All: PP of a topic model ≈ PP of a unigram

 Marginal predictive distribution is independent of d

=1/T

The curve of actual perplexity is expected to be between their values It is natural, regarding each topic as a corpus

slide-27
SLIDE 27

Theorem(PP of LDA models: Case One)

27

 For any reduced vocabulary size W’, the perplexity of the

  • ptimal restored distribution of a topic model in the Case

One is calculated as

T : # of topics in LDA

slide-28
SLIDE 28

PP of LDA models vs. reduced vocab. size

28

Theory (Case One) (Case One + Case All) / 2 Zipf Theory (Case All) Mix of 20 Zipf T=20 CGS w/ 100 iter. α=β=0.1

Log-log graph

Real (Reuters)

slide-29
SLIDE 29

Time, memory, and PP of LDA learning

29

 Results of Reuters corpus  Memory usage of the (1/10)-corpus is only 60% of that of

the original corpus

 Helps in-memory computing for a larger corpus,

although the computational time decreased a little

slide-30
SLIDE 30

Outline

30

 Preliminaries

 Zipf’s law  Perplexity (PP)  Cutoff and restoring

 PP of unigram models  PP of k-gram models  PP of topic models  Conclusion

slide-31
SLIDE 31

Conclusion

31

 Trade-off formulae of the cutoff strategy for k-gram

models and topic models based on Zipf’law

 Perplexity vs. reduced vocabulary size

 Experiments on real corpora showed that the estimation

  • f the perplexity growth rate is reasonable

 We can get the best cutoff parameter by maximizing the

reduction rate ensuring an acceptable (relative) perplexity

 Possibility that we can theoretically derive empirical

parameters, or “rules of thumb”, for different NLP problems

Can we derive other “rules of thumb” based on Zipf’s law?

slide-32
SLIDE 32

Thank you

32