Language and Stats 11-(7/6)61 Heterogenity of language Types and - - PowerPoint PPT Presentation

language and stats 11 7 6 61 heterogenity of language
SMART_READER_LITE
LIVE PREVIEW

Language and Stats 11-(7/6)61 Heterogenity of language Types and - - PowerPoint PPT Presentation

Language and Stats 11-(7/6)61 Heterogenity of language Types and tokens Bhiksha Raj 11-761 1 The fiction we maintain To generate a text, the source randomly chooses a hidden message The concept to be conveyed It also


slide-1
SLIDE 1

Language and Stats 11-(7/6)61 Heterogenity of language Types and tokens

Bhiksha Raj

1 11-761

slide-2
SLIDE 2

The fiction we maintain

  • To generate a text, the source randomly chooses a “hidden” message ℎ
  • The concept to be conveyed
  • It also randomly produces a “surface form” to convey the message ℎ
  • The accessible form
  • Words, sentences, paragraphs, documents..
  • We only get to observe the surface form
  • This is what we must work with
  • To try to decipher inner message ℎ
  • Or just to learn all about valid surface forms
  • Course objectives: Learn all about statistical mechanisms to achieve the

above..

11-761 2

slide-3
SLIDE 3

Story so far

  • Language has hidden inner form and observed

surface form

  • Statistical model for language:
  • Generative stochastic process that produces it
  • Generates (h,s) pairs, but we only see s
  • CL: Make inferences about h from s
  • LM: Model s

11-761 3

slide-4
SLIDE 4

Story so far: Source channel model

  • Given an LT task that requires inference of the form:
  • X à Y
  • The source channel model hypothesizes the following

model

  • Inference is performed using Bayes Rule

!∗ = argmax

)

* ! (*|!)

  • Used in most LT (ASR, MT, Document classification, IR,

QA…)

11-761 4

Noisy Channel Y X

slide-5
SLIDE 5

Heterogeneity in language

  • Topic model: Assumption – P(S|T) is different for

different topics

  • Language changes with topic!
  • What you say and how you say it depends on the topic
  • In fact, language is influenced by many factors
  • And we can predict those factors from language..
  • No such thing as a language
  • Rather, many recognizable sublanguages
  • Even if we call them all “English”

11-761 5

slide-6
SLIDE 6

Heterogeneity in language

  • Lets play a little game:
  • I will show you snippets of text and ask you stuff about

it..

  • You must explain your answers!

11-761 6

slide-7
SLIDE 7

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 7

slide-8
SLIDE 8

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 8

slide-9
SLIDE 9

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 9

slide-10
SLIDE 10

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 10

slide-11
SLIDE 11

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 11

slide-12
SLIDE 12

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 12

slide-13
SLIDE 13

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 13

slide-14
SLIDE 14

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 14

slide-15
SLIDE 15

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 15

slide-16
SLIDE 16

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • …(What else can we say)?

11-761 16

slide-17
SLIDE 17

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • Sender’s gender?
  • …(What else can we say)?

11-761 17

slide-18
SLIDE 18

Heterogeneity in language

  • Hey Yo Trump... Were you IN on the @NRA's ties to Russia?

Russian cash to @NRA and @NRA cash to YOU ? WERE YOU IN ON IT ? Your move. #NRAGate #MariaButina #KremlinCashNRA

  • Tweet, email or text?
  • To an individual, or public address?
  • To a friend?
  • Friendly or combative?
  • Recent or old?
  • Guess the sender’s age
  • Guess the sender’s education
  • Is the sender interested in politics?
  • Where is the sender?
  • Sender’s gender
  • …(What else can we say)?

11-761 18

slide-19
SLIDE 19

Heterogeneity in language

Said the pelican to the elephant “I think we should marry, I do” ‘Cause there’s no name that rhymes with me, And no one else rhymes with you.” Said the elephant to the pelican, “There’s sense to what you’ve said, For rhyming’s as good a reason as any For any two to wed.” And so the elephant wed the pelican, And they dined upon lemons and limes And now they have a baby pelicant, And everybody rhymes

  • What can you tell me about this

11-761 19

slide-20
SLIDE 20

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 20

slide-21
SLIDE 21

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 21

slide-22
SLIDE 22

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 22

slide-23
SLIDE 23

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 23

slide-24
SLIDE 24

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 24

slide-25
SLIDE 25

Heterogeneity

  • The Dover mail was in its usual genial position that the

guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure

  • f nothing but the horses; as to which cattle he could

with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey.

  • News or fiction?
  • Time period of events described?
  • When was this written?
  • Nationality of author?
  • What else..

11-761 25

slide-26
SLIDE 26

Who said this?

  • “Sad”

11-761 26

slide-27
SLIDE 27

Heterogeneity in language

  • What are the various sources of heterogeneity in

language

11-761 27

slide-28
SLIDE 28

Heterogeneity in language

  • What are the various sources of heterogeneity in

language

  • How do they differ?

11-761 28

slide-29
SLIDE 29

Heterogeneity in language

  • What are the various sources of heterogeneity in

language

  • How do they differ?
  • Homework coming up on this problem

11-761 29

slide-30
SLIDE 30

True or false

  • The Merriam Webster dictionary has 470000 words
  • The Merriam Webster dictionary has over 50

million words

11-761 30

slide-31
SLIDE 31

True or false

  • The Merriam Webster dictionary has 470000 words
  • The Merriam Webster dictionary has over 50

million words

11-761 31

slide-32
SLIDE 32

Types vs. Tokens

  • Type: Uniquely identifiable value
  • Words in a lexicon (the left column of the dictionary)
  • Notes in music (how many)
  • Token: Instances of types
  • “Number of words in this article”
  • “12 notes to a bar”

11-761 32

slide-33
SLIDE 33

Word types vs. word tokens

  • How much wood would woodchuck chuck if

woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood.

  • Friends, Romans, countrymen, lend me your ears;

I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 33

slide-34
SLIDE 34

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 34

slide-35
SLIDE 35

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 35

slide-36
SLIDE 36

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 36

slide-37
SLIDE 37

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 37

slide-38
SLIDE 38

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 38

slide-39
SLIDE 39

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 39

slide-40
SLIDE 40

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 40

slide-41
SLIDE 41

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 41

slide-42
SLIDE 42

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 42

slide-43
SLIDE 43

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 43

slide-44
SLIDE 44

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 44

slide-45
SLIDE 45

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 45

slide-46
SLIDE 46

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 46

slide-47
SLIDE 47

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 47

slide-48
SLIDE 48

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 48

slide-49
SLIDE 49

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 49

slide-50
SLIDE 50

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 50

slide-51
SLIDE 51

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 51

slide-52
SLIDE 52

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 52

slide-53
SLIDE 53

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 53

slide-54
SLIDE 54

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 54

slide-55
SLIDE 55

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 55

slide-56
SLIDE 56

Frequency of encountering new types

  • How frequently do we encounter new type as we read the

following texts:

How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones

11-761 56

slide-57
SLIDE 57

Type-token curves

  • Typical type-token curve
  • Increases monotonically
  • Gets flatter all the time
  • But never gets completely flat
  • There are always new words you will encounter
  • Type-token curves will differ for different sub languages

11-761 57

ntokens ntypes

slide-58
SLIDE 58

Comparing type-token curves for sublanguages

  • “Wall Street Journal” Corpus (WSJ):
  • Newspaper articles, 1988-1992
  • Written English, rich vocabulary (leaning towards finance)
  • “Switchboard” Corpus (SWB):
  • Transcribed spoken conversations
  • over the telephone
  • Prescribed topic (one of 70)
  • 1990’s
  • “Broadcast News” Corpus (BN):
  • Transcribed TV/Radio News programs
  • Spoken, but somewhat scripted
slide-59
SLIDE 59

Comparing type-token curves for sub-languages

11-761 59

slide-60
SLIDE 60

WSJ vs BN vs SWB (log scale)

60

Note: slope << 1

slide-61
SLIDE 61

Token-type curves: Bigrams

  • The number of bigrams is greater than unigrams
  • The probability of hitting a “new” bigram type is higher
  • The curve is steeper, but flattens out after a few tens of millions of tokens
  • Distinctions between sub-languages is more stark

61

slide-62
SLIDE 62

Bigram Token Type Curve – BN vs. SWB (log scale)

Note: slope closer to 1.0 than for unigrams

slide-63
SLIDE 63

Token-type curves: Trigrams

  • The number of trigrams is greater than bigrams
  • The curve is steeper than for bigrams
  • Will flatten out after hundreds of millions of tokens

11-761 63

slide-64
SLIDE 64

Trigram Token-Type Curve – BN vs. SWB (log scale)

Note: slope almost 1.0

slide-65
SLIDE 65

Head of word-frequency lists

  • WSJ vs BN vs Switchboard

11-761 65

Count unit: 1000

slide-66
SLIDE 66

Tail of word-frequency lists

  • WSJ vs BN vs Switchboard

11-761 66

Singletons (Count = 1)

slide-67
SLIDE 67

Sub-language Example 2

  • The Diabetes set includes 9 Diabetes-related journals

and a total of 4.5M tokens and 95K types.

  • The Veterinary science set includes 11 journals and

3.2M tokens and 87K types.

  • All Journals were extracted from PubMed in Oct 2010

and they include everything that was available by those journals up until then.

  • This example is provided by Dana Movshovitz-Attias.
slide-68
SLIDE 68

Diabetes vs. Veterinary: Type-Token Curve

slide-69
SLIDE 69

Diabetes vs. Veterinary: Type-Token Curve (log scale)

slide-70
SLIDE 70

Head of Word Frequency List (counts per 1,000 tokens)

diabetes count veterinary count THE 42 THE 57 OF 35 OF 39 AND 31 AND 30 IN 29 IN 29 TO 16 TO 17 WITH 13 A 14 A 13 WERE 11 FOR 10 WAS 10 WAS 10 FOR 10 WERE 9 WITH 9 DIABETES 7 FROM 7 THAT 7 THAT 6 BY 6 IS 6 IS 6 AS 6 2 6 BY 6 AS 5 ON 5 INSULIN 5 AT 5 OR 5 1 4 GLUCOSE 5 BE 4 1 5 THIS 4

Count unit: 1000

slide-71
SLIDE 71

Tail of Word Frequency List: Count=1 (“Singletons”)

Diabetes Veterinary QUESTIONNAIRE-BASED MOLARITIES CAPACITY-CONSTRAINED LIDOCAIN DND MULTIORGAN 1003500 MICROGLIA-MEDIATED ENZYME-INHIBITOR NALYSIS ALVEOLUS-CAPILLARY 10702 KUZUYA BLUE-DNA $6054 HAIR-LOSS SENTENCING POPULATION-DYNAMICAL PAPER-AND-PENCIL STATE-TRANSITION

Singletons (Count = 1)

slide-72
SLIDE 72

An interesting feature

  • In every case, word

frequency count goes down very fast

  • Lets plot the relative

frequency of words against their rank..

11-761 72

slide-73
SLIDE 73

P(word) vs rank

  • The probability of a word falls off rapidly with rank!
  • This is an instance of a more generic principle..

11-761 73

slide-74
SLIDE 74

A peculiar phenomenon..

  • There are many more rare things than there are

common things!

  • This is true, not just of words..

11-761 74

slide-75
SLIDE 75

A peculiar phenomenon..

  • There are many more rare things than there are

common things!

  • This is true, not just of words..
  • In any sufficiently large collection..
  • The most frequent event is
  • ~2 times as frequent as the second most frequent event
  • ~3 times as frequent as the fourth most frequent event
  • ~4 times as frequent as the fourth most frequent event
  • ..
  • ~N times as frequent as the N-th most frequent event

11-761 75

slide-76
SLIDE 76

Typical behavior

  • Rank vs relative frequency

11-761 76

1 2 3 4 5 6 7 8 9 10

Rank Frequency

! !/2 !/3 !/4 !/5 !/6 !/7 !/8 !/9 !/10

slide-77
SLIDE 77

Typical behavior in Log domain

  • Rank vs relative frequency
  • In a log-log plot its just a line with negative slope

approximately 1

11-761 77

slide-78
SLIDE 78

Examples: Population of cities

  • Caveat: Axes are flipped w.r.t. earlier figure
  • The most populous city is approx. twice as populous as the second most

populous city and so on..

78

slide-79
SLIDE 79

Examples: AOL users vs. sites

  • AOL visitors to sites
  • http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html79
slide-80
SLIDE 80

Examples: Cryptocurrencies

  • https://steemit.com/steem/@akrid/applying-zipf-s-law-

to-the-crypto-market

80

slide-81
SLIDE 81

Examples: UNK

  • I think this is an example
  • No clue what it is..

81

slide-82
SLIDE 82

And, of course, words..

  • Word counts in wikipedias of 30 languages
  • (from Wikipedia)

11-761 82

slide-83
SLIDE 83

Zipf’s law

  • Define the probability of a word in terms of its rank
  • Zipf’s hypothesis

! "#$% ∝ 1 $()*("#$%)

  • This is an empirical law

11-761 83

George Kingsley Zipf (1902-1950) Linguist

slide-84
SLIDE 84

Zipf’s law

! "#$% ∝ 1 $()*("#$%)

  • In a log-log plot the relationship is linear

log !("#$%) = 1 − log($()*("#$%))

  • The slope of the plot is -1.0.

11-761 84

log !("#$%) log $()*("#$%) !("#$%) $()*("#$%)

1 3 5 7 9 11 13

slide-85
SLIDE 85

Frequency vs. rank, Brown Corpus

  • Brown Corpus (1969)
  • 500 samples of English-language text, totaling roughly one million words, compiled from

works published in the United States in 1961

  • 15 text categories
  • Appears to match Zipf

85

slide-86
SLIDE 86

Is Zipf’s distribution valid?

  • Is the following a valid distribution?

! "#$% ∝ 1 $()*("#$%)

  • Correction:

! "#$% ∝ 1 $()*("#$%)-./

11-761 86

!("#$%) $()*("#$%)

1 3 5 7 9 11 13

slide-87
SLIDE 87

Adjustment

  • Is the following a valid distribution?

! "#$% ∝ 1 $()*("#$%)

  • Only for finite vocabularies
  • Correction for infinite vocabularies:

! "#$% ∝ 1 $()*("#$%)-./

87

!("#$%) $()*("#$%)

1 3 5 7 9 11 13

slide-88
SLIDE 88

Word unigrams, WSJ corpus

  • Word frequency vs rank, WSJ corpus

11-761 88

slide-89
SLIDE 89

Word bigrams, WSJ corpus

  • Word bigram frequency vs rank, WSJ corpus

11-761 89

slide-90
SLIDE 90

Word trigrams, WSJ corpus

  • Word trigram frequency vs rank, WSJ corpus

11-761 90

slide-91
SLIDE 91

Word trigrams, WSJ corpus

  • Falls off rapidly with rank as predicted

11-761 91

slide-92
SLIDE 92

Zipf’s law

  • Zipf’s law is seen to apply to a large variety of natural

phenomena

  • Zipf suggested it is the outcome of the “principle of

least effort”

  • We – and nature – spends the least effort most of the time
  • Short, easy (or easy-to-recall) words are used most
  • frequently. Longer harder words are less frequent.
  • But a more mathematical explanation came from

Beniot Mandelbrot and George Miller

11-761 92

slide-93
SLIDE 93

Zipf’s law: Monkey on a typewriter

  • A monkey on a keyboard will produce text that follows Zipf’s law
  • Shorter words are more likely than longer words. If the keyboard has only 26

characters + space:

! "#$%&ℎ = ) = 26, 27,./ ! 0123 | "#$%&ℎ = ) = 1 26,

  • Combining these will give you Zipf’s law (not quite; but we return to this)
  • Problem: Language is not a monkey on a typewriter

93

slide-94
SLIDE 94

Don’t try this at home

In 2003, lecturers and students from the University of Plymouth MediaLab Arts course used a £2,000 grant from the Arts Council to study the literary

  • utput of real monkeys. They left a computer keyboard in the enclosure of

six Celebes crested macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast the results on a website… Not only did the monkeys produce nothing but five total pages largely consisting of the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued by urinating and defecating on it.

94 From wikipedia

slide-95
SLIDE 95

The Pareto distribution

  • Real life phenomena are concentrated into a few frequent types, with a long tail of

infrequent ones. 20% of the values account for 80% of the probability mass.

  • Almost everything in the universe follows the Pareto distribution
  • The distribution of wealth (most money in the hands of a few)
  • The sizes of human settlements (few cities, many hamlets/villages)
  • File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few

larger ones)

  • Hard disk drive error rates
  • The values of oil reserves in oil fields (a few large fields, many small fields)
  • The sizes of sand particles

11-761 95

! " = $"%

&

"&'( )*+ " ≥ "% Pareto type 1 Discovered by Vilfredo Pareto 1848-1923

slide-96
SLIDE 96

Zipf’s law: The Pareto distribution

  • Matthew 13:12 For whosoever hath, to him shall be given, and he shall

have more abundance: but whosoever hath not, from him shall be taken away even that he hath.

  • When you use a word once, its easier to recall, so you are more likely to

use it again

  • Over time, you will use familiar words more and more frequently and unfamiliar
  • nes less and less
  • The same principle applies to other data

11-761 96

slide-97
SLIDE 97

Mandelbrot..

  • Beniot Mandelbrot (1924-2010)
  • Fractal theory, Chaos theory, Mandelbrot-zipf law

11-761 97

slide-98
SLIDE 98

Mandelbrot-Zipf law

  • The money on the

typewriter doesn’t actually produce Zipf’s distribution

  • Instead you get the Mandelbrot distribution

! " = $ % + '()*(") -

  • Zipf’s rule is a special case if % = 0 and / = 1

11-761 98

slide-99
SLIDE 99

Mandelbrot distribution

! " = $ % + '()*(") -

  • In logs

log ! " = ( − 2 log % + '()*(")

  • No longer exactly a line (curves “down”)
  • 2 changes the slope of the curve
  • % changes the log transform itself
  • For low-rank ", it allows you to start from a relatively high

log transform

  • Better fits most data

11-761 99

slide-100
SLIDE 100

These curves (Unigram)..

  • Actually better fit the Mandelbrot distribution for

different values of ! and "

11-761 100

slide-101
SLIDE 101

These curves (Bigram)..

  • Actually better fit the Mandelbrot distribution for

different values of ! and "

11-761 101

slide-102
SLIDE 102

These curves (Trigram)..

  • Actually better fit the Mandelbrot distribution for

different values of ! and "

11-761 102

slide-103
SLIDE 103

11-761 103

slide-104
SLIDE 104

The story of ETA OIN SHRDLU

  • And how it became ETA OIN SRLDCU

norvig.com/mayzner.html

11-761 104