Language and Stats 11-(7/6)61 Introduction Objectives Logistics - - PowerPoint PPT Presentation

language and stats 11 7 6 61 introduction objectives
SMART_READER_LITE
LIVE PREVIEW

Language and Stats 11-(7/6)61 Introduction Objectives Logistics - - PowerPoint PPT Presentation

Language and Stats 11-(7/6)61 Introduction Objectives Logistics Statistical Language Modeling (SLM); Computational Linguistics (CL) Bhiksha Raj 11-761 1 11-761 2 Language and Statistics Iozmne pqmnzg habfbngyeydh shahmw Language or


slide-1
SLIDE 1

Language and Stats 11-(7/6)61 Introduction Objectives Logistics Statistical Language Modeling (SLM); Computational Linguistics (CL)

Bhiksha Raj

1 11-761

slide-2
SLIDE 2

11-761 2

slide-3
SLIDE 3

Language and Statistics

  • Iozmne pqmnzg habfbngyeydh shahmw
  • Language or not?

3 11-761

slide-4
SLIDE 4

Language and Statistics

  • Iozmne pqmnzg habfbngyeydh shahmw
  • Language or not?
  • pair none fair happy happy happy but but but brave brave

brave the the the the deserves

  • Language or not?
  • happy happy happy pair

none but the brave none but the brave none but the brave deserves the fair

  • Language or not?

4 11-761

slide-5
SLIDE 5

Language and Statistics

  • Iozmne pqmnzg habfbngyeydh shahmw
  • Language or not?
  • pair none fair happy happy happy but but but brave brave

brave the the the the deserves

  • Language or not?
  • happy happy happy pair

none but the brave none but the brave none but the brave deserves the fair

  • Language or not?

5 11-761

slide-6
SLIDE 6

Language and Statistics

  • Composed of mutually agreed upon units
  • pair happy none fair deserves but the
  • In a mutually agreed upon arrangement
  • happy happy happy pair

none but the brave none but the brave none but the brave deserves the fair

6 11-761

slide-7
SLIDE 7

The linguistic point of view

  • Language is the outcome of a complex process of lexical

semiosis to communicate information

  • Requiring conceptualization, planning, formation and delivery
  • Based on a set of implicitly agreed upon units and rules of

combination

  • Phonological, morphological and syntactic rules
  • Adequately conveying semantics requires following rules
  • Deep complex theories dating back to Plato..
  • Key point: absolutely not random!
  • Random gobbledygook doesn’t convey any useful meaning

7 11-761

slide-8
SLIDE 8

The linguistic point of view

  • Language is the outcome of a complex process of lexical

semiosis to communicate information

  • Requiring conceptualization, planning, formation and delivery
  • Based on a set of implicitly agreed upon units and rules of

combination

  • Phonological, morphological and syntactic rules
  • Adequately conveying semantics requires following rules
  • Deep complex theories dating back to Plato..
  • Key point: absolutely not random!
  • Random gobbledygook doesn’t convey any useful meaning

8 11-761

slide-9
SLIDE 9

“Mutually agreed upon”?

  • When a fox is in the bottle where the tweetle

beetles battle with their paddles in a puddle on a noodle-eating poodle, THIS is what they call…a tweetle beetle noodle poodle bottled paddled muddled duddled fuddled wuddled fox in socks, sir

9 11-761

slide-10
SLIDE 10

“Mutually agreed upon”?

  • When a fox is in the bottle where the tweetle

beetles battle with their paddles in a puddle on a noodle-eating poodle, THIS is what they call…a tweetle beetle noodle poodle bottled paddled muddled duddled fuddled wuddled fox in socks, sir

  • ’Twas brillig, and the slithy toves

Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe.

10 11-761

slide-11
SLIDE 11

Rules?

  • It’s like déjà vu all over again.
  • We made too many wrong mistakes.
  • I never said most of the things I said.
  • The future ain’t what it used to be.
  • Bill Dickey is learning me his experience.
  • Who?
  • “How do you like them apples?”
  • "Soylent Green is people!“
  • "It's a fool looks for logic in the chambers of the human heart."
  • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led

George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“

11 11-761

slide-12
SLIDE 12

Rules?

  • It’s like déjà vu all over again.
  • We made too many wrong mistakes.
  • I never said most of the things I said.
  • The future ain’t what it used to be.
  • Bill Dickey is learning me his experience.
  • Who?
  • “How do you like them apples?”
  • "Soylent Green is people!“
  • "It's a fool looks for logic in the chambers of the human heart."
  • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led

George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“

12 11-761

slide-13
SLIDE 13

Rules?

  • It’s like déjà vu all over again.
  • We made too many wrong mistakes.
  • I never said most of the things I said.
  • The future ain’t what it used to be.
  • Bill Dickey is learning me his experience.
  • Who?
  • “How do you like them apples?”
  • "Soylent Green is people!“
  • "It's a fool looks for logic in the chambers of the human heart."
  • Slim said, "You hadda, George. I swear you hadda. Come on with me." He led

George into the entrance of the trail and up toward the highway. Curley and Carlson looked after them. And Carlson said, "Now what the hell ya suppose is eatin' them two guys?“

13 11-761

slide-14
SLIDE 14

Language and Statistics

  • Why are these understandable?
  • Bill Dickey is learning me his experience
  • It's a fool looks for logic
  • What’s is eating them two guys
  • They are built on common usages
  • Statistically plausible
  • Statistical approach: The “acceptability” of a sequence
  • f words is related to how frequently it is used
  • Or how statistically plausible it is

14 11-761

slide-15
SLIDE 15

Statistical approach

  • Based entirely on frequency of occurrence
  • “Acceptable” word sequences will occur
  • “Unacceptable” ones wont
  • Actually – predicted frequency of occurrence
  • Not just counting from whatever we have already observed
  • Will require predicting probability of word sequences we

have never encountered

  • Some sequences we have never seen are nevertheless much more

likely to be expressed in a valid sentence than other sequences we have never seen

11-761 15

slide-16
SLIDE 16

The problem with the statistical approach

char O, o[]; main(l) {for(;~l;O||puts(o)) O=(O[o]=~(l=getchar())?4<(4^l>>5)?l:46:0)?-~O & printf("%02x ",l)*5:!O;}

  • Will a statistical model know with certainty if the

above is valid code?

11-761 16

slide-17
SLIDE 17

The linguist’s objection

  • The statistical approach treats language as a random process
  • Language is not random
  • A blind statistical approach ignores agency:
  • Human (or animal) language treated no differently from other

patterned sequences of symbolic units

  • Is the sequence of sounds your car produces really language?
  • Language has an agent
  • Is generally the outcome of a deliberate act of communication
  • With an entire sequence of conceptualization, composition and

communication

  • Agents intend to communicate
  • The rules of language affect what unseen word sequences are likely

11-761 17

slide-18
SLIDE 18

In this course

  • We take the perspective that the statistical framework is

more appropriate

  • Can never explicitly catalogue all the rules of language
  • Particularly when they change all the time
  • Not utilizing the prescriptive theory of linguistics
  • Frequency/plausibility of usage is representative of the rules of the

language

  • Statistical characterization of language
  • Related to descriptive theory of linguistics
  • But the framework may be informed by linguistics or

linguistic intuition

  • Required in particular to predict occurrences/behaviors of

previously unseen patterns

11-761 18

slide-19
SLIDE 19

The fiction we maintain

  • Language comes from a probabilistic source..
  • Which randomly produces the text we see
  • We will focus on written language
  • We will concede agency
  • The source is trying to convey a message, not just to

produce text

  • But we will often ignore it(!)

11-761 19

slide-20
SLIDE 20

The fiction we maintain

  • To generate a text, the source randomly chooses a “hidden” message ℎ
  • The concept to be conveyed
  • It also randomly produces a “surface form” to convey the message ℎ
  • The accessible form
  • Words, sentences, paragraphs, documents..
  • We only get to observe the surface form
  • This is what we must work with
  • To try to decipher inner message ℎ
  • Or just to learn all about valid surface forms
  • Course objectives: Learn all about statistical mechanisms to achieve the

above..

11-761 20

slide-21
SLIDE 21

21

Course Goals

  • Teaching statistical foundation and techniques for language

technologies

  • Plugging gaping holes in LTI/CS grad student education in

probability, statistics and information theory.

  • “This course is about how to convert

linguistic intuition and understanding

  • f language into statistical models”.
  • About how to developed statistically

sound methodology, but informed by what we know of the domain of language.”

11-761

slide-22
SLIDE 22

22

Course philosophy

  • Socratic Method
  • Based on discussion and enagagement
  • Participation strongly encouraged (pls state your name)
  • Highly interactive
  • Highly adaptable
  • based on how fast we move
  • Lots of Probability, Statistics, Information theory
  • not in the abstract, but rather as the need arises
  • Lectures emphasize intuition, not rigor or detail
  • background reading will have rigor & detail
  • Will be done partially using slides, and partially on the board

11-761

slide-23
SLIDE 23

23

Course Prerequisites & Mechanics

  • You need to be able to program, from scratch.
  • Largest program is O(100) lines
  • You need to be comfortable with probabilities
  • Can you derive Bayes equation in your sleep?
  • 11661 (masters level): no final project
  • Hand in assignments via Blackboard
  • Vigorous enforcement of collaboration &

disclosure policy

11-761

slide-24
SLIDE 24

24

Background Material

No single book exists which covers the course material.

  • “Foundations of Statistical NLP”, Manning & Schutze
  • Computational Linguistics perspective
  • “Statistical Methods in Speech Recognition”, Jelinek
  • “Text Compression”, Bell, Cleary & Witten
  • first 4 chapters; rest is mostly text compression
  • “Probability and Statistics”, DeGroot
  • “All of Statistics” & “All of nonparametric Statistics”, Wasserman
  • Lots of individual articles

11-761

slide-25
SLIDE 25

25

High Level Syllabus (subject to change)

  • Language Technology formalisms
  • source-channel formulation
  • Bayes classifier
  • Words, Words, Words
  • type vs, token, Zipf, Mandlebrot, heterogeneity of langauge
  • Modeling Word distributions - the unigram:
  • [estimators, ML, zero frequency, smoothing, shrinkage, G-T]
  • N-grams:
  • Deleted Interpolation Model, backoff, toolkit
  • Measuring Success: perplexity
  • Info theory [entropy, KL-div, MI], the entropy of English,

alternatives

11-761

slide-26
SLIDE 26

26

Syllabus (continued)

  • Clustering:
  • class-based N-grams, hierarchical clustering
  • hard and soft clustering
  • Latent Variable Models, EM
  • Hidden Markov Models, revisiting interpolated and class

n-grams

  • Part-Of-Speech tagging, Word Sense Disambiguation
  • Decision & Regression Trees
  • Particularly as applied to language
  • Stochastic Grammars
  • (SCFG, inside-outside alg., Link grammar)

11-761

slide-27
SLIDE 27

27

Syllabus (continued)

  • Maximum Entropy Modeling
  • exponential models, ME principle, feature induction...
  • Language Model Adaptation
  • caches, backoff
  • Dimensionality reduction
  • latent semantic analysis, word2vec
  • Syntactic Language Models

11-761

slide-28
SLIDE 28

Logistics

  • Several assignments:
  • ~ once a week
  • One three-quarters term exam
  • And a project
  • Everyone has the same project definition, but you may

choose your approach

  • Ideally teams of four
  • Project presentations (~10 mns/team) at end of course
  • With truffles as prize to the best project.

28 11-761

slide-29
SLIDE 29

Topics for the rest of today

  • Surface (s) and hidden (h) components of language
  • The p(s,h) function
  • Statistical language modeling: estimating p(s)
  • Distribution of words, sentences, documents
  • Computational linguistics / NLP: estimating p(h|s)
  • The source-channel model (aka, a Bayes classifier for everything)
  • SLM used as prior: speech, translation, spelling correction, OCR,...
  • SLM used as likelihood: document classification,...
  • [Probability: prior, posterior, Bayes' theorem, Bayes classifier]

29 11-761

slide-30
SLIDE 30

Our fiction

  • Language comes from a probabilistic source
  • That it has a distribution
  • Which may change with time
  • Language has a surface form
  • The accessible, non-controversial text form
  • E.g. words, sentences, paragraphs
  • We will assume text for this course
  • And under the surface is the hidden form
  • The deeper aspect of meaning
  • Not always apparent from the surface form

30 11-761

slide-31
SLIDE 31

Surface vs Hidden

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • The surface form is non-controversial
  • Just the words
  • What about the meaning?
  • Hard to decipher. This little snippet of text is

loaded with ambiguity.

11-761 31

slide-32
SLIDE 32

The hidden meaning

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • What are all the entities (people/objects/locations)

that are directly identifiable?

11-761 32

slide-33
SLIDE 33

The hidden meaning

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • What are all the entities (people/objects/locations)

that are directly identifiable?

11-761 33

Mary Jane

slide-34
SLIDE 34

Ambiguities

Mary saw Jane standing by the bank with a telescope. She waved to her.

11-761 34

slide-35
SLIDE 35

Part of speech ambiguity

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Saw (noun)

OR

  • Saw (verb, to see)
  • Part of speech tagging
  • If the part of speech is wrongly identified, downstream

analysis will break badly

11-761 35

slide-36
SLIDE 36

Ambiguities

Mary saw Jane standing by the bank with a telescope. She waved to her.

11-761 36

slide-37
SLIDE 37

Word sense disambiguation

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Word sense disambiguation
  • Bank senses:
  • Financial institution
  • River bank
  • To tilt
  • To depend on something…

11-761 37

slide-38
SLIDE 38

Word senses

  • A “word sense” is one of the possible meanings of a

word

  • Some words can have 50 or more senses (Wikipedia quotes

“play” as an example)

  • Misinterpreting word sense will completely destroy

the intent of the sentence

  • WORDNET: Lexical resource created by George Miller,

which characterizes word senses rather than words as the basic building block

11-761 38

slide-39
SLIDE 39

Ambiguities

Mary saw Jane standing by the bank with a telescope. She waved to her.

11-761 39

slide-40
SLIDE 40

Prepositional phrase attachment

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Did Mary see Jane through the telescope, or was

Jane standing with a telescope?

  • Are there other such examples here?

11-761 40

slide-41
SLIDE 41

Prepositional phrase attachment

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • “Saw by the bank” or “Standing by the bank”?

11-761 41

slide-42
SLIDE 42

More ambiguities

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Mary or Jane?

11-761 42

slide-43
SLIDE 43

More ambiguities

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Mary or Jane?
  • Coreference resolution..
  • Also pronoun resolution

11-761 43

slide-44
SLIDE 44

Coreference resolution

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Mary or Jane?
  • Coreference resolution..

11-761 44

slide-45
SLIDE 45

Coreference resolution

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Who waved to whom?

11-761 45

slide-46
SLIDE 46

World knowledge affects interpretation

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • Cannot both refer to the same person..

11-761 46

slide-47
SLIDE 47

World knowledge affects interpretation

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • If they’re close enough to see other when they

wave to one another, Mary probably doesn’t need a telescope to see Jane

  • => “with a telescope” probably attaches to

standing..

11-761 47

slide-48
SLIDE 48

Still more ambiguity

Mary saw Jane standing by the bank with a telescope. She waved to her.

  • If they’re close enough to see other when they

wave to one another, Mary probably doesn’t need a telescope to see Jane

  • => “with a telescope” probably attaches to

standing..

  • But who is standing??

11-761 48

slide-49
SLIDE 49

Parsing Ambiguity

  • Several other valid parses
  • In general very hard to disambiguate

11-761 49

Mary saw Jane standing with a telescope Mary saw Jane standing with a telescope

slide-50
SLIDE 50

Surface vs Hidden

  • The surface form is just the observed sequence of words
  • Analyzed by language modellers
  • But underneath is the hidden form with the fully annotated

richness of meaning(s)

  • Word senses
  • Various attachments
  • Coreferences
  • Part of speech
  • Named entity
  • Semantic attachments
  • Topic identification
  • Deriving these from the surface form is the task of computational

linguists

11-761 50

slide-51
SLIDE 51

Formalizing our fiction

  • The probabilistic source draws hidden messages ℎ and

surface forms " from a joint distribution #(ℎ, ")

  • Note: a given meaning ℎ may have multiple surface forms ' and

vice versa

  • Knowing #(ℎ, ") would solve nearly all problems in language

technologies and computational linguistics

  • CL problem: Given a surface form " make inferences

about the hidden message ℎ

  • Can be done if #(ℎ, ") is known
  • LM problem: Determine the plausibility of surface forms
  • Know #(")
  • Can be done if #(ℎ, ") is known

51 11-761

slide-52
SLIDE 52

CL problem

  • Determine ℎ from a given "
  • An good heuristic: find the most likely hidden value for the

given surface form ℎ∗ = argmax

*

+(ℎ|") = argmax

*

+(ℎ, ")

  • In theory, ℎ includes all hidden information (complete

annotation)

  • In practice, computational linguists will solve one problem

at a time, and ℎ will be defined accordingly

  • WSD: ℎ represents word senses, " is a sentence
  • Topic recognition: ℎ represents topics, " is a document

11-761 52

slide-53
SLIDE 53

LM Problem

  • Language modelling has to do with simply

estimating the probability of surface phenomena !(#)

  • Easily derived if !(ℎ, #) is known

! # = (

)

!(ℎ, #)

  • In practice we will often try to directly learn ! #

and not bother with h or ! ℎ, #

11-761 53

slide-54
SLIDE 54

Requirement for P(s): Automatic Speech Recognition

  • Input: An acoustic signal !
  • Output: Some piece of language "
  • Sequence of words
  • The ASR system maps ! à "
  • How..
  • And why does this involve #(")

11-761 54

ASR Engine

slide-55
SLIDE 55

The source-channel model

  • Model: The acoustic signal ! actually used to be text " produced by the source
  • But the text " first passed through a “noisy” channel before being output to us
  • The noisy channel mangled and modified " so that it was converted to an

acoustic signal !

  • From channel output A we must now work our way backward and figure out what went

into the channel

  • Note the inversion of order of the process model from the original problem

statement

11-761 55

Noisy Channel

source " !

slide-56
SLIDE 56

The source-channel model for speech

  • For speech, this is not an absurd model
  • Reasonable to image that the sounds we produce

started off as sentences in our heads, which were then translated by our cognitive and speech-production processes into a speech signal

  • For other problems the model will be obviously silly
  • But we’ll use it anyway..

11-761 56

Noisy Channel

source ! "

slide-57
SLIDE 57

Automatic Speech Recognition

  • Given only ! we must guess "
  • We will do it as:

"∗ = argmax

*

+("|!)

  • Using Bayes rule to reverse dependencies

"∗ = argmax

*

+(")+(!|")

  • The AM +(!|") is much harder to estimate than the LM +(").
  • Most of the effort in automatic speech recognition goes into the AM

57

Noisy Channel

source " !

slide-58
SLIDE 58

Automatic Speech Recognition

  • Given only ! we must guess "
  • We will do it as:

"∗ = argmax

*

+("|!)

  • Using Bayes rule to reverse dependencies

"∗ = argmax

*

+(")+(!|")

  • The AM +(!|") is much harder to estimate than the LM +(").
  • Most of the effort in automatic speech recognition goes into the AM

58

Noisy Channel

source " ! Language Model Quantifies the plausibility of word sequences

slide-59
SLIDE 59

Automatic Speech Recognition

  • Given only ! we must guess "
  • We will do it as:

"∗ = argmax

*

+("|!)

  • Using Bayes rule to reverse dependencies

"∗ = argmax

*

+(")+(!|")

  • The AM +(!|") is much harder to estimate than the LM +(").
  • Most of the effort in automatic speech recognition goes into the AM

59

Noisy Channel

source " ! Language Model Quantifies the plausibility of word sequences Acoustic Model Quantifies the degree of match between candidate word sequence and observed acoustics

slide-60
SLIDE 60

Automatic Speech Recognition

  • Given only ! we must guess "
  • We will do it as:

"∗ = argmax

*

+("|!)

  • Using Bayes rule to reverse dependencies

"∗ = argmax

*

+(")+(!|")

  • The AM +(!|") is much harder to estimate than the LM +(").
  • Most of the effort in automatic speech recognition goes into the AM 60

Noisy Channel

source " ! Language Model Quantifies the plausibility of word sequences Acoustic Model Quantifies the degree of match between candidate word sequence and observed acoustics

slide-61
SLIDE 61

Automatic Speech Recognition in practice

  • Given only ! we must guess "
  • We will do it as:

"∗ = argmax

*

+("|!)

  • Using Bayes rule to reverse dependencies

"∗ = argmax

*

/ +(") / +(!|")

  • We wont have the actual probability distributions for language

and acoustics. We must estimate them

61

Noisy Channel

source " ! Estimated Language Model Estimated Acoustic Model

slide-62
SLIDE 62

Another example: Machine Translation

  • Problem: given a sentence in one language (e.g.

English) come up with an equivalent in another (e.g. Spanish)

  • Input: Sentence in English (E)
  • Output: Translation into Spanish (S)
  • The MT system is an EàS converter

11-761 62

four score and seven years ago Hace cuatro y siete años

MT System

slide-63
SLIDE 63

Source channel model for MT

  • Spanish went into a noisy channel and came out as English
  • Again note inversion of process order from target inference
  • Best guess for Spanish

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!)

11-761 63

four score and seven years ago Hace cuatro y siete años

Noisy Channel

slide-64
SLIDE 64

Source channel model for MT

  • Spanish went into a noisy channel and came out as English
  • Again note inversion of process order from target inference
  • Best guess for Spanish

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!) ≈ argmax

)

* ! 0 *(-|!)

11-761 64

four score and seven years ago Hace cuatro y siete años

Noisy Channel

slide-65
SLIDE 65

ASR and MT: The problem of search

!∗ = argmax

)

* + ! * +(-|!)

  • The set of all possible sentences to select to select the most likely one from

is exponentially large

  • Exhaustive search is impossible
  • Heuristics must be applied to select the most promising set to evaluate
  • The problem of search

11-761 65

four score and seven years ago Hace cuatro y siete años

Noisy Channel Noisy Channel

1 ! 2

slide-66
SLIDE 66

A simpler problem without search: Topic classification

  • Given a document, determine the topic its about
  • Sports/politics/finance..
  • Only a finite number of topics
  • Computationally feasible to search exhaustively
  • ver all topics to find the best guess for a

document

11-761 66 There was ease in Casey’s manner as he stepped into his place; There was pride in Casey’s bearing and a smile lit Casey’s face. And when, responding to the cheers, he lightly doffed his hat, No stranger in the crowd could doubt ‘twas Casey at the bat.

SPORTS

Topic Classifier

slide-67
SLIDE 67

Source channel model

  • Source produces a topic !
  • Noisy channel converts a topic ! into a full document "
  • Reasonable model: The channel is a writer who takes in a topic and

produces a document

  • Guessing the topic from the document

!∗ = argmax

*

+(!|") = argmax

*

+ ! +("|!) ≈ argmax

*

+ ! 0 +("|!)

67 There was ease in Casey’s manner as he stepped into his place; There was pride in Casey’s bearing and a smile lit Casey’s face. And when, responding to the cheers, he lightly doffed his hat, No stranger in the crowd could doubt ‘twas Casey at the bat.

SPORTS

Noisy Channel

slide-68
SLIDE 68

Source channel model

  • Source produces a topic !
  • Noisy channel converts a topic ! into a full document "
  • Reasonable model: The channel is a writer who takes in a topic and

produces a document

  • Guessing the topic from the document

!∗ = argmax

*

+(!|") = argmax

*

+ ! +("|!) ≈ argmax

*

+ ! 0 +("|!)

68 There was ease in Casey’s manner as he stepped into his place; There was pride in Casey’s bearing and a smile lit Casey’s face. And when, responding to the cheers, he lightly doffed his hat, No stranger in the crowd could doubt ‘twas Casey at the bat.

SPORTS

Noisy Channel An estimated probability distribution for topics of documents A multinomial A separate language model for every topic e.g. P(D|sports), P(D|politics), etc..

slide-69
SLIDE 69

Source-channel model is Bayes classification

  • The generic source channel model

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!)

  • This is also the Bayes classifier
  • Also known as the Maximum A Posteriori classifier
  • This is the optimal classification rule when we have the true values of

*(!|-) (or alternately of *(!) and *(-|!))

  • Optimal in that it guarantees the least expected misclassification
  • No guarantee of optimality if we use estimates for any of the distributions

11-761 69

X

Noisy Channel

Y

slide-70
SLIDE 70

Source-channel model is Bayes classification

  • The generic source channel model

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!)

  • This is also the Bayes classifier
  • Also known as the Maximum A Posteriori classifier
  • This is the optimal classification rule when we have the true values of

*(!|-) (or alternately of *(!) and *(-|!))

  • Optimal in that it guarantees the least expected misclassification
  • No guarantee of optimality if we use estimates for any of the distributions

11-761 70

X

Noisy Channel

Y

The a priori probability distribution of X (also called the prior) Tells us about the natural biases in the data – which Xs are produced more preferentially by the source

slide-71
SLIDE 71

Source-channel model is Bayes classification

  • The generic source channel model

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!)

  • This is also the Bayes classifier
  • Also known as the Maximum A Posteriori classifier
  • This is the optimal classification rule when we have the true values of

*(!|-) (or alternately of *(!) and *(-|!))

  • Optimal in that it guarantees the least expected misclassification
  • No guarantee of optimality if we use estimates for any of the distributions

11-761 71

X

Noisy Channel

Y

The a priori probability distribution of X (also called the prior) Tells us about the natural biases in the data – which Xs are produced more preferentially by the source The conditional probability of Y given X Gives us a measure of the “fit” of Y to a given X

slide-72
SLIDE 72

Source-channel model is Bayes classification

  • The generic source channel model

!∗ = argmax

)

*(!|-) = argmax

)

* ! *(-|!)

  • This is also the Bayes classifier
  • Also known as the Maximum A Posteriori classifier
  • This is the optimal classification rule when we have the true values of

*(!|-) (or alternately of *(!) and *(-|!))

  • Optimal in that it guarantees the least expected misclassification
  • No guarantee of optimality if we use estimates for any of the distributions

11-761 72

X

Noisy Channel

Y

The a priori probability distribution of X (also called the prior) Tells us about the natural biases in the data – which Xs are produced more preferentially by the source The conditional probability of Y given X Gives us a measure of the “fit” of Y to a given X The decomposition allows us to learn these two from two entirely different datasets

slide-73
SLIDE 73

!∗= argmax

)

* ! *(,|!) vs !∗ = argmax

)

*(!|,)

  • Advantages to the generative framework:
  • Sometimes the generative story just makes more sense
  • E.g. in ASR *(/|0) is more natural since acoustic production of sounds is a well studied problem
  • More generally the decomposition separates out two sources of evidence
  • *(,|!): Captures goodness of fit
  • E.g. captures different ways of translating English to Spanish
  • Requires expensive bi-lingual data to learn
  • * ! : Captures natural preference for some ! over other
  • E.g. Some ways of saying things in Spanish are preferred over others
  • Can be trained separately from large amounts of monolingual data
  • Enables better usage of training data
  • P(Spanish) can be trained from quantities of monolingual text
  • And adaptation
  • P(Spanish) may change over time
  • But P(English|Spanish) is relatively stable

11-761 73

slide-74
SLIDE 74

Sources of error in Bayes classification

!∗ = argmax

)

* ! *(,|!)

  • Intrinsic error of classifier (intrinsic confusability of classes)
  • The same Y can be produced from different Xs
  • But the Bayes classifier will always give you the same X for a given Y
  • The “Bayes” error
  • Cannot do better than this

11-761 74

X

Noisy Channel

Y

slide-75
SLIDE 75

Sources of error in Bayes classification

!∗ = argmax

)

* + ! +(-|!)

  • Error in estimating the a priori probability of !
  • Now the error will be greater than the Bayes error

11-761 75

X

Noisy Channel

Y

slide-76
SLIDE 76

Sources of error in Bayes classification

!∗ = argmax

)

* + ! * +(-|!)

  • Error in estimating the a priori probability of !
  • Error in estimating the conditional probability of -
  • Will result in errors exceeding the optimal Bayes error

11-761 76

X

Noisy Channel

Y

slide-77
SLIDE 77

Sources of error in Bayes classification

!∗ = argmax

)

* ! *(,|!)

  • Error in estimating the a priori probability of !
  • Error in estimating the conditional probability of ,
  • Insufficient search
  • Even with true probabilities, only optimal if you search over all

possible !s to pick the best one

77

X

Noisy Channel

Y

slide-78
SLIDE 78

Sources of error in Topic recognition

!∗ ≈ argmax

)

* + ! * +(-|!)

  • Bayes error
  • Error in estimating prior
  • Error in estimating conditional probability of

document given prior

11-761 78 There was ease in Casey’s manner as he stepped into his place; There was pride in Casey’s bearing and a smile lit Casey’s face. And when, responding to the cheers, he lightly doffed his hat, No stranger in the crowd could doubt ‘twas Casey at the bat.

SPORTS

Noisy Channel

slide-79
SLIDE 79

Sources of error in MT / ASR

!∗ = argmax

)

* + ! * +(-|!)

  • Bayes error
  • Error in estimating prior
  • Error in estimating conditional
  • Search error from incomplete evaluation of the space of S

79

four score and seven years ago Hace cuatro y siete años

Noisy Channel Noisy Channel

1 ! 2

slide-80
SLIDE 80

Source-channel model

  • Can be applied to any LT problem: OCR, Spelling

Correction, Question answering, IR..

  • Originally formalized for automatic speech

recognition (in the 80s)

  • Applied to MT in the late 80s/ early 90s.
  • Now used everywhere (often the first thing tried)..

11-761 80

slide-81
SLIDE 81

The uncertainty principle of language modelling

  • Sub languages vary along many dimensions
  • If you combine data from multiple genres, you get

more data for the combined category

  • Better estimation of distributions
  • But you lose the distinction between genres
  • Loss of resolution
  • One of the primary challenges of LM: How to make

better use of data, without losing resolution..

  • “Uncertainty principle of language modelling”

11-761 81