Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - - PowerPoint PPT Presentation

language models
SMART_READER_LITE
LIVE PREVIEW

Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy - - PowerPoint PPT Presentation

CSE 6240: Web Search and Text Mining. Spring 2020 Language Models Prof. Srijan Kumar with Roshan Pati and Arindum Roy 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Language Models What are language models?


slide-1
SLIDE 1

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

1

CSE 6240: Web Search and Text Mining. Spring 2020

Language Models

  • Prof. Srijan Kumar

with Roshan Pati and Arindum Roy

slide-2
SLIDE 2

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

2

Language Models

  • What are language models?
  • Statistical language models

– Unigram, bigram and n-gram language model

  • Neural language models
slide-3
SLIDE 3

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

3

Language Models: Objective

  • Key question: How well does a model represent the

language?

– Character language model: Given alphabet vocabulary V, models the probability of generating strings in the language – Word language model: Given word vocabulary V, models the probability of generating sentences in the language

slide-4
SLIDE 4

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

4

Language Model: Applications

  • Assign a probability to sentences

– Machine translation:

  • P(high wind tonight) > P(large wind tonight)

– Spell correction:

  • The office is about fifteen minuets from my house
  • P(about fifteen minutes from) > P(about fifteen minuets from)

– Speech recognition:

  • P(I saw a van) >> P(eyes awe of an)

– Information retrieval: use words that you expect to find in matching documents as your query – Many more: Summarization, question-answering, and more

slide-5
SLIDE 5

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

5

Language Models

  • What are language models?
  • Statistical language models
  • Neural language models
slide-6
SLIDE 6

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

6

Language Model: Definition

  • Goal: Compute the probability of a sentence or sequence
  • f words P(s) = P(w1, w2, … wn)
  • Related task: Probability of an upcoming word: P(w5| w1,

w2, w3, w4)

  • A model that computes either of these is a language model
  • How to compute the joint probability?

– Intuition: apply the chain rule

slide-7
SLIDE 7

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

7

How To Compute Sentence Probability?

  • Given sentence s = t1t2t3t4
  • Applying the chain rule under language model M
slide-8
SLIDE 8

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

8

Complexity of Language Models

  • The complexity of language models depends on the window
  • f the word-word or character-character dependency

they can handle

  • Common types are:

– Unigram language model – Bigram language model – N-gram language model

slide-9
SLIDE 9

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

9

Unigram Model

  • Unigram language model only models the probability of

each word according to the model

– Does NOT model word-word dependency – The word order is irrelevant – Akin to the “bag of words” model

slide-10
SLIDE 10

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

10

Bigram Model

  • Bigram language model models the consecutive word

dependency

– Does NOT model longer dependency – Word order is relevant here

slide-11
SLIDE 11

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

11

N-gram Model

  • Bigram language model models the longer sequences of

word dependency

– Most complex among all three

slide-12
SLIDE 12

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

12

Unigram Language Model: Example

  • What is the probability of the sentence s under language

model M?

  • Example:

“the man likes the woman” 0.2 x 0.01 x 0.02 x 0.2 x 0.01 = 0.00000008 P (s | M) = 0.00000008

Word Probability the 0.2 a 0.1 man 0.01 woman 0.01 said 0.03 likes 0.02

Language Model M

slide-13
SLIDE 13

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

13

Comparing Language Models

  • Given two language models, how can we decide which

language model is better?

  • Solution:

– Take a set S of sentences we desire to model – For each language model:

  • Find the probability of each sentence
  • Average the probability scores

– The language model with the highest average probability is the best fit for language model

slide-14
SLIDE 14

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

14

Comparing Language Models

  • s: “the man likes the woman”
  • M1: 0.2 x 0.01 x 0.02 x 0.2 x 0.01 è P(s|M1) = 0.00000008
  • M2 : 0.1 x 0.1 x 0.01 x 0.1 x 0.1 è P(s|M2) = 0.000001
  • P(s|M2) > P(s|M1) è M2 is a better language model

Word Probability the 0.2 a 0.1 man 0.01 woman 0.01 said 0.03 likes 0.02

Language Model M1 Language Model M2

Word Probability the 0.1 a 0.02 man 0.1 woman 0.1 said 0.02 likes 0.01

slide-15
SLIDE 15

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

15

Estimating Probabilities

  • N-gram conditional probability can be estimated based on

the raw occurrence counts in the observed corpus

  • Uni-gram
  • Bi-gram
  • N-gram
slide-16
SLIDE 16

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

16

Estimating Bigram Probabilities: Case Study

  • Corpus: Berkeley Restaurant Project sentences

– – – – – –

slide-17
SLIDE 17

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

17

Raw Bigram Counts: Case Study

  • Bigram matrix created from 9222 sentences
slide-18
SLIDE 18

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

18

Raw Bigram Probabilities: Case Study

  • Unigram counts
  • Normalize by unigrams

P(want | i) = C(i, want)/C(i) = 827 / 2533 = 0.33

slide-19
SLIDE 19

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

19

Language Models

  • What are language models?
  • Statistical language models

– Unigram, bigram, and n-gram language models

  • Neural language models
  • Language models for IR
slide-20
SLIDE 20

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

20

Neural Language Models

  • So far, the language models have been statistics and

counting based

  • Now, language models are created using neural

networks/deep learning

  • Key question: how to model sequences?
slide-21
SLIDE 21

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

21

Neural-based Bigram Language Mode

Problem: Does not model sequential information (too local)

1-hot encoding 1-hot encoding 1-hot encoding 1-hot encoding

slide-22
SLIDE 22

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

22

Sequences in Inputs or Outputs?

slide-23
SLIDE 23

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

23

Sequences in Inputs or Outputs?

slide-24
SLIDE 24

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

24

Key Conceptual Ideas

  • Parameter Sharing

– in computational graphs = adding gradients

  • “Unrolling”

– in computational graphs with parameter sharing

  • Parameter Sharing + “Unrolling”

– Allows modeling arbitrary length sequences! – Keeps number of parameters in check

slide-25
SLIDE 25

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

25

Recurrent Neural Network

slide-26
SLIDE 26

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

26

Recurrent Neural Network

  • We can process a sequence of vectors x by applying a

recurrence formula at every time step

  • fW is used at every time step and shared across all data
slide-27
SLIDE 27

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

27

(Vanilla) Recurrent Neural Network

Learned matrix weights

slide-28
SLIDE 28

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

28

RNN Computational Graph

Input at time 1 Input at time 2 Input at time 3 Initial hidden state Final hidden state

slide-29
SLIDE 29

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

29

RNN Computational Graph

  • The same weight matrices W is shared for all time steps

Shared weight matrix

slide-30
SLIDE 30

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

30

RNN Computational Graph: Many to Many

Output at time 1 Output at time 2 Output at time 3 Final output

  • Many-to-many architecture has one output per time step
slide-31
SLIDE 31

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

31

RNN Computational Graph: Many to Many

Loss at time 1 Loss at time 2 Loss at time 3 Final loss

slide-32
SLIDE 32

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

32

RNN Computational Graph: Many to Many

Total loss

slide-33
SLIDE 33

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

33

RNN Computational Graph: Many to one

  • Many-to-one architecture has one final output
slide-34
SLIDE 34

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

34

RNN Computational Graph: One to many

  • Many-to-one architecture has one input and several outputs
slide-35
SLIDE 35

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

35

Example: Character-level Language Model

  • Input: one hot representation of the characters
  • Vocabulary: [‘h’, ‘e’, ‘l’, ‘o’]
slide-36
SLIDE 36

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

36

Example: Character-level Language Model

  • Transform every input

into the hidden vector

slide-37
SLIDE 37

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

37

Example: Character-level Language Model

  • Transform each hidden

vector into a output vector

slide-38
SLIDE 38

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

38

Example: Generating Output via Sampling

  • Generating output:

Sample from the vocabulary based on the normalized output layer

  • At test time, sample one

character at a time and feed back into the RNN model at the next time step

slide-39
SLIDE 39

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

39

Calculating Loss: BackProp Through Time

slide-40
SLIDE 40

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

40

Truncated BackProp Through Time

slide-41
SLIDE 41

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

41

Truncated BackProp Through Time

slide-42
SLIDE 42

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

42

Truncated BackProp Through Time

slide-43
SLIDE 43

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

43

Example: Learning to Write Like Shakespeare

slide-44
SLIDE 44

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

44

Example: Learning to Write Like Shakespeare

slide-45
SLIDE 45

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

45

Example: Learning to Write Like Shakespeare

slide-46
SLIDE 46

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

46

Example: Learning to Code

slide-47
SLIDE 47

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

47

Example: Generating Rap Lyrics

slide-48
SLIDE 48

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

48

Example: Movie generated by AI

slide-49
SLIDE 49

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

49

Complex RNNs: Multilayer

  • Multilayer RNNs: Create

multiple layers of hidden layers on top of one another

slide-50
SLIDE 50

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

50

Long-Short Term Memory (LSTM)

  • Problem with RNN: can not model long sequences well

– Vanishing gradient problem: gradient of the loss function decays exponentially with time

  • LSTM overcomes this issue
slide-51
SLIDE 51

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

51

LSTM Architecture

slide-52
SLIDE 52

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

52

Long-Short Term Memory (LSTM)

  • Cell State: long-term memory of the information
slide-53
SLIDE 53

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

53

LSTM Intuition: Forget Gate

  • Forget gate: should we remember the past information?
slide-54
SLIDE 54

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

54

LSTM Intuition: Input Gate

  • Input gate: should we update the memory using the new

information bit? If so, by how much?

slide-55
SLIDE 55

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

55

LSTM Intuition: Memory Update

  • Forget what needs to be forgotten, update what

needs to be updated

slide-56
SLIDE 56

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

56

LSTM Intuition: Output Gate

  • Should we output this bit of information, i.e., to deeper

LSTM layers?

slide-57
SLIDE 57

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

57

LSTM Intuition: Additive Updates

  • Backpropagation from ct to ct-1 requires only

elementwise multiplication by f, no matrix multiply by W

slide-58
SLIDE 58

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

58

LSTM Intuition: Additive Updates

slide-59
SLIDE 59

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

59

Gated Recurrent Unit (GRU)

  • Simpler than LSTM
  • No separate memory unit, memory = hidden output
slide-60
SLIDE 60

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

60

Language Models

  • What are language models?
  • Statistical language models

– Unigram, bigram, and n-gram language models

  • Neural language models
  • Language models for IR

Parts of this lecture are inspired by ChengXiang Zhai

slide-61
SLIDE 61

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

61

Probability of Relevance

  • Three random variables: Query Q, Document D, Relevance R

Î {0,1}

  • Key question: what is the probability that THIS document is

relevant to THIS query?

  • Goal: Given a particular query q, a particular document d,

p(R=1|Q=q,D=d)=?

– Then, rank D based on P(R=1|Q,D)

  • Solution: Language modeling based
slide-62
SLIDE 62

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

62

Defining P(R=1|Q,D): Generative models

  • Basic idea

– Define P(Q,D|R) – Compute O(R=1|Q,D) using Bayes’ rule

  • Types of models

– Query “generation” model: P(Q,D|R)=P(Q|D,R)P(D|R) – Document “generation” model: P(Q,D|R)=P(D|Q,R)P(Q|R)

) ( ) 1 ( ) | , ( ) 1 | , ( ) , | ( ) , | 1 ( ) , | 1 ( = = = = = = = = = R P R P R D Q P R D Q P D Q R P D Q R P D Q R O

Ignored for ranking D

slide-63
SLIDE 63

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

63

Query Generation: A Language Models for IR

  • Assuming uniform prior, we have
  • P(Q|D, R=1) = Probability that a user who likes D would

pose query Q.

  • How to estimate it?

)) | ( ) , | ( ( ) | ( ) 1 | ( ) 1 , | ( ) | ( ) , | ( ) 1 | ( ) 1 , | ( ) | , ( ) 1 | , ( ) , | 1 ( = » = = = = µ = = = = = = = µ = R Q P R D Q P Assume R D P R D P R D Q P R D P R D Q P R D P R D Q P R D Q P R D Q P D Q R O

Query likelihood p(Q| D,R=1) Document prior

) 1 , | ( ) , | 1 ( = µ = R D Q P D Q R O

slide-64
SLIDE 64

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

64

Estimating Probabilities

  • How to compute P(Q|D, R=1)?
  • The Basic LM Approach, by Ponte & Croft 1998
  • Generally involves two steps:
  • 1. Estimate a language model based on D
  • 2. Compute the query likelihood according to the estimated model
slide-65
SLIDE 65

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

65

Ranking Docs by Query Likelihood

d1 d2 dN q qd1 qd2 qdN

Doc LM

p(q| qd1) p(q| qd2) p(q| qdN)

Query likelihood Step 1: Given a document, generate a language model Step 2: Compute query likelihood from the document language model

slide-66
SLIDE 66

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

66

Example

Document Text mining paper Food nutrition paper Language Model

text ? mining ? assocation ? clustering ? … food ?

… …

food ? nutrition ? healthy ? diet ?

Query = “data mining algorithms”

?

Which model would most likely have generated this query?

slide-67
SLIDE 67

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

67

Modeling Queries: Multi-Bernoulli Model

  • Multi-Bernoulli: Modeling word presence/absence

– q= (x1, …, x|V|) – xi =1 for presence of word wi; xi =0 for absence – Parameters: {p(wi=1|d), p(wi=0|d)}

  • p(wi=1|d) + p(wi=0|d) = 1

| | | | | | 1 | | 1 1, 1 1,

( ( ,..., ) | ) ( | ) ( 1| ) ( 0 | )

i i

V V V V i i i i i i x i x

p q x x d p w x d p w d p w d

= = = = =

= = = = = =

Õ Õ Õ

slide-68
SLIDE 68

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

68

Modeling Queries: Multinomial Model

  • Multinomial Language Model: Modeling word frequency

– q=(q1,…qm), where qj is a query word – c(wi,q) is the count of word wi in query q – Parameters: {p(wi|d)}

  • p(w1|d)+… p(w|v||d) = 1
  • Multinomial language model has performed better than

Bernoulli

| | ( , ) 1 1 1

( ... | ) ( | ) ( | )

i

V m c w q m j i j i

p q q q d p q d p w d

= =

= = =

Õ Õ

slide-69
SLIDE 69

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

69

Retrieval as LM Estimation

  • Using Multinomial Language Model
  • Document ranking based on query likelihood
  • Retrieval problem » Estimation of p(wi|d)

| | 1 1 1 2

log ( | ) log ( | ) ( , )log ( | ) , ...

V m i i i i i m

p q d p q d c w q p w d where q q q q

= =

= = =

å å

Document language model

| | ( , ) 1 1 1

( ... | ) ( | ) ( | )

i

V m c w q m j i j i

p q q q d p q d p w d

= =

= = =

Õ Õ

slide-70
SLIDE 70

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

70

How To Estimate P(w|d)?

  • Simplest solution: Maximum Likelihood Estimator

– P(w|d) = relative frequency of word w in d

  • What if a word doesn’t appear in the text? Then P(w|d)=0

– What probability should we give a word that has not been

  • bserved?

– Requires smoothing

70

slide-71
SLIDE 71

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

71

How To Smooth A Language Model

  • Key Question: what probability should be assigned to an

unseen word?

  • Solution: Let the probability of an unseen word be

proportional to its probability given by a reference LM

  • Example: Reference LM = Collection LM

î í ì =

  • therwise

C w p d in seen is w if d w p d w p

d Seen

) | ( ) | ( ) | ( a

Discounted ML estimate Collection language model

slide-72
SLIDE 72

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

72

Rewriting the Ranking Function with Smoothing

å å å

> Î = Î Î

a + = =

d) c(w, d) c(w, , V w , V w d Seen V w

) C | w ( p log ) q , w ( c ) d | w ( p log ) q , w ( c ) d | w ( p log ) q , w ( c ) d | q ( p log

å å

> Î Î

+ a + a =

) d , w ( c , V w V w d d Seen

) C | w ( p log ) q , w ( c log | q | ) C | w ( p ) d | w ( p log ) q , w ( c Query words matched in d Query words not matched in d

å å

> Î Î

a

  • a

) d , w ( c , V w d V w d

) C | w ( p log ) q , w ( c ) C | w ( p log ) q , w ( c

All query words Query words matched in d

slide-73
SLIDE 73

Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining

73

  • Better understanding of the ranking function

– Smoothing with p(w|C) è TF-IDF weighting + length norm.

  • Enable efficient computation

å å

= Î Î

+ a + a =

n 1 i i d q w d w i d i Seen

) C | w ( p log log n ] ) C | w ( p ) d | w ( p )[log q , w ( c ) d | q ( p log

i i

Benefit of Rewriting

Ignore for ranking IDF weighting TF weighting Doc length normalization matched query terms