Natural Language Processing Basics Yingyu Liang University of - - PowerPoint PPT Presentation

natural language processing basics
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Basics Yingyu Liang University of - - PowerPoint PPT Presentation

Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the most important AI tasks One


slide-1
SLIDE 1

Natural Language Processing Basics

Yingyu Liang University of Wisconsin-Madison

slide-2
SLIDE 2

Natural language Processing (NLP)

  • The processing of the human languages by computers
  • One of the oldest AI tasks
  • One of the most important AI tasks
  • One of the hottest AI tasks nowadays
slide-3
SLIDE 3

Difficulty

  • Difficulty 1: ambiguous, typically no formal description
  • Example: “We saw her duck.”

How many different meanings?

slide-4
SLIDE 4

Difficulty

  • Difficulty 1: ambiguous, typically no formal description
  • Example: “We saw her duck.”
  • 1. We looked at a duck that belonged to her.
  • 2. We looked at her quickly squat down to avoid something.
  • 3. We use a saw to cut her duck.
slide-5
SLIDE 5

Difficulty

  • Difficulty 2: computers do not have human concepts
  • Example: “She like little animals. For example, yesterday we saw her

duck.”

  • 1. We looked at a duck that belonged to her.
  • 2. We looked at her quickly squat down to avoid something.
  • 3. We use a saw to cut her duck.
slide-6
SLIDE 6

Words

Preprocess Zipf’s Law

slide-7
SLIDE 7

Preprocess

  • Corpus: often a set of text documents
  • Tokenization or text normalization: turn corpus into sequence(s) of tokens
  • 1. Remove unwanted stuff: HTML tags, encoding tags
  • 2. Determine word boundaries: usually white space and punctuations
  • Sometimes can be tricky, like Ph.D.
slide-8
SLIDE 8

Preprocess

  • Tokenization or text normalization: turn data into sequence(s) of tokens
  • 1. Remove unwanted stuff: HTML tags, encoding tags
  • 2. Determine word boundaries: usually white space and punctuations
  • Sometimes can be tricky, like Ph.D.
  • 3. Remove stopwords: the, of, a, with, …
slide-9
SLIDE 9

Preprocess

  • Tokenization or text normalization: turn data into sequence(s) of tokens
  • 1. Remove unwanted stuff: HTML tags, encoding tags
  • 2. Determine word boundaries: usually white space and punctuations
  • Sometimes can be tricky, like Ph.D.
  • 3. Remove stopwords: the, of, a, with, …
  • 4. Case folding: lower-case all characters.
  • Sometimes can be tricky, like US and us
  • 5. Stemming/Lemmatization (optional): looks, looked, looking  look
slide-10
SLIDE 10

Vocabulary

Given the preprocessed text

  • Word token: occurrences of a word
  • Word type: unique word as a dictionary entry (i.e., unique tokens)
  • Vocabulary: the set of word types
  • Often 10k to 1 million on different corpora
  • Often remove too rare words
slide-11
SLIDE 11

Zipf’s Law

  • Word count 𝑔, word rank 𝑠
  • Zipf’s law: 𝑔 ∗ 𝑠 ≈ constant

Zipf’s law on the corpus Tom Sawyer

slide-12
SLIDE 12

Text: Bag-of-Words Representation

Bag-of-Words tf-idf

slide-13
SLIDE 13

Bag-of-Words

How to represent a piece of text (sentence/document) as numbers?

  • Let 𝑛 denote the size of the vocabulary
  • Given a document 𝑒, let 𝑑(𝑥, 𝑒) denote the #occurrence of 𝑥 in 𝑒
  • Bag-of-Words representation of the document

𝑤𝑒 = 𝑑 𝑥1, 𝑒 , 𝑑 𝑥2, 𝑒 , … , 𝑑 𝑥𝑛, 𝑒 /𝑎𝑒

  • Often 𝑎𝑒 = σ𝑥 𝑑(𝑥, 𝑒)
slide-14
SLIDE 14

Example

  • Preprocessed text: this is a good sentence this is another good

sentence

  • BoW representation:

𝑑 ′𝑏′, 𝑒 /𝑎𝑒, 𝑑 ′𝑗𝑡′, 𝑒 /𝑎𝑒, … , 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓′, 𝑒 /𝑎𝑒

  • What is 𝑎𝑒?
  • What is 𝑑 ′𝑏′, 𝑒 /𝑎𝑒?
  • What is 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓′, 𝑒 /𝑎𝑒?
slide-15
SLIDE 15

tf-idf

  • tf: normalized term frequency

𝑢𝑔

𝑥 =

𝑑(𝑥, 𝑒) max

𝑤

𝑑(𝑤, 𝑒)

  • idf: inverse document frequency

𝑗𝑒𝑔

𝑥 = log

total #doucments #documents containing 𝑥

  • tf-idf: 𝑢𝑔-𝑗𝑒𝑔

𝑥 = 𝑢𝑔 𝑥 ∗ 𝑗𝑒𝑔 𝑥

  • Representation of the document

𝑤𝑒 = [𝑢𝑔−𝑗𝑒𝑔

𝑥1, 𝑢𝑔−𝑗𝑒𝑔 𝑥2, … , 𝑢𝑔−𝑗𝑒𝑔 𝑥𝑛]

slide-16
SLIDE 16

Cosine Similarity

How to measure similarities between pieces of text?

  • Given the document vectors, can use any similarity notion on vectors
  • Commonly used in NLP: cosine of the angle between the two vectors

𝑡𝑗𝑛 𝑦, 𝑧 = 𝑦⊤𝑧 𝑦⊤𝑦 𝑧⊤𝑧

slide-17
SLIDE 17

Text: statistical Language Model

Statistical language model N-gram Smoothing

slide-18
SLIDE 18

Probabilistic view

  • Use probabilistic distribution to model the language
  • Dates back to Shannon (information theory; bits in the message)
slide-19
SLIDE 19

Statistical language model

  • Language model: probability distribution over sequences of tokens
  • Typically, tokens are words, and distribution is discrete
  • Tokens can also be characters or even bytes
  • Sentence: “the quick brown fox jumps over the lazy dog”

𝑦1 𝑦2 𝑦3 𝑦4 𝑦5 𝑦6 𝑦7 𝑦8 𝑦9 Tokens:

slide-20
SLIDE 20

Statistical language model

  • For simplification, consider fixed length sequence of tokens (sentence)
  • Probabilistic model:

(𝑦1, 𝑦2, 𝑦3, … , 𝑦𝜐−1, 𝑦𝜐) P [𝑦1, 𝑦2, 𝑦3, … , 𝑦𝜐−1, 𝑦𝜐]

slide-21
SLIDE 21

Unigram model

  • Unigram model: define the probability of the sequence as the product
  • f the probabilities of the tokens in the sequence
  • Independence!

P 𝑦1, 𝑦2, … , 𝑦𝜐 = ෑ

𝑢=1 𝜐

P[𝑦𝑢]

slide-22
SLIDE 22

A simple unigram example

  • Sentence: “the dog ran away”
  • How to estimate on the training corpus?

෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ P 𝑢ℎ𝑓 ෠ P 𝑒𝑝𝑕 ෠ P 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧] ෠ P 𝑢ℎ𝑓

slide-23
SLIDE 23

A simple unigram example

  • Sentence: “the dog ran away”
  • How to estimate on the training corpus?

෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ P 𝑢ℎ𝑓 ෠ P 𝑒𝑝𝑕 ෠ P 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧] ෠ P 𝑢ℎ𝑓

slide-24
SLIDE 24

n-gram model

  • 𝑜-gram: sequence of 𝑜 tokens
  • 𝑜-gram model: define the conditional probability of the 𝑜-th token

given the preceding 𝑜 − 1 tokens P 𝑦1, 𝑦2, … , 𝑦𝜐 = P 𝑦1, … , 𝑦𝑜−1 ෑ

𝑢=𝑜 𝜐

P[𝑦𝑢|𝑦𝑢−𝑜+1, … , 𝑦𝑢−1]

slide-25
SLIDE 25

n-gram model

  • 𝑜-gram: sequence of 𝑜 tokens
  • 𝑜-gram model: define the conditional probability of the 𝑜-th token

given the preceding 𝑜 − 1 tokens P 𝑦1, 𝑦2, … , 𝑦𝜐 = P 𝑦1, … , 𝑦𝑜−1 ෑ

𝑢=𝑜 𝜐

P[𝑦𝑢|𝑦𝑢−𝑜+1, … , 𝑦𝑢−1] Markovian assumptions

slide-26
SLIDE 26

Typical 𝑜-gram model

  • 𝑜 = 1: unigram
  • 𝑜 = 2: bigram
  • 𝑜 = 3: trigram
slide-27
SLIDE 27

Training 𝑜-gram model

  • Straightforward counting: counting the co-occurrence of the grams

For all grams (𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢)

  • 1. count and estimate ෠

P[𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢]

  • 2. count and estimate ෠

P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1

  • 3. compute

෠ P 𝑦𝑢 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1 = ෠ P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1, 𝑦𝑢 ෠ P 𝑦𝑢−𝑜+1, … , 𝑦𝑢−1

slide-28
SLIDE 28

A simple trigram example

  • Sentence: “the dog ran away”

෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜] ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜]

slide-29
SLIDE 29

Drawback

  • Sparsity issue: ෠

P … most likely to be 0

  • Bad case: “dog ran away” never appear in the training corpus, so

෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = 0

  • Even worse: “dog ran” never appear in the training corpus, so

෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜] = 0

slide-30
SLIDE 30

Rectify: smoothing

  • Basic method: adding non-zero probability mass to zero entries
  • Example: Laplace smoothing that adds one count to all 𝑜-grams

pseudocount[𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1

slide-31
SLIDE 31

Rectify: smoothing

  • Basic method: adding non-zero probability mass to zero entries
  • Example: Laplace smoothing that adds one count to all 𝑜-grams

pseudocount[𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1 ෠ P 𝑒𝑝𝑕 = pseudocount[𝑒𝑝𝑕] pseudo length of the corpus = pseudocount[𝑒𝑝𝑕] actual length of the corpus + |𝑊|

slide-32
SLIDE 32

Rectify: smoothing

  • Basic method: adding non-zero probability mass to zero entries
  • Example: Laplace smoothing that adds one count to all 𝑜-grams

pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜] = ?

slide-33
SLIDE 33

Rectify: smoothing

  • Basic method: adding non-zero probability mass to zero entries
  • Example: Laplace smoothing that adds one count to all 𝑜-grams

pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 + |𝑊| ෠ P 𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜 ≈ pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] since #bigrams ≈#trigrams on the corpus

slide-34
SLIDE 34

Example

  • Preprocessed text: this is a good sentence this is another good

sentence

  • How many unigrams?
  • How many bigrams?
  • Estimate ෠

P 𝑗𝑡|𝑢ℎ𝑗𝑡 without using Laplace smoothing

  • Estimate ෠

P 𝑗𝑡|𝑢ℎ𝑗𝑡 using Laplace smoothing (|V| = 10000)

slide-35
SLIDE 35

Rectify: smoothing

  • Basic method: adding non-zero probability mass to zero entries
  • Example: Laplace smoothing
  • Back-off methods: restore to lower order statistics
  • Example: if ෡

P[𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜] does not work, use ෡ P 𝑏𝑥𝑏𝑧 𝑠𝑏𝑜 as replacement

  • Mixture methods: use a linear combination of ෠

P 𝑏𝑥𝑏𝑧 𝑠𝑏𝑜 and ෠ P[𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜]

slide-36
SLIDE 36

Another drawback

  • High dimesion: # of grams too large
  • Vocabulary size: about 10k=2^14
  • #trigram: about 2^42
slide-37
SLIDE 37

Rectify: clustering

  • Class-based language models: cluster tokens into classes; replace

each token with its class

  • Significantly reduces the vocabulary size; also address sparsity issue
  • Combinations of smoothing and clustering are also possible