Natural Language Processing Basics Yingyu Liang University of - PowerPoint PPT Presentation

Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison

Natural language Processing (NLP) • The processing of the human languages by computers • One of the oldest AI tasks • One of the most important AI tasks • One of the hottest AI tasks nowadays

Difficulty • Difficulty 1: ambiguous, typically no formal description • Example: “ We saw her duck.” How many different meanings?

Difficulty • Difficulty 1: ambiguous, typically no formal description • Example: “ We saw her duck.” • 1. We looked at a duck that belonged to her. • 2. We looked at her quickly squat down to avoid something. • 3. We use a saw to cut her duck.

Difficulty • Difficulty 2: computers do not have human concepts • Example: “ She like little animals. For example, yesterday we saw her duck .” • 1. We looked at a duck that belonged to her. • 2. We looked at her quickly squat down to avoid something. • 3. We use a saw to cut her duck.

Words Preprocess Zipf’s Law

Preprocess • Corpus: often a set of text documents • Tokenization or text normalization: turn corpus into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D.

Preprocess • Tokenization or text normalization: turn data into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D. 3. Remove stopwords : the, of, a, with, …

Preprocess • Tokenization or text normalization: turn data into sequence(s) of tokens 1. Remove unwanted stuff: HTML tags, encoding tags 2. Determine word boundaries: usually white space and punctuations • Sometimes can be tricky, like Ph.D. 3. Remove stopwords : the, of, a, with, … 4. Case folding: lower-case all characters. • Sometimes can be tricky, like US and us 5. Stemming/Lemmatization (optional): looks, looked, looking  look

Vocabulary Given the preprocessed text • Word token: occurrences of a word • Word type: unique word as a dictionary entry (i.e., unique tokens) • Vocabulary: the set of word types • Often 10k to 1 million on different corpora • Often remove too rare words

Zipf’s Law • Word count 𝑔 , word rank 𝑠 • Zipf’s law: 𝑔 ∗ 𝑠 ≈ constant Zipf’s law on the corpus Tom Sawyer

Text: Bag-of-Words Representation Bag-of-Words tf-idf

Bag-of-Words How to represent a piece of text (sentence/document) as numbers? • Let 𝑛 denote the size of the vocabulary • Given a document 𝑒 , let 𝑑(𝑥, 𝑒) denote the #occurrence of 𝑥 in 𝑒 • Bag-of-Words representation of the document 𝑤 𝑒 = 𝑑 𝑥 1 , 𝑒 , 𝑑 𝑥 2 , 𝑒 , … , 𝑑 𝑥 𝑛 , 𝑒 /𝑎 𝑒 • Often 𝑎 𝑒 = σ 𝑥 𝑑(𝑥, 𝑒)

Example • Preprocessed text: this is a good sentence this is another good sentence • BoW representation: 𝑑 ′𝑏 ′ , 𝑒 /𝑎 𝑒 , 𝑑 ′𝑗𝑡 ′ , 𝑒 /𝑎 𝑒 , … , 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓 ′ , 𝑒 /𝑎 𝑒 • What is 𝑎 𝑒 ? • What is 𝑑 ′𝑏 ′ , 𝑒 /𝑎 𝑒 ? • What is 𝑑 ′𝑓𝑦𝑏𝑛𝑞𝑚𝑓 ′ , 𝑒 /𝑎 𝑒 ?

tf-idf • tf: normalized term frequency 𝑑(𝑥, 𝑒) 𝑢𝑔 𝑥 = max 𝑑(𝑤, 𝑒) 𝑤 • idf: inverse document frequency total #doucments 𝑗𝑒𝑔 𝑥 = log #documents containing 𝑥 • tf-idf: 𝑢𝑔 - 𝑗𝑒𝑔 𝑥 = 𝑢𝑔 𝑥 ∗ 𝑗𝑒𝑔 𝑥 • Representation of the document 𝑤 𝑒 = [𝑢𝑔 − 𝑗𝑒𝑔 𝑥 1 , 𝑢𝑔 − 𝑗𝑒𝑔 𝑥 2 , … , 𝑢𝑔 − 𝑗𝑒𝑔 𝑥 𝑛 ]

Cosine Similarity How to measure similarities between pieces of text? • Given the document vectors, can use any similarity notion on vectors • Commonly used in NLP: cosine of the angle between the two vectors 𝑦 ⊤ 𝑧 𝑡𝑗𝑛 𝑦, 𝑧 = 𝑦 ⊤ 𝑦 𝑧 ⊤ 𝑧

Text: statistical Language Model Statistical language model N-gram Smoothing

Probabilistic view • Use probabilistic distribution to model the language • Dates back to Shannon (information theory; bits in the message)

Statistical language model • Language model: probability distribution over sequences of tokens • Typically, tokens are words, and distribution is discrete • Tokens can also be characters or even bytes • Sentence: “ the quick brown fox jumps over the lazy dog ” 𝑦 1 𝑦 2 𝑦 3 𝑦 4 𝑦 5 𝑦 6 𝑦 7 𝑦 8 𝑦 9 Tokens:

Statistical language model • For simplification, consider fixed length sequence of tokens (sentence) (𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 𝜐−1 , 𝑦 𝜐 ) • Probabilistic model: P [𝑦 1 , 𝑦 2 , 𝑦 3 , … , 𝑦 𝜐−1 , 𝑦 𝜐 ]

Unigram model • Unigram model: define the probability of the sequence as the product of the probabilities of the tokens in the sequence 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = ෑ P[𝑦 𝑢 ] 𝑢=1 • Independence!

A simple unigram example • Sentence: “ the dog ran away ” P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ P 𝑢ℎ𝑓 ෠ P 𝑒𝑝𝑕 ෠ P 𝑠𝑏𝑜 ෠ P[𝑏𝑥𝑏𝑧] • How to estimate on the training corpus? ෠ P 𝑢ℎ𝑓

n-gram model • 𝑜 -gram: sequence of 𝑜 tokens • 𝑜 -gram model: define the conditional probability of the 𝑜 -th token given the preceding 𝑜 − 1 tokens 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = P 𝑦 1 , … , 𝑦 𝑜−1 ෑ P[𝑦 𝑢 |𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 ] 𝑢=𝑜

n-gram model • 𝑜 -gram: sequence of 𝑜 tokens • 𝑜 -gram model: define the conditional probability of the 𝑜 -th token given the preceding 𝑜 − 1 tokens 𝜐 P 𝑦 1 , 𝑦 2 , … , 𝑦 𝜐 = P 𝑦 1 , … , 𝑦 𝑜−1 ෑ P[𝑦 𝑢 |𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 ] 𝑢=𝑜 Markovian assumptions

Typical 𝑜 -gram model • 𝑜 = 1 : unigram • 𝑜 = 2 : bigram • 𝑜 = 3 : trigram

Training 𝑜 -gram model • Straightforward counting: counting the co-occurrence of the grams For all grams (𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ) 1. count and estimate ෠ P[𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ] 2. count and estimate ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 3. compute ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 , 𝑦 𝑢 ෠ P 𝑦 𝑢 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1 = ෠ P 𝑦 𝑢−𝑜+1 , … , 𝑦 𝑢−1

A simple trigram example • Sentence: “ the dog ran away ” P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 P[𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜] ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 = ෠ ෠ P 𝑢ℎ𝑓 𝑒𝑝𝑕 𝑠𝑏𝑜 ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜]

Drawback • Sparsity issue: ෠ P … most likely to be 0 • Bad case: “dog ran away” never appear in the training corpus, so ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = 0 • Even worse: “dog ran” never appear in the training corpus, so ෠ P[𝑒𝑝𝑕 𝑠𝑏𝑜] = 0

Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1

Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕] = actualcount 𝑒𝑝𝑕 + 1 pseudocount[𝑒𝑝𝑕] pseudocount[𝑒𝑝𝑕] ෠ P 𝑒𝑝𝑕 = pseudo length of the corpus = actual length of the corpus + |𝑊|

Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] = ?

Rectify: smoothing • Basic method: adding non-zero probability mass to zero entries • Example: Laplace smoothing that adds one count to all 𝑜 -grams pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧 + 1 pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] = actualcount 𝑒𝑝𝑕 𝑠𝑏𝑜 + |𝑊| P 𝑏𝑥𝑏𝑧|𝑒𝑝𝑕 𝑠𝑏𝑜 ≈ pseudocount[𝑒𝑝𝑕 𝑠𝑏𝑜 𝑏𝑥𝑏𝑧] ෠ pseudocount [𝑒𝑝𝑕 𝑠𝑏𝑜] since #bigrams ≈ #trigrams on the corpus

Example • Preprocessed text: this is a good sentence this is another good sentence • How many unigrams? • How many bigrams? • Estimate ෠ P 𝑗𝑡|𝑢ℎ𝑗𝑡 without using Laplace smoothing • Estimate ෠ P 𝑗𝑡|𝑢ℎ𝑗𝑡 using Laplace smoothing (|V| = 10000)

Natural Language Processing Basics Yingyu Liang University of - PowerPoint PPT Presentation

Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the most important AI tasks One

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Cross section and Higgs mass measurement with - H e+e at the CEPC: initial state radiation

Aim I can understand and use common imperial units of length and convert between inches and

B LEU ATRE : Flattening Syntactic Dependencies for MT Evaluation Dennis N. Mehay and Chris

3-2: Learning Goals Lets see how big different things are. Download for free at

!"#$%&'(%&&")'*"++,'&-."' "#$%!&!

ENERGY 1 ENERGY UNITS Energy: The ability to do work (make something happen) Joule (J)

Exam I Given: 12 October 2012 Due: End of class This exam is open-book, open-notes (your notes,

Thermodynamics

Natural Language Processing Basics Yingyu Liang University of - PowerPoint PPT Presentation

Natural Language Processing Basics Yingyu Liang University of Wisconsin-Madison Natural language Processing (NLP) The processing of the human languages by computers One of the oldest AI tasks One of the most important AI tasks One

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Outline of todays lecture Overview of Natural Language Generation Components of Natural

Introduction to Natural Language Processing CMSC 470 Marine Carpuat Natural Language Processing

Cross section and Higgs mass measurement with - H e+e at the CEPC: initial state radiation

Aim I can understand and use common imperial units of length and convert between inches and

B LEU ATRE : Flattening Syntactic Dependencies for MT Evaluation Dennis N. Mehay and Chris

3-2: Learning Goals Lets see how big different things are. Download for free at

!&quot;#$%&amp;'(%&amp;&amp;&quot;)'*&quot;++,'&amp;-.&quot;' &quot;#$%!&amp;!

ENERGY 1 ENERGY UNITS Energy: The ability to do work (make something happen) Joule (J)

Exam I Given: 12 October 2012 Due: End of class This exam is open-book, open-notes (your notes,

Thermodynamics

!"#$%&'(%&&")'*"++,'&-."' "#$%!&!