Social Media & Text Analysis lecture 3 - Language Identification - - PowerPoint PPT Presentation

social media text analysis
SMART_READER_LITE
LIVE PREVIEW

Social Media & Text Analysis lecture 3 - Language Identification - - PowerPoint PPT Presentation

Social Media & Text Analysis lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org In-class Presentation a


slide-1
SLIDE 1

Social Media & Text Analysis

lecture 3 - Language Identification 
 (supervised learning and Naive Bayes algorithm)

CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org

slide-2
SLIDE 2

Alan Ritter ◦ socialmedia-class.org

In-class Presentation

  • a 10-minute presentation plus 2-minute Q&A (20 points)
  • A Social Media Platform or a NLP Researcher
  • Pairing up (2 students collaboration)
  • Sign up now!
slide-3
SLIDE 3

Alan Ritter ◦ socialmedia-class.org

Reading #1

slide-4
SLIDE 4

Alan Ritter ◦ socialmedia-class.org

Reading #1

slide-5
SLIDE 5

Alan Ritter ◦ socialmedia-class.org

Reading #2

slide-6
SLIDE 6

Dan$Jurafsky$

Language(Technology(

Coreference$resoluIon$ QuesIon$answering$(QA)$ PartOofOspeech$(POS)$tagging$

Word$sense$disambiguaIon$(WSD)$

Paraphrase$ Named$enIty$recogniIon$(NER)$ Parsing$ SummarizaIon$ InformaIon$extracIon$(IE)$ Machine$translaIon$(MT)$ Dialog$ SenIment$analysis$ $$$

mostly$solved$ making$good$progress$ sIll$really$hard$

Spam$detecIon$

Let’s$go$to$Agra!$ Buy$V1AGRA$…$

✓ ✗

Colorless$$$green$$$ideas$$$sleep$$$furiously.$

$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$

Einstein$met$with$UN$officials$in$Princeton$

PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$

You’re$invited$to$our$dinner$ party,$Friday$May$27$at$8:30$

Party$ May$27$ add$

Best$roast$chicken$in$San$Francisco!$ The$waiter$ignored$us$for$20$minutes.$ Carter$told$Mubarak$he$shouldn’t$run$again.$

I$need$new$baWeries$for$my$mouse.$

The$13th$Shanghai$InternaIonal$Film$FesIval…$ 13… The$Dow$Jones$is$up$ Housing$prices$rose$ Economy$is$ good$ Q.$How$effecIve$is$ibuprofen$in$reducing$ fever$in$paIents$with$acute$febrile$illness?$

I$can$see$Alcatraz$from$the$window!$

XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ Where$is$CiIzen$Kane$playing$in$SF?$$ Castro$Theatre$at$7:30.$Do$ you$want$a$Icket?$ The$S&P500$jumped$

Natural Language Processing

slide-7
SLIDE 7

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • NLP is often designed for one domain (in-domain),

and may not work well for other domains (out-of- domain).

  • Why?

News Blogs Wikipedia Forums Comments Twitter …

slide-8
SLIDE 8

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • How different?

Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013

slide-9
SLIDE 9

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • How different?
  • ut-of-vocabulary

Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013

slide-10
SLIDE 10

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • How similar?


Twitter ≡ Comments < Forums < Blogs < BNC < Wikipedia


Source: Baldwin et al. 
 "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013

slide-11
SLIDE 11

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • What to do?
  • robust tools/models that works across domains
  • specific tools/models for Twitter data only —

many techniques/algorithms are useful elsewhere 
 
 
 
 (we will see examples of both in the class)

slide-12
SLIDE 12

Alan Ritter ◦ socialmedia-class.org

Domain/Genre

  • Why so much Twitter?
  • publicly available (vs. SMS, emails)
  • large amount of data
  • large demand for research/commercial purpose
  • too different from well-edited text (which most

NLP tools have been made for)

slide-13
SLIDE 13

Alan Ritter ◦ socialmedia-class.org

NLP Pipeline

slide-14
SLIDE 14

Alan Ritter ◦ socialmedia-class.org

Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)

NLP Pipeline

Stemming

Normalization

slide-15
SLIDE 15

Alan Ritter ◦ socialmedia-class.org

Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)

NLP Pipeline

Stemming

Normalization

slide-16
SLIDE 16

Alan Ritter ◦ socialmedia-class.org

Language Identification

(a.k.a Language Detection)

slide-17
SLIDE 17

Alan Ritter ◦ socialmedia-class.org

LangID: why needed?

  • Twitter is highly multilingual
  • But NLP is often monolingual
slide-18
SLIDE 18

Alan Ritter ◦ socialmedia-class.org

slide-19
SLIDE 19

Alan Ritter ◦ socialmedia-class.org

known as the “Chinese Twitter” 120 Million Posts / Day

slide-20
SLIDE 20

Alan Ritter ◦ socialmedia-class.org

LangID: Google Translate

slide-21
SLIDE 21

Alan Ritter ◦ socialmedia-class.org

  • introduced in March 2013
  • uses two-letter ISO 639-1 code

LangID: Twitter API

slide-22
SLIDE 22

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

slide-23
SLIDE 23

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

slide-24
SLIDE 24

Alan Ritter ◦ socialmedia-class.org

LangID:

A Classification Problem

  • Input:
  • a document d
  • a fixed set of classes C = {c1, c2, …, cj}
  • Output:
  • a predicted class c ∈ C
slide-25
SLIDE 25

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Hand-crafted Rules

  • Keyword-based approaches do not work well for

language identification:

  • poor recall
  • expensive to build large dictionaries for all

different languages

  • cognate words
slide-26
SLIDE 26

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Input:
  • a document d
  • a fixed set of classes C = {c1, c2, …, cj}
  • a training set of m hand-labeled documents 


(d1, c1), … , (dm, cm)

  • Output:
  • a learned classifier 𝜹: d → c
slide-27
SLIDE 27

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

Source: NLTK Book

slide-28
SLIDE 28

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

Source: NLTK Book

slide-29
SLIDE 29

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Naïve Bayes
  • Logistic Regression
  • Support Vector Machines (SVM)
slide-30
SLIDE 30

Alan Ritter ◦ socialmedia-class.org

Classification Method:

Supervised Machine Learning

  • Naïve Bayes
  • Logistic Regression
  • Support Vector Machines (SVM)
slide-31
SLIDE 31

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • a family of simple probabilistic classifiers based on

Bayes’ theorem with strong (naive) independence assumptions between the features.

  • Bayes’ Theorem:

P(c | d) = P(d | c)P(c) P(d)

slide-32
SLIDE 32

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • For a document d, find the most probable class c:

Source: adapted from Dan jurafsky

cMAP = argmax

c∈ C

P(c | d)

maximum a posteriori

slide-33
SLIDE 33

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • For a document d, find the most probable class c:

= argmax

c∈ C

P(d | c)P(c) P(d)

Bayes Rule

cMAP = argmax

c∈ C

P(c | d)

Source: adapted from Dan jurafsky

slide-34
SLIDE 34

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • For a document d, find the most probable class c:

= argmax

c∈ C

P(d | c)P(c) P(d) = argmax

c∈ C

P(d | c)P(c)

Bayes Rule drop the 
 denominator

cMAP = argmax

c∈ C

P(c | d)

Source: adapted from Dan jurafsky

slide-35
SLIDE 35

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • document d represented as features t1, t2, …, tn:

= argmax

c∈ C

P(t1,t2,...,tn | c)P(c) cMAP = argmax

c∈ C

P(d | c)P(c)

Source: adapted from Dan jurafsky

slide-36
SLIDE 36

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • document d represented as features t1, t2, …, tn:

prior how often does this class occur? — simple count

cMAP = argmax

c∈ C

P(t1,t2,...,tn | c)P(c)

Source: adapted from Dan jurafsky

slide-37
SLIDE 37

Alan Ritter ◦ socialmedia-class.org

cMAP = argmax

c∈ C

P(t1,t2,...,tn | c)P(c)

Naïve Bayes

  • document d represented as features t1, t2, …, tn:

likelihood prior O(|T|n·|C|) parameters n = number of unique n-gram tokens
 — need to make simplifying assumption

Source: adapted from Dan jurafsky

slide-38
SLIDE 38

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • Conditional Independence Assumption:



 features P(ti |c) are independent given the class c

P(t1,t2,...,tn | c) = P(t1 | c)⋅P(t2 | c)⋅...⋅P(tn | c)

Source: adapted from Dan jurafsky

slide-39
SLIDE 39

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • For a document d, find the most probable class c:


cMAP = argmax

c∈ C

P(t1,t2,...,tn | c)P(c) cNB = argmax

c∈ C

P(c) P(ti | c)

ti∈ d

Source: adapted from Dan jurafsky

slide-40
SLIDE 40

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

cNB = argmax

c∈ C

P(c) P(ti | c)

ti∈ d

c t1 t2 tn

Probabilistic Graphical Model

slide-41
SLIDE 41

Alan Ritter ◦ socialmedia-class.org

Variations of Naïve Bayes

cMAP = argmax

c∈ C

P(d | c)P(c)

Source: adapted from Dan jurafsky

  • different assumptions on distributions of features:
  • Multinomial: discrete features
  • Bernoulli: binary features
  • Gaussian: continuous features
slide-42
SLIDE 42

Alan Ritter ◦ socialmedia-class.org

Variations of Naïve Bayes

cMAP = argmax

c∈ C

P(d | c)P(c)

Source: adapted from Dan jurafsky

  • different assumptions on distributions of feature:
  • Multinomial: discrete features
  • Bernoulli: binary features
  • Gaussian: continuous features
slide-43
SLIDE 43

Alan Ritter ◦ socialmedia-class.org

LangID features

  • n-grams features:
  • 1-gram: 


“the” “following” “Wikipedia” “en” “español” …

  • 2-gram:


“the following” “following is”
 “Wikipedia en” “en español” …

  • 3-gram:


….

The following is a list of words that occur in both Modern English and Modern Spanish, but which are pronounced differently and may have different meanings in each language. … Wikipedia en español es la edición en idioma español de Wikipedia. Actualmente cuenta con 1 185 590 páginas válidas de contenido y

  • cupa el décimo puesto

en esta estadística entre …

English Spanish

slide-44
SLIDE 44

Alan Ritter ◦ socialmedia-class.org

Bag-of-Words Model

  • positional independence assumption:
  • features are the words occurring in the document

and their value is the number of occurrences

  • word probabilities are position independent
slide-45
SLIDE 45

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • Learning the Multinomial Naïve Bayes model simply

uses the frequencies in the training data:

ˆ P(c) = count(c) count(cj)

cj∈ C

∑ cNB = argmax

c∈ C

P(c) P(ti | c)

ti∈ d

ˆ P(t | c) = count(t,c) count(ti,c)

ti∈ V

Source: adapted from Dan jurafsky

slide-46
SLIDE 46

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

Doc Words Class Training 1 English Wikipedia editor en 2 free English Wikipedia en 3 Wikipedia editor en 4 español de Wikipedia es Test 5 Wikipedia español el ?

P(en)=3/4

P(“Wikipedia” |en) = 3/8 , P(“Wikipedia” |es) = 1/3
 P(“español” |en) = 0/8 , P(“español” |es) = 1/3
 P(“el” |en) = 0/8 , P(“el” |es) = 0/3

P(en|doc5) = 3/4×3/8×0/8×0/8 = 0
 P(es|doc5) = 1/4×1/3×1/3×0/3 = 0

P(sp)=1/4

ˆ P(c) = count(c) count(cj)

cj∈ C

ˆ P(t | c) = count(t,c) count(ti,c)

ti∈ V

cNB = argmax

c∈ C

P(c) P(ti | c)

ti∈ d

slide-47
SLIDE 47

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • What if the word “el” doesn’t occur in the training

documents that labeled as Spanish(es)?
 
 


  • To deal with 0 counts, use add-one or Laplace

smoothing:

ˆ P(t | c) = count(t,c)+1 count(ti,c)

ti∈ V

+ |V |

ˆ P("el"| es) = count("el",es) count(t,es)

t∈ V

= 0 ˆ P(t | c) = count(t,c) count(ti,c)

ti∈ V

Source: adapted from Dan jurafsky

smooth

slide-48
SLIDE 48

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

Doc Words Class Training 1 English Wikipedia editor en 2 free English Wikipedia en 3 Wikipedia editor en 4 español de Wikipedia sp Test 5 Wikipedia español el ?

P(en)=3/4

P(“Wikipedia” |en) = 3+1/8+6 , P(“Wikipedia” |sp) = 1+1/3+6
 P(“español” |en) = 0+1/8+6 , P(“español” |sp) = 1+1/3+6
 P(“el” |en) = 0+1/8+6 , P(“el” |sp) = 0+1/3+6

P(en|doc5) = 3/4×4/14×1/14×1/14 = 0.00109
 P(sp|doc5) = 1/4×2/9×2/9×1/9 = 0.00137

P(sp)=1/4

ˆ P(c) = count(c) count(cj)

cj∈ C

ˆ P(t | c) = count(t,c) count(ti,c)

ti∈ V

slide-49
SLIDE 49

Alan Ritter ◦ socialmedia-class.org

Naïve Bayes

  • Pros: (works well for spam filtering, text classification,

sentiment analysis, language identification)

  • simple (no iterative learning)
  • fast and light-weighted
  • less parameters, so need less training data
  • even if the NB assumption doesn't hold, a NB classifier still
  • ften performs surprisingly well in practice
  • Cons
  • assumes independence of features
  • can’t model dependencies/structures
slide-50
SLIDE 50

Alan Ritter ◦ socialmedia-class.org

Correlated Features

  • For example, for spam email classification, word

“win” often occurs together with “free”, “prize”.

  • Solution:
  • feature selection
  • or other models (e.g. logistic/softmax regression)
slide-51
SLIDE 51

Alan Ritter ◦ socialmedia-class.org

Model Structure

  • For example, the word order matters in part-of-

speech tagging:

c t1 t2 tn …

Naive Bayes

X1

w1 w2 w4

X2 X3 X4

w3 w5

X5

<s> I love cooking . <s> PRP VBP NN .

Hidden Markov Model (HMM)

sequence

slide-52
SLIDE 52

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

slide-53
SLIDE 53

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

  • main techniques:
  • Multinominal Naïve Bayes
  • diverse training data from multiple domains

(Wikipedia, Reuters, Debian, etc.)

  • plus feature selection using Information Gain (IG)

to choose features that are informative about language, but not informative about domain
 


Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012

slide-54
SLIDE 54

Alan Ritter ◦ socialmedia-class.org

Entropy & Information Gain

  • Entropy is a measure of disorder in a dataset

H(X) = − P(xi)logP(xi)

i

H(X) = 0 H(X) = 1

slide-55
SLIDE 55

Alan Ritter ◦ socialmedia-class.org

Entropy & Information Gain

  • Entropy is a measure of disorder in a dataset
  • Information Gain is a measure of the

decrease in disorder achieved by partitioning the original data set.

IG(Y | X) = H(Y )− H(Y | X)

H(X) = − P(xi)logP(xi)

i

H(X) = 0 H(X) = 1

slide-56
SLIDE 56

Alan Ritter ◦ socialmedia-class.org

Information Gain

Source: Andrew Moore

H(X) = − P(xi)logP(xi)

i

IG(Y | X) = H(Y )− H(Y | X)

slide-57
SLIDE 57

Alan Ritter ◦ socialmedia-class.org

Information Gain

Source: Andrew Moore

slide-58
SLIDE 58

Alan Ritter ◦ socialmedia-class.org

Information Gain used for?

  • choose features that are informative (most useful)

for discriminating between the classes. 
 Wealth Longevity

IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001

slide-59
SLIDE 59

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

  • feature selection using Information Gain (IG)

Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012

correlate domain independent

slide-60
SLIDE 60

Alan Ritter ◦ socialmedia-class.org

LangID Tool: langid.py

  • main advantages:
  • cross-domain (works on all kinds of texts)
  • works for Twitter (accuracy = 0.89)
  • fast (300 tweets/second — 24G RAM)
  • currently supports 97 language
  • retrainable

Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012

slide-61
SLIDE 61

Alan Ritter ◦ socialmedia-class.org

Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)

Summary

Stemming

Normalization

classification (Naïve Bayes)

slide-62
SLIDE 62

Alan Ritter ◦ socialmedia-class.org

Sign up for in-class presentation (by next week)

socialmedia-class.org