Social Media & Text Analysis
lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm)
CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org
Social Media & Text Analysis lecture 3 - Language Identification - - PowerPoint PPT Presentation
Social Media & Text Analysis lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm) CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org In-class Presentation a
lecture 3 - Language Identification (supervised learning and Naive Bayes algorithm)
CSE 5539-0010 Ohio State University Instructor: Alan Ritter Website: socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Dan$Jurafsky$
Coreference$resoluIon$ QuesIon$answering$(QA)$ PartOofOspeech$(POS)$tagging$
Word$sense$disambiguaIon$(WSD)$
Paraphrase$ Named$enIty$recogniIon$(NER)$ Parsing$ SummarizaIon$ InformaIon$extracIon$(IE)$ Machine$translaIon$(MT)$ Dialog$ SenIment$analysis$ $$$
mostly$solved$ making$good$progress$ sIll$really$hard$
Spam$detecIon$
Let’s$go$to$Agra!$ Buy$V1AGRA$…$
✓ ✗
Colorless$$$green$$$ideas$$$sleep$$$furiously.$
$$$$$ADJ$$$$$$$$$ADJ$$$$NOUN$$VERB$$$$$$ADV$
Einstein$met$with$UN$officials$in$Princeton$
PERSON$$$$$$$$$$$$$$ORG$$$$$$$$$$$$$$$$$$$$$$LOC$
You’re$invited$to$our$dinner$ party,$Friday$May$27$at$8:30$
Party$ May$27$ add$
Best$roast$chicken$in$San$Francisco!$ The$waiter$ignored$us$for$20$minutes.$ Carter$told$Mubarak$he$shouldn’t$run$again.$
I$need$new$baWeries$for$my$mouse.$
The$13th$Shanghai$InternaIonal$Film$FesIval…$ 13… The$Dow$Jones$is$up$ Housing$prices$rose$ Economy$is$ good$ Q.$How$effecIve$is$ibuprofen$in$reducing$ fever$in$paIents$with$acute$febrile$illness?$
I$can$see$Alcatraz$from$the$window!$
XYZ$acquired$ABC$yesterday$ ABC$has$been$taken$over$by$XYZ$ Where$is$CiIzen$Kane$playing$in$SF?$$ Castro$Theatre$at$7:30.$Do$ you$want$a$Icket?$ The$S&P500$jumped$
Alan Ritter ◦ socialmedia-class.org
and may not work well for other domains (out-of- domain).
News Blogs Wikipedia Forums Comments Twitter …
Alan Ritter ◦ socialmedia-class.org
Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013
Alan Ritter ◦ socialmedia-class.org
Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013
Alan Ritter ◦ socialmedia-class.org
Twitter ≡ Comments < Forums < Blogs < BNC < Wikipedia
Source: Baldwin et al. "How Noisy Social Media Text, How Diffrnt Social Media Sources?" IJCNLP 2013
Alan Ritter ◦ socialmedia-class.org
many techniques/algorithms are useful elsewhere (we will see examples of both in the class)
Alan Ritter ◦ socialmedia-class.org
NLP tools have been made for)
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)
Stemming
Normalization
Alan Ritter ◦ socialmedia-class.org
Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)
Stemming
Normalization
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
known as the “Chinese Twitter” 120 Million Posts / Day
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
language identification:
different languages
Alan Ritter ◦ socialmedia-class.org
Classification Method:
(d1, c1), … , (dm, cm)
Alan Ritter ◦ socialmedia-class.org
Classification Method:
Source: NLTK Book
Alan Ritter ◦ socialmedia-class.org
Classification Method:
Source: NLTK Book
Alan Ritter ◦ socialmedia-class.org
Classification Method:
Alan Ritter ◦ socialmedia-class.org
Classification Method:
Alan Ritter ◦ socialmedia-class.org
Bayes’ theorem with strong (naive) independence assumptions between the features.
Alan Ritter ◦ socialmedia-class.org
Source: adapted from Dan jurafsky
c∈ C
maximum a posteriori
Alan Ritter ◦ socialmedia-class.org
c∈ C
Bayes Rule
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
c∈ C
Bayes Rule drop the denominator
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
prior how often does this class occur? — simple count
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
likelihood prior O(|T|n·|C|) parameters n = number of unique n-gram tokens — need to make simplifying assumption
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
features P(ti |c) are independent given the class c
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
c∈ C
ti∈ d
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
ti∈ d
…
Probabilistic Graphical Model
Alan Ritter ◦ socialmedia-class.org
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
c∈ C
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
“the” “following” “Wikipedia” “en” “español” …
“the following” “following is” “Wikipedia en” “en español” …
….
The following is a list of words that occur in both Modern English and Modern Spanish, but which are pronounced differently and may have different meanings in each language. … Wikipedia en español es la edición en idioma español de Wikipedia. Actualmente cuenta con 1 185 590 páginas válidas de contenido y
en esta estadística entre …
English Spanish
Alan Ritter ◦ socialmedia-class.org
and their value is the number of occurrences
Alan Ritter ◦ socialmedia-class.org
uses the frequencies in the training data:
ˆ P(c) = count(c) count(cj)
cj∈ C
c∈ C
ti∈ d
ˆ P(t | c) = count(t,c) count(ti,c)
ti∈ V
Source: adapted from Dan jurafsky
Alan Ritter ◦ socialmedia-class.org
Doc Words Class Training 1 English Wikipedia editor en 2 free English Wikipedia en 3 Wikipedia editor en 4 español de Wikipedia es Test 5 Wikipedia español el ?
P(en)=3/4
P(“Wikipedia” |en) = 3/8 , P(“Wikipedia” |es) = 1/3 P(“español” |en) = 0/8 , P(“español” |es) = 1/3 P(“el” |en) = 0/8 , P(“el” |es) = 0/3
P(en|doc5) = 3/4×3/8×0/8×0/8 = 0 P(es|doc5) = 1/4×1/3×1/3×0/3 = 0
P(sp)=1/4
ˆ P(c) = count(c) count(cj)
cj∈ C
ˆ P(t | c) = count(t,c) count(ti,c)
ti∈ V
cNB = argmax
c∈ C
P(c) P(ti | c)
ti∈ d
∏
Alan Ritter ◦ socialmedia-class.org
documents that labeled as Spanish(es)?
smoothing:
ˆ P(t | c) = count(t,c)+1 count(ti,c)
ti∈ V
+ |V |
ˆ P("el"| es) = count("el",es) count(t,es)
t∈ V
= 0 ˆ P(t | c) = count(t,c) count(ti,c)
ti∈ V
Source: adapted from Dan jurafsky
smooth
Alan Ritter ◦ socialmedia-class.org
Doc Words Class Training 1 English Wikipedia editor en 2 free English Wikipedia en 3 Wikipedia editor en 4 español de Wikipedia sp Test 5 Wikipedia español el ?
P(en)=3/4
P(“Wikipedia” |en) = 3+1/8+6 , P(“Wikipedia” |sp) = 1+1/3+6 P(“español” |en) = 0+1/8+6 , P(“español” |sp) = 1+1/3+6 P(“el” |en) = 0+1/8+6 , P(“el” |sp) = 0+1/3+6
P(en|doc5) = 3/4×4/14×1/14×1/14 = 0.00109 P(sp|doc5) = 1/4×2/9×2/9×1/9 = 0.00137
P(sp)=1/4
ˆ P(c) = count(c) count(cj)
cj∈ C
ˆ P(t | c) = count(t,c) count(ti,c)
ti∈ V
Alan Ritter ◦ socialmedia-class.org
sentiment analysis, language identification)
Alan Ritter ◦ socialmedia-class.org
“win” often occurs together with “free”, “prize”.
Alan Ritter ◦ socialmedia-class.org
speech tagging:
c t1 t2 tn …
Naive Bayes
X1
w1 w2 w4
X2 X3 X4
w3 w5
X5
<s> I love cooking . <s> PRP VBP NN .
Hidden Markov Model (HMM)
sequence
Alan Ritter ◦ socialmedia-class.org
Alan Ritter ◦ socialmedia-class.org
(Wikipedia, Reuters, Debian, etc.)
to choose features that are informative about language, but not informative about domain
Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012
Alan Ritter ◦ socialmedia-class.org
i
H(X) = 0 H(X) = 1
Alan Ritter ◦ socialmedia-class.org
decrease in disorder achieved by partitioning the original data set.
i
H(X) = 0 H(X) = 1
Alan Ritter ◦ socialmedia-class.org
Source: Andrew Moore
H(X) = − P(xi)logP(xi)
i
IG(Y | X) = H(Y )− H(Y | X)
Alan Ritter ◦ socialmedia-class.org
Source: Andrew Moore
Alan Ritter ◦ socialmedia-class.org
for discriminating between the classes. Wealth Longevity
IG(LongLife | HairColor) = 0.01 IG(LongLife | Smoker) = 0.2 IG(LongLife | Gender) = 0.25 IG(LongLife | LastDigitOfSSN) = 0.00001
Alan Ritter ◦ socialmedia-class.org
Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012
correlate domain independent
Alan Ritter ◦ socialmedia-class.org
Source: Lui and Baldwin “langid.py: An Off-the-shelf Language Identification Tool" ACL 2012
Alan Ritter ◦ socialmedia-class.org
Language Identification Tokenization Part-of- Speech (POS) Tagging Shallow Parsing (Chunking) Named Entity Recognition (NER)
Stemming
Normalization
classification (Naïve Bayes)
Alan Ritter ◦ socialmedia-class.org
socialmedia-class.org