Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS - PowerPoint PPT Presentation

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing

Uns u per v ised learning Some more nat u ral lang u age processing ( NLP ) v ocab u lar y: Latent Dirichlet allocation ( LDA ) is a standard topic model A collection of doc u ments is kno w n as a corp u s Bag - of -w ords is treating e v er y w ord in a doc u ment separatel y Topic models � nd pa � erns of w ords appearing together Searching for pa � erns rather than predicting is kno w n as u ns u per v ised learning INTRODUCTION TO TEXT ANALYSIS IN R

Word probabilities INTRODUCTION TO TEXT ANALYSIS IN R

Cl u stering v s . topic modeling Cl u stering Cl u sters are u nco v ered based on distance , w hich is contin u o u s . E v er y object is assigned to a single cl u ster . Topic Modeling Topics are u nco v ered based on w ord freq u enc y, w hich is discrete . E v er y doc u ment is a mi x t u re ( i . e ., partial member ) of e v er y topic . INTRODUCTION TO TEXT ANALYSIS IN R

Let ' s practice ! IN TR OD U C TION TO TE XT AN ALYSIS IN R

Doc u ment term matrices IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing

Matrices and sparsit y sparse_review Terms Docs admit ago albeit amazing angle awesome 4 1 0 1 0 0 0 5 0 1 0 1 1 0 3 0 0 0 0 0 1 2 0 0 0 0 0 0 INTRODUCTION TO TEXT ANALYSIS IN R

Using cast _ dtm () tidy_review %>% count(word, id) %>% cast_dtm(id, word, n) <<DocumentTermMatrix (documents: 1791, terms: 9669)>> Non-/sparse entries: 62766/17252622 Sparsity : 100% Maximal term length: NA Weighting : term frequency (tf) INTRODUCTION TO TEXT ANALYSIS IN R

Using as . matri x() dtm_review <- tidy_review %>% count(word, id) %>% cast_dtm(id, word, n) %>% as.matrix() dtm_review[1:4, 2000:2004] Terms Docs consecutive consensus consequences considerable considerably 223 0 0 0 0 0 615 0 0 0 0 0 1069 0 0 0 0 0 425 0 0 0 0 0 INTRODUCTION TO TEXT ANALYSIS IN R

R u nning topic models IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing

Using LDA () library(topicmodels) lda_out <- LDA( dtm_review, k = 2, method = "Gibbs", control = list(seed = 42) ) INTRODUCTION TO TEXT ANALYSIS IN R

LDA () o u tp u t lda_out A LDA_Gibbs topic model with 2 topics. INTRODUCTION TO TEXT ANALYSIS IN R

Using glimpse () glimpse(lda_out) Formal class 'LDA_Gibbs' [package "topicmodels"] with 16 slots ..@ seedwords : NULL ..@ z : int [1:75670] 1 2 2 1 1 2 1 1 2 2 ... ..@ alpha : num 25 ..@ call : language LDA(x = dtm_review, k = 2, method = "Gibbs", ... ..@ Dim : int [1:2] 1791 9668 ..@ control :Formal class 'LDA_Gibbscontrol' [package "topicmodels"] ... ..@ beta : num [1:2, 1:17964] -8.81 -10.14 -9.09 -8.43 -12.53 ... ... INTRODUCTION TO TEXT ANALYSIS IN R

Using tid y() lda_topics <- lda_out %>% tidy(matrix = "beta") lda_topics %>% arrange(desc(beta)) # A tibble: 19,336 x 3 topic term beta <int> <chr> <dbl> 1 1 hair 0.0241 2 2 clean 0.0231 3 2 cleaning 0.0201 # … with 19,333 more rows INTRODUCTION TO TEXT ANALYSIS IN R

Let ’ s practice ! IN TR OD U C TION TO TE XT AN ALYSIS IN R

Interpreting topics IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing

T w o topics lda_topics <- LDA( dtm_review, k = 2, method = "Gibbs", control = list(seed = 42) ) %>% tidy(matrix = "beta") word_probs <- lda_topics %>% group_by(topic) %>% top_n(15, beta) %>% ungroup() %>% mutate(term2 = fct_reorder(term, beta)) INTRODUCTION TO TEXT ANALYSIS IN R

T w o topics ggplot( word_probs, aes( term2, beta, fill = as.factor(topic) ) ) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + coord_flip() INTRODUCTION TO TEXT ANALYSIS IN R

Three topics lda_topics2 <- LDA( dtm_review, k = 3, method = "Gibbs", control = list(seed = 42) ) %>% tidy(matrix = "beta") word_probs2 <- lda_topics2 %>% group_by(topic) %>% top_n(15, beta) %>% ungroup() %>% mutate(term2 = fct_reorder(term, beta)) INTRODUCTION TO TEXT ANALYSIS IN R

Three topics ggplot( word_probs2, aes( term2, beta, fill = as.factor(topic) ) ) + geom_col(show.legend = FALSE) + facet_wrap(~ topic, scales = "free") + coord_flip() INTRODUCTION TO TEXT ANALYSIS IN R

Fo u r topics INTRODUCTION TO TEXT ANALYSIS IN R

The art of model selection Adding topics that are di � erent is good If w e start repeating topics , w e 'v e gone too far Name the topics based on the combination of high - probabilit y w ords INTRODUCTION TO TEXT ANALYSIS IN R

Wrap -u p IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing

S u mmar y Tokeni z ing te x t and remo v ing stop w ords Vis u ali z ing w ord co u nts Cond u cting sentiment anal y sis R u nning and interpreting topic models INTRODUCTION TO TEXT ANALYSIS IN R

Ne x t steps Other DataCamp co u rses : Sentiment Anal y sis in R : The Tid y Wa y Topic Modeling in R Additional reso u rce : Te x t Mining w ith R INTRODUCTION TO TEXT ANALYSIS IN R

All the best ! IN TR OD U C TION TO TE XT AN ALYSIS IN R

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS - PowerPoint PPT Presentation

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing Uns u per v ised learning Some more nat u ral lang u age processing ( NLP ) v ocab u lar y: Latent Dirichlet allocation ( LDA )

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation using

14 Allocation Dirichlet Latent Lecture : Taheri Sara Scribes : Chu 4am Exam Man Tue

Measuring Topic Quality in Latent Dirichlet Allocation Sergey Nikolenko Sergei Koltsov Olessia

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 Istv an B r D avid

Latent Dirichlet Allocation Alberto Bie+ Trop dinformation

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

More Register Allocation Last time Register allocation Global allocation via graph

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Greed Is Good (For Scheduling Under Uncertainty) Marc Uetz m.uetz@utwente.nl (updated) IPCO

Swi$:&New&Paradigms&for& iOS&Development Marc%Prud'hommeaux

Practical Challenges of Gaussian Processes Marc Deisenroth Statistical Machine Learning Group

Eurocrypt 2016 Report Marc Fischlin TU Darmstadt, Germany Submissions, submissions, submissions

Linuxcon 2013 Case Study Live upgrading many thousand of servers from an ancient Red Hat

Complex collective choices Luigi Marengo 1 Dept. of Management, LUISS University, Roma,

Back to the Future: an Interaction-oriented Framework for Social Computing M. Baldoni 1 C.

IRAC Lessons on Absolute Calibra2on Sean Carey (Spitzer Science

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS - PowerPoint PPT Presentation

Latent Dirichlet allocation IN TR OD U C TION TO TE XT AN ALYSIS IN R Marc Dotson Assistant Professor of Marketing Uns u per v ised learning Some more nat u ral lang u age processing ( NLP ) v ocab u lar y: Latent Dirichlet allocation ( LDA )

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

A Framework for Incorporating General Domain Knowledge into Latent Dirichlet Allocation using

14 Allocation Dirichlet Latent Lecture : Taheri Sara Scribes : Chu 4am Exam Man Tue

Measuring Topic Quality in Latent Dirichlet Allocation Sergey Nikolenko Sergei Koltsov Olessia

Linked Latent Dirichlet Allocation in Web Spam Filtering o 1 o 1 Istv an B r D avid

Latent Dirichlet Allocation Alberto Bie+ Trop dinformation

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Sunday Homework 3 : an Diniohlet Allocation Model Latent Generative : Generative model

1 Latent variable models In the next section we will discuss latent variable models for

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

More Register Allocation Last time Register allocation Global allocation via graph

Project Nexus Principle Workshop Project Nexus Principle Workshop ALLOCATION ALLOCATION 15

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Greed Is Good (For Scheduling Under Uncertainty) Marc Uetz m.uetz@utwente.nl (updated) IPCO

Swi$:&amp;New&amp;Paradigms&amp;for&amp; iOS&amp;Development Marc%Prud'hommeaux

Practical Challenges of Gaussian Processes Marc Deisenroth Statistical Machine Learning Group

Eurocrypt 2016 Report Marc Fischlin TU Darmstadt, Germany Submissions, submissions, submissions

Linuxcon 2013 Case Study Live upgrading many thousand of servers from an ancient Red Hat

Complex collective choices Luigi Marengo 1 Dept. of Management, LUISS University, Roma,

Back to the Future: an Interaction-oriented Framework for Social Computing M. Baldoni 1 C.

IRAC Lessons on Absolute Calibra2on Sean Carey (Spitzer Science

Swi$:&New&Paradigms&for& iOS&Development Marc%Prud'hommeaux