DataCamp Topic Modeling in R
Finding the best number of topics
TOPIC MODELING IN R
Finding the best number of topics Pavel Oleinikov Associate - - PowerPoint PPT Presentation
DataCamp Topic Modeling in R TOPIC MODELING IN R Finding the best number of topics Pavel Oleinikov Associate Director Quantitative Analysis Center Wesleyan University DataCamp Topic Modeling in R Approaches Topic coherence - examine the
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
newdata
perplexity(object=mod, newdata=dtm) 186.7139
DataCamp Topic Modeling in R
mod_log_lik = numeric(10) mod_perplexity = numeric(10) for (i in 2:10) { mod = LDA(dtm, k=i, method="Gibbs", control=list(alpha=0.5, iter=1000, seed=12345, thin=1)) mod_log_lik[i] = logLik(mod) mod_perplexity[i] = perplexity(mod, dtm) }
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
# Initial run mod = LDA(x=dtm, method="Gibbs", k=4, control=list(alpha=0.5, seed=12345, iter=1000, keep=1)) # Resumed run mod2 = LDA(x=dtm, model=mod, control=list(thin=1, seed=10000, iter=200))
DataCamp Topic Modeling in R
The study of disease using mathematical models has a long and rich history. Much interesting and new mathematics has been motivated by disease, because the problems are inherently nonlinear and multidimensional.
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
corpus %>% unnest_tokens(input=text, output=word) %>% count(chapter, word) 7 %/% 3 25784 %/% 1000 2 25
DataCamp Topic Modeling in R
corpus %>% unnest_tokens(input=text, output=word) %>% mutate(word_index = 1:n()) %>% mutate(doc_number = word_index %/% 1000 + 1) %>% count(doc_number, word) %>% cast_dtm(term=word, document=doc_number, value=n)
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
control=list(seed=12345)
DataCamp Topic Modeling in R
seedwords requires a matrix, k rows, N columns.
DataCamp Topic Modeling in R
k is 2
seedwords = matrix(nrow=2, ncol=34, data=0) colnames(seedwords) = colnames(dtm) seedwords[1, "restaurant"] = 1 seedwords[2, "loans"] = 1
DataCamp Topic Modeling in R
lda_mod = LDA(x=dtm, k=2, method="Gibbs", control=list(alpha=1, seed=1234)) tidy(lda_mod, "beta") %>% spread(key=topic, value=beta) %>% filter(term %in% c("restaurant", "loans")) term `1` `2` 1 loans 0.0767 0.00379 2 restaurant 0.0272 0.0795 lda_mod = LDA(x=dtm, k=2, method="Gibbs", seedwords=seedwords, control=list(alpha=1, seed=1234)) tidy(lda_mod, "beta") %>% spread(key=topic, value=beta) %>% filter(term %in% c("restaurant", "loans")) term `1` `2` 1 loans 0.00379 0.0967 2 restaurant 0.155 0.00236
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
Word2Vec models:
word2vec models use very large corpora (e.g., 2 billion words)
DataCamp Topic Modeling in R
TOPIC MODELING IN R