DataCamp Topic Modeling in R
Using topic models as classifiers
TOPIC MODELING IN R
Using topic models as classifiers Pavel Oleinikov Associate - - PowerPoint PPT Presentation
DataCamp Topic Modeling in R TOPIC MODELING IN R Using topic models as classifiers Pavel Oleinikov Associate Director Quantitative Analysis Center Wesleyan University DataCamp Topic Modeling in R Topic models as soft classifiers
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
document `1` `2` <chr> <dbl> <dbl> 1 d_1 0.154 0.846 2 d_2 0.278 0.722 3 d_3 0.875 0.125 4 d_4 0.923 0.0769 5 d_5 0.5 0.5
k 50
document `1` `2` <chr> <dbl> <dbl> 1 d_1 0.475 0.525 2 d_2 0.530 0.470 3 d_3 0.482 0.518 4 d_4 0.508 0.492 5 d_5 0.5 0.5
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
[,1] [,2] [,3] [1,] 0.604 0.100 0.295 [2,] 0.133 0.609 0.259 [3,] 0.514 0.221 0.265 [4,] 0.113 0.112 0.775 [5,] 0.258 0.502 0.240
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
treasure_L1 which_L2 had_R1 bequeathed_R2 two_L1 years_L2 was_R1
docs <- df %>% group_by(entity) %>% summarise(doc_id = first(entity), text = paste(text, collapse=" "))
DataCamp Topic Modeling in R
[A-Z][a-z]+ - one uppercase letter followed by one or more lowercase
pattern <- "[A-Z][a-z]+" m <- gregexpr(text, pattern) entities <- unlist(regmatches(text, m))
DataCamp Topic Modeling in R
(St[.] ) is a group. The ? quantifier means the group is optional
p <- "(St[.] )?[A-Z][a-z]+"
DataCamp Topic Modeling in R
t <- "the great Darius threw across" gsub("^([a-z]+) ([a-z]+)", "\\1_L1 \\2_L2", t) "the_L1 great_L2 Darius threw across"
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
TOPIC MODELING IN R
DataCamp Topic Modeling in R
topics = tidy(mod, matrix="gamma") %>% spread(topic, gamma) topics %>% filter(document %in% c(" Alboin", " Alexander", " Asia Minor", " Amorium", " Cappadocian")) document `1` `2` `3` <chr> <dbl> <dbl> <dbl> 1 " Alboin" 0.143 0.143 0.714 2 " Alexander" 0.143 0.143 0.714 3 " Amorium" 0.364 0.364 0.273 4 " Asia Minor" 0.0213 0.723 0.255 5 " Cappadocian" 0.571 0.143 0.286
DataCamp Topic Modeling in R
new_data must be aligned with the vocabulary used in the model
model = LDA(...) result = posterior(model, new_data) result$topics
DataCamp Topic Modeling in R
model_vocab <- tidy(mod, matrix="beta") %>% select(term) %>% distinct() new_table <- new_doc %>% unnest_tokens(input=text, output=word) %>% count(doc_id, word) %>% right_join(model_vocab, by=c("word"="term"))
DataCamp Topic Modeling in R
doc_id word n <chr> <chr> <int> 1 NA emerged_r1 NA 2 NA from_r2 NA 3 NA horde_l1 NA new_dtm <- new_table %>% arrange(desc(doc_id)) %>% mutate(doc_id = ifelse(is.na(doc_id), first(doc_id), doc_id), n = ifelse(is.na(n), 0, n)) %>% cast_dtm(document=doc_id, term=word, value=n)
DataCamp Topic Modeling in R
DataCamp Topic Modeling in R
TOPIC MODELING IN R