literary data some approaches
play

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition. one tool does not do it all NLP and openNLP packages let you interface with the Apache OpenNLP Java software


  1. Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition.

  2. one tool does not do it all ▶ NLP and openNLP packages let you interface with the Apache OpenNLP Java software (and theoretically Stanford CoreNLP too). These packages are not good. ▶ much useful software lacks R glue ▶ but some of it can be used from the command line…and thus via R

  3. program <- function (input, arg1, arg2, ...) { ... # make calculations ... # possible side effects output } program arg1 arg2 ... This is a Unix system! I know this! ▶ think of a program as a function: ▶ in the shell , this looks like

  4. echo "Hello Raptor" example

  5. where did input and output go? standard input source, might represent user typing standard output target, might represent the console

  6. wc < sheik.txt > sheik_wc.txt echo "Hello Raptor" | wc txl %>% filter(Genre == "poetry") %>% arrange(Price) redirection ▶ can replace stdin and stdout with files (no more typing) ▶ can connect stdout of one program to stdin of the next one ▶ shell analogue of dplyr pipelines:

  7. opennlp SomeTool arg ... openNLP command line ▶ takes text input on stdin , sends results to stdout ▶ arg is the name of an auxiliary data file SomeTool needs SentenceDetector split text into sentences TokenizerME split sentences (N.B.) into words TokenNameFinder Named Entity Recognition (persons, places…) POSTagger part-of-speech labeling

  8. opennlp SentenceDetector en-sent.bin < sheik.txt \ > sheik_sent.txt opennlp TokenizerME en-tok.bin < sheik_sent.txt \ > sheik_words.txt opennlp POSTagger en-pos.bin < sheik_words.txt \ > sheik_pos.txt head sheik_pos.txt "_`` Are_VBP you_PRP coming_VBG in_IN to_TO watch_VB the_DT dancing_NN ,_, Lady_NNP Conway_NNP ?_. "_'' "_`` I_PRP most_RBS decidedly_RB am_VBP not_RB ._. openNLP pipeline

  9. opennlp TokenizerME en-tok.bin | \ opennlp SentenceDetector en-sent.bin < sheik.txt | \ opennlp POSTagger en-pos.bin sheik_pos.txt simplify the pipeline

  10. system(s, intern=T) # return stdout as a character vector system(s) # run command s in current working directory the command line from R

  11. install.packages("openNLP") install.packages("openNLPmodels.en", repos="http://datacube.wu.ac.at") brew install apache-opennlp # Mac, requires homebrew # I'm working on a setup for the VM getting the stuff

  12. } str_c("opennlp SentenceDetector ", entity_model_file) " | opennlp TokenNameFinder ", token_model_file, " | opennlp TokenizerME ", " < ", input_file, sent_model_file, package = "openNLPmodels.en") ner_command <- function (input_file, entity_type) { str_c("en-ner-", entity_type, ".bin"), entity_model_file <- system.file("models", "en-token.bin", package = "openNLPdata") token_model_file <- system.file("models", "en-sent.bin", package = "openNLPdata") composing a command sent_model_file <- system.file("models",

  13. # Stowe, Uncle Tom's Cabin, vol. 1 input_file <- "wright_body/VAC7958.txt" stowe <- system(ner_command(input_file, "location"), intern=T) head(stowe) [1] "LATE in the afternoon of a chilly day in February , two gentlemen were sitting" [2] "alone over their wine , in a well-furnished dining parlor , in the town of <START:location> P—— <END> , in" [3] "<START:location> Kentucky <END> ." [4] "There were no servants present , and the gentlemen , with chairs closely" [5] "approaching , seemed to be discussing some subject with great earnestness ." [6] "For convenience sake , we have said , hitherto , two gentlemen ." doing the deed

  14. extract_locs <- function (tagged) { unlist(str_extract_all(tagged, perl("<START:location> .*? <END>"))) %>% str_replace("^<START:location> ", "") %>% str_replace(" <END>$", "") } re-parsing the results

  15. locs_frame <- function (fs) { data_frame(filename=fs) %>% mutate(cmd=ner_command(filename, "location")) %>% group_by(filename) %>% do({ message("NER on ", .$filename) locs <- extract_locs(system(.$cmd, intern=T)) if (length(locs) == 0) locs <- NA data_frame(loc=locs) }) %>% group_by(filename, loc) %>% summarize(count=n()) } corpus

  16. if (!file.exists("wright_locs.tsv")) { locs_frame(fs) %>% # takes a while write.table("wright_locs.tsv", sep="\t", col.names=T, row.names=F, quote=F) } fs <- Sys.glob("wright_body/*.txt")

  17. locs <- read.table("wright_locs.tsv", sep="\t", comment.char="", header=T, as.is=T, quote="") locs <- locs %>% group_by(loc) %>% filter(sum(count) > 5, n() > 2) # Wilkens's filter

  18. meta <- read.table("wright_meta.tsv", sep="\t", as.is=T, header=T, quote="") %>% mutate(filename=file.path("wright_body", file)) %>% select(filename, pubplace) %>% mutate(pubplace=str_trim(pubplace)) # (but the place names need more cleaning than this) locs <- inner_join(locs, meta, by="filename") merge with metadata

  19. top_locs <- locs %>% group_by(pubplace, loc) %>% # but just what have we been counting here? summarize(count=sum(count)) %>% filter(pubplace %in% c("Boston", "New York")) %>% top_n(5) %>% arrange(desc(count)) %>% rename(published=pubplace, mentioned=loc) preliminary peek

  20. print_tabular(top_locs) hmmm published mentioned count Boston New York 3493 Boston Boston 3273 Boston England 2716 Boston Florence 2309 Boston Paris 1781 New York New York 13866 New York England 7800 New York Paris 5463 New York London 5131 New York Washington 4852

  21. what could come next: geospatial data and visualization ▶ Lincoln Mullen’s extensive draft chapters lincolnmullen.com/projects/dh-r/geospatial-data.html lincolnmullen.com/projects/dh-r/mapping.html ▶ Bivand, Roger, et al. Applied Spatial Data Analysis with R . 2nd ed. New York: Springer, 2013. DOI: 10.1007/978-1-4614-7618-4. ▶ ggmap package (geocoding) ▶ sp and many more R packages (spatial analysis) ▶ GIS libraries and command line tools (not R): GDAL, GEOS, …

  22. install.packages("tmap") # new on CRAN; needs GDAL, GEOS library("tmap") vignette("tmap-nutshell") # some documentation I saw something shiny

  23. next

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend