Literary Data: Some Approaches Andrew Goldstone - - PowerPoint PPT Presentation

literary data some approaches
SMART_READER_LITE
LIVE PREVIEW

Literary Data: Some Approaches Andrew Goldstone - - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition. one tool does not do it all NLP and openNLP packages let you interface with the Apache OpenNLP Java software


slide-1
SLIDE 1

Literary Data: Some Approaches

Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata April 30, 2015. Named entity recognition.

slide-2
SLIDE 2
  • ne tool does not do it all

▶ NLP and openNLP packages let you interface with the Apache

OpenNLP Java software (and theoretically Stanford CoreNLP too). These packages are not good.

▶ much useful software lacks R glue ▶ but some of it can be used from the command line…and thus via R

slide-3
SLIDE 3

This is a Unix system! I know this!

▶ think of a program as a function:

program <- function (input, arg1, arg2, ...) { ... # make calculations ... # possible side effects

  • utput

}

▶ in the shell, this looks like

program arg1 arg2 ...

slide-4
SLIDE 4

example

echo "Hello Raptor"

slide-5
SLIDE 5

where did input and output go?

standard input source, might represent user typing standard output target, might represent the console

slide-6
SLIDE 6

redirection

▶ can replace stdin and stdout with files (no more typing)

wc < sheik.txt > sheik_wc.txt

▶ can connect stdout of one program to stdin of the next one

echo "Hello Raptor" | wc

▶ shell analogue of dplyr pipelines:

txl %>% filter(Genre == "poetry") %>% arrange(Price)

slide-7
SLIDE 7
  • penNLP command line
  • pennlp SomeTool arg ...

▶ takes text input on stdin, sends results to stdout ▶ arg is the name of an auxiliary data file SomeTool needs

SentenceDetector split text into sentences TokenizerME split sentences (N.B.) into words TokenNameFinder Named Entity Recognition (persons, places…) POSTagger part-of-speech labeling

slide-8
SLIDE 8
  • penNLP pipeline
  • pennlp SentenceDetector en-sent.bin < sheik.txt \

> sheik_sent.txt

  • pennlp TokenizerME en-tok.bin < sheik_sent.txt \

> sheik_words.txt

  • pennlp POSTagger en-pos.bin < sheik_words.txt \

> sheik_pos.txt head sheik_pos.txt "_`` Are_VBP you_PRP coming_VBG in_IN to_TO watch_VB the_DT dancing_NN ,_, Lady_NNP Conway_NNP ?_. "_'' "_`` I_PRP most_RBS decidedly_RB am_VBP not_RB ._.

slide-9
SLIDE 9

simplify the pipeline

  • pennlp SentenceDetector en-sent.bin < sheik.txt | \
  • pennlp TokenizerME en-tok.bin | \
  • pennlp POSTagger en-pos.bin sheik_pos.txt
slide-10
SLIDE 10

the command line from R

system(s) # run command s in current working directory system(s, intern=T) # return stdout as a character vector

slide-11
SLIDE 11

getting the stuff

install.packages("openNLP") install.packages("openNLPmodels.en", repos="http://datacube.wu.ac.at")

brew install apache-opennlp # Mac, requires homebrew # I'm working on a setup for the VM

slide-12
SLIDE 12

composing a command

ner_command <- function (input_file, entity_type) { sent_model_file <- system.file("models", "en-sent.bin", package = "openNLPdata") token_model_file <- system.file("models", "en-token.bin", package = "openNLPdata") entity_model_file <- system.file("models", str_c("en-ner-", entity_type, ".bin"), package = "openNLPmodels.en") str_c("opennlp SentenceDetector ", sent_model_file, " < ", input_file, " | opennlp TokenizerME ", token_model_file, " | opennlp TokenNameFinder ", entity_model_file) }

slide-13
SLIDE 13

doing the deed

# Stowe, Uncle Tom's Cabin, vol. 1 input_file <- "wright_body/VAC7958.txt" stowe <- system(ner_command(input_file, "location"), intern=T) head(stowe)

[1] "LATE in the afternoon of a chilly day in February , two gentlemen were sitting" [2] "alone over their wine , in a well-furnished dining parlor , in the town of <START:location> P—— <END> , in" [3] "<START:location> Kentucky <END> ." [4] "There were no servants present , and the gentlemen , with chairs closely" [5] "approaching , seemed to be discussing some subject with great earnestness ." [6] "For convenience sake , we have said , hitherto , two gentlemen ."

slide-14
SLIDE 14

re-parsing the results

extract_locs <- function (tagged) { unlist(str_extract_all(tagged, perl("<START:location> .*? <END>"))) %>% str_replace("^<START:location> ", "") %>% str_replace(" <END>$", "") }

slide-15
SLIDE 15

corpus

locs_frame <- function (fs) { data_frame(filename=fs) %>% mutate(cmd=ner_command(filename, "location")) %>% group_by(filename) %>% do({ message("NER on ", .$filename) locs <- extract_locs(system(.$cmd, intern=T)) if (length(locs) == 0) locs <- NA data_frame(loc=locs) }) %>% group_by(filename, loc) %>% summarize(count=n()) }

slide-16
SLIDE 16

if (!file.exists("wright_locs.tsv")) { fs <- Sys.glob("wright_body/*.txt") locs_frame(fs) %>% # takes a while write.table("wright_locs.tsv", sep="\t", col.names=T, row.names=F, quote=F) }

slide-17
SLIDE 17

locs <- read.table("wright_locs.tsv", sep="\t", comment.char="", header=T, as.is=T, quote="") locs <- locs %>% group_by(loc) %>% filter(sum(count) > 5, n() > 2) # Wilkens's filter

slide-18
SLIDE 18

merge with metadata

meta <- read.table("wright_meta.tsv", sep="\t", as.is=T, header=T, quote="") %>% mutate(filename=file.path("wright_body", file)) %>% select(filename, pubplace) %>% mutate(pubplace=str_trim(pubplace)) # (but the place names need more cleaning than this) locs <- inner_join(locs, meta, by="filename")

slide-19
SLIDE 19

preliminary peek

top_locs <- locs %>% group_by(pubplace, loc) %>% # but just what have we been counting here? summarize(count=sum(count)) %>% filter(pubplace %in% c("Boston", "New York")) %>% top_n(5) %>% arrange(desc(count)) %>% rename(published=pubplace, mentioned=loc)

slide-20
SLIDE 20

hmmm

print_tabular(top_locs) published mentioned count Boston New York 3493 Boston Boston 3273 Boston England 2716 Boston Florence 2309 Boston Paris 1781 New York New York 13866 New York England 7800 New York Paris 5463 New York London 5131 New York Washington 4852

slide-21
SLIDE 21

what could come next: geospatial data and visualization

▶ Lincoln Mullen’s extensive draft chapters

lincolnmullen.com/projects/dh-r/geospatial-data.html lincolnmullen.com/projects/dh-r/mapping.html

▶ Bivand, Roger, et al. Applied Spatial Data Analysis with R. 2nd ed.

New York: Springer, 2013. DOI: 10.1007/978-1-4614-7618-4.

▶ ggmap package (geocoding) ▶ sp and many more R packages (spatial analysis) ▶ GIS libraries and command line tools (not R): GDAL, GEOS, …

slide-22
SLIDE 22

I saw something shiny

install.packages("tmap") # new on CRAN; needs GDAL, GEOS library("tmap") vignette("tmap-nutshell") # some documentation

slide-23
SLIDE 23

next