SLIDE 1
Literary Data: Some Approaches
Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr.
SLIDE 2 it depends
▶ every program has dependencies
▶ software packages (library) ▶ data files (readLines, read.csv, scan, dir…)
▶ good programs document their dependencies clearly at the start ▶ nice programs allow their users to meet dependencies in a
controlled fashion. Which is better as a file dependency:
▶ "/Users/agoldst/jockers/data/plainText/austen.txt" ▶ "austen.txt" ▶ "../../../../data/plainText/austen.txt"
SLIDE 3 file system, once and for all
▶ Every R process has a working directory ▶ RStudio defaults to
▶ ~ ▶ the project directory ▶ the containing directory of the file you launched RStudio to open
▶ Knitting starts a new R process whose working directory is the
containing directory of the R markdown file
SLIDE 4
portable dependencies
▶ all file paths relative to the working directory ▶ working directory set to the directory containing the program
script
▶ working directory never subsequently modified
Testing portability
▶ for console testing, start each session by setting the working
directory once to the script-containing directory
▶ for knitting, do not modify the working directory ▶ read your error messages (including in knit PDFs)
SLIDE 5 ha ha ha, character encoding
ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 . . . T h e P r
e c t G
Moral
readLines(filename, encoding="UTF-8") read.csv(filename, as.is=T, ..., encoding="UTF-8") scan(..., encoding="UTF-8")
SLIDE 6 ha ha ha, character encoding
ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 . . . T h e P r
e c t G
Moral
readLines(filename, encoding="UTF-8") read.csv(filename, as.is=T, ..., encoding="UTF-8") scan(..., encoding="UTF-8")
SLIDE 7
first class
▶ function definitions look like assignments because they are ▶ function (...) { ... } is a value like any other
SLIDE 8
function as parameter
bind <- function (x, f) { f(x) } bind(c(1, 2, 3), sum)
[1] 6
twice <- function (s) { str_c(s, s) } bind("ha", twice)
[1] "haha"
SLIDE 9
funny function
`%p%` <- function (x, y) { x + y } 100 %p% 200
[1] 300
`%b%` <- function (x, f) { f(x) } "ha" %b% twice
[1] "haha"
SLIDE 10
funny function
`%p%` <- function (x, y) { x + y } 100 %p% 200
[1] 300
`%b%` <- function (x, f) { f(x) } "ha" %b% twice
[1] "haha"
SLIDE 11
anonymous function
bind("parenthetical", function (s) { str_c("(", s, ")") } )
[1] "(parenthetical)"
SLIDE 12
map <- function (f, xs) { result <- list() for (j in seq_along(xs)) { result[[j]] <- f(xs[[j]]) } result } map(twice, c("well", "now", "no"))
[[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono"
SLIDE 13
filter_vector <- function (f, xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } pos <- function (x) (x > 0) filter_vector(pos, (-5):5)
[1] 1 2 3 4 5
SLIDE 14
curry
map_f <- function (f) { function (xs) { result <- list() for (j in seq_along(xs)) { result[[j]] <- f(xs[[j]]) } result } } map_f(twice)(c("well", "now", "no"))
[[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono"
SLIDE 15
▶ how would you write filter_f?
filter_f <- function (f) { function (xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } } filter_f(pos)(c(-1, 1))
[1] 1
SLIDE 16
▶ how would you write filter_f?
filter_f <- function (f) { function (xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } } filter_f(pos)(c(-1, 1))
[1] 1
SLIDE 17
built-in
lapply(lst, f) # same as map(f, lst) lapply(lst, f, x, y, ...) # ... passed on to f: # inside the for loop: result[[j]] <- f(xs[[j]], x, y, ...)
▶ sapply (returns a vector if possible) ▶ apply (iterate over rows/columns of a matrix) ▶ tapply (iterate over groups identified by a factor) ▶ what a mess!
SLIDE 18
dplyr: split-apply-combine
▶ split a data frame up into pieces (rows, groups of rows…) ▶ do something to each piece ▶ put together the result
SLIDE 19
No new functionality (but…)
select column indexing by name (or $) filter logical row subscripting arrange order mutate expressions in terms of columns summarize for loops, table, sum, mean…
SLIDE 20
surnames <- select(laureates, surname) head(surnames) # data type!
surname 1 Modiano 2 Munro 3 Yan 4 Tranströmer 5 Vargas Llosa 6 Müller
women <- filter(laureates, gender=="female") women$surname[1:3]
[1] "Munro" "Müller" "Lessing"
SLIDE 21
(notionally)
women <- filter(laureates, function (laur_row) { laur_row$gender == "female" })
SLIDE 22
modularity, but…
women_surnames <- select(filter(laureates, gender=="female"), surname, year) # ugh
SLIDE 23
curiouser and curiouser
women_last <- laureates %>% filter(gender=="female") %>% select(surname, year) women_last[1:3, ]
surname year 1 Munro 2013 2 Müller 2009 3 Lessing 2007
SLIDE 24
“pipe”
x %>% f # equivalent to... x %b% f # or... f(x) x %>% f(y, z) # equivalent to... f(x, y, z) # wow!?
SLIDE 25
a special function
laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it...
total 1 12
laureates %>% filter(gender=="female") %>% summarize(total=n())
total 1 12
SLIDE 26
a special function
laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it...
total 1 12
laureates %>% filter(gender=="female") %>% summarize(total=n())
total 1 12
SLIDE 27
laureates %>% filter(!is.na(diedCountryCode)) %>% filter(bornCountryCode != diedCountryCode) %>% summarize(n_exiles=n())
n_exiles 1 34
SLIDE 28
f_sws <- laureates %>% filter(gender=="female" | bornCountry=="Sweden") %>% select(firstname, surname) %>% slice(1:5) f_sws
firstname surname 1 Alice Munro 2 Tomas Tranströmer 3 Herta Müller 4 Doris Lessing 5 Elfriede Jelinek
f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" "))
firstname surname fullname 1 Alice Munro Alice Munro 2 Tomas Tranströmer Tomas Tranströmer 3 Herta Müller Herta Müller 4 Doris Lessing Doris Lessing 5 Elfriede Jelinek Elfriede Jelinek
SLIDE 29
f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% select(-firstname) # minus!!
surname fullname 1 Munro Alice Munro 2 Tranströmer Tomas Tranströmer 3 Müller Herta Müller 4 Lessing Doris Lessing 5 Jelinek Elfriede Jelinek
SLIDE 30
f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% arrange(surname) %>% select(fullname)
fullname 1 Elfriede Jelinek 2 Doris Lessing 3 Alice Munro 4 Herta Müller 5 Tomas Tranströmer
SLIDE 31
laur_ages <- laureates %>% filter(born != "0000-00-00") %>% # Mo Yan mutate(born_year=as.numeric(str_sub(born, 1, 4))) %>% mutate(age=year - born_year) %>% mutate(decade=str_c(str_sub(year, 1, 3), 0)) %>% select(surname, year, decade, age) laur_ages %>% slice(1:3)
surname year decade age 1 Modiano 2014 2010 69 2 Munro 2013 2010 82 3 Tranströmer 2011 2010 80
SLIDE 32
split
laur_ages %>% group_by(decade) # no-op?
Source: local data frame [106 x 4] Groups: decade surname year decade age 1 Modiano 2014 2010 69 2 Munro 2013 2010 82 3 Tranströmer 2011 2010 80 4 Vargas Llosa 2010 2010 74 5 Müller 2009 2000 56 6 Le Clézio 2008 2000 68 7 Lessing 2007 2000 88 8 Pamuk 2006 2000 54 9 Pinter 2005 2000 75 10 Jelinek 2004 2000 58 .. ... ... ... ...
SLIDE 33
apply-combine
laur_ages %>% group_by(decade) %>% summarize(avg_age=mean(age))
Source: local data frame [12 x 2] decade avg_age 1 1900 64.11111 2 1910 58.87500 3 1920 60.10000 4 1930 56.44444 5 1940 64.33333 6 1950 63.70000 7 1960 66.20000 8 1970 67.00000 9 1980 67.60000 10 1990 67.50000 11 2000 66.40000 12 2010 76.25000
SLIDE 34
input…
txl <- read.csv("three-percent.csv", as.is=T, encoding="UTF-8") # N.B. nrow(txl)
[1] 3187
colnames(txl)
[1] "ISBN" "Titles" "AuthorFN" [4] "AuthorLN" "TranslatorFN" "TranslatorLN" [7] "Publisher" "Genre" "Price" [10] "Month" "Year" "Lanuage" [13] "Country" "OtherFN" "OtherLN" [16] "Other.Role"
SLIDE 35
txl <- txl %>% filter(Year < 2015) %>% select(ISBN, Year, Country, Genre, Publisher, Language=Lanuage)
SLIDE 36
split, apply, combine
txl %>% group_by(Year, Genre) %>% summarize(count=n())
Source: local data frame [14 x 3] Groups: Year Year Genre count 1 2008 Fiction 278 2 2008 Poetry 82 3 2009 Fiction 291 4 2009 Poetry 72 5 2010 Fiction 265 6 2010 Poetry 75 7 2011 Fiction 303 8 2011 Poetry 67 9 2012 Fiction 387 10 2012 Poetry 73 11 2013 Fiction 448 12 2013 Poetry 93 13 2014 Fiction 494 14 2014 Poetry 93
SLIDE 37
split, apply, combine (more than one row)
▶ top_n(x, n, wt): top n rows of x by wt
lang_counts <- txl %>% group_by(Year, Language) %>% summarize(count=n()) win_place_langs <- lang_counts %>% top_n(2, count) %>% arrange(Year, desc(count))
SLIDE 38
win_place_langs
Source: local data frame [14 x 3] Groups: Year Year Language count 1 2008 French 60 2 2008 Spanish 48 3 2009 Spanish 62 4 2009 French 54 5 2010 French 60 6 2010 Spanish 52 7 2011 French 63 8 2011 Spanish 50 9 2012 French 68 10 2012 Spanish 58 11 2013 French 90 12 2013 Spanish 71 13 2014 French 106 14 2014 German 85
SLIDE 39
variety
txl %>% group_by(Publisher) %>% summarize(langs=n_distinct(Language)) %>% arrange(desc(langs)) %>% top_n(5, langs)
Source: local data frame [5 x 2] Publisher langs 1 Dalkey Archive 26 2 Archipelago 21 3 Open Letter 20 4 Knopf 17 5 Other Press 17