Literary Data: Some Approaches Andrew Goldstone - - PowerPoint PPT Presentation

literary data some approaches
SMART_READER_LITE
LIVE PREVIEW

Literary Data: Some Approaches Andrew Goldstone - - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr. it depends every program has dependencies software packages ( library ) data files (


slide-1
SLIDE 1

Literary Data: Some Approaches

Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr.

slide-2
SLIDE 2

it depends

▶ every program has dependencies

▶ software packages (library) ▶ data files (readLines, read.csv, scan, dir…)

▶ good programs document their dependencies clearly at the start ▶ nice programs allow their users to meet dependencies in a

controlled fashion. Which is better as a file dependency:

▶ "/Users/agoldst/jockers/data/plainText/austen.txt" ▶ "austen.txt" ▶ "../../../../data/plainText/austen.txt"

slide-3
SLIDE 3

file system, once and for all

▶ Every R process has a working directory ▶ RStudio defaults to

▶ ~ ▶ the project directory ▶ the containing directory of the file you launched RStudio to open

▶ Knitting starts a new R process whose working directory is the

containing directory of the R markdown file

slide-4
SLIDE 4

portable dependencies

▶ all file paths relative to the working directory ▶ working directory set to the directory containing the program

script

▶ working directory never subsequently modified

Testing portability

▶ for console testing, start each session by setting the working

directory once to the script-containing directory

▶ for knitting, do not modify the working directory ▶ read your error messages (including in knit PDFs)

slide-5
SLIDE 5

ha ha ha, character encoding

ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 . . . T h e P r

  • j

e c t G

Moral

readLines(filename, encoding="UTF-8") read.csv(filename, as.is=T, ..., encoding="UTF-8") scan(..., encoding="UTF-8")

slide-6
SLIDE 6

ha ha ha, character encoding

ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 . . . T h e P r

  • j

e c t G

Moral

readLines(filename, encoding="UTF-8") read.csv(filename, as.is=T, ..., encoding="UTF-8") scan(..., encoding="UTF-8")

slide-7
SLIDE 7

first class

▶ function definitions look like assignments because they are ▶ function (...) { ... } is a value like any other

slide-8
SLIDE 8

function as parameter

bind <- function (x, f) { f(x) } bind(c(1, 2, 3), sum)

[1] 6

twice <- function (s) { str_c(s, s) } bind("ha", twice)

[1] "haha"

slide-9
SLIDE 9

funny function

`%p%` <- function (x, y) { x + y } 100 %p% 200

[1] 300

`%b%` <- function (x, f) { f(x) } "ha" %b% twice

[1] "haha"

slide-10
SLIDE 10

funny function

`%p%` <- function (x, y) { x + y } 100 %p% 200

[1] 300

`%b%` <- function (x, f) { f(x) } "ha" %b% twice

[1] "haha"

slide-11
SLIDE 11

anonymous function

bind("parenthetical", function (s) { str_c("(", s, ")") } )

[1] "(parenthetical)"

slide-12
SLIDE 12

map <- function (f, xs) { result <- list() for (j in seq_along(xs)) { result[[j]] <- f(xs[[j]]) } result } map(twice, c("well", "now", "no"))

[[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono"

slide-13
SLIDE 13

filter_vector <- function (f, xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } pos <- function (x) (x > 0) filter_vector(pos, (-5):5)

[1] 1 2 3 4 5

slide-14
SLIDE 14

curry

map_f <- function (f) { function (xs) { result <- list() for (j in seq_along(xs)) { result[[j]] <- f(xs[[j]]) } result } } map_f(twice)(c("well", "now", "no"))

[[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono"

slide-15
SLIDE 15

▶ how would you write filter_f?

filter_f <- function (f) { function (xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } } filter_f(pos)(c(-1, 1))

[1] 1

slide-16
SLIDE 16

▶ how would you write filter_f?

filter_f <- function (f) { function (xs) { result <- c() for (x in xs) { if (f(x)) { result <- c(result, x) } } result } } filter_f(pos)(c(-1, 1))

[1] 1

slide-17
SLIDE 17

built-in

lapply(lst, f) # same as map(f, lst) lapply(lst, f, x, y, ...) # ... passed on to f: # inside the for loop: result[[j]] <- f(xs[[j]], x, y, ...)

▶ sapply (returns a vector if possible) ▶ apply (iterate over rows/columns of a matrix) ▶ tapply (iterate over groups identified by a factor) ▶ what a mess!

slide-18
SLIDE 18

dplyr: split-apply-combine

▶ split a data frame up into pieces (rows, groups of rows…) ▶ do something to each piece ▶ put together the result

slide-19
SLIDE 19

No new functionality (but…)

select column indexing by name (or $) filter logical row subscripting arrange order mutate expressions in terms of columns summarize for loops, table, sum, mean…

slide-20
SLIDE 20

surnames <- select(laureates, surname) head(surnames) # data type!

surname 1 Modiano 2 Munro 3 Yan 4 Tranströmer 5 Vargas Llosa 6 Müller

women <- filter(laureates, gender=="female") women$surname[1:3]

[1] "Munro" "Müller" "Lessing"

slide-21
SLIDE 21

(notionally)

women <- filter(laureates, function (laur_row) { laur_row$gender == "female" })

slide-22
SLIDE 22

modularity, but…

women_surnames <- select(filter(laureates, gender=="female"), surname, year) # ugh

slide-23
SLIDE 23

curiouser and curiouser

women_last <- laureates %>% filter(gender=="female") %>% select(surname, year) women_last[1:3, ]

surname year 1 Munro 2013 2 Müller 2009 3 Lessing 2007

slide-24
SLIDE 24

“pipe”

x %>% f # equivalent to... x %b% f # or... f(x) x %>% f(y, z) # equivalent to... f(x, y, z) # wow!?

slide-25
SLIDE 25

a special function

laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it...

total 1 12

laureates %>% filter(gender=="female") %>% summarize(total=n())

total 1 12

slide-26
SLIDE 26

a special function

laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it...

total 1 12

laureates %>% filter(gender=="female") %>% summarize(total=n())

total 1 12

slide-27
SLIDE 27

laureates %>% filter(!is.na(diedCountryCode)) %>% filter(bornCountryCode != diedCountryCode) %>% summarize(n_exiles=n())

n_exiles 1 34

slide-28
SLIDE 28

f_sws <- laureates %>% filter(gender=="female" | bornCountry=="Sweden") %>% select(firstname, surname) %>% slice(1:5) f_sws

firstname surname 1 Alice Munro 2 Tomas Tranströmer 3 Herta Müller 4 Doris Lessing 5 Elfriede Jelinek

f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" "))

firstname surname fullname 1 Alice Munro Alice Munro 2 Tomas Tranströmer Tomas Tranströmer 3 Herta Müller Herta Müller 4 Doris Lessing Doris Lessing 5 Elfriede Jelinek Elfriede Jelinek

slide-29
SLIDE 29

f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% select(-firstname) # minus!!

surname fullname 1 Munro Alice Munro 2 Tranströmer Tomas Tranströmer 3 Müller Herta Müller 4 Lessing Doris Lessing 5 Jelinek Elfriede Jelinek

slide-30
SLIDE 30

f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% arrange(surname) %>% select(fullname)

fullname 1 Elfriede Jelinek 2 Doris Lessing 3 Alice Munro 4 Herta Müller 5 Tomas Tranströmer

slide-31
SLIDE 31

laur_ages <- laureates %>% filter(born != "0000-00-00") %>% # Mo Yan mutate(born_year=as.numeric(str_sub(born, 1, 4))) %>% mutate(age=year - born_year) %>% mutate(decade=str_c(str_sub(year, 1, 3), 0)) %>% select(surname, year, decade, age) laur_ages %>% slice(1:3)

surname year decade age 1 Modiano 2014 2010 69 2 Munro 2013 2010 82 3 Tranströmer 2011 2010 80

slide-32
SLIDE 32

split

laur_ages %>% group_by(decade) # no-op?

Source: local data frame [106 x 4] Groups: decade surname year decade age 1 Modiano 2014 2010 69 2 Munro 2013 2010 82 3 Tranströmer 2011 2010 80 4 Vargas Llosa 2010 2010 74 5 Müller 2009 2000 56 6 Le Clézio 2008 2000 68 7 Lessing 2007 2000 88 8 Pamuk 2006 2000 54 9 Pinter 2005 2000 75 10 Jelinek 2004 2000 58 .. ... ... ... ...

slide-33
SLIDE 33

apply-combine

laur_ages %>% group_by(decade) %>% summarize(avg_age=mean(age))

Source: local data frame [12 x 2] decade avg_age 1 1900 64.11111 2 1910 58.87500 3 1920 60.10000 4 1930 56.44444 5 1940 64.33333 6 1950 63.70000 7 1960 66.20000 8 1970 67.00000 9 1980 67.60000 10 1990 67.50000 11 2000 66.40000 12 2010 76.25000

slide-34
SLIDE 34

input…

txl <- read.csv("three-percent.csv", as.is=T, encoding="UTF-8") # N.B. nrow(txl)

[1] 3187

colnames(txl)

[1] "ISBN" "Titles" "AuthorFN" [4] "AuthorLN" "TranslatorFN" "TranslatorLN" [7] "Publisher" "Genre" "Price" [10] "Month" "Year" "Lanuage" [13] "Country" "OtherFN" "OtherLN" [16] "Other.Role"

slide-35
SLIDE 35

txl <- txl %>% filter(Year < 2015) %>% select(ISBN, Year, Country, Genre, Publisher, Language=Lanuage)

slide-36
SLIDE 36

split, apply, combine

txl %>% group_by(Year, Genre) %>% summarize(count=n())

Source: local data frame [14 x 3] Groups: Year Year Genre count 1 2008 Fiction 278 2 2008 Poetry 82 3 2009 Fiction 291 4 2009 Poetry 72 5 2010 Fiction 265 6 2010 Poetry 75 7 2011 Fiction 303 8 2011 Poetry 67 9 2012 Fiction 387 10 2012 Poetry 73 11 2013 Fiction 448 12 2013 Poetry 93 13 2014 Fiction 494 14 2014 Poetry 93

slide-37
SLIDE 37

split, apply, combine (more than one row)

▶ top_n(x, n, wt): top n rows of x by wt

lang_counts <- txl %>% group_by(Year, Language) %>% summarize(count=n()) win_place_langs <- lang_counts %>% top_n(2, count) %>% arrange(Year, desc(count))

slide-38
SLIDE 38

win_place_langs

Source: local data frame [14 x 3] Groups: Year Year Language count 1 2008 French 60 2 2008 Spanish 48 3 2009 Spanish 62 4 2009 French 54 5 2010 French 60 6 2010 Spanish 52 7 2011 French 63 8 2011 Spanish 50 9 2012 French 68 10 2012 Spanish 58 11 2013 French 90 12 2013 Spanish 71 13 2014 French 106 14 2014 German 85

slide-39
SLIDE 39

variety

txl %>% group_by(Publisher) %>% summarize(langs=n_distinct(Language)) %>% arrange(desc(langs)) %>% top_n(5, langs)

Source: local data frame [5 x 2] Publisher langs 1 Dalkey Archive 26 2 Archipelago 21 3 Open Letter 20 4 Knopf 17 5 Other Press 17