literary data some approaches
play

Literary Data: Some Approaches Andrew Goldstone - PowerPoint PPT Presentation

Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr. it depends every program has dependencies software packages ( library ) data files (


  1. Literary Data: Some Approaches Andrew Goldstone http://www.rci.rutgers.edu/~ag978/litdata March 12, 2015. Higher-order functions and dplyr.

  2. it depends ▶ every program has dependencies ▶ software packages ( library ) ▶ data files ( readLines , read.csv , scan , dir …) ▶ good programs document their dependencies clearly at the start ▶ nice programs allow their users to meet dependencies in a controlled fashion. Which is better as a file dependency: ▶ "/Users/agoldst/jockers/data/plainText/austen.txt" ▶ "austen.txt" ▶ "../../../../data/plainText/austen.txt"

  3. file system, once and for all ▶ Every R process has a working directory ▶ RStudio defaults to ▶ ~ ▶ the project directory ▶ the containing directory of the file you launched RStudio to open ▶ Knitting starts a new R process whose working directory is the containing directory of the R markdown file

  4. portable dependencies ▶ all file paths relative to the working directory ▶ working directory set to the directory containing the program script ▶ working directory never subsequently modified Testing portability ▶ for console testing, start each session by setting the working directory once to the script-containing directory ▶ for knitting, do not modify the working directory ▶ read your error messages (including in knit PDFs)

  5. Moral scan(..., encoding="UTF-8") j read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e o ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 r P e h T . . . ha ha ha, character encoding

  6. scan(..., encoding="UTF-8") ef bb bf 54 68 65 20 50 72 6f 6a 65 63 74 20 47 read.csv(filename, as.is=T, ..., encoding="UTF-8") readLines(filename, encoding="UTF-8") G t c e j o r P e h T . . . ha ha ha, character encoding Moral

  7. first class ▶ function definitions look like assignments because they are ▶ function (...) { ... } is a value like any other

  8. bind <- function (x, f) { f(x) } bind(c(1, 2, 3), sum) [1] 6 twice <- function (s) { str_c(s, s) } bind("ha", twice) [1] "haha" function as parameter

  9. `%p%` <- function (x, y) { x + y } 100 %p% 200 [1] 300 `%b%` <- function (x, f) { f(x) } "ha" %b% twice [1] "haha" funny function

  10. `%p%` <- function (x, y) { x + y } 100 %p% 200 [1] 300 `%b%` <- function (x, f) { f(x) } "ha" %b% twice [1] "haha" funny function

  11. bind("parenthetical", function (s) { str_c("(", s, ")") } ) [1] "(parenthetical)" anonymous function

  12. map <- function (f, xs) { result[[j]] <- f(xs[[j]]) } result } map(twice, c("well", "now", "no")) [[1]] [1] "wellwell" [[2]] [1] "nownow" [[3]] [1] "nono" result <- list() for (j in seq_along(xs)) {

  13. filter_vector <- function (f, xs) { for (x in xs) { if (f(x)) { } } result } filter_vector(pos, (-5):5) [1] 1 2 3 4 5 result <- c() result <- c(result, x) pos <- function (x) (x > 0)

  14. [1] "nono" } [[3]] [1] "nownow" [[2]] [1] "wellwell" [[1]] map_f(twice)(c("well", "now", "no")) } map_f <- function (f) { result } for (j in seq_along(xs)) { function (xs) { curry result <- list() result[[j]] <- f(xs[[j]])

  15. result <- c() result <- c(result, x) filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ?

  16. filter_f <- function (f) { function (xs) { for (x in xs) { if (f(x)) { } } result } } filter_f(pos)(c(-1, 1)) [1] 1 ▶ how would you write filter_f ? result <- c() result <- c(result, x)

  17. lapply(lst, f) # same as map(f, lst) lapply(lst, f, x, y, ...) # ... passed on to f: # inside the for loop: result[[j]] <- f(xs[[j]], x, y, ...) built-in ▶ sapply (returns a vector if possible) ▶ apply (iterate over rows/columns of a matrix) ▶ tapply (iterate over groups identified by a factor) ▶ what a mess!

  18. dplyr: split-apply-combine ▶ split a data frame up into pieces (rows, groups of rows…) ▶ do something to each piece ▶ put together the result

  19. arrange order No new functionality (but…) select column indexing by name (or $ ) filter logical row subscripting mutate expressions in terms of columns summarize for loops, table , sum , mean …

  20. surnames <- select(laureates, surname) Tranströmer "Müller" [1] "Munro" women$surname[1:3] women <- filter(laureates, gender=="female") Müller 6 5 Vargas Llosa 4 head(surnames) Yan 3 Munro 2 Modiano 1 surname # data type! "Lessing"

  21. women <- filter(laureates, function (laur_row) { }) (notionally) laur_row$gender == "female"

  22. women_surnames <- select(filter(laureates, gender=="female"), modularity, but… surname, year) # ugh

  23. women_last <- laureates %>% filter(gender=="female") %>% select(surname, year) women_last[1:3, ] surname year 1 Munro 2013 2 Müller 2009 3 Lessing 2007 curiouser and curiouser

  24. x %>% f # equivalent to... x %b% f # or... f(x) f(x, y, z) # wow!? “pipe” x %>% f(y, z) # equivalent to...

  25. laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it... total 1 12 laureates %>% filter(gender=="female") %>% summarize(total=n()) total 1 12 a special function

  26. laureates %>% filter(gender=="female") %>% summarize(total=length(surname)) # think about it... total 1 12 laureates %>% filter(gender=="female") %>% summarize(total=n()) total 1 12 a special function

  27. laureates %>% filter(!is.na(diedCountryCode)) %>% filter(bornCountryCode != diedCountryCode) %>% summarize(n_exiles=n()) n_exiles 1 34

  28. f_sws <- laureates %>% Herta fullname 1 Alice Munro Alice Munro 2 Tomas Tranströmer Tomas Tranströmer 3 Müller firstname Herta Müller 4 Doris Lessing Doris Lessing 5 Elfriede Jelinek surname mutate(fullname=str_c(firstname, surname, sep=" ")) filter(gender=="female" | bornCountry=="Sweden") %>% 2 select(firstname, surname) %>% slice(1:5) f_sws firstname surname 1 Alice Munro Tomas Tranströmer f_sws %>% 3 Herta Müller 4 Doris Lessing 5 Elfriede Jelinek Elfriede Jelinek

  29. f_sws %>% 3 Jelinek 5 Doris Lessing Lessing 4 Herta Müller Müller 2 Tranströmer Tomas Tranströmer mutate(fullname=str_c(firstname, surname, sep=" ")) %>% Alice Munro Munro 1 fullname surname # minus!! select(-firstname) Elfriede Jelinek

  30. f_sws %>% mutate(fullname=str_c(firstname, surname, sep=" ")) %>% arrange(surname) %>% select(fullname) fullname 1 Elfriede Jelinek 2 Doris Lessing 3 Alice Munro 4 Herta Müller 5 Tomas Tranströmer

  31. laur_ages <- laureates %>% 2010 2010 3 Tranströmer 2011 82 2010 Munro 2013 2 69 Modiano 2014 filter(born != "0000-00-00") %>% 1 surname year decade age select(surname, year, decade, age) mutate(decade=str_c(str_sub(year, 1, 3), 0)) %>% mutate(age=year - born_year) %>% mutate(born_year=as.numeric(str_sub(born, 1, 4))) %>% # Mo Yan 80 laur_ages %>% slice(1:3)

  32. ... ... 54 Le Clézio 2008 2000 68 7 Lessing 2007 2000 88 8 Pamuk 2006 2000 9 56 Pinter 2005 2000 75 10 Jelinek 2004 2000 58 .. ... ... 6 2000 laur_ages %>% Munro 2013 group_by(decade) # no-op? Source: local data frame [106 x 4] Groups: decade surname year decade age 1 Modiano 2014 2010 69 2 2010 Müller 2009 82 3 Tranströmer 2011 2010 80 4 Vargas Llosa 2010 2010 74 5 split

  33. 2010 76.25000 6 12 2000 66.40000 11 1990 67.50000 10 1980 67.60000 9 1970 67.00000 8 1960 66.20000 7 1950 63.70000 1940 64.33333 laur_ages %>% group_by(decade) %>% 5 1930 56.44444 4 1920 60.10000 3 1910 58.87500 2 1900 64.11111 1 avg_age decade Source: local data frame [12 x 2] summarize(avg_age=mean(age)) apply-combine

  34. [16] "Other.Role" [7] "Publisher" "OtherLN" "OtherFN" [13] "Country" "Lanuage" "Year" [10] "Month" "Price" "Genre" "TranslatorFN" "TranslatorLN" txl <- read.csv("three-percent.csv", as.is=T, [4] "AuthorLN" "AuthorFN" "Titles" [1] "ISBN" colnames(txl) [1] 3187 nrow(txl) # N.B. encoding="UTF-8") input…

  35. txl <- txl %>% filter(Year < 2015) %>% select(ISBN, Year, Country, Genre, Publisher, Language=Lanuage)

  36. 93 10 2012 7 2011 Fiction 303 8 2011 Poetry 67 9 2012 Fiction 387 Poetry Poetry 73 11 2013 Fiction 448 12 2013 Poetry 93 13 2014 Fiction 494 14 2014 Poetry 75 2010 txl %>% group_by(Year, Genre) %>% Poetry summarize(count=n()) Source: local data frame [14 x 3] Groups: Year Year Genre count 1 2008 Fiction 278 2 2008 82 6 3 2009 Fiction 291 4 2009 Poetry 72 5 2010 Fiction 265 split, apply, combine

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend