Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) - - PowerPoint PPT Presentation

counting for humanists
SMART_READER_LITE
LIVE PREVIEW

Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) - - PowerPoint PPT Presentation

Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) Wednesday, April 30, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Academic disciplines (and even


slide-1
SLIDE 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Counting for Humanists

Andrew Goldstone (http://andrewgoldstone.com) Wednesday, April 30, 2014

slide-2
SLIDE 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

shall we count?

Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.

slide-3
SLIDE 3

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

shall we count?

Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.

slide-4
SLIDE 4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

what shall we count?

favorite author female (%) male (%) Stephen King 17.5 35.9 Wilbur Smith 3.0 23.5 Agatha Christie 11.0 7.2 Danielle Steel 13.0 0.3 Jeffrey Archer 8.1 9.1 Virginia Andrews 11.9 0.8 Catherine Cookson 11.0 0.9 Sidney Sheldon 3.7 3.1 Bryce Courtenay 3.2 2.7 Tom Clancy 1.5 11.6

Table: Australian readers’ favorite authors, by gender, from Tony Bennett et al., Accounting for Tastes (Cambridge UP, 1999), 151.

slide-5
SLIDE 5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1850 1860 1870 1880 1890 1900

Figure 6: Book imports into India

Thousands of pounds sterling. Source: Priya Joshi, In Another Country: Colonialism, Culture, and the English Novel in India, New York 2002.

300 250 200 150 100 50

Figure reprinted in Franco Moretti, “Graphs, Maps, Trees,” NLR 24 (Nov.-Dec. 2003): 75.

slide-6
SLIDE 6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

comma-separated values

"firstname","surname","bornCountry" "Alice","Munro","Canada" "Mo","Yan","China" "Tomas","Tranströmer","Sweden"

slide-7
SLIDE 7

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

the norms of CSV

▶ plain-text file for tabular data ▶ delimiter separates columns (usually , or a tab) ▶ newline separates rows ▶ names of columns in first row (optional) ▶ tricky bits:

▶ what if a data point contains a comma? ▶ what if a data point contains a quotation mark? ▶ what text-encoding should be used? ▶ how do you know what rules have been followed? (There is RFC

4180, but no promises.)

slide-8
SLIDE 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

people

id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year 892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013 880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012 868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011 854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010 844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009 832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008 817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007 808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006 801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005 Source: requests to api.nobelprize.org. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_

  • rg/developer
slide-9
SLIDE 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

words

WORDCOUNTS,WEIGHT the,766

  • f,482

and,305 in,259 to,224 a,195 new,101 as,101 that,86 it,75 Source: a wordcounts CSV file from a http://dfr.jstor.org request.

slide-10
SLIDE 10

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

affordances

What kinds of data can be accommodated in this format? And what can’t be?

slide-11
SLIDE 11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

affordances

What kinds of data can be accommodated in this format? And what can’t be?

slide-12
SLIDE 12

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

data types: simple: numerical

▶ Whole numbers (integer scale). How many (books, people, words,

genres…)?

▶ Real numbers (interval scale). How much (distance, time,

money…)? Special cases:

▶ percentages or proportions (ratio scale). How much of the total

(population, corpus of texts…)?

▶ dates. When? (And does the day, month, year, decade, century…

matter?)

slide-13
SLIDE 13

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

data types: simple: categorical

▶ Unordered. Which of… (languages, nations, genders(?))? Special

cases:

▶ binary or Boolean category: true or false, yes or no. ▶ many categories (headwords in the dictionary, authors in the

catalogue).

▶ Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or

neutral”)?

Categories to numbers

▶ true: 1, false: 0 ▶ like: 1, neutral: 0, dislike: -1 ▶ like: 2, neutral: 1, dislike: 0 ▶ a: 1, b: 2, c: 3… (character encoding)

slide-14
SLIDE 14

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

data types: compound

The list / the series

17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5

The list of lists / the table

firstname: Alice, Mo, Tomas surname: Munro, Yan, Tranströmer bornCountry: Canada, China, Sweden firstname surname bornCountry Alice Munro Canada Mo Yan China Tomas Tranströmer Sweden

(more elaborate possibilities exist…)

slide-15
SLIDE 15

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

and text?

a (looooong) list of characters (a “string”): O, n, c, e, *space*, u, p, o, n, *space*, a, *space*, t, i, m, e

  • ther representations

▶ the bag of words (to: 2, be: 2, or: 1, not: 1) ▶ content analysis (automated, human, or semi-automated) ▶ marked-up text

<sp who="#Salinus"><speaker>Duke.</speaker> <p>Haplesse <name>Egeon</name> whom the fates haue markt...</p>

▶ parsed trees ▶ page images

slide-16
SLIDE 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

programming in a nutshell

  • 1. A program is a formal description of a process for transforming

data.

  • 2. A computer performs calculations on numbers and stores the

results of those calculations.

  • 3. If the inputs, outputs, and the formal description can be encoded

as numbers, a program can be executed on a computer.

slide-17
SLIDE 17

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

programming in a nutshell

  • 1. A program is a formal description of a process for transforming

data.

  • 2. A computer performs calculations on numbers and stores the

results of those calculations.

  • 3. If the inputs, outputs, and the formal description can be encoded

as numbers, a program can be executed on a computer.

slide-18
SLIDE 18

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

programming in a nutshell

  • 1. A program is a formal description of a process for transforming

data.

  • 2. A computer performs calculations on numbers and stores the

results of those calculations.

  • 3. If the inputs, outputs, and the formal description can be encoded

as numbers, a program can be executed on a computer.

slide-19
SLIDE 19

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

the R experience

The console

You type an expression, R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk).

The script

You prepare a list of expressions in a file, and R figures out their value

  • ne by one.
slide-20
SLIDE 20

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

first steps in the console

R is a parrot

2 "Shiver me timbers"

R gets crabby easily

Shiver Shiver me timbers help ( "Shiver Press ESC.

slide-21
SLIDE 21

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

some hidden features

▶ history navigation with up and down arrows (or RStudio History

pane)

▶ tab completion ▶ help: help("sqrt") or ?sqrt

slide-22
SLIDE 22

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R data kinds (“modes”)

Numbers

Whole, integer, real, ratio… (complex too)

Strings

"Avast" "\"Avast,\" he said" "Beware the \\" Represent a newline with \n and a tab with \t.

Booleans

TRUE and FALSE or T and F for short

Factors

(For categorical data: more later)

slide-23
SLIDE 23

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Rithmetic

Try: 2 * 2 5/7 TRUE | FALSE TRUE & FALSE T & T !FALSE !TRUE 4 == 3 !(4 == 3) 4 != 3 1 < 5

slide-24
SLIDE 24

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R functions

Functions map inputs to outputs. Describe these: sqrt(4) nchar("Munro") paste("Alice", "Munro") Functions in R can have named parameters as well. Experiment with: paste("Munro", "Alice", sep = ", ") paste("Munro", "Alice", sep = "")

slide-25
SLIDE 25

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

assignment

<- stores a value under a name which you can refer to (or change) later. x <- 108 x x + 2 storage <- 10 storage <- storage - 10 My_Perfectly_Good_Name2012 <- "Mo Yan"

slide-26
SLIDE 26

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

R compound data types

vectors (for a series of values)

Construct a vector with the special function c (concatenate): xs <- c(2, 4, 8) xs bs <- c(T, F, T) bs people <- c("Munro", "Mo", "Transtromer") people c(people, "Vargas Llosa")

slide-27
SLIDE 27

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

subscripting

Choose an element or elements from a vector with []: xs[2] people[1]

sequences

1:3 c(1:3, 6:8) What is the value of these expressions? people[1:2] people[c(1, 3)]

logical subscripting

people[bs]

slide-28
SLIDE 28

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vector operations

c(1, 3, 5) + c(2, 4, 6) paste(c("a", "b"), c("c", "d")) c(T, F, F) | c(F, T, F) c("a", "b", "c") %in% c("b", "c", "d", "e")

slide-29
SLIDE 29

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

recycling

c(1, 3, 5) + 1 paste("The", c("beginning", "end")) c(1, 3, 5) == 3 xs choice <- xs > 3 choice xs[choice] What does Boolean-vector subscripting express?

slide-30
SLIDE 30

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

factors

A special type for categorical data, normally made out of strings: nationalities <- c("American", "Canadian", "French", "French", "Chinese", "American") nat_fact <- factor(nationalities) nat_fact nat_fact[1] nat_fact[3:4]

slide-31
SLIDE 31

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

the data frame

A list of vectors not necessarily of the same type, but all of the same length: laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83)) laureates laureates$surname # Levels?? laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83), stringsAsFactors=F)

slide-32
SLIDE 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

indexing by row and column

laureates[1, 1] laureates[1, 2] laureates[1, "firstname"] laureates[2, "surname"] laureates[3, c("firstname", "surname")]

Exercise

Write a single expression in terms of laureates to produce the full name of Canada’s laureate.

slide-33
SLIDE 33

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  • mitted indices

laureates[3, ] laureates[, 2] laureates[, c("surname", "bornCountry")] laureates[c(T, F, T), ]

A shorthand

laureates[, "surname"] laureates$surname laureates$surname[2]

slide-34
SLIDE 34

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

getting a real dataframe

laureates <- read.csv("laureates.csv", stringsAsFactors=F) laureates # Scroll up! laureates$surname

properties of the frame

names(laureates) nrow(laureates)

slide-35
SLIDE 35

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

the logic of the query

laureates$bornCountry == "Sweden" swedes <- laureates$bornCountry == "Sweden" laureates$surname[swedes] laureates[swedes, ] women <- laureates$gender == "female" laureates[women, ] laureates[women & swedes, ] laureates[women | swedes, ] laureates$surname[women & !swedes]

slide-36
SLIDE 36

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

exercise

Write an expression whose value is a dataframe containing the names and prize-years of all the laureates who died in a country other than the country of their birth.

slide-37
SLIDE 37

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

exiles and émigrés

laureates[laureates$bornCountryCode != laureates$diedCountryCode, c("surname","year")]

slide-38
SLIDE 38

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

counting

table(c("a", "b", "a", "c", "b"))

…and division

table(laureates$bornCountryCode) table(laureates$bornCountryCode)/nrow(laureates) * 100

Exercise

Write an expression for a tabulation of the number of men and women to win the Nobel in literature.

slide-39
SLIDE 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

cross-tabulation

table(laureates$bornCountryCode, laureates$gender)

Sorting

laureate_countries <- table(laureates$bornCountryCode) sort(laureate_countries) sort(laureate_countries, decreasing = T)

Exercise

Write an expression for the top three countries-of-death of the Nobel laureates.

slide-40
SLIDE 40

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

we’ll always have…

sort(table(laureates$diedCountry), decreasing = T)[1:4]

slide-41
SLIDE 41

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

messier data

Metadata for every item in the TEI-encoded Poetry and Crisis from the Modernist Journals Project: from http://sourceforge.net/projects/mjplab/files/. readLines("Poetry_2.everytitle.txt", n = 4) !?!!! After consulting help(read.csv)… poetry_titles <- read.table("Poetry_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T) crisis_titles <- read.table("Crisis_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T)

slide-42
SLIDE 42

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a comparison

  • verall proportions

table(poetry_titles$genre)/nrow(poetry_titles) table(crisis_titles$genre)/nrow(poetry_titles)

combine and recount

mags <- rbind(poetry_titles, crisis_titles) table(mags$genre, mags$journal.title)

slide-43
SLIDE 43

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

who’s in both?

poetry_in_crisis <- poetry_titles$creator %in% crisis_titles$creator shared_auths <- poetry_titles$creator[poetry_in_crisis] unique(shared_auths) Whoops! shared_auths <- shared_auths[shared_auths != "" & shared_auths != "Anonymous"] mags_shared <- mags[mags$creator %in% shared_auths, ] table(mags_shared$journal.title,mags_shared$genre, mags_shared$creator)

slide-44
SLIDE 44

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

from tables back to data frames

laur_country_tab <- table(laureates$bornCountryCode) laureate_countries <- as.data.frame(laur_country_tab) names(laureate_countries) names(laureate_countries) <- c("country", "count")

slide-45
SLIDE 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

visualization, grammatically

A visualization transforms data inputs into graphical outputs (sound familiar?). A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output. library("ggplot2")

slide-46
SLIDE 46

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

making a point (plot)

data: translations published in US, year by year

Source: UNESCO Index Translationum us_tx <- read.csv("us-trans.csv")

  • 1. years on x axis
  • 2. counts on y axis
  • 3. what to draw?

▶ point for each yearly entry ▶ line connecting ▶ shaded-in area

slide-47
SLIDE 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

the code

qplot(x=year, # aesthetics (mapping) y=translations, geom="point", # geometry (shape) data=us_tx) # data source qplot(x=year,y=translations, group=1, # special aesthetic: "1 line" geom="line", # geometry (shape) data=us_tx) qplot(x=year,y=translations,geom="area", data=us_tx)

slide-48
SLIDE 48

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

arbitrary mappings

  • 1. countries on x axis, in alphabetical order
  • 2. laureate count on y axis
  • 3. point for each country

qplot(x=country,y=count,geom="point", data=laureate_countries) qplot(x=country,y=count, geom="bar", # bars but: stat="identity",# don't tally y var. data=laureate_countries) qplot(x=bornCountryCode, geom="bar", data=laureates) # bars, and do tally sorted_countries <- laureate_countries[order(laureate_countries$count),] qplot(x=count,geom="bar",binwidth=1, data=sorted_countries)

slide-49
SLIDE 49

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

dates

Consider mags. What type is mags$date? poetry_articles <- poetry_titles[poetry_titles$genre == "articles", ] art_series <- as.data.frame(table(poetry_articles$date)) names(art_series) <- c("date", "count") art_series$date <- as.Date(art_series$date) qplot(x = date, y = count, group = 1, geom = "line", data = art_series)

slide-50
SLIDE 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

construct the data you want to plot

genre_series <- as.data.frame(table(mags$date, mags$genre,mags$journal.title)) names(genre_series) <- c("date","genre","journal", "count") genre_series$date <- as.Date(genre_series$date) qplot(x=date,y=count,color=genre,geom="point", data=genre_series) qplot(x=date,y=count,color=genre,group=genre, geom="line",data=genre_series) qplot(x=date,y=count,fill=genre,geom="bar", stat="identity",position="stack",data=genre_series)

slide-51
SLIDE 51

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

small multiples

qplot(x=date,y=count,group=genre, facets=genre ~ journal,geom="bar", stat="identity",data=genre_series) qplot(x=date,y=count,group=genre, facets= ~ journal,geom="bar", stat="identity",data=genre_series)

Exercise

Generate either overlaid or small-multiples plots of the time series of genres in Poetry.

slide-52
SLIDE 52

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

counting on

Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer,

  • 2009. http://dx.doi.org/10.1007/978-0-387-98141-3.

Wilkinson, Leland. The Grammar of Graphics. 2nd ed. Springer, 2005. http://link.springer.com/book/10.1007/0-387-28695-0. Online documentation for ggplot2. http://docs.ggplot2.org/.