. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) - - PowerPoint PPT Presentation
Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) - - PowerPoint PPT Presentation
Counting for Humanists Andrew Goldstone (http://andrewgoldstone.com) Wednesday, April 30, 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Academic disciplines (and even
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
shall we count?
Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
shall we count?
Academic disciplines (and even interdisciplines or hybrids) are relational entitites; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the naïve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’ ” NLH 41, no. 2 (Spring 2010): xii–xiii.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
what shall we count?
favorite author female (%) male (%) Stephen King 17.5 35.9 Wilbur Smith 3.0 23.5 Agatha Christie 11.0 7.2 Danielle Steel 13.0 0.3 Jeffrey Archer 8.1 9.1 Virginia Andrews 11.9 0.8 Catherine Cookson 11.0 0.9 Sidney Sheldon 3.7 3.1 Bryce Courtenay 3.2 2.7 Tom Clancy 1.5 11.6
Table: Australian readers’ favorite authors, by gender, from Tony Bennett et al., Accounting for Tastes (Cambridge UP, 1999), 151.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1850 1860 1870 1880 1890 1900
Figure 6: Book imports into India
Thousands of pounds sterling. Source: Priya Joshi, In Another Country: Colonialism, Culture, and the English Novel in India, New York 2002.
300 250 200 150 100 50
Figure reprinted in Franco Moretti, “Graphs, Maps, Trees,” NLR 24 (Nov.-Dec. 2003): 75.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
comma-separated values
"firstname","surname","bornCountry" "Alice","Munro","Canada" "Mo","Yan","China" "Tomas","Tranströmer","Sweden"
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the norms of CSV
▶ plain-text file for tabular data ▶ delimiter separates columns (usually , or a tab) ▶ newline separates rows ▶ names of columns in first row (optional) ▶ tricky bits:
▶ what if a data point contains a comma? ▶ what if a data point contains a quotation mark? ▶ what text-encoding should be used? ▶ how do you know what rules have been followed? (There is RFC
4180, but no promises.)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
people
id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year 892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013 880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012 868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011 854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010 844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009 832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008 817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007 808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006 801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005 Source: requests to api.nobelprize.org. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_
- rg/developer
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
words
WORDCOUNTS,WEIGHT the,766
- f,482
and,305 in,259 to,224 a,195 new,101 as,101 that,86 it,75 Source: a wordcounts CSV file from a http://dfr.jstor.org request.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affordances
What kinds of data can be accommodated in this format? And what can’t be?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
affordances
What kinds of data can be accommodated in this format? And what can’t be?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
data types: simple: numerical
▶ Whole numbers (integer scale). How many (books, people, words,
genres…)?
▶ Real numbers (interval scale). How much (distance, time,
money…)? Special cases:
▶ percentages or proportions (ratio scale). How much of the total
(population, corpus of texts…)?
▶ dates. When? (And does the day, month, year, decade, century…
matter?)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
data types: simple: categorical
▶ Unordered. Which of… (languages, nations, genders(?))? Special
cases:
▶ binary or Boolean category: true or false, yes or no. ▶ many categories (headwords in the dictionary, authors in the
catalogue).
▶ Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or
neutral”)?
Categories to numbers
▶ true: 1, false: 0 ▶ like: 1, neutral: 0, dislike: -1 ▶ like: 2, neutral: 1, dislike: 0 ▶ a: 1, b: 2, c: 3… (character encoding)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
data types: compound
The list / the series
17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5
The list of lists / the table
firstname: Alice, Mo, Tomas surname: Munro, Yan, Tranströmer bornCountry: Canada, China, Sweden firstname surname bornCountry Alice Munro Canada Mo Yan China Tomas Tranströmer Sweden
(more elaborate possibilities exist…)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
and text?
a (looooong) list of characters (a “string”): O, n, c, e, *space*, u, p, o, n, *space*, a, *space*, t, i, m, e
- ther representations
▶ the bag of words (to: 2, be: 2, or: 1, not: 1) ▶ content analysis (automated, human, or semi-automated) ▶ marked-up text
<sp who="#Salinus"><speaker>Duke.</speaker> <p>Haplesse <name>Egeon</name> whom the fates haue markt...</p>
▶ parsed trees ▶ page images
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
programming in a nutshell
- 1. A program is a formal description of a process for transforming
data.
- 2. A computer performs calculations on numbers and stores the
results of those calculations.
- 3. If the inputs, outputs, and the formal description can be encoded
as numbers, a program can be executed on a computer.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
programming in a nutshell
- 1. A program is a formal description of a process for transforming
data.
- 2. A computer performs calculations on numbers and stores the
results of those calculations.
- 3. If the inputs, outputs, and the formal description can be encoded
as numbers, a program can be executed on a computer.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
programming in a nutshell
- 1. A program is a formal description of a process for transforming
data.
- 2. A computer performs calculations on numbers and stores the
results of those calculations.
- 3. If the inputs, outputs, and the formal description can be encoded
as numbers, a program can be executed on a computer.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the R experience
The console
You type an expression, R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk).
The script
You prepare a list of expressions in a file, and R figures out their value
- ne by one.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
first steps in the console
R is a parrot
2 "Shiver me timbers"
R gets crabby easily
Shiver Shiver me timbers help ( "Shiver Press ESC.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
some hidden features
▶ history navigation with up and down arrows (or RStudio History
pane)
▶ tab completion ▶ help: help("sqrt") or ?sqrt
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R data kinds (“modes”)
Numbers
Whole, integer, real, ratio… (complex too)
Strings
"Avast" "\"Avast,\" he said" "Beware the \\" Represent a newline with \n and a tab with \t.
Booleans
TRUE and FALSE or T and F for short
Factors
(For categorical data: more later)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rithmetic
Try: 2 * 2 5/7 TRUE | FALSE TRUE & FALSE T & T !FALSE !TRUE 4 == 3 !(4 == 3) 4 != 3 1 < 5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R functions
Functions map inputs to outputs. Describe these: sqrt(4) nchar("Munro") paste("Alice", "Munro") Functions in R can have named parameters as well. Experiment with: paste("Munro", "Alice", sep = ", ") paste("Munro", "Alice", sep = "")
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
assignment
<- stores a value under a name which you can refer to (or change) later. x <- 108 x x + 2 storage <- 10 storage <- storage - 10 My_Perfectly_Good_Name2012 <- "Mo Yan"
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
R compound data types
vectors (for a series of values)
Construct a vector with the special function c (concatenate): xs <- c(2, 4, 8) xs bs <- c(T, F, T) bs people <- c("Munro", "Mo", "Transtromer") people c(people, "Vargas Llosa")
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
subscripting
Choose an element or elements from a vector with []: xs[2] people[1]
sequences
1:3 c(1:3, 6:8) What is the value of these expressions? people[1:2] people[c(1, 3)]
logical subscripting
people[bs]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vector operations
c(1, 3, 5) + c(2, 4, 6) paste(c("a", "b"), c("c", "d")) c(T, F, F) | c(F, T, F) c("a", "b", "c") %in% c("b", "c", "d", "e")
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
recycling
c(1, 3, 5) + 1 paste("The", c("beginning", "end")) c(1, 3, 5) == 3 xs choice <- xs > 3 choice xs[choice] What does Boolean-vector subscripting express?
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
factors
A special type for categorical data, normally made out of strings: nationalities <- c("American", "Canadian", "French", "French", "Chinese", "American") nat_fact <- factor(nationalities) nat_fact nat_fact[1] nat_fact[3:4]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the data frame
A list of vectors not necessarily of the same type, but all of the same length: laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83)) laureates laureates$surname # Levels?? laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83), stringsAsFactors=F)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
indexing by row and column
laureates[1, 1] laureates[1, 2] laureates[1, "firstname"] laureates[2, "surname"] laureates[3, c("firstname", "surname")]
Exercise
Write a single expression in terms of laureates to produce the full name of Canada’s laureate.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
- mitted indices
laureates[3, ] laureates[, 2] laureates[, c("surname", "bornCountry")] laureates[c(T, F, T), ]
A shorthand
laureates[, "surname"] laureates$surname laureates$surname[2]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
getting a real dataframe
laureates <- read.csv("laureates.csv", stringsAsFactors=F) laureates # Scroll up! laureates$surname
properties of the frame
names(laureates) nrow(laureates)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the logic of the query
laureates$bornCountry == "Sweden" swedes <- laureates$bornCountry == "Sweden" laureates$surname[swedes] laureates[swedes, ] women <- laureates$gender == "female" laureates[women, ] laureates[women & swedes, ] laureates[women | swedes, ] laureates$surname[women & !swedes]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
exercise
Write an expression whose value is a dataframe containing the names and prize-years of all the laureates who died in a country other than the country of their birth.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
exiles and émigrés
laureates[laureates$bornCountryCode != laureates$diedCountryCode, c("surname","year")]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
counting
table(c("a", "b", "a", "c", "b"))
…and division
table(laureates$bornCountryCode) table(laureates$bornCountryCode)/nrow(laureates) * 100
Exercise
Write an expression for a tabulation of the number of men and women to win the Nobel in literature.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
cross-tabulation
table(laureates$bornCountryCode, laureates$gender)
Sorting
laureate_countries <- table(laureates$bornCountryCode) sort(laureate_countries) sort(laureate_countries, decreasing = T)
Exercise
Write an expression for the top three countries-of-death of the Nobel laureates.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
we’ll always have…
sort(table(laureates$diedCountry), decreasing = T)[1:4]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
messier data
Metadata for every item in the TEI-encoded Poetry and Crisis from the Modernist Journals Project: from http://sourceforge.net/projects/mjplab/files/. readLines("Poetry_2.everytitle.txt", n = 4) !?!!! After consulting help(read.csv)… poetry_titles <- read.table("Poetry_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T) crisis_titles <- read.table("Crisis_2.everytitle.txt", sep = "|", strip.white = T, stringsAsFactors = F, quote = "", header = T)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a comparison
- verall proportions
table(poetry_titles$genre)/nrow(poetry_titles) table(crisis_titles$genre)/nrow(poetry_titles)
combine and recount
mags <- rbind(poetry_titles, crisis_titles) table(mags$genre, mags$journal.title)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
who’s in both?
poetry_in_crisis <- poetry_titles$creator %in% crisis_titles$creator shared_auths <- poetry_titles$creator[poetry_in_crisis] unique(shared_auths) Whoops! shared_auths <- shared_auths[shared_auths != "" & shared_auths != "Anonymous"] mags_shared <- mags[mags$creator %in% shared_auths, ] table(mags_shared$journal.title,mags_shared$genre, mags_shared$creator)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
from tables back to data frames
laur_country_tab <- table(laureates$bornCountryCode) laureate_countries <- as.data.frame(laur_country_tab) names(laureate_countries) names(laureate_countries) <- c("country", "count")
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
visualization, grammatically
A visualization transforms data inputs into graphical outputs (sound familiar?). A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output. library("ggplot2")
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
making a point (plot)
data: translations published in US, year by year
Source: UNESCO Index Translationum us_tx <- read.csv("us-trans.csv")
- 1. years on x axis
- 2. counts on y axis
- 3. what to draw?
▶ point for each yearly entry ▶ line connecting ▶ shaded-in area
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
the code
qplot(x=year, # aesthetics (mapping) y=translations, geom="point", # geometry (shape) data=us_tx) # data source qplot(x=year,y=translations, group=1, # special aesthetic: "1 line" geom="line", # geometry (shape) data=us_tx) qplot(x=year,y=translations,geom="area", data=us_tx)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
arbitrary mappings
- 1. countries on x axis, in alphabetical order
- 2. laureate count on y axis
- 3. point for each country
qplot(x=country,y=count,geom="point", data=laureate_countries) qplot(x=country,y=count, geom="bar", # bars but: stat="identity",# don't tally y var. data=laureate_countries) qplot(x=bornCountryCode, geom="bar", data=laureates) # bars, and do tally sorted_countries <- laureate_countries[order(laureate_countries$count),] qplot(x=count,geom="bar",binwidth=1, data=sorted_countries)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
dates
Consider mags. What type is mags$date? poetry_articles <- poetry_titles[poetry_titles$genre == "articles", ] art_series <- as.data.frame(table(poetry_articles$date)) names(art_series) <- c("date", "count") art_series$date <- as.Date(art_series$date) qplot(x = date, y = count, group = 1, geom = "line", data = art_series)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
construct the data you want to plot
genre_series <- as.data.frame(table(mags$date, mags$genre,mags$journal.title)) names(genre_series) <- c("date","genre","journal", "count") genre_series$date <- as.Date(genre_series$date) qplot(x=date,y=count,color=genre,geom="point", data=genre_series) qplot(x=date,y=count,color=genre,group=genre, geom="line",data=genre_series) qplot(x=date,y=count,fill=genre,geom="bar", stat="identity",position="stack",data=genre_series)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
small multiples
qplot(x=date,y=count,group=genre, facets=genre ~ journal,geom="bar", stat="identity",data=genre_series) qplot(x=date,y=count,group=genre, facets= ~ journal,geom="bar", stat="identity",data=genre_series)
Exercise
Generate either overlaid or small-multiples plots of the time series of genres in Poetry.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
counting on
Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer,
- 2009. http://dx.doi.org/10.1007/978-0-387-98141-3.