PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - - PowerPoint PPT Presentation
PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - - PowerPoint PPT Presentation
PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014 Todays plan Preliminaries Intro to R Basic univariate and bivariate stats Plots Preliminaries Section: Tuesday, 5:00-6:00, Scott 212 Office
Today’s plan
Preliminaries Intro to R Basic univariate and bivariate stats Plots
Preliminaries
◮ Section: Tuesday, 5:00-6:00, Scott 212 ◮ Office Hours: Thursday, 12:30-2:00, Scott 230 ◮ Problem Sets:
◮ hard copies ◮ include code (annotated) ◮ neat tables (cleaned up in Word or L
A
T EX)
◮ grades: number correct (meaningless)
◮ Questions: substantive questions to office hours, please ◮ Website: my overheads/code will be posted at
www.djflynn.org/teaching
Caveats
◮ this presentation: intro to the basics ◮ a lot of helpful R guides out there (see Thomas Leeper’s:
thomasleeper.com/Rcourse/Intro2R/Intro2R.pdf)
◮ 90% of R skills come from trial-and-error ◮ Google error messages ◮ pro tip: always know what you’re asking R to do (not just the
code). Next quarter Jay will show you what’s going on behind the scenes.
R looks like this...
RStudio
I highly recommend using a text editor, such as RStudio:
About R
◮ Almost entirely command-based (no point-and-click) ◮ Core functionalities already loaded; if you need anything else,
load a package (we’ll do this)
◮ Advantages: FREE, extremely flexible, great graphics,
increasingly the norm
◮ Disadvantages: steep learning curve, tedious code, very
sensitive, unhelpful error messages
Practical tips1
◮ R is extremely sensitive: x=X, Data=data ◮ scroll through code using up and down arrows ◮ putting a question mark before a command will bring up the
relevant help file: ?summary
◮ use pound signs (#) to annotate code as you go along ◮ ALWAYS save your code in a separate file (RStudio makes this
easy)
◮ when R asks if you want to save the workspace image, say yes!
1Most of these tips came from Salma Al-Shami’s slides from previous years
(thanks, Salma!)
Basic commands
◮ R works like a calculator:
◮ Creating objects in R:
◮ constants:
x<-5 constant=1
◮ vectors:
myvec<-c(1,2,3,4,5) myothervec<-c(6,7,8,9,10) colors<-c("blue","green","red","purple")
◮ matrices:
mymatrix<-cbind(myvec,myothervec) my.other.matrix<-matrix(seq(1,100),10,10)
◮ data frames:
mydataframe<-cbind.data.frame(myvec,myothervec)
Looking at data
◮ you have to tell R where to find variables: dataset$variable ◮ use attach() and detach(), but always know what dataset
you’re referring to
◮ to look at an object, just type its name ◮ descriptives: mean median mode max min var sd range ◮ distributions: table() summary() head() ◮ variables: names(dataset) dataset$variable
dataset$variable[obs1:obs2]
Practice looking at variables in the pre-loaded dataset faithful. Access it like this: install.packages("car") library(car) names(faithful)
Loading packages
install.packages("nameofpackage") library(nameofpackage)
Loading data in R
◮ code depends on the type of file you’re attempting to load:
read.table read.dta read.csv read.spss, etc..
◮ two options: (1) tell R exactly where to find the dataset you
want, or (2) set a working directory and then just tell it the file name
◮ I highly recommend the latter because typing long file paths
can be a nightmare (e.g., typos, slashes, quotation marks)
◮ to load data not already in .R format, load the foreign
package
◮ MUCH easier in RStudio (and on Macs)
Example using pilot.data.csv
Option 1: Load from file path install.packages("foreign") library(foreign) pilot<-read.csv("∼/Documents/TAing/winter 2014/section/week1/pilot.data.csv") names(pilot) Option 2: Set wd, then call up file setwd("∼/Documents/TAing/winter 2014/section/week1") install.packages("foreign") library(foreign) pilot<-read.csv("pilot.data.csv") names(pilot) Option 3: Point-and-click open in RStudio
Types of variables and why we care
◮ nominal/categorical: can’t be ordered; distance not
meaningful
◮ ordinal: can be ordered; distance may/may not be meaningful ◮ continuous: can be ordered; distance meaningful
Model selection depends on type of DV. This class: continuous and quasi-continuous DVs Next class: categorical/limited DVs
Re-coding
Raw data (especially secondary data, e.g., ANES) are ofen coded awkwardly, so we want to re-code: load("/Users/DJF/Documents/TAing/winter 2014/section/week1/nes2008.RData") practice<-nes08 summary(practice$partyid) #notice how responses are non-numeric Here I code Dems as 1, Reps as 2, Inds as 3, and others as missing: library(car) practice$newpartyid<-recode(practice$partyid,"’1. Democrat’=1; ’2. Republican’=2; ’3. Independent’=3;else=’NA’") It’s always a good idea to compare the distributions before and afer re-coding to make sure everything was done correctly: table(practice$partyid) table(practice$newpartyid)
Another recoding example (this time changing already numeric responses): library(car) pilot$gmf.new<-recode(pilot$gmf,"7=1;6=2;5=3;4=4;3=5; 2=6;1=7;else=NA") table(pilot$gmf) table(pilot$gmf.new)
Sub-setting
We ofen want to subset data based on values of one or more variables (e.g., look only at Democrats, or voters>50, etc..):
- lder<-subset(practice,V081104>=60)
Does partyid vary by age? table(practice$partyid) table(older$partyid) CrossTable(practice$age,practice$partyid) Subsetting on older GOP voters:
- lderGOP<-subset(older,newpartyid==2)
We could now run analyses on our subsets...
Basic bivariate stats
◮ Correlation (numeric variables)
duration<-faithful$eruptions waiting<-faithful$waiting cor(duration,waiting) cor.test(duration,waiting)
◮ Crosstabulation (categorical variables)
install.packages("gmodels") library(gmodels) CrossTable(nes08$partyid,nes08$marriage) CrossTable(nes08$partyid,nes08$bibleview)
◮ down the road: regression models
Sample plots
hist(faithful$eruptions)
Histogram of faithful$eruptions
faithful$eruptions Frequency 2 3 4 5 20 40 60
hist(faithful$eruptions,breaks=20,col="lightblue2", main="Histogram of ’eruptions’ variable",xlab="x",ylab="freq(x)")
Histogram of 'eruptions' variable
x freq(x) 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 10 20 30 40
hist(eruptions, breaks=20,col="lightblue2",main="Histogram of ’eruptions’ Variable",xlab="x",ylab="freq(x)",prob=TRUE) curve(dnorm(x, mean=mean(eruptions), sd=sd(eruptions)), add=TRUE)
Histogram of 'eruptions' Variable
x freq(x) 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
my.density<-density(faithful$eruptions) plot(my.density)
1 2 3 4 5 6 0.0 0.1 0.2 0.3 0.4 0.5
density.default(x = faithful$eruptions)
N = 272 Bandwidth = 0.3348 Density
plot(my.density,col="seagreen3",main="PDF of ’eruptions’ variable",xlab="x",ylab="Pr(X=x)",lty=6,lwd=4)
plot(faithful$eruptions,faithful$waiting)
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 50 60 70 80 90 faithful$eruptions faithful$waiting
plot(eruptions,waiting,main="Scatterplot of faithful Data",xlab="Eruptions",ylab="Waiting",pch=19)
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 50 60 70 80 90
Scatterplot of faithful Data
Waiting