PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - - PowerPoint PPT Presentation

ps 405 week 1 section intro to r and summary statistics
SMART_READER_LITE
LIVE PREVIEW

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - - PowerPoint PPT Presentation

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014 Todays plan Preliminaries Intro to R Basic univariate and bivariate stats Plots Preliminaries Section: Tuesday, 5:00-6:00, Scott 212 Office


slide-1
SLIDE 1

PS 405 – Week 1 Section Intro to R and Summary Statistics

D.J. Flynn January 14, 2014

slide-2
SLIDE 2

Today’s plan

Preliminaries Intro to R Basic univariate and bivariate stats Plots

slide-3
SLIDE 3

Preliminaries

◮ Section: Tuesday, 5:00-6:00, Scott 212 ◮ Office Hours: Thursday, 12:30-2:00, Scott 230 ◮ Problem Sets:

◮ hard copies ◮ include code (annotated) ◮ neat tables (cleaned up in Word or L

A

T EX)

◮ grades: number correct (meaningless)

◮ Questions: substantive questions to office hours, please ◮ Website: my overheads/code will be posted at

www.djflynn.org/teaching

slide-4
SLIDE 4

Caveats

◮ this presentation: intro to the basics ◮ a lot of helpful R guides out there (see Thomas Leeper’s:

thomasleeper.com/Rcourse/Intro2R/Intro2R.pdf)

◮ 90% of R skills come from trial-and-error ◮ Google error messages ◮ pro tip: always know what you’re asking R to do (not just the

code). Next quarter Jay will show you what’s going on behind the scenes.

slide-5
SLIDE 5

R looks like this...

slide-6
SLIDE 6

RStudio

I highly recommend using a text editor, such as RStudio:

slide-7
SLIDE 7
slide-8
SLIDE 8

About R

◮ Almost entirely command-based (no point-and-click) ◮ Core functionalities already loaded; if you need anything else,

load a package (we’ll do this)

◮ Advantages: FREE, extremely flexible, great graphics,

increasingly the norm

◮ Disadvantages: steep learning curve, tedious code, very

sensitive, unhelpful error messages

slide-9
SLIDE 9

Practical tips1

◮ R is extremely sensitive: x=X, Data=data ◮ scroll through code using up and down arrows ◮ putting a question mark before a command will bring up the

relevant help file: ?summary

◮ use pound signs (#) to annotate code as you go along ◮ ALWAYS save your code in a separate file (RStudio makes this

easy)

◮ when R asks if you want to save the workspace image, say yes!

1Most of these tips came from Salma Al-Shami’s slides from previous years

(thanks, Salma!)

slide-10
SLIDE 10

Basic commands

◮ R works like a calculator:

slide-11
SLIDE 11

◮ Creating objects in R:

◮ constants:

x<-5 constant=1

◮ vectors:

myvec<-c(1,2,3,4,5) myothervec<-c(6,7,8,9,10) colors<-c("blue","green","red","purple")

◮ matrices:

mymatrix<-cbind(myvec,myothervec) my.other.matrix<-matrix(seq(1,100),10,10)

◮ data frames:

mydataframe<-cbind.data.frame(myvec,myothervec)

slide-12
SLIDE 12

Looking at data

◮ you have to tell R where to find variables: dataset$variable ◮ use attach() and detach(), but always know what dataset

you’re referring to

◮ to look at an object, just type its name ◮ descriptives: mean median mode max min var sd range ◮ distributions: table() summary() head() ◮ variables: names(dataset) dataset$variable

dataset$variable[obs1:obs2]

slide-13
SLIDE 13

Practice looking at variables in the pre-loaded dataset faithful. Access it like this: install.packages("car") library(car) names(faithful)

slide-14
SLIDE 14

Loading packages

install.packages("nameofpackage") library(nameofpackage)

slide-15
SLIDE 15

Loading data in R

◮ code depends on the type of file you’re attempting to load:

read.table read.dta read.csv read.spss, etc..

◮ two options: (1) tell R exactly where to find the dataset you

want, or (2) set a working directory and then just tell it the file name

◮ I highly recommend the latter because typing long file paths

can be a nightmare (e.g., typos, slashes, quotation marks)

◮ to load data not already in .R format, load the foreign

package

◮ MUCH easier in RStudio (and on Macs)

slide-16
SLIDE 16

Example using pilot.data.csv

Option 1: Load from file path install.packages("foreign") library(foreign) pilot<-read.csv("∼/Documents/TAing/winter 2014/section/week1/pilot.data.csv") names(pilot) Option 2: Set wd, then call up file setwd("∼/Documents/TAing/winter 2014/section/week1") install.packages("foreign") library(foreign) pilot<-read.csv("pilot.data.csv") names(pilot) Option 3: Point-and-click open in RStudio

slide-17
SLIDE 17

Types of variables and why we care

◮ nominal/categorical: can’t be ordered; distance not

meaningful

◮ ordinal: can be ordered; distance may/may not be meaningful ◮ continuous: can be ordered; distance meaningful

Model selection depends on type of DV. This class: continuous and quasi-continuous DVs Next class: categorical/limited DVs

slide-18
SLIDE 18

Re-coding

Raw data (especially secondary data, e.g., ANES) are ofen coded awkwardly, so we want to re-code: load("/Users/DJF/Documents/TAing/winter 2014/section/week1/nes2008.RData") practice<-nes08 summary(practice$partyid) #notice how responses are non-numeric Here I code Dems as 1, Reps as 2, Inds as 3, and others as missing: library(car) practice$newpartyid<-recode(practice$partyid,"’1. Democrat’=1; ’2. Republican’=2; ’3. Independent’=3;else=’NA’") It’s always a good idea to compare the distributions before and afer re-coding to make sure everything was done correctly: table(practice$partyid) table(practice$newpartyid)

slide-19
SLIDE 19

Another recoding example (this time changing already numeric responses): library(car) pilot$gmf.new<-recode(pilot$gmf,"7=1;6=2;5=3;4=4;3=5; 2=6;1=7;else=NA") table(pilot$gmf) table(pilot$gmf.new)

slide-20
SLIDE 20

Sub-setting

We ofen want to subset data based on values of one or more variables (e.g., look only at Democrats, or voters>50, etc..):

  • lder<-subset(practice,V081104>=60)

Does partyid vary by age? table(practice$partyid) table(older$partyid) CrossTable(practice$age,practice$partyid) Subsetting on older GOP voters:

  • lderGOP<-subset(older,newpartyid==2)

We could now run analyses on our subsets...

slide-21
SLIDE 21

Basic bivariate stats

◮ Correlation (numeric variables)

duration<-faithful$eruptions waiting<-faithful$waiting cor(duration,waiting) cor.test(duration,waiting)

◮ Crosstabulation (categorical variables)

install.packages("gmodels") library(gmodels) CrossTable(nes08$partyid,nes08$marriage) CrossTable(nes08$partyid,nes08$bibleview)

◮ down the road: regression models

slide-22
SLIDE 22

Sample plots

hist(faithful$eruptions)

Histogram of faithful$eruptions

faithful$eruptions Frequency 2 3 4 5 20 40 60

slide-23
SLIDE 23

hist(faithful$eruptions,breaks=20,col="lightblue2", main="Histogram of ’eruptions’ variable",xlab="x",ylab="freq(x)")

Histogram of 'eruptions' variable

x freq(x) 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 10 20 30 40

slide-24
SLIDE 24

hist(eruptions, breaks=20,col="lightblue2",main="Histogram of ’eruptions’ Variable",xlab="x",ylab="freq(x)",prob=TRUE) curve(dnorm(x, mean=mean(eruptions), sd=sd(eruptions)), add=TRUE)

Histogram of 'eruptions' Variable

x freq(x) 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

slide-25
SLIDE 25

my.density<-density(faithful$eruptions) plot(my.density)

1 2 3 4 5 6 0.0 0.1 0.2 0.3 0.4 0.5

density.default(x = faithful$eruptions)

N = 272 Bandwidth = 0.3348 Density

slide-26
SLIDE 26

plot(my.density,col="seagreen3",main="PDF of ’eruptions’ variable",xlab="x",ylab="Pr(X=x)",lty=6,lwd=4)

slide-27
SLIDE 27

plot(faithful$eruptions,faithful$waiting)

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 50 60 70 80 90 faithful$eruptions faithful$waiting

slide-28
SLIDE 28

plot(eruptions,waiting,main="Scatterplot of faithful Data",xlab="Eruptions",ylab="Waiting",pch=19)

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 50 60 70 80 90

Scatterplot of faithful Data

Waiting

slide-29
SLIDE 29

plot(eruptions∼waiting,main="Scatterplot with Regression Line",xlab="Eruptions",ylab="Waiting") abline(lm(eruptions∼waiting),col="blue",lwd=3)

slide-30
SLIDE 30

plot(eruptions,waiting,main="Scatterplot with Smoothed Regression Line",xlab="Eruptions",ylab="Waiting",pch=20) lines(lowess(eruptions,waiting),col="red",lwd=3)