PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - PowerPoint PPT Presentation

PS 405 – Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014

Today’s plan Preliminaries Intro to R Basic univariate and bivariate stats Plots

Preliminaries ◮ Section: Tuesday, 5:00-6:00, Scott 212 ◮ Office Hours: Thursday, 12:30-2:00, Scott 230 ◮ Problem Sets: ◮ hard copies ◮ include code (annotated) ◮ neat tables (cleaned up in Word or L A T EX) ◮ grades: number correct (meaningless) ◮ Questions: substantive questions to office hours, please ◮ Website: my overheads/code will be posted at www.djflynn.org/teaching

Caveats ◮ this presentation: intro to the basics ◮ a lot of helpful R guides out there (see Thomas Leeper’s: thomasleeper.com/Rcourse/Intro2R/Intro2R.pdf ) ◮ 90% of R skills come from trial-and-error ◮ Google error messages ◮ pro tip: always know what you’re asking R to do (not just the code). Next quarter Jay will show you what’s going on behind the scenes.

R looks like this...

RStudio I highly recommend using a text editor, such as RStudio:

About R ◮ Almost entirely command-based (no point-and-click) ◮ Core functionalities already loaded; if you need anything else, load a package (we’ll do this) ◮ Advantages : FREE, extremely flexible, great graphics, increasingly the norm ◮ Disadvantages : steep learning curve, tedious code, very sensitive, unhelpful error messages

Practical tips 1 ◮ R is extremely sensitive: x � = X, Data � = data ◮ scroll through code using up and down arrows ◮ putting a question mark before a command will bring up the relevant help file: ?summary ◮ use pound signs (#) to annotate code as you go along ◮ ALWAYS save your code in a separate file (RStudio makes this easy) ◮ when R asks if you want to save the workspace image, say yes! 1 Most of these tips came from Salma Al-Shami’s slides from previous years (thanks, Salma!)

Basic commands ◮ R works like a calculator:

◮ Creating objects in R : ◮ constants: x<-5 constant=1 ◮ vectors: myvec<-c(1,2,3,4,5) myothervec<-c(6,7,8,9,10) colors<-c("blue","green","red","purple") ◮ matrices: mymatrix<-cbind(myvec,myothervec) my.other.matrix<-matrix(seq(1,100),10,10) ◮ data frames: mydataframe<-cbind.data.frame(myvec,myothervec)

Looking at data ◮ you have to tell R where to find variables: dataset$variable ◮ use attach() and detach() , but always know what dataset you’re referring to ◮ to look at an object, just type its name ◮ descriptives: mean median mode max min var sd range ◮ distributions: table() summary() head() ◮ variables: names(dataset) dataset$variable dataset$variable[obs1:obs2]

Practice looking at variables in the pre-loaded dataset faithful . Access it like this: install.packages("car") library(car) names(faithful)

Loading packages install.packages("nameofpackage") library(nameofpackage)

Loading data in R ◮ code depends on the type of file you’re attempting to load: read.table read.dta read.csv read.spss , etc.. ◮ two options: (1) tell R exactly where to find the dataset you want, or (2) set a working directory and then just tell it the file name ◮ I highly recommend the latter because typing long file paths can be a nightmare (e.g., typos, slashes, quotation marks) ◮ to load data not already in .R format, load the foreign package ◮ MUCH easier in RStudio (and on Macs)

Example using pilot.data.csv Option 1: Load from file path install.packages("foreign") library(foreign) pilot<-read.csv(" ∼ /Documents/TAing/winter 2014/section/week1/pilot.data.csv") names(pilot) Option 2: Set wd, then call up file setwd(" ∼ /Documents/TAing/winter 2014/section/week1") install.packages("foreign") library(foreign) pilot<-read.csv("pilot.data.csv") names(pilot) Option 3: Point-and-click open in RStudio

Types of variables and why we care ◮ nominal/categorical: can’t be ordered; distance not meaningful ◮ ordinal: can be ordered; distance may/may not be meaningful ◮ continuous: can be ordered; distance meaningful Model selection depends on type of DV. This class: continuous and quasi-continuous DVs Next class: categorical/limited DVs

Re-coding Raw data (especially secondary data, e.g., ANES) are ofen coded awkwardly, so we want to re-code: load("/Users/DJF/Documents/TAing/winter 2014/section/week1/nes2008.RData") practice<-nes08 summary(practice$partyid) #notice how responses are non-numeric Here I code Dems as 1, Reps as 2, Inds as 3, and others as missing: library(car) practice$newpartyid<-recode(practice$partyid,"’1. Democrat’=1; ’2. Republican’=2; ’3. Independent’=3;else=’NA’") It’s always a good idea to compare the distributions before and afer re-coding to make sure everything was done correctly: table(practice$partyid) table(practice$newpartyid)

Another recoding example (this time changing already numeric responses): library(car) pilot$gmf.new<-recode(pilot$gmf,"7=1;6=2;5=3;4=4;3=5; 2=6;1=7;else=NA") table(pilot$gmf) table(pilot$gmf.new)

Sub-setting We ofen want to subset data based on values of one or more variables (e.g., look only at Democrats, or voters>50, etc..): older<-subset(practice,V081104>=60) Does partyid vary by age? table(practice$partyid) table(older$partyid) CrossTable(practice$age,practice$partyid) Subsetting on older GOP voters: olderGOP<-subset(older,newpartyid==2) We could now run analyses on our subsets...

Basic bivariate stats ◮ Correlation (numeric variables) duration<-faithful$eruptions waiting<-faithful$waiting cor(duration,waiting) cor.test(duration,waiting) ◮ Crosstabulation (categorical variables) install.packages("gmodels") library(gmodels) CrossTable(nes08$partyid,nes08$marriage) CrossTable(nes08$partyid,nes08$bibleview) ◮ down the road: regression models

Sample plots hist(faithful$eruptions) Histogram of faithful$eruptions 60 Frequency 40 20 0 2 3 4 5 faithful$eruptions

hist(faithful$eruptions,breaks=20,col="lightblue2", main="Histogram of ’eruptions’ variable",xlab="x",ylab="freq(x)") Histogram of 'eruptions' variable 40 30 freq(x) 20 10 0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 x

hist(eruptions, breaks=20,col="lightblue2",main="Histogram of ’eruptions’ Variable",xlab="x",ylab="freq(x)",prob=TRUE) curve(dnorm(x, mean=mean(eruptions), sd=sd(eruptions)), add=TRUE) Histogram of 'eruptions' Variable 0.7 0.6 0.5 0.4 freq(x) 0.3 0.2 0.1 0.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 x

my.density<-density(faithful$eruptions) plot(my.density) density.default(x = faithful$eruptions) 0.5 0.4 0.3 Density 0.2 0.1 0.0 1 2 3 4 5 6 N = 272 Bandwidth = 0.3348

plot(my.density,col="seagreen3",main="PDF of ’eruptions’ variable",xlab="x",ylab="Pr(X=x)",lty=6,lwd=4)

plot(faithful$eruptions,faithful$waiting) 90 80 faithful$waiting 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 faithful$eruptions

plot(eruptions,waiting,main="Scatterplot of faithful Data",xlab="Eruptions",ylab="Waiting",pch=19) Scatterplot of faithful Data 90 80 Waiting 70 60 50 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

plot(eruptions ∼ waiting,main="Scatterplot with Regression Line",xlab="Eruptions",ylab="Waiting") abline(lm(eruptions ∼ waiting),col="blue",lwd=3)

plot(eruptions,waiting,main="Scatterplot with Smoothed Regression Line",xlab="Eruptions",ylab="Waiting",pch=20) lines(lowess(eruptions,waiting),col="red",lwd=3)

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn - PowerPoint PPT Presentation

PS 405 Week 1 Section Intro to R and Summary Statistics D.J. Flynn January 14, 2014 Todays plan Preliminaries Intro to R Basic univariate and bivariate stats Plots Preliminaries Section: Tuesday, 5:00-6:00, Scott 212 Office

I-405 Peak-Use Shoulder Lane Project Overview Barrett Hanson, P.E. Design Manager WSDOT

Math 610 Section 700 - Recitation week 3 week 4 week 6 week 8 TA: Peng Wei Office: Blocker

Intro to Electronics Week 2 Intro to Electronics, Week 2 Last updated Oct. 17, 2012 1 Build a

ACEC OC February 22, 2017 I-405 Freeway 1958 I-405 Freeway Today Measure R Highway Projects

MATH2130-F17 Week 13 Week 14 Week 15, Inner Farid Aliniaeifard Product Space CU BOULDER

I- I-405 Sepulveda Pass Widening Project 405 Sepulveda Pass Widening Project November 19, 2009

TB Morbidity New Jersey, 2009 2018 450 400 405 405 350 300 331 326 302 307 302 291

TB Morbidity New Jersey, 20082017 450 400 422 405 405 350 300 331 319 326 307 302

TB Morbidity New Jersey, 2008 2017 450 400 422 405 405 350 300 331 319 326 307 302

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Interchange Intro Presentation Plus: Intro (Mixed media Interchange Intro Presentation Plus: Intro

Time Matters Week 7 Week 6 Prototyping + Needfinding Week 7 Week 8 Implementation Week 9

Module V: Vector Spaces Module V Math 237 Module V Section V.0 Section V.1 Section V.2

Intro to Electronics Week 5 Intro to Electronics, Week 5 Last updated Nov. 14, 2012 1 Build a

Intro to Electronics Week 4 Intro to Electronics, Week 4 Last updated Oct. 31, 2012 1 Make an

Intro to Electronics Week 3 Intro to Electronics, Week 3 Last updated Oct. 24, 2012 1 Thanks

IB SL PAPER 1 Document Based Question Paper 1 1 Hour in Length 30% of IB-SL History Grade (25

How to Make Beautiful Technical Documents with LaTeX PHYS 87 Benjam n Grinstein UCSD

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

Information Retrieval Lecture 4 Recap of last time Postings pointer storage Dictionary

Exam Preparation Advice (for ANLP and in general) Shay Cohen (based on slides by Sharon

Trademark and Unfair Competition Law Slides 20: Dilution LAWS 7341-001 Prof. Kristelia Garca

Class 26 4 November 2019 Queues and amortized analysis Rackette operations in detail Abstract

Degree Map Empowers Students Set education and career goals Create a clear, personalized