R A Personalized Introduc3on Debapriyo Majumdar Data Mining - - PowerPoint PPT Presentation

r
SMART_READER_LITE
LIVE PREVIEW

R A Personalized Introduc3on Debapriyo Majumdar Data Mining - - PowerPoint PPT Presentation

R A Personalized Introduc3on Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata August 18, 2014 About R A suite of software tools for Data manipulation


slide-1
SLIDE 1

R ¡

A ¡Personalized ¡Introduc3on ¡ ¡

Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata

August 18, 2014

slide-2
SLIDE 2

About ¡“R” ¡

§ A suite of software tools for

– Data manipulation – Calculations – Graphical display

§ Largely based on the programming language S § Packages

– About 25 packages standard and recommended supplied – Many more available for download at: http://CRAN.R-project.org

§ Free (GPL). Also BSD, MIT

2 ¡

slide-3
SLIDE 3

Basic ¡

§ Arithmetic

> 2+2 [1] 4

§ Assign variables

> x <- 2 > y <- 5 > z <- 2 * x + 3 * y > z [1] 19

§ The created objects are now stored in the workspace. List them

> ls() [1] "x" "y" "z”

§ Also, we can remove them

> rm(x) > ls() [1] "y" "z”

  • 3 ¡
slide-4
SLIDE 4

Vectors ¡

§ Creating a vector

> x <- c(2,5,9) > y <- c(3,1,-1) > x + y [1] 5 6 8

§ But x * y would do a element-wise multiplication

> x * y [1] 6 5 9

§ But x + 2 would add 2 to all elements of x

> x + 2 [1] 4 7 11

4 ¡

slide-5
SLIDE 5

Useful ¡func3ons ¡related ¡to ¡vectors ¡

§ Sequence of integers from a to b

> seq(2,9) [1] 2 3 4 5 6 7 8 9

§ The repeat function

> rep(1,3) [1] 1 1 1 > rep(1:3,3) [1] 1 2 3 1 2 3 1 2 3

§ Try the help or ? command

> help(rep) > ?rep

  • 5 ¡
slide-6
SLIDE 6

Data ¡and ¡Sta3s3cs ¡– ¡Basics ¡ ¡

§ A lot of things out of the box

> x <- c(2,3,1,5,7,2,5,8,3,2,0,3,2,6,7,3,1,3,5,8,4) > summary(x)

  • Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 2.00 3.00 3.81 5.00 8.00

§ Specifying elements or subsets (index starts at 1, not 0)

> x[1] [1] 2 > x[3:6] [1] 1 5 7 2

§ Excluding elements by the minus sign

> x[-(2:4)] [1] 2 7 2 5 8 3 2 0 3 2 6 7 3 1 3 5 8 4

6 ¡

slide-7
SLIDE 7

Matrices ¡

§ Bind columns (cbind) or rows (rbind)

> x <- c(3,5,2); y <- c(8,2,1) > z <- cbind(x,y) > z x y [1,] 3 8 [2,] 5 2 [3,] 2 1

§ Or specify the entries and number of rows

> A <- matrix(c(3,5,2,8,2,1),nrow=3) > B <- matrix(c(3,5,2,8,2,1),nrow=2)

7 ¡

slide-8
SLIDE 8

Matrix ¡opera3ons ¡

§ Addition is usual

> A + 2* A

  • [,1] [,2]

[1,] 9 24 [2,] 15 6 [3,] 6 3

§ Multiplication: x * y is element wise, not matrix multiplication § Matrix multiplication: %*%

> A %*% B

  • [,1] [,2] [,3]

[1,] 49 70 14 [2,] 25 26 12 [3,] 11 12 5

8 ¡

slide-9
SLIDE 9

Inverse ¡and ¡Covariance ¡of ¡matrix ¡

§ Computes the inverse of a matrix if it exists:

> solve(X)

§ Covariance matrix

> var(X) > cov(X)

§ Covariance matrix (recall)

X1,…, Xn are random variables, each with finite variance Σ is the covariance matrix where

  • 9 ¡

Σij = cov(Xi, X j) = E[(Xi −µi)(X j −µ j)] § Also called var(X) = Variance of the random vector X

slide-10
SLIDE 10

Wri3ng ¡a ¡func3on ¡

§ A new function can be defined

> z <- function(x,y) 3*x + 4*y > z(2,3) [1] 18

§ A function with many lines

> z <- function(x,y) {

  • c <- 3*x + 4*y;
  • 5 * c

}

§ The last line is the output § Can write the function in a text file prog.R and source it

> source("/Users/deb/…/R/xTest.R")

§ Can also define a new binary operator

> “%LL%” <- function(x,y) { 3*x + 4*y } > 5 %LL% 3

  • 10 ¡
slide-11
SLIDE 11

Data ¡

§ Read an entire data frame

– The first line of the file should have a name for each variable in the data frame – Each additional line of the file has as its first item a row label and the values for each variable Age Income.K Owns.House 01 25 8 No 02 33 5 No 03 30 130 Yes 04 45 50 Yes 05 65 5 No 06 75 7 Yes

  • > H <- read.table(”filename")
  • 11 ¡
slide-12
SLIDE 12

Using ¡data ¡

§ Plot tries to figure out what kind of plot will be suitable

> plot(H[1:2])

§ We want to label points based on some attribute

– Let us select a subset of the data

> H[which(H$Owns.House=='Yes'),] Age Income.K Owns.House 03 30 130 Yes 04 45 50 Yes 06 75 7 Yes 07 28 200 Yes 08 35 90 Yes 10 55 102 Yes … … … …

  • 12 ¡
slide-13
SLIDE 13

Using ¡data ¡

§ Plot one subset with blue, another with red

  • 13 ¡

30 40 50 60 70 80 50 100 150 200 Age Income.K

> HYes <- H[which(H $Owns.House=='Yes'),]

  • > plot(HYes[1:2],

col='blue')

  • > points(HNo[1:2],

col='red') New ¡observa3on ¡(black) ¡

Hands ¡on ¡in ¡class ¡

slide-14
SLIDE 14

References ¡

§ The R manual: http://cran.r-project.org/doc/manuals/r-release/R- intro.html § A self-learn tutorial: https://www.nceas.ucsb.edu/files/scicomp/Dloads/ RProgramming/BestFirstRTutorial.pdf

14 ¡