Data Science 101: Using R Language to get Big Insights Satnam - - PowerPoint PPT Presentation

data science 101 using r language to get big insights
SMART_READER_LITE
LIVE PREVIEW

Data Science 101: Using R Language to get Big Insights Satnam - - PowerPoint PPT Presentation

Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013 Motivation: Using Data to


slide-1
SLIDE 1

Data Science 101: Using R Language to get Big Insights

Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013

slide-2
SLIDE 2

Motivation: Using Data to get Business Insights

2

Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?

slide-3
SLIDE 3
  • Ref. [kaggle.com]

Data Science Programming Languages

Why R?

  • Popular, Free
  • Open source
  • Multi-platform
  • Vectorization
  • Many statistical packages
  • Large support base
  • Obj. oriented prog. lang.

Ref [http://www.r-project.org]

slide-4
SLIDE 4

R Language Basics

> y <- c(1,2,3,4) > y [1] 1 2 3 4 Vector Operations Function Calls

> y <- 21 > y [1] 21 > z = 233 > z [1] 233

Simple Operations

slide-5
SLIDE 5

R Language: Data Structures Examples

  • Data frame
  • Matrix

! "# $%$&%

  • '()"#

5

  • Matrix
  • List

*+(* ')+('()* ,+ (+')

slide-6
SLIDE 6

Case Study: Activity Recognition

Example of Accelerometer data Smartphone’s Accelerometer

6

  • Activity Recognition: Detect walking,

driving, biking, climbing stairs, standing, etc. Accelerometer Sensor

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar

slide-7
SLIDE 7

Data Analysis - Steps

Feature Extraction Time Series Data 43 Features

Mean for each

  • acc. Axis (3)
  • Std. dev. for each
  • acc. Axis (3)

200 samples (10 sec)

  • Avg. Abs. diff. from

Mean for each

  • acc. Axis (3)
  • Avg. Resultant Acc. (1)

7

  • Avg. Resultant Acc. (1)

Histogram (30)

Classifiers CART: Decision Tree RF: Random Forest Classify the Activity

[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University

slide-8
SLIDE 8

Data Visualization – Activity (Class Variable)

ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase

Bar Plot

[Ref] Rattle R Data Mining Tool

summary(na.omit(crs$dataset[,][crs$datase t$class=="Walking",]$class)))

  • rd <- order(ds[1,], decreasing=TRUE)

bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))

Dot Plot

slide-9
SLIDE 9

Data Visualization Example – Variable Yavg.

ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)

[Ref] Rattle R Data Mining Tool

slide-10
SLIDE 10
  • Easy to interpret

Blue : Positive correlation Red: Negative correlation

Correlation Plot

require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) [Ref] Rattle R Data Mining Tool crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]

slide-11
SLIDE 11

Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees

Data Science R Packages

regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression

slide-12
SLIDE 12

Decision Tree - Visualization

[Ref] Rattle R Data Mining Tool

slide-13
SLIDE 13
  • Decision Tree Model Results:

n= 3792

1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041)

Decision Tree

rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))

2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342

slide-14
SLIDE 14

Random Forest: Ensemble of Trees

Tree1 Tree2 Treen

[Ref] Rattle R Data Mining Tool

Σ

Random Forest Tree1 Tree2

slide-15
SLIDE 15
  • Random Forest Model Results:

Number of observations used to build the model: 3792 Type of random forest: classification

Random Forest Package in R

randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)

Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064

slide-16
SLIDE 16
  • Fusion of data science and domain knowledge

enables the big insights from the data

  • R language provides a platform to rapidly build

prototypes and test the ideas

  • Getting data insights is an outcome of intense

Summary

  • Getting data insights is an outcome of intense

team effort between various stakeholders

16

slide-17
SLIDE 17
  • R Project: http://www.r-project.org
  • Activity Recognition Dataset- “ The Impact of Personalization on

Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05

  • “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,

AAAI Conference on Artificial Intelligence -2010

  • R wiki:

http://rwiki.sciviews.org/doku.php

  • R graph gallery:

References

  • R graph gallery:

http://addictedtor.free.fr/graphiques/thumbs.php

  • Kickstarting R:

http://cran.r-project.org/doc/contrib/Lemon-kickstart/

  • Rattle – R Data Mining Tool [http://rattle.togaware.com/]
  • Sensor Platforms, http://www.sensorplatforms.com/context-aware/
  • Movea, http://www.movea.com/
  • Alohar, https://www.alohar.com

17