Data Science 101: Using R Language to get Big Insights Satnam - - PowerPoint PPT Presentation
Data Science 101: Using R Language to get Big Insights Satnam - - PowerPoint PPT Presentation
Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013 Motivation: Using Data to
Motivation: Using Data to get Business Insights
2
Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?
- Ref. [kaggle.com]
Data Science Programming Languages
Why R?
- Popular, Free
- Open source
- Multi-platform
- Vectorization
- Many statistical packages
- Large support base
- Obj. oriented prog. lang.
Ref [http://www.r-project.org]
R Language Basics
> y <- c(1,2,3,4) > y [1] 1 2 3 4 Vector Operations Function Calls
> y <- 21 > y [1] 21 > z = 233 > z [1] 233
Simple Operations
R Language: Data Structures Examples
- Data frame
- Matrix
! "# $%$&%
- '()"#
5
- Matrix
- List
*+(* ')+('()* ,+ (+')
Case Study: Activity Recognition
Example of Accelerometer data Smartphone’s Accelerometer
6
- Activity Recognition: Detect walking,
driving, biking, climbing stairs, standing, etc. Accelerometer Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
Data Analysis - Steps
Feature Extraction Time Series Data 43 Features
Mean for each
- acc. Axis (3)
- Std. dev. for each
- acc. Axis (3)
200 samples (10 sec)
- Avg. Abs. diff. from
Mean for each
- acc. Axis (3)
- Avg. Resultant Acc. (1)
7
- Avg. Resultant Acc. (1)
Histogram (30)
Classifiers CART: Decision Tree RF: Random Forest Classify the Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University
Data Visualization – Activity (Class Variable)
ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase
Bar Plot
[Ref] Rattle R Data Mining Tool
summary(na.omit(crs$dataset[,][crs$datase t$class=="Walking",]$class)))
- rd <- order(ds[1,], decreasing=TRUE)
bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))
Dot Plot
Data Visualization Example – Variable Yavg.
ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)
[Ref] Rattle R Data Mining Tool
- Easy to interpret
Blue : Positive correlation Red: Negative correlation
Correlation Plot
require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) [Ref] Rattle R Data Mining Tool crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees
Data Science R Packages
regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
- Decision Tree Model Results:
n= 3792
1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041)
Decision Tree
rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342
Random Forest: Ensemble of Trees
…
Tree1 Tree2 Treen
[Ref] Rattle R Data Mining Tool
Σ
Random Forest Tree1 Tree2
- Random Forest Model Results:
Number of observations used to build the model: 3792 Type of random forest: classification
Random Forest Package in R
randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064
- Fusion of data science and domain knowledge
enables the big insights from the data
- R language provides a platform to rapidly build
prototypes and test the ideas
- Getting data insights is an outcome of intense
Summary
- Getting data insights is an outcome of intense
team effort between various stakeholders
16
- R Project: http://www.r-project.org
- Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05
- “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank,
AAAI Conference on Artificial Intelligence -2010
- R wiki:
http://rwiki.sciviews.org/doku.php
- R graph gallery:
References
- R graph gallery:
http://addictedtor.free.fr/graphiques/thumbs.php
- Kickstarting R:
http://cran.r-project.org/doc/contrib/Lemon-kickstart/
- Rattle – R Data Mining Tool [http://rattle.togaware.com/]
- Sensor Platforms, http://www.sensorplatforms.com/context-aware/
- Movea, http://www.movea.com/
- Alohar, https://www.alohar.com
17