Data Science 101: Using R Language to get Big Insights Satnam - PowerPoint PPT Presentation

Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013

Motivation: Using Data to get Business Insights Data Bases & Clusters Data Bases Data Bases & Clusters & Clusters Insights? Insights? Insights? 2

Data Science Programming Languages Why R? • Popular, Free • Open source Ref. [kaggle.com] • Multi-platform • Vectorization • Many statistical packages • Large support base • Obj. oriented prog. lang. Ref [http://www.r-project.org]

R Language Basics > y <- c(1,2,3,4) > y Function Vector [1] 1 2 3 4 Calls Operations > y <- 21 > y Simple [1] 21 Operations > z = 233 > z [1] 233

R Language: Data Structures Examples • Data frame �� ! �� "��#� �� $%�$��&%�� '��()��"��#�� • Matrix • Matrix ��*��+(��*�� '��)��+('��()��*� • List ��,�+� ��(��+��'��)�� 5

Case Study: Activity Recognition Example of Accelerometer data Smartphone’s Accelerometer Accelerometer Sensor • Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc. [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar 6

Data Analysis - Steps Time Series Data 43 Features Mean for each 200 samples (10 sec) acc. Axis (3) Std. dev. for each acc. Axis (3) Feature Avg. Abs. diff. from Extraction Mean for each acc. Axis (3) Avg. Resultant Acc. (1) Avg. Resultant Acc. (1) Histogram (30) Classifiers Classify the CART: Decision Tree Activity RF: Random Forest [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University 7

Data Visualization – Activity (Class Variable) Bar Plot ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase summary(na.omit(crs$dataset[,][crs$datase Dot Plot t$class=="Walking",]$class))) ord <- order(ds[1,], decreasing=TRUE) bp <- barplot2(ds[,ord ], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19)) [Ref] Rattle R Data Mining Tool

Data Visualization Example – Variable Yavg. ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], [Ref] Rattle R Data Mining Tool grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) hs <- hist(ds[ds$grp =="All",1], main="", points(1:7, summaryBy(dat ~ grp, data=ds, xlab="YAVG", ylab="Frequency", col="grey90", FUN=mean, na.rm=TRUE)$dat.mean, pch=8) ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)

Correlation Plot • Easy to interpret Blue : Positive correlation Red: Negative correlation require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6] [Ref] Rattle R Data Mining Tool

Data Science R Packages Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression

Decision Tree - Visualization [Ref] Rattle R Data Mining Tool

Decision Tree rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0)) • Decision Tree Model Results: n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342

Random Forest: Ensemble of Trees … Tree n Tree 1 Tree 1 Tree 2 Tree 2 Σ Random Forest [Ref] Rattle R Data Mining Tool

Random Forest Package in R randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix) • Random Forest Model Results: Number of observations used to build the model: 3792 Type of random forest: classification Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064

Summary • Fusion of data science and domain knowledge enables the big insights from the data • R language provides a platform to rapidly build prototypes and test the ideas • Getting data insights is an outcome of intense • Getting data insights is an outcome of intense team effort between various stakeholders 16

References • R Project: http://www.r-project.org • Activity Recognition Dataset- “ The Impact of Personalization on Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05 • “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010 • R wiki: http://rwiki.sciviews.org/doku.php • • R graph gallery: R graph gallery: http://addictedtor.free.fr/graphiques/thumbs.php • Kickstarting R: http://cran.r-project.org/doc/contrib/Lemon-kickstart/ • Rattle – R Data Mining Tool [http://rattle.togaware.com/] • Sensor Platforms, http://www.sensorplatforms.com/context-aware/ • Movea, http://www.movea.com/ • Alohar, https://www.alohar.com 17

Data Science 101: Using R Language to get Big Insights Satnam - PowerPoint PPT Presentation

Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013 Motivation: Using Data to

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Pragmatic insights Pragmatic insights on the evolution of language evolution of language on the

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Common Alerting Protocol (CAP) Presentation Outline 101.1 Opportunity and Challenge 101.2

Networking 101.101.101.101 The Internet The Internet is governed by a series of protocols

Applying Behavioural Insights to Public Policy Simon Ruda Outline 1. What are behavioural

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Investing 101 Small Steps Can Make a Difference Investing 101 Investing 101 Todays Agenda

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Learner Centred Learning-by-design Extended Cyberhunts (LCLBDC): An Internet Strategy to

transition to your territory February 2019 1 Energy transition and e-mobility Transportation,

L A T EX: A high quality document preparation system for S & T Literature K. P. Sanjailal

Introduction to L A T EX A Quick Crash Course Amin Mesbah 15 April, 2016 A T What is L EX? A

AP Physics C - Mechanics Simple Harmonic Motion 2015-12-05 www.njctl.org Slide 3 / 102 Slide 4

FEEDING THE FUTURE >>> Horizontal & Vertical SITE SURVEY STUDENTS: ZERO + House

CAMP CAMP ( C C ollege ollege A A ssistance ssistance M M igrant igrant P P rogram) rogram) (

It Increasing the Health and Nutritional Outcomes of Rwandas One Cow per Poor Family

Data Science 101: Using R Language to get Big Insights Satnam - PowerPoint PPT Presentation

Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013 Motivation: Using Data to

Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge Get the edge

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Pragmatic insights Pragmatic insights on the evolution of language evolution of language on the

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Common Alerting Protocol (CAP) Presentation Outline 101.1 Opportunity and Challenge 101.2

Networking 101.101.101.101 The Internet The Internet is governed by a series of protocols

Applying Behavioural Insights to Public Policy Simon Ruda Outline 1. What are behavioural

BIG DATA CONFERENCE How to transform data into money using Big Data technologies INTRO THE

From Big Data Management to Big Data Science 1 What is next? Real big data is widely available

Investing 101 Small Steps Can Make a Difference Investing 101 Investing 101 Todays Agenda

CS535 Big Data 2/5/2020 Week 3- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 2/10/2019 Week 4-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

CS535 Big Data 4/13/2020 Week 12-A Sangmi Lee Pallickara CS535 Big Data | Computer Science |

Learner Centred Learning-by-design Extended Cyberhunts (LCLBDC): An Internet Strategy to

transition to your territory February 2019 1 Energy transition and e-mobility Transportation,

L A T EX: A high quality document preparation system for S &amp; T Literature K. P. Sanjailal

Introduction to L A T EX A Quick Crash Course Amin Mesbah 15 April, 2016 A T What is L EX? A

AP Physics C - Mechanics Simple Harmonic Motion 2015-12-05 www.njctl.org Slide 3 / 102 Slide 4

FEEDING THE FUTURE &gt;&gt;&gt; Horizontal &amp; Vertical SITE SURVEY STUDENTS: ZERO + House

CAMP CAMP ( C C ollege ollege A A ssistance ssistance M M igrant igrant P P rogram) rogram) (

It Increasing the Health and Nutritional Outcomes of Rwandas One Cow per Poor Family

L A T EX: A high quality document preparation system for S & T Literature K. P. Sanjailal

FEEDING THE FUTURE >>> Horizontal & Vertical SITE SURVEY STUDENTS: ZERO + House