Questions Do you know data mining and its algorithms and techniques? - - PowerPoint PPT Presentation

questions
SMART_READER_LITE
LIVE PREVIEW

Questions Do you know data mining and its algorithms and techniques? - - PowerPoint PPT Presentation

Introduction to Data Mining with R 1 Yanchang Zhao http://www.RDataMining.com Statistical Modelling and Computing Workshop at Geoscience Australia 8 May 2015 1Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at


slide-1
SLIDE 1

Introduction to Data Mining with R1

Yanchang Zhao

http://www.RDataMining.com

Statistical Modelling and Computing Workshop at Geoscience Australia

8 May 2015

1Presented at AusDM 2014 (QUT, Brisbane) in Nov 2014, at Twitter (US) in Oct 2014, at UJAT (Mexico) in Sept 2014, and at University of Canberra in Sept 2013 1 / 44

slide-2
SLIDE 2

Questions

◮ Do you know data mining and its algorithms and techniques?

2 / 44

slide-3
SLIDE 3

Questions

◮ Do you know data mining and its algorithms and techniques? ◮ Have you heard of R?

2 / 44

slide-4
SLIDE 4

Questions

◮ Do you know data mining and its algorithms and techniques? ◮ Have you heard of R? ◮ Have you ever used R in your work?

2 / 44

slide-5
SLIDE 5

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

3 / 44

slide-6
SLIDE 6

What is R?

◮ R 2 is a free software environment for statistical computing

and graphics.

◮ R can be easily extended with 6,600+ packages available on

CRAN3 (as of May 2015).

◮ Many other packages provided on Bioconductor4, R-Forge5,

GitHub6, etc.

◮ R manuals on CRAN7

◮ An Introduction to R ◮ The R Language Definition ◮ R Data Import/Export ◮ . . . 2http://www.r-project.org/ 3http://cran.r-project.org/ 4http://www.bioconductor.org/ 5http://r-forge.r-project.org/ 6https://github.com/ 7http://cran.r-project.org/manuals.html 4 / 44

slide-7
SLIDE 7

Why R?

◮ R is widely used in both academia and industry. ◮ R was ranked no. 1 in the KDnuggets 2014 poll on Top

Languages for analytics, data mining, data science8 (actually,

  • no. 1 in 2011, 2012 & 2013!).

◮ The CRAN Task Views 9 provide collections of packages for

different tasks.

◮ Machine learning & statistical learning ◮ Cluster analysis & finite mixture models ◮ Time series analysis ◮ Multivariate statistics ◮ Analysis of spatial data ◮ . . . 8http://www.kdnuggets.com/polls/2014/languages-analytics-data-mining-data-science.html 9http://cran.r-project.org/web/views/ 5 / 44

slide-8
SLIDE 8

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

6 / 44

slide-9
SLIDE 9

Classification with R

◮ Decision trees: rpart, party ◮ Random forest: randomForest, party ◮ SVM: e1071, kernlab ◮ Neural networks: nnet, neuralnet, RSNNS ◮ Performance evaluation: ROCR

7 / 44

slide-10
SLIDE 10

The Iris Dataset

# iris data str(iris) ## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0... ## $ Species : Factor w/ 3 levels "setosa","versicolor",.... # split into training and test datasets set.seed(1234) ind <- sample(2, nrow(iris), replace=T, prob=c(0.7, 0.3)) iris.train <- iris[ind==1, ] iris.test <- iris[ind==2, ]

8 / 44

slide-11
SLIDE 11

Build a Decision Tree

# build a decision tree library(party) iris.formula <- Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width iris.ctree <- ctree(iris.formula, data=iris.train)

9 / 44

slide-12
SLIDE 12

plot(iris.ctree)

Petal.Length p < 0.001 1 ≤ 1.9 > 1.9 Node 2 (n = 40) setosa 0.2 0.4 0.6 0.8 1 Petal.Width p < 0.001 3 ≤ 1.7 > 1.7 Petal.Length p = 0.026 4 ≤ 4.4 > 4.4 Node 5 (n = 21) setosa 0.2 0.4 0.6 0.8 1 Node 6 (n = 19) setosa 0.2 0.4 0.6 0.8 1 Node 7 (n = 32) setosa 0.2 0.4 0.6 0.8 1 10 / 44

slide-13
SLIDE 13

Prediction

# predict on test data pred <- predict(iris.ctree, newdata = iris.test) # check prediction result table(pred, iris.test$Species) ## ## pred setosa versicolor virginica ## setosa 10 ## versicolor 12 2 ## virginica 14

11 / 44

slide-14
SLIDE 14

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

12 / 44

slide-15
SLIDE 15

Clustering with R

◮ k-means: kmeans(), kmeansruns()10 ◮ k-medoids: pam(), pamk() ◮ Hierarchical clustering: hclust(), agnes(), diana() ◮ DBSCAN: fpc ◮ BIRCH: birch ◮ Cluster validation: packages clv, clValid, NbClust

10Functions are followed with “()”, and others are packages. 13 / 44

slide-16
SLIDE 16

k-means Clustering

set.seed(8953) iris2 <- iris # remove class IDs iris2$Species <- NULL # k-means clustering iris.kmeans <- kmeans(iris2, 3) # check result table(iris$Species, iris.kmeans$cluster) ## ## 1 2 3 ## setosa 0 50 ## versicolor 2 0 48 ## virginica 36 0 14

14 / 44

slide-17
SLIDE 17

# plot clusters and their centers plot(iris2[c("Sepal.Length", "Sepal.Width")], col=iris.kmeans$cluster) points(iris.kmeans$centers[, c("Sepal.Length", "Sepal.Width")], col=1:3, pch="*", cex=5)

4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 2.0 2.5 3.0 3.5 4.0 Sepal.Width

* * *

15 / 44

slide-18
SLIDE 18

Density-based Clustering

library(fpc) iris2 <- iris[-5] # remove class IDs # DBSCAN clustering ds <- dbscan(iris2, eps = 0.42, MinPts = 5) # compare clusters with original class IDs table(ds$cluster, iris$Species) ## ## setosa versicolor virginica ## 2 10 17 ## 1 48 ## 2 37 ## 3 3 33

16 / 44

slide-19
SLIDE 19

# 1-3: clusters; 0: outliers or noise plotcluster(iris2, ds$cluster)

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 −8 −6 −4 −2 2 −2 −1 1 2 3 dc 1 dc 2 17 / 44

slide-20
SLIDE 20

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

18 / 44

slide-21
SLIDE 21

Association Rule Mining with R

◮ Association rules: apriori(), eclat() in package arules ◮ Sequential patterns: arulesSequence ◮ Visualisation of associations: arulesViz

19 / 44

slide-22
SLIDE 22

The Titanic Dataset

load("./data/titanic.raw.rdata") dim(titanic.raw) ## [1] 2201 4 idx <- sample(1:nrow(titanic.raw), 8) titanic.raw[idx, ] ## Class Sex Age Survived ## 501 3rd Male Adult No ## 477 3rd Male Adult No ## 674 3rd Male Adult No ## 766 Crew Male Adult No ## 1485 3rd Female Adult No ## 1388 2nd Female Adult No ## 448 3rd Male Adult No ## 590 3rd Male Adult No

20 / 44

slide-23
SLIDE 23

Association Rule Mining

# find association rules with the APRIORI algorithm library(arules) rules <- apriori(titanic.raw, control=list(verbose=F), parameter=list(minlen=2, supp=0.005, conf=0.8), appearance=list(rhs=c("Survived=No", "Survived=Yes"), default="lhs")) # sort rules quality(rules) <- round(quality(rules), digits=3) rules.sorted <- sort(rules, by="lift") # have a look at rules # inspect(rules.sorted)

21 / 44

slide-24
SLIDE 24

# lhs rhs support confidence lift # 1 {Class=2nd, # Age=Child} => {Survived=Yes} 0.011 1.000 3.096 # 2 {Class=2nd, # Sex=Female, # Age=Child} => {Survived=Yes} 0.006 1.000 3.096 # 3 {Class=1st, # Sex=Female} => {Survived=Yes} 0.064 0.972 3.010 # 4 {Class=1st, # Sex=Female, # Age=Adult} => {Survived=Yes} 0.064 0.972 3.010 # 5 {Class=2nd, # Sex=Male, # Age=Adult} => {Survived=No} 0.070 0.917 1.354 # 6 {Class=2nd, # Sex=Female} => {Survived=Yes} 0.042 0.877 2.716 # 7 {Class=Crew, # Sex=Female} => {Survived=Yes} 0.009 0.870 2.692 # 8 {Class=Crew, # Sex=Female, # Age=Adult} => {Survived=Yes} 0.009 0.870 2.692 # 9 {Class=2nd, # Sex=Male} => {Survived=No} 0.070 0.860 1.271 # 10 {Class=2nd,

22 / 44

slide-25
SLIDE 25

library(arulesViz) plot(rules, method = "graph")

Graph for 12 rules

{Class=1st,Sex=Female,Age=Adult} {Class=1st,Sex=Female} {Class=2nd,Age=Child} {Class=2nd,Sex=Female,Age=Adult} {Class=2nd,Sex=Female,Age=Child} {Class=2nd,Sex=Female} {Class=2nd,Sex=Male,Age=Adult} {Class=2nd,Sex=Male} {Class=3rd,Sex=Male,Age=Adult} {Class=3rd,Sex=Male} {Class=Crew,Sex=Female,Age=Adult} {Class=Crew,Sex=Female} {Survived=No} {Survived=Yes}

width: support (0.006 − 0.192) color: lift (1.222 − 3.096)

23 / 44

slide-26
SLIDE 26

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

24 / 44

slide-27
SLIDE 27

Text Mining with R

◮ Text mining: tm ◮ Topic modelling: topicmodels, lda ◮ Word cloud: wordcloud ◮ Twitter data access: twitteR

25 / 44

slide-28
SLIDE 28

Retrieve Tweets

Retrieve recent tweets by @RDataMining

## Option 1: retrieve tweets from Twitter library(twitteR) tweets <- userTimeline("RDataMining", n = 3200) ## Option 2: download @RDataMining tweets from RDataMining.com url <- "http://www.rdatamining.com/data/rdmTweets.RData" download.file(url, destfile = "./data/rdmTweets.RData") ## load tweets into R load(file = "./data/rdmTweets.RData") (n.tweet <- length(tweets)) ## [1] 320 strwrap(tweets[[320]]$text, width = 55) ## [1] "An R Reference Card for Data Mining is now available" ## [2] "on CRAN. It lists many useful R functions and packages" ## [3] "for data mining applications."

26 / 44

slide-29
SLIDE 29

Text Cleaning

library(tm) # convert tweets to a data frame df <- twListToDF(tweets) # build a corpus myCorpus <- Corpus(VectorSource(df$text)) # convert to lower case myCorpus <- tm_map(myCorpus, tolower) # remove punctuations and numbers myCorpus <- tm_map(myCorpus, removePunctuation) myCorpus <- tm_map(myCorpus, removeNumbers) # remove URLs, 'http' followed by non-space characters removeURL <- function(x) gsub("http[^[:space:]]*", "", x) myCorpus <- tm_map(myCorpus, removeURL) # remove 'r' and 'big' from stopwords myStopwords <- setdiff(stopwords("english"), c("r", "big")) # remove stopwords myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

27 / 44

slide-30
SLIDE 30

Stemming

# keep a copy of corpus myCorpusCopy <- myCorpus # stem words myCorpus <- tm_map(myCorpus, stemDocument) # stem completion myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy) # replace "miners" with "mining", because "mining" was # first stemmed to "mine" and then completed to "miners" myCorpus <- tm_map(myCorpus, gsub, pattern="miners", replacement="mining") strwrap(myCorpus[320], width=55) ## [1] "r reference card data mining now available cran list" ## [2] "used r functions package data mining applications"

28 / 44

slide-31
SLIDE 31

Frequent Terms

myTdm <- TermDocumentMatrix(myCorpus, control=list(wordLengths=c(1,Inf))) # inspect frequent words (freq.terms <- findFreqTerms(myTdm, lowfreq=20)) ## [1] "analysis" "big" "computing" "data" ... ## [5] "examples" "mining" "network" "package"... ## [9] "position" "postdoctoral" "r" "research... ## [13] "slides" "social" "tutorial" "universi... ## [17] "used"

29 / 44

slide-32
SLIDE 32

Associations

# which words are associated with 'r'? findAssocs(myTdm, "r", 0.2) ## r ## examples 0.32 ## code 0.29 ## package 0.20 # which words are associated with 'mining'? findAssocs(myTdm, "mining", 0.25) ## mining ## data 0.47 ## mahout 0.30 ## recommendation 0.30 ## sets 0.30 ## supports 0.30 ## frequent 0.26 ## itemset 0.26

30 / 44

slide-33
SLIDE 33

Network of Terms

library(graph) library(Rgraphviz) plot(myTdm, term=freq.terms, corThreshold=0.1, weighting=T)

analysis big computing data examples mining network package position postdoctoral r research slides social tutorial university used

31 / 44

slide-34
SLIDE 34

Word Cloud

library(wordcloud) m <- as.matrix(myTdm) freq <- sort(rowSums(m), decreasing=T) wordcloud(words=names(freq), freq=freq, min.freq=4, random.order=F)

r

data

mining

analysis

examples

research package

position slides used

network university

postdoctoral social tutorial

big computing applications book code

introduction see group analytics australia available modelling scientist time

association free job lecture

  • nline

parallel text clustering course learn pdf rules series talk

document now programming statistics ausdm detection google large

  • pen
  • utlier

rdatamining techniques thanks tools users amp chapter due graphics map presentation science software starting studies vacancy via visualizing business call can case center cfp database details follower functions join kdnuggets page poll spatial submission technology twitter

analyst california card classification fast find get graph handling information interactive linkedin list machine melbourne notes processing provided recent reference tried using videos web workshop wwwrdataminingcom access added canada canberra china cloud conference datasets distributed dmapps draft events experience fellow forecasting frequent high ibm industrial knowledge management nd performance predictive published search sentiment short snowfall southern sydney tenuretrack top topic week youtube

32 / 44

slide-35
SLIDE 35

Topic Modelling

library(topicmodels) set.seed(123) myLda <- LDA(as.DocumentTermMatrix(myTdm), k=8) terms(myLda, 5) ## Topic 1 Topic 2 Topic 3 Topic 4 ## [1,] "mining" "data" "r" "position" ## [2,] "data" "free" "examples" "research" ## [3,] "analysis" "course" "code" "university" ## [4,] "network" "online" "book" "data" ## [5,] "social" "ausdm" "mining" "postdoctoral" ## Topic 5 Topic 6 Topic 7 Topic 8 ## [1,] "data" "data" "r" "r" ## [2,] "r" "scientist" "package" "data" ## [3,] "mining" "research" "computing" "clustering" ## [4,] "applications" "r" "slides" "mining" ## [5,] "series" "package" "parallel" "detection"

33 / 44

slide-36
SLIDE 36

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

34 / 44

slide-37
SLIDE 37

Time Series Analysis with R

◮ Time series decomposition: decomp(), decompose(), arima(),

stl()

◮ Time series forecasting: forecast ◮ Time Series Clustering: TSclust ◮ Dynamic Time Warping (DTW): dtw

35 / 44

slide-38
SLIDE 38

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

36 / 44

slide-39
SLIDE 39

Social Network Analysis with R

◮ Packages: igraph, sna ◮ Centrality measures: degree(), betweenness(), closeness(),

transitivity()

◮ Clusters: clusters(), no.clusters() ◮ Cliques: cliques(), largest.cliques(), maximal.cliques(),

clique.number()

◮ Community detection: fastgreedy.community(),

spinglass.community()

◮ Graph database Neo4j: package RNeo4j http://nicolewhite.github.io/RNeo4j/

37 / 44

slide-40
SLIDE 40

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

38 / 44

slide-41
SLIDE 41

R and Big Data Platforms

◮ Hadoop

◮ Hadoop (or YARN) - a framework that allows for the

distributed processing of large data sets across clusters of computers using simple programming models

◮ R Packages: RHadoop, RHIPE

◮ Spark

◮ Spark - a fast and general engine for large-scale data

processing, which can be 100 times faster than Hadoop

◮ SparkR - R frontend for Spark

◮ H2O

◮ H2O - an open source in-memory prediction engine for big

data science

◮ R Package: h2o

◮ MongoDB

◮ MongoDB - an open-source document database ◮ R packages: rmongodb, RMongo 39 / 44

slide-42
SLIDE 42

R and Hadoop

◮ Packages: RHadoop, RHive ◮ RHadoop11 is a collection of R packages:

◮ rmr2 - perform data analysis with R via MapReduce on a

Hadoop cluster

◮ rhdfs - connect to Hadoop Distributed File System (HDFS) ◮ rhbase - connect to the NoSQL HBase database ◮ . . .

◮ You can play with it on a single PC (in standalone or

pseudo-distributed mode), and your code developed on that will be able to work on a cluster of PCs (in full-distributed mode)!

◮ Step-by-Step Guide to Setting Up an R-Hadoop System http://www.rdatamining.com/big-data/r-hadoop-setup-guide

11https://github.com/RevolutionAnalytics/RHadoop/wiki 40 / 44

slide-43
SLIDE 43

An Example of MapReducing with R12

library(rmr2) map <- function(k, lines) { words.list <- strsplit(lines, "\\s") words <- unlist(words.list) return(keyval(words, 1)) } reduce <- function(word, counts) { keyval(word, sum(counts)) } wordcount <- function(input, output = NULL) { mapreduce(input = input, output = output, input.format = "text", map = map, reduce = reduce) } ## Submit job

  • ut <- wordcount(in.file.path, out.file.path)

12From Jeffrey Breen’s presentation on Using R with Hadoop http://www.revolutionanalytics.com/news-events/free-webinars/2013/using-r-with-hadoop/ 41 / 44

slide-44
SLIDE 44

Outline

Introduction Classification with R Clustering with R Association Rule Mining with R Text Mining with R Time Series Analysis with R Social Network Analysis with R R and Big Data Online Resources

42 / 44

slide-45
SLIDE 45

Online Resources

◮ RDataMining website: http://www.rdatamining.com

◮ R Reference Card for Data Mining ◮ RDataMining Slides Series ◮ R and Data Mining: Examples and Case Studies

◮ RDataMining Group on LinkedIn (12,000+ members) http://group.rdatamining.com ◮ RDataMining on Twitter (2,000+ followers) @RDataMining ◮ Free online courses http://www.rdatamining.com/resources/courses ◮ Online documents http://www.rdatamining.com/resources/onlinedocs

43 / 44

slide-46
SLIDE 46

The End

Thanks! Email: yanchang(at)rdatamining.com

44 / 44