Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish - - PowerPoint PPT Presentation

spark machine learning
SMART_READER_LITE
LIVE PREVIEW

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish - - PowerPoint PPT Presentation

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1 Data Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1 Data Actionable Knowledge Amir H. Payberah


slide-1
SLIDE 1

Spark Machine Learning

Amir H. Payberah

amir@sics.se

SICS Swedish ICT June 30, 2016

Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

slide-2
SLIDE 2

Data

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

slide-3
SLIDE 3

Data Actionable Knowledge

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

slide-4
SLIDE 4

Data Actionable Knowledge

That is roughly the problem that Machine Learning addresses!

Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

slide-5
SLIDE 5

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-6
SLIDE 6

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-7
SLIDE 7

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam? ◮ Is there a face in this picture?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-8
SLIDE 8

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam? ◮ Is there a face in this picture?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-9
SLIDE 9

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam? ◮ Is there a face in this picture? ◮ Should I lend money to this customer given his spending behavior?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-10
SLIDE 10

Data and Knowledge Data Knowledge

◮ Is this email spam or no spam? ◮ Is there a face in this picture? ◮ Should I lend money to this customer given his spending

behaviour?

Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

slide-11
SLIDE 11

Data and Knowledge

◮ Knowledge is not concrete ◮ Spam is an abstraction ◮ Face is an abstraction ◮ Who to lend to is an abstraction

You do not find spam, faces, and financial advice in datasets, you just find bits!

Amir H. Payberah (SICS) MLLib June 30, 2016 4 / 1

slide-12
SLIDE 12

Knowledge Discovery from Data (KDD)

◮ Preprocessing ◮ Data mining ◮ Result validation

Amir H. Payberah (SICS) MLLib June 30, 2016 5 / 1

slide-13
SLIDE 13

KDD - Preprocessing

◮ Data cleaning ◮ Data integration ◮ Data reduction, e.g., sampling ◮ Data transformation, e.g., normalization

Amir H. Payberah (SICS) MLLib June 30, 2016 6 / 1

slide-14
SLIDE 14

KDD - Mining Functionalities

◮ Classification and regression (supervised learning) ◮ Clustering (unsupervised learning) ◮ Mining the frequent patterns ◮ Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 7 / 1

slide-15
SLIDE 15

KDD - Result Validation

◮ Needs to evaluate the performance of the model on some criteria. ◮ Depends on the application and its requirements.

Amir H. Payberah (SICS) MLLib June 30, 2016 8 / 1

slide-16
SLIDE 16

MLlib - Data Types

Amir H. Payberah (SICS) MLLib June 30, 2016 9 / 1

slide-17
SLIDE 17

Data Types - Local Vector

◮ Stored on a single machine ◮ Dense and sparse

  • Dense (1.0, 0.0, 3.0): [1.0, 0.0, 3.0]
  • Sparse (1.0, 0.0, 3.0): (3, [0, 2], [1.0, 3.0])

val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 10 / 1

slide-18
SLIDE 18

Data Types - Labeled Point

◮ A local vector (dense or sparse) associated with a label. ◮ label: label for this data point. ◮ features: list of features for this data point. case class LabeledPoint(label: Double, features: Vector) val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)))

Amir H. Payberah (SICS) MLLib June 30, 2016 11 / 1

slide-19
SLIDE 19

MLlib - Preprocessing

Amir H. Payberah (SICS) MLLib June 30, 2016 12 / 1

slide-20
SLIDE 20

Data Transformation - Normalizing Features

◮ To get data in a standard Gaussian distribution: x−mean sqrt(variance) val features = labelData.map(_.features) val scaler = new StandardScaler(withMean = true, withStd = true).fit(features) val scaledData = labelData.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features)))

Amir H. Payberah (SICS) MLLib June 30, 2016 13 / 1

slide-21
SLIDE 21

MLlib - Data Mining

Amir H. Payberah (SICS) MLLib June 30, 2016 14 / 1

slide-22
SLIDE 22

Data Mining Functionalities

◮ Classification and regression (supervised learning) ◮ Clustering (unsupervised learning) ◮ Mining the frequent patterns ◮ Outlier detection

Amir H. Payberah (SICS) MLLib June 30, 2016 15 / 1

slide-23
SLIDE 23

Classification and Regression (Supervised Learning)

Amir H. Payberah (SICS) MLLib June 30, 2016 16 / 1

slide-24
SLIDE 24

Supervised Learning (1/3)

◮ Right answers are given.

  • Training data (input data) is labeled, e.g., spam/not-spam or a

stock price at a time.

◮ A model is prepared through a training process. ◮ The training process continues until the model achieves a desired

level of accuracy on the training data.

Amir H. Payberah (SICS) MLLib June 30, 2016 17 / 1

slide-25
SLIDE 25

Supervised Learning (2/3)

◮ Face recognition

Training data Testing data

[ORL dataset, AT&T Laboratories, Cambridge UK] Amir H. Payberah (SICS) MLLib June 30, 2016 18 / 1

slide-26
SLIDE 26

Supervised Learning (3/3)

◮ Set of N training examples: (x1, y1), · · · , (xn, yn). ◮ xi = xi1, xi2, · · · , xim is the feature vector of the ith example. ◮ yi is the ith feature vector label. ◮ A learning algorithm seeks a function yi = f(Xi).

Amir H. Payberah (SICS) MLLib June 30, 2016 19 / 1

slide-27
SLIDE 27

Classification vs. Regression

◮ Classification: the output variable takes class labels. ◮ Regression: the output variable takes continuous values.

Amir H. Payberah (SICS) MLLib June 30, 2016 20 / 1

slide-28
SLIDE 28

Types of Classification/Regression Models in Spark

◮ Linear models ◮ Decision trees ◮ Naive Bayes models

Amir H. Payberah (SICS) MLLib June 30, 2016 21 / 1

slide-29
SLIDE 29

Linear Models

Amir H. Payberah (SICS) MLLib June 30, 2016 22 / 1

slide-30
SLIDE 30

Linear Models

◮ Training dataset: (x1, y1), · · · , (xn, yn). ◮ xi = xi1, xi2, · · · , xim ◮ Model the target as a function of a linear predictor applied to the

input variables: yi = g(wTxi).

  • E.g., yi = w1xi1 + w2xi2 + · · · + wmxim

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

slide-31
SLIDE 31

Linear Models

◮ Training dataset: (x1, y1), · · · , (xn, yn). ◮ xi = xi1, xi2, · · · , xim ◮ Model the target as a function of a linear predictor applied to the

input variables: yi = g(wTxi).

  • E.g., yi = w1xi1 + w2xi2 + · · · + wmxim

◮ Loss function: f (w) := n i=1 L(g(wTxi), yi) ◮ An optimization problem min w∈Rm f(w)

Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

slide-32
SLIDE 32

Linear Models - Regression (1/2)

◮ g(wTxi) = w1xi1 + w2xi2 + · · · + wmxim ◮ Loss function: minimizing squared different between predicted value

and actual value: L(g(wTxi), yi) := 1

2(wTxi − yi)2

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

slide-33
SLIDE 33

Linear Models - Regression (1/2)

◮ g(wTxi) = w1xi1 + w2xi2 + · · · + wmxim ◮ Loss function: minimizing squared different between predicted value

and actual value: L(g(wTxi), yi) := 1

2(wTxi − yi)2 ◮ Gradient descent

Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

slide-34
SLIDE 34

Linear Models - Regression (2/2)

val data: RDD[LabeledPoint] = ... val splits = labelData.randomSplit(Array(0.7, 0.3)) val (trainigData, testData) = (splits(0), splits(1)) val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD .train(trainigData, numIterations, stepSize) val valuesAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }

Amir H. Payberah (SICS) MLLib June 30, 2016 25 / 1

slide-35
SLIDE 35

Linear Models - Classification (Logistic Regression) (1/2)

◮ Binary classification: output values between 0 and 1 ◮ g(wTx) := 1 1+e−wT x (sigmoid function) ◮ If g(wTxi) > 0.5, then yi = 1, else yi = 0

Amir H. Payberah (SICS) MLLib June 30, 2016 26 / 1

slide-36
SLIDE 36

Linear Models - Classification (Logistic Regression) (2/2)

val data: RDD[LabeledPoint] = ... val splits = labelData.randomSplit(Array(0.7, 0.3)) val (trainigData, testData) = (splits(0), splits(1)) val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(trainingData) val predictionAndLabels = testData.map { point => val prediction = model.predict(point.features) (prediction, point.label) }

Amir H. Payberah (SICS) MLLib June 30, 2016 27 / 1

slide-37
SLIDE 37

Decision Tree

Amir H. Payberah (SICS) MLLib June 30, 2016 28 / 1

slide-38
SLIDE 38

Decision Tree

◮ A greedy algorithm. ◮ It performs a recursive binary partitioning of the feature space. ◮ Decision tree construction algorithm:

  • Find the best split condition (quantified based on the impurity

measure).

  • Stops when no improvement possible.

Amir H. Payberah (SICS) MLLib June 30, 2016 29 / 1

slide-39
SLIDE 39

Impurity Measure

◮ Measures how well are the two classes separated. ◮ The current implementation in Spark:

  • Regression: variance
  • Classification: gini and entropy

Amir H. Payberah (SICS) MLLib June 30, 2016 30 / 1

slide-40
SLIDE 40

Stopping Rules

◮ The node depth is equal to the maxDepth training parameter. ◮ No split candidate leads to an information gain greater than

minInfoGain.

◮ No split candidate produces child nodes which each have at least

minInstancesPerNode training instances.

Amir H. Payberah (SICS) MLLib June 30, 2016 31 / 1

slide-41
SLIDE 41

Decision Tree - Regression

val data: RDD[LabeledPoint] = ... val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "variance" val maxDepth = 5 val maxBins = 32 val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val labelsAndPredictions = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }

Amir H. Payberah (SICS) MLLib June 30, 2016 32 / 1

slide-42
SLIDE 42

Decision Tree - Classification

val data: RDD[LabeledPoint] = ... val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val impurity = "gini" val maxDepth = 5 val maxBins = 32 val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }

Amir H. Payberah (SICS) MLLib June 30, 2016 33 / 1

slide-43
SLIDE 43

Random Forest

◮ Train a set of decision trees separately. ◮ The training can be done in parallel. ◮ The algorithm injects randomness into the training process, so that

each decision tree is a bit different.

Amir H. Payberah (SICS) MLLib June 30, 2016 34 / 1

slide-44
SLIDE 44

Random Forest - Regression

val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 3 val featureSubsetStrategy = "auto" val impurity = "variance" val maxDepth = 4 val maxBins = 32 val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) val labelsAndPredictions = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }

Amir H. Payberah (SICS) MLLib June 30, 2016 35 / 1

slide-45
SLIDE 45

Random Forest - Classification

val numClasses = 2 val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 3 val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 4 val maxBins = 32 val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) }

Amir H. Payberah (SICS) MLLib June 30, 2016 36 / 1

slide-46
SLIDE 46

Naive Bayes

Amir H. Payberah (SICS) MLLib June 30, 2016 37 / 1

slide-47
SLIDE 47

Naive Bayes (1/4)

◮ Using the probability theory to classify things. ◮ Assumption: independency between every pair of features.

Amir H. Payberah (SICS) MLLib June 30, 2016 38 / 1

slide-48
SLIDE 48

Naive Bayes (1/4)

◮ Using the probability theory to classify things. ◮ Assumption: independency between every pair of features. ◮ y1: circles, and y2: triangles. ◮ (x1, x2) belongs to y1 or y2?

Amir H. Payberah (SICS) MLLib June 30, 2016 38 / 1

slide-49
SLIDE 49

Naive Bayes (2/4)

◮ (x1, x2) belongs to y1 or y2? ◮ If p(y1|x1, x2) > p(y2|x1, x2), the class is y1. ◮ If p(y1|x1, x2) < p(y2|x1, x2), the class is y2.

Amir H. Payberah (SICS) MLLib June 30, 2016 39 / 1

slide-50
SLIDE 50

Naive Bayes (2/4)

◮ (x1, x2) belongs to y1 or y2? ◮ If p(y1|x1, x2) > p(y2|x1, x2), the class is y1. ◮ If p(y1|x1, x2) < p(y2|x1, x2), the class is y2. ◮ Replace p(y|x) with p(x|y)p(y) p(x)

Amir H. Payberah (SICS) MLLib June 30, 2016 39 / 1

slide-51
SLIDE 51

Naive Bayes (3/4)

◮ Bayes theorem: p(y|x) = p(x|y)p(y) p(x) ◮ p(y|x): probability of instance x being in class y. ◮ p(x|y): probability of generating instance x given class y. ◮ p(y): probability of occurrence of class y ◮ p(x): probability of instance x occurring.

Amir H. Payberah (SICS) MLLib June 30, 2016 40 / 1

slide-52
SLIDE 52

Naive Bayes (4/4)

Is officer Drew male or female?

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

slide-53
SLIDE 53

Naive Bayes (4/4)

◮ p(y|x) = p(x|y)p(y) p(x) ◮ p(male|drew) = ? ◮ p(female|drew) = ?

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

slide-54
SLIDE 54

Naive Bayes (4/4)

◮ p(y|x) = p(x|y)p(y) p(x) ◮ p(male|drew) = p(drew|male)p(male) p(drew)

=

1 3 × 3 8 3 8

= 0.33

◮ p(female|drew) = p(drew|female)p(female) p(drew)

=

2 5 × 5 8 3 8

= 0.66

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

slide-55
SLIDE 55

Naive Bayes (4/4)

Officer Drew is female.

◮ p(y|x) = p(x|y)p(y) p(x) ◮ p(male|drew) = p(drew|male)p(male) p(drew)

=

1 3 × 3 8 3 8

= 0.33

◮ p(female|drew) = p(drew|female)p(female) p(drew)

=

2 5 × 5 8 3 8

= 0.66

Amir H. Payberah (SICS) MLLib June 30, 2016 41 / 1

slide-56
SLIDE 56

Naive Bayes

val data: RDD[LabeledPoint] = ... val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) val model = NaiveBayes.train(trainingData, lambda = 1.0, modelType = "multinomial") val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))

Amir H. Payberah (SICS) MLLib June 30, 2016 42 / 1

slide-57
SLIDE 57

Clustering (Unsupervised Learning)

Amir H. Payberah (SICS) MLLib June 30, 2016 43 / 1

slide-58
SLIDE 58

Clustering (1/5)

◮ Clustering is a technique for finding similarity groups in data, called

clusters.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

slide-59
SLIDE 59

Clustering (1/5)

◮ Clustering is a technique for finding similarity groups in data, called

clusters.

◮ It groups data instances that are similar to each other in one clus-

ter, and data instances that are very different from each other into different clusters.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

slide-60
SLIDE 60

Clustering (1/5)

◮ Clustering is a technique for finding similarity groups in data, called

clusters.

◮ It groups data instances that are similar to each other in one clus-

ter, and data instances that are very different from each other into different clusters.

◮ Clustering is often called an unsupervised learning task as no class

values denoting an a priori grouping of the data instances are given.

Amir H. Payberah (SICS) MLLib June 30, 2016 44 / 1

slide-61
SLIDE 61

Clustering (2/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

slide-62
SLIDE 62

Clustering (2/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

slide-63
SLIDE 63

Clustering (2/5)

◮ k-means clustering is a popular method for clustering.

Amir H. Payberah (SICS) MLLib June 30, 2016 45 / 1

slide-64
SLIDE 64

Clustering (3/5)

◮ K: number of clusters (given)

  • One mean per cluster.

◮ Initialize means: by picking k samples at random. ◮ Iterate:

  • Assign each point to nearest mean.
  • Move mean to center of its cluster.

Amir H. Payberah (SICS) MLLib June 30, 2016 46 / 1

slide-65
SLIDE 65

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

slide-66
SLIDE 66

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

slide-67
SLIDE 67

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

slide-68
SLIDE 68

Clustering (4/5)

Amir H. Payberah (SICS) MLLib June 30, 2016 47 / 1

slide-69
SLIDE 69

Clustering (5/5)

val data: RDD[LabeledPoint] = ... val numClusters = 2 val numIterations = 20 val clusters = KMeans.train(data, numClusters, numIterations) // Evaluate clustering by computing Within Set Sum of Squared Errors val WSSSE = clusters.computeCost(data) println("Within Set Sum of Squared Errors = " + WSSSE)

Amir H. Payberah (SICS) MLLib June 30, 2016 48 / 1

slide-70
SLIDE 70

MLlib - Result Validation

Amir H. Payberah (SICS) MLLib June 30, 2016 49 / 1

slide-71
SLIDE 71

Classification Model Evaluation

◮ There exists a true output and a model-generated predicted output

for each data point.

◮ The results for each data point can be assigned to one of four

categories:

1

True Positive (TP): label is positive and prediction is also positive

2

True Negative (TN): label is negative and prediction is also negative

3

False Positive (FP): label is negative but prediction is positive

4

False Negative (FN): label is positive but prediction is negative

Amir H. Payberah (SICS) MLLib June 30, 2016 50 / 1

slide-72
SLIDE 72

Binary Classification (1/2)

◮ Precision (positive predictive value): the fraction of retrieved in-

stances that are relevant.

◮ Recall (sensitivity): the fraction of relevant instances that are re-

trieved.

◮ F-measure: (1 + β2) precision.recall β2.precision+recall

Amir H. Payberah (SICS) MLLib June 30, 2016 51 / 1

slide-73
SLIDE 73

Binary Classification (2/2)

val model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(trainingData) val predictionAndLabels = testData.map { point => val prediction = model.predict(point.features) (prediction, point.label) } val metrics = new BinaryClassificationMetrics(predictionAndLabels) val precision = metrics.precisionByThreshold precision.foreach { case (t, p) => println(s"Threshold: $t, Precision: $p") } val recall = metrics.recallByThreshold recall.foreach { case (t, r) => println(s"Threshold: $t, Recall: $r") } val beta = 0.5 val fScore = metrics.fMeasureByThreshold(beta) fScore.foreach { case (t, f) => println(s"Threshold: $t, F-score: $f") }

Amir H. Payberah (SICS) MLLib June 30, 2016 52 / 1

slide-74
SLIDE 74

Regression Model Evaluation (1/2)

◮ Mean Squared Error (MSE): N−1

i=0 (yi−ˆ

yi)2 N ◮ Root Mean Squared Error (RMSE):

N−1

i=0 (yi−ˆ

yi)2 N ◮ Mean Absolute Error (MAE): N−1 i=0 |yi − ˆ

yi|

Amir H. Payberah (SICS) MLLib June 30, 2016 53 / 1

slide-75
SLIDE 75

Regression Model Evaluation (2/2)

val numIterations = 100 val model = LinearRegressionWithSGD.train(trainigData, numIterations) val valuesAndPreds = testData.map{ point => val prediction = model.predict(point.features) (prediction, point.label) } val metrics = new RegressionMetrics(valuesAndPreds) println(s"MSE = ${metrics.meanSquaredError}") println(s"RMSE = ${metrics.rootMeanSquaredError}") println(s"MAE = ${metrics.meanAbsoluteError}")

Amir H. Payberah (SICS) MLLib June 30, 2016 54 / 1

slide-76
SLIDE 76

Summary

Amir H. Payberah (SICS) MLLib June 30, 2016 55 / 1

slide-77
SLIDE 77

Summary

◮ Preprocessing: cleaning, integration, reduction, transformation ◮ Data mining: classification, clustering, frequent patterns, anomaly ◮ Result validation

Amir H. Payberah (SICS) MLLib June 30, 2016 56 / 1

slide-78
SLIDE 78

Questions?

Amir H. Payberah (SICS) MLLib June 30, 2016 57 / 1