Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish - PowerPoint PPT Presentation

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1

Data Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data Actionable Knowledge Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data Actionable Knowledge That is roughly the problem that Machine Learning addresses! Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1

Data and Knowledge Data Knowledge ◮ Is this email spam or no spam? Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge Data Knowledge ◮ Is this email spam or no spam? ◮ Is there a face in this picture? Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge Data Knowledge ◮ Is this email spam or no spam? ◮ Is there a face in this picture? ◮ Should I lend money to this customer given his spending behavior? Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge Data Knowledge ◮ Is this email spam or no spam? ◮ Is there a face in this picture? ◮ Should I lend money to this customer given his spending behaviour? Amir H. Payberah (SICS) MLLib June 30, 2016 3 / 1

Data and Knowledge ◮ Knowledge is not concrete ◮ Spam is an abstraction ◮ Face is an abstraction ◮ Who to lend to is an abstraction You do not find spam, faces, and financial advice in datasets, you just find bits! Amir H. Payberah (SICS) MLLib June 30, 2016 4 / 1

Knowledge Discovery from Data (KDD) ◮ Preprocessing ◮ Data mining ◮ Result validation Amir H. Payberah (SICS) MLLib June 30, 2016 5 / 1

KDD - Preprocessing ◮ Data cleaning ◮ Data integration ◮ Data reduction, e.g., sampling ◮ Data transformation, e.g., normalization Amir H. Payberah (SICS) MLLib June 30, 2016 6 / 1

KDD - Mining Functionalities ◮ Classification and regression (supervised learning) ◮ Clustering (unsupervised learning) ◮ Mining the frequent patterns ◮ Outlier detection Amir H. Payberah (SICS) MLLib June 30, 2016 7 / 1

KDD - Result Validation ◮ Needs to evaluate the performance of the model on some criteria. ◮ Depends on the application and its requirements. Amir H. Payberah (SICS) MLLib June 30, 2016 8 / 1

MLlib - Data Types Amir H. Payberah (SICS) MLLib June 30, 2016 9 / 1

Data Types - Local Vector ◮ Stored on a single machine ◮ Dense and sparse • Dense (1.0, 0.0, 3.0): [1.0, 0.0, 3.0] • Sparse (1.0, 0.0, 3.0): (3, [0, 2], [1.0, 3.0]) val dv: Vector = Vectors.dense(1.0, 0.0, 3.0) val sv1: Vector = Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0)) val sv2: Vector = Vectors.sparse(3, Seq((0, 1.0), (2, 3.0))) Amir H. Payberah (SICS) MLLib June 30, 2016 10 / 1

Data Types - Labeled Point ◮ A local vector (dense or sparse) associated with a label. ◮ label : label for this data point. ◮ features : list of features for this data point. case class LabeledPoint(label: Double, features: Vector) val pos = LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)) val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) Amir H. Payberah (SICS) MLLib June 30, 2016 11 / 1

MLlib - Preprocessing Amir H. Payberah (SICS) MLLib June 30, 2016 12 / 1

Data Transformation - Normalizing Features x − mean ◮ To get data in a standard Gaussian distribution: sqrt ( variance ) val features = labelData.map(_.features) val scaler = new StandardScaler(withMean = true, withStd = true).fit(features) val scaledData = labelData.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features))) Amir H. Payberah (SICS) MLLib June 30, 2016 13 / 1

MLlib - Data Mining Amir H. Payberah (SICS) MLLib June 30, 2016 14 / 1

Data Mining Functionalities ◮ Classification and regression (supervised learning) ◮ Clustering (unsupervised learning) ◮ Mining the frequent patterns ◮ Outlier detection Amir H. Payberah (SICS) MLLib June 30, 2016 15 / 1

Classification and Regression (Supervised Learning) Amir H. Payberah (SICS) MLLib June 30, 2016 16 / 1

Supervised Learning (1/3) ◮ Right answers are given. • Training data (input data) is labeled, e.g., spam/not-spam or a stock price at a time. ◮ A model is prepared through a training process. ◮ The training process continues until the model achieves a desired level of accuracy on the training data. Amir H. Payberah (SICS) MLLib June 30, 2016 17 / 1

Supervised Learning (2/3) ◮ Face recognition Training data Testing data [ORL dataset, AT&T Laboratories, Cambridge UK] Amir H. Payberah (SICS) MLLib June 30, 2016 18 / 1

Supervised Learning (3/3) ◮ Set of N training examples: ( x 1 , y 1 ) , · · · , ( x n , y n ). ◮ x i = � x i1 , x i2 , · · · , x im � is the feature vector of the i th example. ◮ y i is the i th feature vector label. ◮ A learning algorithm seeks a function y i = f ( X i ). Amir H. Payberah (SICS) MLLib June 30, 2016 19 / 1

Classification vs. Regression ◮ Classification: the output variable takes class labels. ◮ Regression: the output variable takes continuous values. Amir H. Payberah (SICS) MLLib June 30, 2016 20 / 1

Types of Classification/Regression Models in Spark ◮ Linear models ◮ Decision trees ◮ Naive Bayes models Amir H. Payberah (SICS) MLLib June 30, 2016 21 / 1

Linear Models Amir H. Payberah (SICS) MLLib June 30, 2016 22 / 1

Linear Models ◮ Training dataset: ( x 1 , y 1 ) , · · · , ( x n , y n ). ◮ x i = � x i1 , x i2 , · · · , x im � ◮ Model the target as a function of a linear predictor applied to the input variables: y i = g ( w T x i ). • E.g., y i = w 1 x i1 + w 2 x i2 + · · · + w m x im Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Linear Models ◮ Training dataset: ( x 1 , y 1 ) , · · · , ( x n , y n ). ◮ x i = � x i1 , x i2 , · · · , x im � ◮ Model the target as a function of a linear predictor applied to the input variables: y i = g ( w T x i ). • E.g., y i = w 1 x i1 + w 2 x i2 + · · · + w m x im ◮ Loss function: f ( w ) := � n i =1 L ( g ( w T x i ) , y i ) ◮ An optimization problem min w ∈ R m f ( w ) Amir H. Payberah (SICS) MLLib June 30, 2016 23 / 1

Linear Models - Regression (1/2) ◮ g ( w T x i ) = w 1 x i1 + w 2 x i2 + · · · + w m x im ◮ Loss function: minimizing squared different between predicted value and actual value: L ( g ( w T x i ) , y i ) := 1 2 ( w T x i − y i ) 2 Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Linear Models - Regression (1/2) ◮ g ( w T x i ) = w 1 x i1 + w 2 x i2 + · · · + w m x im ◮ Loss function: minimizing squared different between predicted value and actual value: L ( g ( w T x i ) , y i ) := 1 2 ( w T x i − y i ) 2 ◮ Gradient descent Amir H. Payberah (SICS) MLLib June 30, 2016 24 / 1

Linear Models - Regression (2/2) val data: RDD[LabeledPoint] = ... val splits = labelData.randomSplit(Array(0.7, 0.3)) val (trainigData, testData) = (splits(0), splits(1)) val numIterations = 100 val stepSize = 0.00000001 val model = LinearRegressionWithSGD .train(trainigData, numIterations, stepSize) val valuesAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } Amir H. Payberah (SICS) MLLib June 30, 2016 25 / 1

Linear Models - Classification (Logistic Regression) (1/2) ◮ Binary classification: output values between 0 and 1 ◮ g ( w T x ) := 1 1+ e − w T x (sigmoid function) ◮ If g ( w T x i ) > 0 . 5, then y i = 1, else y i = 0 Amir H. Payberah (SICS) MLLib June 30, 2016 26 / 1

Linear Models - Classification (Logistic Regression) (2/2) val data: RDD[LabeledPoint] = ... val splits = labelData.randomSplit(Array(0.7, 0.3)) val (trainigData, testData) = (splits(0), splits(1)) val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(trainingData) val predictionAndLabels = testData.map { point => val prediction = model.predict(point.features) (prediction, point.label) } Amir H. Payberah (SICS) MLLib June 30, 2016 27 / 1

Decision Tree Amir H. Payberah (SICS) MLLib June 30, 2016 28 / 1

Decision Tree ◮ A greedy algorithm. ◮ It performs a recursive binary partitioning of the feature space. ◮ Decision tree construction algorithm: • Find the best split condition (quantified based on the impurity measure). • Stops when no improvement possible. Amir H. Payberah (SICS) MLLib June 30, 2016 29 / 1

Impurity Measure ◮ Measures how well are the two classes separated. ◮ The current implementation in Spark: • Regression: variance • Classification: gini and entropy Amir H. Payberah (SICS) MLLib June 30, 2016 30 / 1

Stopping Rules ◮ The node depth is equal to the maxDepth training parameter. ◮ No split candidate leads to an information gain greater than minInfoGain . ◮ No split candidate produces child nodes which each have at least minInstancesPerNode training instances. Amir H. Payberah (SICS) MLLib June 30, 2016 31 / 1

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish - PowerPoint PPT Presentation

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1 Data Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1 Data Actionable Knowledge Amir H. Payberah

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

732A54 Big Data Analytics Lecture 11: Machine Learning with Spark Jose M. Pe na IDA, Link

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

Combating Snowshoe Spam with Fire Olivier van der Toorn <o.i.vandertoorn@utwente.nl>

Lecture 1.2: Linear independence and spanning sets Matthew Macauley Department of Mathematical

End-to-end Neural Coreference Resolution Kenton Lee Luheng He Mike Lewis Luke

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish - PowerPoint PPT Presentation

Spark Machine Learning Amir H. Payberah amir@sics.se SICS Swedish ICT June 30, 2016 Amir H. Payberah (SICS) MLLib June 30, 2016 1 / 1 Data Amir H. Payberah (SICS) MLLib June 30, 2016 2 / 1 Data Actionable Knowledge Amir H. Payberah

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

High Integrity Ada with SPARK Praxis Critical Systems 1 SPARK and the SPARK Examiner What is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Flex 4 - Spark Containers Ryan Frishberg Software Consultant, Lab49 http://www.frishy.com Spark

Spark starts here. Spark New Zealand Annual Results 2014 Investor Presentation Spark is more

SPARK NEW ZEALAND ANNUAL MEETING 2015 Spark New Zealand 2015 Spark New Zealand 2015 2 Order of

What Information SPARK Collects, and Why What Information SPARK Collects, and Why LeeAnne Green

Spark Technology 1. Spark main objectives 2. RDD concepts and operations 3. SPARK application

Apache Spark: A Unified Engine for Big Data Processing Presented by: Huanyi Chen Apache Spark:

732A54 Big Data Analytics Lecture 11: Machine Learning with Spark Jose M. Pe na IDA, Link

Apache PredictionIO End-to-End Machine Learning Server with Apache Spark What is Machine

Spark Processing 101 September 10, 2015 Justin Sun Overview What is Spark? SparkContext

Unified Big Data nified Big Data Pr Processing ocessing with with Apache Spark pache Spark

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Link Spam Detection Based on Mass Estimation Zoltn Gyngyi , Pavel Berkhin, Hector

Fighting spam for fun and profit the long road to SpamAssassin 4.0 Giovanni Bechis

Email Administra5on Don Porter CSE/ISE 311: Systems Administra5on

Vsevolod Stakhov https://rspamd.com Why rspamd? A real example Rspamd in nutshell Uses

CS 403X Mobile and Ubiquitous Computing Lecture 15: Making Apps Intelligent/Machine Learning

Combating Snowshoe Spam with Fire Olivier van der Toorn &lt;o.i.vandertoorn@utwente.nl&gt;

Lecture 1.2: Linear independence and spanning sets Matthew Macauley Department of Mathematical

End-to-end Neural Coreference Resolution Kenton Lee Luheng He Mike Lewis Luke

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

Combating Snowshoe Spam with Fire Olivier van der Toorn <o.i.vandertoorn@utwente.nl>