Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - PowerPoint PPT Presentation

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: • Integration of MLlib classification algorithms into Weka • Consistent evaluation of algorithms on the desktop and in the cluster • Benefits for data science practitioners

What’s Weka? • Weka is a library containing a large collection of machine learning algorithms, implemented in Java • Main types of learning problems that it can tackle – Classification: given a labeled set of observations, learn to predict labels for new observations – Regression: numeric value instead of label – Attribute selection: find attributes of observations that are important for prediction – Clustering: no labels, just identify groups of similar observations (clusters) • 190 plugin “packages”

Introduction RDD<Instance> Map partitions RDD<Model> Partition 0 Instances • Distributed Weka for Instances streamed or Model data batched to Spark package: algorithm • Averaging for several Partition 1 Instances classifiers Instances streamed or Model data batched to • “Dagging” for all the rest algorithm • Ensembles via Dagging can Cached work well, but… – Partition size is another tuning parameter – Small partitions might lead Voted ensemble to poor modelling power

MLlib • Small set of algorithms • Learning is fully distributed —> single final model • Could have an accuracy advantage over Dagging for complex problems • Definitely has an advantage when model comprehensibility is important • New distributedWekaSparkDev package – No coding MLlib integration

MLlib Integration in Weka: Desktop Mode • Weka wrapper classifiers for MLlib supervised learning schemes • Work like any other Weka classifier • Operate on datasets that fit into main memory on the desktop • Can be used within Weka’s evaluation framework, used as base classifiers in meta learners, combined with preprocessing filters, used in standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter

Under the Hood • WEKA MLlib classifiers accept standard Instances objects • Weka filters are applied automatically (where necessary) – MLlibNaiveBayes wrapper discretizes numeric fields for Bernoulli model • Instances are parallelized to RDD[ Instance ] • RDD[Instance] converted to RDD[ LabeledPoint ] • Local Spark cluster started on the fly • Scoring only requires LabeledPoint data structure – no Spark cluster/infrastructure required

MLlib Integration in Weka: Distributed Weka • Run in a cluster • Data sourced from Spark data frame-support formats • Data frames converted to RDD[ Instance ], then to RDD[ LabeledPoint ] • Weka filters can be applied within each RDD partition • Implements hold-out and cross-validation for evaluation

Cross-Validation • X-val folds that are consistent for both MLlib and Weka classifiers • Max parallelism for Weka Dagging/model averaging – Build all training fold classifiers in one pass over the data – Evaluate all classifiers in second pass Pass 2 Pass 1

Cross-Validation for MLlib RDD for test • MLlib classifiers – new fold 1 RDD partitions RDD for each training (original dataset) Fold 1 fold Fold 1 Fold 1 – Assemble partial folds Fold 2 for fold k – Each fold processed Fold 3 RDD for training sequentially in turn fold 1 Partition 1 Fold 2 Fold 1 Fold 2 MLlib M1 Fold 2 Fold 3 Fold 3 Fold 3 Partition 2

Demonstration • MLlib classifiers running in Weka Explorer, and Knowledge Flow • Comparing MLlib schemes against Weka, R and Python equivalents in the Weka Experimenter • Deploying an MLlib model in PDI’s WekaScoring step

Summary What we covered today: • Integration of MLlib algorithms continues Weka’s interoperability theme • Provides convenient no-coding access to Mllib algorithms for desktop and cluster-based execution • Simplifies the data scientist’s job when considering multiple tools – Weka vs R vs Scikit-learn vs MLlib within one unified experimental framework

Next Steps Want to learn more? • http://wiki.pentaho.com/display/DATAMINING/Pentaho+Data+Mining+Commun ity+Documentation • http://markahall.blogspot.co.nz/2017/07/integrating-spark-mllib-into- weka.html

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - PowerPoint PPT Presentation

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: Integration of MLlib

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Advanced Data Mining with Weka Class 4 Lesson 1 What is distributed Weka? Mark Hall Pentaho

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Advanced Data Mining with Weka Class 2 Lesson 1 Incremental classifiers in Weka Albert Bifet

Advanced Data Mining with Weka Class 5 Lesson 1 Invoking Python from Weka Peter Reutemann

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Urania tables and integrating Weka to Java project Bc. Peter Nos 207773@mail.muni.cz

Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Data Mining with Weka Class 3 Lesson 1 Simplicity first! Ian H. Witten Department of Computer

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

Realtime Java for Industrial and Critical Applications Andy Walter COO, aicas GmbH 1 June 2007

Monadic Imperative languages C# Java C / C++ Fortran Scala Subtract abstractions Add

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki

Incorporating Off-The- Shelf Components with Event-based Integration Jie Ren, Richard Taylor

Solving Minimax Problems with Feasible Sequential Quadratic Programming 05/06/2014 Constrained

e Java Memory Model . Jesper qvist . . . e Java Environment . . . 2 . . e Java

Extending and Contributing to an Open Source Web-based System for the Assessment of Programming

MAJOR: An Efficient and Extensible Tool for Mutation Analysis in a Java Compiler e Just 1 , Franz

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - PowerPoint PPT Presentation

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: Integration of MLlib

Big Data Meets Machine Learning Apache Spark MLlib 1 MLlib Spark MLlib Graphx

Advanced Data Mining with Weka Class 4 Lesson 1 What is distributed Weka? Mark Hall Pentaho

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is

Distributing Matrix Computations with Spark MLlib Reza Zadeh A General Platform Standard libraries

Overview of PySpark MLlib BIG DATA F UN DAMEN TALS W ITH P YS PARK Upendra Devisetty Science

Advanced Data Mining with Weka Class 2 Lesson 1 Incremental classifiers in Weka Albert Bifet

Advanced Data Mining with Weka Class 5 Lesson 1 Invoking Python from Weka Peter Reutemann

Spark Code Camp Discover Spark Streaming &amp; Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li &amp; Lei Yao CS 848 University of Waterloo Outline

Large-Scale Data Engineering Spark and MLLIB event.cwi.nl/lsde OVERVIEW OF SPARK

Urania tables and integrating Weka to Java project Bc. Peter Nos 207773@mail.muni.cz

Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer

Intr Intro o to Spark to Spark and Spark and Spark SQL SQL AMP Camp 2014 Michael Armbrust -

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Data Mining with Weka Class 3 Lesson 1 Simplicity first! Ian H. Witten Department of Computer

Data Mining with Weka Class 2 Lesson 1 Be a classifier! Ian H. Witten Department of Computer

Realtime Java for Industrial and Critical Applications Andy Walter COO, aicas GmbH 1 June 2007

Monadic Imperative languages C# Java C / C++ Fortran Scala Subtract abstractions Add

Easy and High Performance GPU Programming for Java Programmers GTC 2016 Kazuaki Ishizaki

Incorporating Off-The- Shelf Components with Event-based Integration Jie Ren, Richard Taylor

Solving Minimax Problems with Feasible Sequential Quadratic Programming 05/06/2014 Constrained

e Java Memory Model . Jesper qvist . . . e Java Environment . . . 2 . . e Java

Extending and Contributing to an Open Source Web-based System for the Assessment of Programming

MAJOR: An Efficient and Extensible Tool for Mutation Analysis in a Java Compiler e Just 1 , Franz

Spark Code Camp Discover Spark Streaming & Spark SQL Project Overview Focus on Spark

GPU Enabled Spark MLlib Lingyun Li & Lei Yao CS 848 University of Waterloo Outline