Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - - PowerPoint PPT Presentation

integrating spark mllib into weka
SMART_READER_LITE
LIVE PREVIEW

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining - - PowerPoint PPT Presentation

Integrating Spark MLlib into Weka Mark Hall Pentaho Data Mining Architect, Hitachi Vantara Agenda The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses: Integration of MLlib


slide-1
SLIDE 1

Integrating Spark MLlib into Weka

Mark Hall Pentaho Data Mining Architect, Hitachi Vantara

slide-2
SLIDE 2

Agenda

The Spark distributed processing framework was designed with iterative machine learning in mind. This session discusses:

  • Integration of MLlib classification algorithms into Weka
  • Consistent evaluation of algorithms on the desktop and in the cluster
  • Benefits for data science practitioners
slide-3
SLIDE 3

What’s Weka?

  • Weka is a library containing a large collection of

machine learning algorithms, implemented in Java

  • Main types of learning problems that it can tackle

– Classification: given a labeled set of observations, learn to predict labels for new observations – Regression: numeric value instead of label – Attribute selection: find attributes of observations that are important for prediction – Clustering: no labels, just identify groups of similar

  • bservations (clusters)
  • 190 plugin “packages”
slide-4
SLIDE 4

Introduction

  • Distributed Weka for

Spark package:

  • Averaging for several

classifiers

  • “Dagging” for all the rest
  • Ensembles via Dagging can

work well, but…

– Partition size is another tuning parameter – Small partitions might lead to poor modelling power

Instances data

RDD<Instance>

Instances data

Partition 0 Partition 1

Instances streamed or batched to algorithm

Map partitions

Instances streamed or batched to algorithm Model Model

Cached

Voted ensemble

RDD<Model>

slide-5
SLIDE 5

MLlib

  • Small set of algorithms
  • Learning is fully distributed —> single final model
  • Could have an accuracy advantage over Dagging for complex problems
  • Definitely has an advantage when model comprehensibility is important
  • New distributedWekaSparkDev package

– No coding MLlib integration

slide-6
SLIDE 6

MLlib Integration in Weka: Desktop Mode

  • Weka wrapper classifiers for MLlib supervised

learning schemes

  • Work like any other Weka classifier
  • Operate on datasets that fit into main memory on

the desktop

  • Can be used within Weka’s evaluation framework,

used as base classifiers in meta learners, combined with preprocessing filters, used in standard Knowledge Flow processes and used in repeated cross-validation experiments in the Experimenter

slide-7
SLIDE 7

Under the Hood

  • WEKA MLlib classifiers accept standard Instances
  • bjects
  • Weka filters are applied automatically (where

necessary)

– MLlibNaiveBayes wrapper discretizes numeric fields for Bernoulli model

  • Instances are parallelized to RDD[Instance]
  • RDD[Instance] converted to RDD[LabeledPoint]
  • Local Spark cluster started on the fly
  • Scoring only requires LabeledPoint data structure

– no Spark cluster/infrastructure required

slide-8
SLIDE 8

MLlib Integration in Weka: Distributed Weka

  • Run in a cluster
  • Data sourced from Spark data frame-support formats
  • Data frames converted to RDD[Instance], then to RDD[LabeledPoint]
  • Weka filters can be applied within each RDD partition
  • Implements hold-out and cross-validation for evaluation
slide-9
SLIDE 9

Cross-Validation

  • X-val folds that are consistent for both MLlib and Weka classifiers
  • Max parallelism for Weka Dagging/model averaging

– Build all training fold classifiers in one pass over the data – Evaluate all classifiers in second pass Pass 1 Pass 2

slide-10
SLIDE 10

Cross-Validation for MLlib

  • MLlib classifiers – new

RDD for each training fold

– Assemble partial folds for fold k – Each fold processed sequentially in turn Fold 1 Fold 2 Fold 3 Fold 1 Fold 2 Fold 3

RDD partitions (original dataset)

Fold 1 Fold 1 Fold 2 Fold 2 Fold 3 Fold 3

RDD for test fold 1 RDD for training fold 1

MLlib M1

Partition 1 Partition 2

slide-11
SLIDE 11

Demonstration

  • MLlib classifiers running in Weka

Explorer, and Knowledge Flow

  • Comparing MLlib schemes against

Weka, R and Python equivalents in the Weka Experimenter

  • Deploying an MLlib model in PDI’s

WekaScoring step

slide-12
SLIDE 12

Summary

What we covered today:

  • Integration of MLlib algorithms continues Weka’s interoperability theme
  • Provides convenient no-coding access to Mllib algorithms for desktop and

cluster-based execution

  • Simplifies the data scientist’s job when considering multiple tools

– Weka vs R vs Scikit-learn vs MLlib within one unified experimental framework

slide-13
SLIDE 13

Next Steps

Want to learn more?

  • http://wiki.pentaho.com/display/DATAMINING/Pentaho+Data+Mining+Commun

ity+Documentation

  • http://markahall.blogspot.co.nz/2017/07/integrating-spark-mllib-into-

weka.html

slide-14
SLIDE 14