The Three Dimensions of Scalable Machine Learning Reza Zadeh - - PowerPoint PPT Presentation

the three dimensions of scalable machine learning
SMART_READER_LITE
LIVE PREVIEW

The Three Dimensions of Scalable Machine Learning Reza Zadeh - - PowerPoint PPT Presentation

The Three Dimensions of Scalable Machine Learning Reza Zadeh @Reza_Zadeh | http://reza-zadeh.com Outline Data Flow Engines and Spark The Three Dimensions of Machine Learning Matrix Computations MLlib + {Streaming, GraphX, SQL} Future of MLlib Data


slide-1
SLIDE 1

Reza Zadeh

The Three Dimensions of Scalable Machine Learning

@Reza_Zadeh | http://reza-zadeh.com

slide-2
SLIDE 2

Outline

Data Flow Engines and Spark The Three Dimensions of Machine Learning Matrix Computations MLlib + {Streaming, GraphX, SQL} Future of MLlib

slide-3
SLIDE 3

Data Flow Models

Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators

» System picks how to split each operator into tasks and where to run each task » Run parts twice fault recovery

Biggest example: MapReduce

Map Map Map Reduce Reduce

slide-4
SLIDE 4

Spark Computing Engine

Extends a programming language with a distributed collection data-structure

» “Resilient distributed datasets” (RDD)

Open source at Apache

» Most active community in big data, with 50+ companies contributing

Clean APIs in Java, Scala, Python Community: SparkR, being released in 1.4!

slide-5
SLIDE 5

Key Idea

Resilient Distributed Datasets (RDDs)

» Collections of objects across a cluster with user controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that they can be:

Automatically rebuilt on failure

slide-6
SLIDE 6

Resilient Distributed Datasets (RDDs)

Main idea: Resilient Distributed Datasets

» Immutable collections of objects, spread across cluster » Statically typed: RDD[T] has objects of type T

val sc = new SparkContext() val lines = sc.textFile("log.txt") // RDD[String]

  • // Transform using standard collection operations

val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split(‘\t’)(2))

  • messages.saveAsTextFile("errors.txt")

lazily evaluated kicks off a computation

slide-7
SLIDE 7

MLlib: Available algorithms

classification: classification: logistic regression, linear SVM, naïve Bayes, least squares, classification tree regr egression: ession: generalized linear models (GLMs), regression tree collaborative filtering: collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: clustering: k-means|| decomposition: decomposition: SVD, PCA

  • ptimization:
  • ptimization: stochastic gradient descent, L-BFGS
slide-8
SLIDE 8

The Three Dimensions

slide-9
SLIDE 9

ML Objectives

Almost all machine learning objectives are

  • ptimized using this update
slide-10
SLIDE 10

Scaling

1) Data size 2) Number of models 3) Model size

slide-11
SLIDE 11

Logistic Regression ¡

Goal: ¡find ¡best ¡line ¡separating ¡two ¡sets ¡of ¡points ¡

+ – + + + + + + + + – – – – – – – – +

target ¡

random ¡initial ¡line ¡

slide-12
SLIDE 12

Data Scaling

data ¡= ¡spark.textFile(...).map(readPoint).cache() ¡ ¡ w ¡= ¡numpy.random.rand(D) ¡ ¡ for ¡i ¡in ¡range(iterations): ¡ ¡ ¡ ¡ ¡gradient ¡= ¡data.map(lambda ¡p: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡(1 ¡/ ¡(1 ¡+ ¡exp(-­‑p.y ¡* ¡w.dot(p.x)))) ¡* ¡p.y ¡* ¡p.x ¡ ¡ ¡ ¡ ¡).reduce(lambda ¡a, ¡b: ¡a ¡+ ¡b) ¡ ¡ ¡ ¡ ¡w ¡-­‑= ¡gradient ¡ ¡ print ¡“Final ¡w: ¡%s” ¡% ¡w ¡

slide-13
SLIDE 13

Separable Updates

Can be generalized for » Unconstrained optimization » Smooth or non-smooth » LBFGS, Conjugate Gradient, Accelerated Gradient methods, …

slide-14
SLIDE 14

Logistic Regression Results

500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running T Running Time (s) ime (s) Number of Iterations Number of Iterations Hadoop Spark

110 s / iteration first iteration 80 s further iterations 1 s

100 GB of data on 50 m1.xlarge EC2 machines

¡

slide-15
SLIDE 15

Behavior with Less RAM

68.8 58.1 40.7 29.7 11.5 20 40 60 80 100 0% 25% 50% 75% 100% Iteration time (s) Iteration time (s) % of working set in memory % of working set in memory

slide-16
SLIDE 16

Lots of little models

Is embarrassingly parallel Most of the work should be handled by data flow paradigm ML pipelines does this

slide-17
SLIDE 17

Hyper-parameter Tuning

slide-18
SLIDE 18

Model Scaling

Linear models only need to compute the dot product of each example with model Use a BlockMatrix to store data, use joins to compute dot products Coming in 1.5

slide-19
SLIDE 19

Model Scaling

Data joined with model (weight):

slide-20
SLIDE 20

Optimization

At least two large classes of optimization problems humans can solve: » Convex » Spectral

slide-21
SLIDE 21

Optimization Example: Spectral Program

slide-22
SLIDE 22

Spark PageRank

Given directed graph, compute node

  • importance. Two RDDs:

» Neighbors (a sparse graph/matrix) » Current guess (a vector)

  • Using cache(), keep neighbor list in RAM
slide-23
SLIDE 23

Spark PageRank

Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges) Ranks (id, rank)

join partitionBy join join

slide-24
SLIDE 24

PageRank Results

171 72 23 50 100 150 200 Time per iteration (s) ime per iteration (s) Hadoop Basic Spark Spark + Controlled Partitioning

slide-25
SLIDE 25

Spark PageRank

Generalizes ¡to ¡Matrix ¡Multiplication, ¡opening ¡many ¡algorithms ¡ from ¡Numerical ¡Linear ¡Algebra ¡

slide-26
SLIDE 26

Distributing Matrix Computations

slide-27
SLIDE 27

Distributing Matrices

How to distribute a matrix across machines? » By Entries (CoordinateMatrix) » By Rows (RowMatrix) » By Blocks (BlockMatrix) All of Linear Algebra to be rebuilt using these partitioning schemes

As ¡of ¡version ¡1.3 ¡

slide-28
SLIDE 28

Distributing Matrices

Even the simplest operations require thinking about communication e.g. multiplication How many different matrix multiplies needed? » At least one per pair of {Coordinate, Row, Block, LocalDense, LocalSparse} = 10 » More because multiplies not commutative

slide-29
SLIDE 29

Singular Value Decomposition on Spark

slide-30
SLIDE 30

Singular Value Decomposition

slide-31
SLIDE 31

Singular Value Decomposition

Two cases » Tall and Skinny » Short and Fat (not really) » Roughly Square SVD method on RowMatrix takes care of which one to call.

slide-32
SLIDE 32

Tall and Skinny SVD

slide-33
SLIDE 33

Tall and Skinny SVD

Gets ¡us ¡ ¡ ¡V ¡and ¡the ¡ singular ¡values ¡ Gets ¡us ¡ ¡ ¡U ¡by ¡one ¡ matrix ¡multiplication ¡

slide-34
SLIDE 34

Square SVD

ARPACK: Very mature Fortran77 package for computing eigenvalue decompositions JNI interface available via netlib-java Distributed using Spark – how?

slide-35
SLIDE 35

Square SVD via ARPACK

Only interfaces with distributed matrix via matrix-vector multiplies The result of matrix-vector multiply is small. The multiplication can be distributed.

slide-36
SLIDE 36

Square SVD

With 68 executors and 8GB memory in each, looking for the top 5 singular vectors

slide-37
SLIDE 37

MLlib + {Streaming, GraphX, SQL}

slide-38
SLIDE 38

A General Platform

Spark Core Spark Streaming

real-time

Spark SQL

structured

GraphX

graph

MLlib

machine learning

Standard libraries included with Spark

slide-39
SLIDE 39

Benefit for Users

Same engine Same engine performs data extraction, model training and interactive queries

… DFS read DFS write parse DFS read DFS write train DFS read DFS write query DFS DFS read parse train query

Separate engines Spark

slide-40
SLIDE 40

MLlib + Streaming

As of Spark 1.1, you can train linear models in a streaming fashion, k-means as of 1.2 Model weights are updated via SGD, thus amenable to streaming More work needed for decision trees

slide-41
SLIDE 41

MLlib + SQL

df = context.sql(“select latitude, longitude from tweets”) model = pipeline.fit(df)

DataFrames in Spark 1.3! (March 2015) Powerful coupled with new pipeline API

slide-42
SLIDE 42

MLlib + GraphX

slide-43
SLIDE 43

Future of MLlib

slide-44
SLIDE 44

Goals for next version

Tighter integration with DataFrame and spark.ml API Accelerated gradient methods & Optimization interface Model export: PMML (current export exists in Spark 1.3, but not PMML, which lacks distributed models) Scaling: Model scaling (e.g. via Parameter Servers)

slide-45
SLIDE 45

Most active open source community in big data 200+ 200+ developers, 50+ 50+ companies contributing

Spark Community

Giraph Storm 50 100 150

Contributors in past year

slide-46
SLIDE 46

Continuing Growth

source: ohloh.net

Contributors per month to Spark

slide-47
SLIDE 47

Spark and ML

Spark has all its roots in research, so we hope to keep incorporating new ideas!

slide-48
SLIDE 48
slide-49
SLIDE 49

Model Broadcast

slide-50
SLIDE 50

Model Broadcast

Use ¡via ¡.value ¡ Call ¡sc.broadcast ¡