MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - - PowerPoint PPT Presentation

machine learning statistical learning and parallel
SMART_READER_LITE
LIVE PREVIEW

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - - PowerPoint PPT Presentation

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE


slide-1
SLIDE 1

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI

slide-2
SLIDE 2

INTRODUCTION

MACHINE LEARNING STATISTICAL LEARNING

VS

Separate evolutions, but with shared methodologies MACHINE LEARNING STATISTICAL LEARNING

MERGE

  • Data Science (Hayashi 1998, Cleveland 2001)
  • “Unifying” Books (Hastie et al. 2009, Barber 2011)
slide-3
SLIDE 3

WHAT’S HAPPENING NOW?

BIG DATA REVOLUTION

SOME OPEN ISSUES (Jordan 2013)

Inferential issues of more “logical” algorithms (e.g. testing on NNs) Speed issues in computation-intensive algorithms

slide-4
SLIDE 4

SPEED ISSUES

HOW TO TACKLE THEM?

“Divide et Impera” (Lucius Aemilius Paullus, 168 BC) So, split algorithms in simple and parallelizable steps, and then employ massive parallelization strategies!

slide-5
SLIDE 5

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING

A STORY ABOUT MANY DIFFERENT THINGS SUCH AS FAST COMPUTERS, COOL ALGORITHMS AND ELEGANT STATISTICS PARALLEL NETWORK ANALYSIS ON APACHE SPARK HYBRID CANOPY-FUZZY C-MEANS CLUSTERING DISTRIBUTED DEEP LEARNING PLANET AND ENSEMBLE- BASED TREE LEARNING

  • M. FONTANA
  • C. VOLPETTI
  • J. DI IORIO
  • D. OCCHIUTO
slide-6
SLIDE 6

CHAPTER 1

MATTEO FONTANA

PARALLEL NETWORK ANALYSIS WITH APACHE SPARK

slide-7
SLIDE 7

PARALLEL NETWORK ANALYSIS WITH APACHE SPARK

ABOUT HOW TO PERFORM PARALLEL GRAPH ANALYTICS ON SPARK

STATE OF ART

MOVE TO DATA PARALLEL SYSTEMS? PERFORMANCE

GRAPH ANALYTICS SYSTEM

DRAWBACKS SOLUTION

A NEW PROGRAMMING ABSTRACTION

slide-8
SLIDE 8

A BRIEF STATE OF THE ART OF NETWORK ANALYSIS

The analysis of network data is becoming more and more prevalent in the ML and Statistics communities, and several attempts are made to create novel methods:

  • Complex objects in network topology (Kolaczyk, 2009)
  • Networks as statistical units (Sienkiewicz and Wang 2014; Shen et al.

2014)

  • Graph Analytics Systems

– Pregel (Malewicz et al. 2010) – PowerGraph (Gonzalez, Low, and Gu 2012) – GraphLab (Low et al. 2012)

slide-9
SLIDE 9

GRAPH ANALYTICS SYSTEM

THE IDEAS BEHIND THEM IS THE IMPLEMENTATION OF A RESTRICTED PROGRAMMING ABSTRACTION, THAT ALLOWS FOR THE FAST SPECIFICATION OF GRAPH ALGORITHM ORDERS OF MAGNITUDE IMPROVEMENT OVER DATA-PARALLEL SYSTEMS

slide-10
SLIDE 10

DRAWBACKS OF GRAPH ANALYTICS SYSTEMS

BLESSING CURSE OPERATIONS REQUIRED IN NETWORK ANALYTICS PIPELINE NEED A MUCH MORE GENERAL VIEW OF THE DATA, “FLOATING” OVER THE GRAPH TOPOLOGY

(e.g. graph creation and modification, graph partition, graph comparison)

slide-11
SLIDE 11

WHY DON’T WE MOVE TO DATA PARALLEL SYSTEM?

ONE IDEA AVOID SPECIALIZED SYSTEMS FOR GRAPH PROCESSING AND MOVE TO GENERAL PURPOSE DATA PARALLE PLATFORMS (e.g. Apache Spark) BUT IMPLEMENTATION OF DATA PARALLEL ALGORITHMS ON DATA PARALLEL INFRASTRUCTURE IS SENSELESS!

slide-12
SLIDE 12

THE SOLUTION: FUSION!

The idea to solve this issue is creating an hybrid data and graph parallel infrastructure, that allows the data scientist to get the best of both worlds:

  • Speed in graph computation of Graph Parallel Systems.
  • Flexibility in the pipeline of Data Parallel Systems.
slide-13
SLIDE 13

A NEW PROGRAMMING ABSTRACTION

FIRST BRICK: PROGRAMMING ABSTRACTION RDD Based (graph as two RDDs) – GraphX AFTER THE RELEASE OF DATA FRAME API DataFrame Based (graph as two DataFrames) – GraphFrames THEN rewriting of Graph operations (usually, based on Gather, Apply, Scatter paradigm) as distributed joins and aggregations

slide-14
SLIDE 14

PERFORMANCE

Performance-wise, and looking at the single graph computation (i.e. calculating PageRank) these systems are still outperformed by lower-level implemented systems such as GraphLab. The situation dramatically changes when the whole pipeline is taken into account: being able to deal with I/O, ETL and the other pipeline

  • perations using the same systems has proven to increase

dramatically performance.

slide-15
SLIDE 15

CHAPTER 2

CLAUDIA VOLPETTI

DISTRIBUTED DEEP LEARNING

slide-16
SLIDE 16

DISTRIBUTED

DEEP

LEARNING

A STORY ABOUT DEEP LEARNING

WHAT’S DEEP LEARNING?

DISTRIBUTED BATCH GRADIENT DESCENT

STALE GRADIENT PROBLEM

DEEP LEARNING NEED FOR SPEED WHAT IS BATCH GRADIENT DESCENT? DISTRIBUTED STOCHASTIC GRADIENT DESCENT

SYNCHRONOUS DISTRIBUTED SGD

DISTRIBUTED DEEP LEARNING

ASYNC DISTRIBUTED SGD

slide-17
SLIDE 17

WHAT’S DEEP LEARNING?

?

WHAT’S THE DIFFERENCE BETWEEN ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING?

Source: Blog post NVIDIA by Michael V. Copeland is a senior editor at WIRED. Prior senior writer at Fortune Magazine.

slide-18
SLIDE 18

WHAT’S DEEP LEARNING?

? WHY NOW?

  • large amounts of data training favors deep learning
  • faster machines and multicore CPU/GPUs
slide-19
SLIDE 19

WHAT’S DEEP LEARNING?

Source: Yan LeCun, Marc’Aurelio Ranzato (IMCL 2013)

slide-20
SLIDE 20

DEEP LEARNING NEED FOR SPEED

LARGER TRAINING DATASET IMPROVE ACCURACY

MORE TIME TO TRAIN

DEEP LEARNING NEED FOR SPEED

slide-21
SLIDE 21

DISTRIBUTED DEEP LEARNING

NEED:

drastically reduce the time to train DEEP LEARNING large models on even larger datasets.

PPT GOAL:

A BRIEF OVERVIEW OF THE EFFORTS AND CHALLENGES FOR DISTRIBUTED DEEP LEARNING (NEURAL NETWORKS ALGORITHMS) IN A MAP REDUCE PARADIGM

slide-22
SLIDE 22

WHAT’S BATCH GRADIENT DESCENT?

An optimization technique used in Deep Learning

  • 1. For a pre-defined number of epochs, Batch

Gradient Descent first compute the gradient vector ∇θJ(θ) of the loss function for the whole dataset w.r.t. the parameters θ for the entire training dataset.

  • 1. Then updates the parameters in the

direction of the gradients with the learning rate determining how big of an update we perform, according to θ=θ−η⋅∇θJ(θ).

slide-23
SLIDE 23

WHAT’S BATCH GRADIENT DESCENT?

SUM Neural Networks algorithms using Batch Gradient Descent are easily parallelizable as follows: (1) The whole data is partitioned into subsets and moved through the network towards the workers; (2) every worker calculates the partial gradient on its shard of the data; (3) The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. It has been proved to speed up x2 (almost)

slide-24
SLIDE 24

DISTRIBUTED STOCHASTIC GRADIENT DESCENT

Stochastic Gradient Descent a challenge for parallelization. It’s sequential in nature: SGD in contrast to Batch Gradient Descent, performs the parameter update for each training example (or for every n samples in its Mini-batch version) that is much faster but … not easy to parallelize L

slide-25
SLIDE 25

DISTRIBUTED STOCHASTIC GRADIENT DESCENT

LOCK ED

Indeed the problem arises when different workers in every step compute ∇θ and then update the parameters. When one worker is updating the parameters essentially locks them until the updating is over. Trying to distribute SGD optimization mechanism is essentially all about overcoming the fact that some workers will finish their computations sooner than the others and the common parameters are locked by continuous updates.

slide-26
SLIDE 26

SYNCHRONOUS DISTRIBUTED SGD

The pitfall is that workers have to wait all the other computations to be completed BEFORE being allowed to proceeding to the next gradient computation.

slide-27
SLIDE 27

ASYNCHRONOUS DISTRIBUTED SGD

  • This technique let the workers communicate updates through a centralized parameter server, which

keeps the current state of all parameters for the model shared across the multiple workers.

  • This approach comes with the main advantage that workers don’t have to wait for the common

parameter updated to be completed before move forward the following computation

slide-28
SLIDE 28

STALE GRADIENT PROBLEM

The model parameters hence are at risk of slow convergence or in some case to diverge. Since those approaches remove any explicit synchronization between updates, they implicitly allow workers to compute gradients locally by using model parameters that may contain some set of parameters, which are several steps behind the most updates set of model parameters.

slide-29
SLIDE 29

STALE GRADIENT PROBLEM

Many variants of Asynchronous Distributed SGD aim to cope with the Stale Gradient problem by applying a variety of strategies to minimize the impact of its effects. Some approaches are listed below:

  • 1. Zhang, Choromanska, & LeCun (2015)suggest to separate gradients by their level of staleness in
  • rder to limit the impact of very stale gradients by modulate the learning according to it.
  • 2. In Elastic Averaging SGD (Zhang et al., 2015), which proved to overcome the Downpour SGD, is

provided a new algorithm based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). In addition, the algorithm enables the local workers to perform more exploration. On a hand, the algorithm allows the local variables to fluctuate further from the center variable and on the other hand, reduces the amount of communication between local workers and the master.

slide-30
SLIDE 30

CHAPTER 3

JACOPO DI IORIO PLANET AND ENSEMBLE-BASED TREE MODELS

slide-31
SLIDE 31

PLANET AND ENSEMBLE

  • BASED

TREE MODELS

A STORY ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.

REASONS WHY

TREES, BOOSTING AND FORESTS

PLANET

slide-32
SLIDE 32

REASONS WHY

PERSONAL DRAMA 1 WORLD DRAMA PERSONAL DRAMA 1

slide-33
SLIDE 33

PERSONAL DRAMA 1

REASONS WHY

stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET

slide-34
SLIDE 34

PERSONAL DRAMA 1

REASONS WHY

stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET

GRADIENT TREE BOOSTING

slide-35
SLIDE 35

PERSONAL DRAMA

REASONS WHY

seat pagine gialle internship PREDICT SEX AND AGE OF UNREGISTERED WEB USERS FROM THEIR ONLINE BEHAVIOUR LARGE AMOUNT OF DATA: BIG DATA PERSONAL DRAMA 2

18-30 30-45 45-60 60+

slide-36
SLIDE 36

PERSONAL DRAMA

REASONS WHY

seat pagine gialle internship PREDICT SEX AND AGE OF UNREGISTERED WEB USERS FROM THEIR ONLINE BEHAVIOUR LARGE AMOUNT OF DATA: BIG DATA PERSONAL DRAMA 2

RANDOM FOREST

slide-37
SLIDE 37

PERSONAL DRAMA

REASONS WHY

PERSONAL DRAMA 2

WHAT WAS I REALLY DOING? ARE THESE METHODS INTERESTING FOR THE WORLD?

slide-38
SLIDE 38

REASONS WHY

WORLD DRAMA TREE MODELS AND ENSEMBLE-BASED TREE MODELS ARE EXTENSIVELY USED IN ORDER TO SOLVE SUPERVISED LEARNING PROBLEMS SUCH AS:

  • CLASSIFICATION PROBLEM
  • REGRESSION PROBLEM

NOT SO TRIVIAL ON MASSIVE DATASETS: LARGE INPUTS CREATE A BOTTLENECK BECAUSE OF THE COST OF SCANNING DATA FROM A SECONDARY STORAGE

PLANET

PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES

proposed by Google implemented using MapReduce

slide-39
SLIDE 39

PLANET AND ENSEMBLE

  • BASED

TREE MODELS

A CHAPTER ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.

REASONS WHY

TREES, BOOSTING AND FORESTS

PLANET

slide-40
SLIDE 40

TREES, BOOSTING AND FORESTS

DATA

D∗ = {(xi, yi) |xi ∈ DX1, . . . , DXN , yi ∈ DY }

Where xi is a set of N attributes and

yi

is an output we are interested to predict

GOAL

F : DX1 × · · · × DXN → DY

Find a a function able to predict the output from the input

yi xi

HOW?

Classification and regression trees are one of the oldest and most popular TREES BOOSTING RANDOM FOREST

slide-41
SLIDE 41

TREES, BOOSTING AND FORESTS

TREES BOOSTING RANDOM FOREST uses a weighted combination

  • f weak learners to form an

accurate predictive model built using an additive training strategy

Fm

Fm+1(x) = Fm(x) + h(x) Where is an add-in fitted on the residual

h(x)

averages many noisy but approximately unbiased and uncorrelated tree models. A B C D

Tree 1 Tree 2 Tree 3

1 1 1 partition attributes space into non-overlapping regions whose boundaries are defined as a predicate:

If A>B then C

Splits are obtained by reducing the impurity D∗

DA

DB DC DD

slide-42
SLIDE 42

PLANET AND ENSEMBLE

  • BASED

TREE MODELS

A CHAPTER ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.

REASONS WHY

TREES, BOOSTING AND FORESTS

PLANET

slide-43
SLIDE 43

PLANET

TWO-PHASES DISTRIBUTED COMPUTATION:

MAP

  • the dataset is partioned and assigned to

workers named mappers

  • map function is applied to each record in
  • rder to get a set of <key,value> pairs

REDUCE

  • pairs are then grouped by key and

distributed to a series of reducers

  • an user-defined reduce function is applied

to each record

  • Control the process
  • Coordinate MapReduce jobs in
  • rder to split data
  • Updates situation by accessing

to CONTROLLER Contains the entire tree built so far MODEL FILE MAPREDUCEQUEUE INMEMORYQUEUE nodes for which D is too large MR_ExpandNodes nodes for which D fits in memory MR_InMemory

slide-44
SLIDE 44

PLANET

EXAMPLE A B C D

D∗

DA

DB DC DD

  • M, MRQ and InMemQ are empty
  • NODE 0 : ENTIRE DATASET

MRQ MR_ExpandNodes({0},M,D*)

  • IMPURITY
  • BRANCHES
  • NUMBER OF RECORDS

CONTROLLER Select best split Updates M

  • NODE A:

> STOPthreshold

DA

  • NODE B:

>STOPthreshold

DB

> threshold

DB

MRQ …

DA < threshold

InMemQ MR_InMEMORY

  • DEPTH EXPANTION
  • CLASSICAL SEQUENTIAL ALGORITHM

slide-45
SLIDE 45

PLANET

BOOSTING AND RANDOM FORESTS F is setted. Residual computed by computing the current model's prediction. New tree is created: it is sufficient to push the root node for the new tree

  • nto MapReduce after the completion of the last tree.

Hash-based sampling: same sample in the same tree Nodes of all trees are pushed onto the MRQ: the queues will contain nodes belonging to many different trees instead of a single tree. TRAINING ON A SINGLE TREE MULTIPLE TREES ARE TRAINED IN PARALLEL RANDOM FOREST BOOSTING

slide-46
SLIDE 46

CHAPTER 4

Daniele Occhiuto Hybrid Canopy-Fuzzy C-means Clustering

slide-47
SLIDE 47

Canopy Algorithm and FCM Clustering

Dai, W., Yu, C., & Jiang, Z. (2016). An Improved Hybrid Canopy-Fuzzy C-Means Clustering Algorithm Based on MapReduce Model.

Canopy and FCM

Map Reduce Parallelization

Hybrid Canopy-FCM approach

slide-48
SLIDE 48

CANOPY AND LARGE DATASETS

CANOPY MOTIVATIONS MAINLY THREE WAYS DATASETS CAN BE LARGE : 1. LARGE NUMBER OF ELEMENTS IN THE DATA 2. EACH ELEMENT CAN HAVE MANY FEATURES 3. THERE CAN BE MANY CLUSTERS TO DISCOVER KEY IDEA: GREATLY REDUCE NUMBER OF DISTANCE

  • COMPUTATIONS. PERFORM CLUSTERING IN 2 STAGES:

1. ROUGH QUICK STAGE TO DIVIDE DATA INTO OVERLAPPING SUBSETS CALLED CANOPIES 2. RIGOROUS FINAL STAGE WITH EXPENSIVE DISTANCE MEASUREMENT ONLY AMONG POINTS THAT OCCUR IN COMMON CANOPIES. Canopy: subsets of the elements - data points or items - that according to the approximate similarity measure are within some distance threshold from a central point. Property: Points not appearing in any common canopies are far enough apart that they could not possibly be in the same cluster. → only approximate distance may not guarantee the property. Overlapping, adequate distance threshold and understanding properties of approximate allow to be confident property holds. RESTRICTION: we do not calculate the distance between two points that never appear in the same canopy. RISK: Inexpensive clustering must not exclude the solution for the expensive clustering. Otherwise, valid data points would be excluded from expensive clustering solution. à ACCURACY LOSS à CHECK OPTIMALITY CONDITION

slide-49
SLIDE 49

CANOPY IN PRACTICE

CANOPY ALGORITHM

  • Start with a list of the data points in any order
  • Set two distance thresholds, T1 and T2 (T1 > T2) .
  • Pick a point off the list and approximately measure

its distance to all other points. (This is extremely cheap with an inverted index.)

  • Put all points that are within distance threshold T1

into a canopy.

  • Remove from the list all points that are within

distance threshold T2.

  • Repeat until the list is empty.

OPTIMALITY CONDITION for each cluster there exists at least one canopy that completely contains that cluster.

slide-50
SLIDE 50

FUZZY C-MEANS CLUSTERING

FCM MOTIVATIONS MAINLY THREE WAYS DATASETS CAN BE LARGE : 1. LARGE NUMBER OF ELEMENTS IN THE DATA 2. EACH ELEMENT CAN HAVE MANY FEATURES 3. THERE CAN BE MANY CLUSTERS TO DISCOVER KEY IDEA: ALLOW ONE PIECE OF DATA TO BELONG TO TWO OR MORE CLUSTERS. EACH ELEMENT HAS A MEMBERESHIP DEGREE EXPRESSING HOW STRONGLY IT BELONGS TO A CLUSTER. based on minimization of the following objective function:

𝐾" = $ $ 𝑣&'

" 𝑦& − 𝑑' +, 1 ≤ 𝑛 < ∞ 2 '34 5 &34

Fuzzy partitions carried out through iterative optimization of the above objective function by updating membership function and centers

𝑣&' = 1 ∑ 𝑦& − 𝑑' 𝑦& − 𝑑7

+ "84 2 734

𝑑' = ∑ 𝑣&'

" ∗ 𝑦& 5 &34

∑ 𝑣&'

" 5 &34

𝑛𝑏𝑦&' = 𝑣&'

(7<4) − 𝑣&' 7 < 𝜁

Termination criterion

slide-51
SLIDE 51

MAPREDUCE FCM CLUSTERING

MAPREDUCE OF CANOPY ALGORITHM PRESENTED IN TH COURSE HYBRID CANOPY + FCM ALL BASED ON MAPREDUCE MapReduce of FCM Clustering:

  • Map
  • Combine
  • Reduce
  • Iterate
  • Classify

Map process calculates the membership degree of the data objects in the current node on cluster center. The data structure of <key, value> is represented as (center, (point, weight)) . Combine values containing the same key – local

  • ptimization.

Reduce process issues all cluster centers when whole process terminates. Iteration is necessary given the nature of FCM

  • algorithms. Variation of all clusters remains under

the specified threshold we can terminate the algorithm. Classification with a final MapReduce process

slide-52
SLIDE 52

HYBRID CANOPY-FCM CLUSTERING

slide-53
SLIDE 53

RESULTS IRIS DATASET

Hybrid canopy-FCM was compared with FCM clustering on Car Evaluation dataset and Iris dataset, resulting in improved efficacy of FCM clustering and exposing a reduced execution time when original FCM is based on MapReduce algorithm as dataset size increases. CLASSIC FCM HYBRID CANOPY-FCM

slide-54
SLIDE 54

WHY?

FURTHER APPLICATIONS? Lin, D., & Pantel, P. (2001, August). Induction of semantic classes from natural language text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining Dredze, M., McNamee, P., Rao, D., Gerber, A., & Finin, T. (2010, August). Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics DOMAIN SPECIFIC CHEAP DISTANCE FUNCTION IS CRUCIAL FOR ACCURACY WHEN APPLYING CANOPY ALGORITHM FUZZY SETS IN CANOPY CLUSTERING? FCM PARALLELIZATION REQUIRES SEVERAL ITERATIONS à APACHE SPARK TO INCREASE SPEED FINAL REMARKS

slide-55
SLIDE 55

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI