MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING
JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI
MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - - PowerPoint PPT Presentation
JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE
JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI
INTRODUCTION
MACHINE LEARNING STATISTICAL LEARNING
Separate evolutions, but with shared methodologies MACHINE LEARNING STATISTICAL LEARNING
MERGE
WHAT’S HAPPENING NOW?
BIG DATA REVOLUTION
SOME OPEN ISSUES (Jordan 2013)
Inferential issues of more “logical” algorithms (e.g. testing on NNs) Speed issues in computation-intensive algorithms
SPEED ISSUES
HOW TO TACKLE THEM?
“Divide et Impera” (Lucius Aemilius Paullus, 168 BC) So, split algorithms in simple and parallelizable steps, and then employ massive parallelization strategies!
MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING
A STORY ABOUT MANY DIFFERENT THINGS SUCH AS FAST COMPUTERS, COOL ALGORITHMS AND ELEGANT STATISTICS PARALLEL NETWORK ANALYSIS ON APACHE SPARK HYBRID CANOPY-FUZZY C-MEANS CLUSTERING DISTRIBUTED DEEP LEARNING PLANET AND ENSEMBLE- BASED TREE LEARNING
MATTEO FONTANA
PARALLEL NETWORK ANALYSIS WITH APACHE SPARK
ABOUT HOW TO PERFORM PARALLEL GRAPH ANALYTICS ON SPARK
STATE OF ART
MOVE TO DATA PARALLEL SYSTEMS? PERFORMANCE
GRAPH ANALYTICS SYSTEM
DRAWBACKS SOLUTION
A NEW PROGRAMMING ABSTRACTION
A BRIEF STATE OF THE ART OF NETWORK ANALYSIS
The analysis of network data is becoming more and more prevalent in the ML and Statistics communities, and several attempts are made to create novel methods:
2014)
– Pregel (Malewicz et al. 2010) – PowerGraph (Gonzalez, Low, and Gu 2012) – GraphLab (Low et al. 2012)
GRAPH ANALYTICS SYSTEM
THE IDEAS BEHIND THEM IS THE IMPLEMENTATION OF A RESTRICTED PROGRAMMING ABSTRACTION, THAT ALLOWS FOR THE FAST SPECIFICATION OF GRAPH ALGORITHM ORDERS OF MAGNITUDE IMPROVEMENT OVER DATA-PARALLEL SYSTEMS
DRAWBACKS OF GRAPH ANALYTICS SYSTEMS
BLESSING CURSE OPERATIONS REQUIRED IN NETWORK ANALYTICS PIPELINE NEED A MUCH MORE GENERAL VIEW OF THE DATA, “FLOATING” OVER THE GRAPH TOPOLOGY
(e.g. graph creation and modification, graph partition, graph comparison)
WHY DON’T WE MOVE TO DATA PARALLEL SYSTEM?
ONE IDEA AVOID SPECIALIZED SYSTEMS FOR GRAPH PROCESSING AND MOVE TO GENERAL PURPOSE DATA PARALLE PLATFORMS (e.g. Apache Spark) BUT IMPLEMENTATION OF DATA PARALLEL ALGORITHMS ON DATA PARALLEL INFRASTRUCTURE IS SENSELESS!
THE SOLUTION: FUSION!
The idea to solve this issue is creating an hybrid data and graph parallel infrastructure, that allows the data scientist to get the best of both worlds:
A NEW PROGRAMMING ABSTRACTION
FIRST BRICK: PROGRAMMING ABSTRACTION RDD Based (graph as two RDDs) – GraphX AFTER THE RELEASE OF DATA FRAME API DataFrame Based (graph as two DataFrames) – GraphFrames THEN rewriting of Graph operations (usually, based on Gather, Apply, Scatter paradigm) as distributed joins and aggregations
PERFORMANCE
Performance-wise, and looking at the single graph computation (i.e. calculating PageRank) these systems are still outperformed by lower-level implemented systems such as GraphLab. The situation dramatically changes when the whole pipeline is taken into account: being able to deal with I/O, ETL and the other pipeline
dramatically performance.
CLAUDIA VOLPETTI
DISTRIBUTED DEEP LEARNING
DISTRIBUTED
A STORY ABOUT DEEP LEARNING
WHAT’S DEEP LEARNING?
DISTRIBUTED BATCH GRADIENT DESCENT
STALE GRADIENT PROBLEM
DEEP LEARNING NEED FOR SPEED WHAT IS BATCH GRADIENT DESCENT? DISTRIBUTED STOCHASTIC GRADIENT DESCENT
SYNCHRONOUS DISTRIBUTED SGD
DISTRIBUTED DEEP LEARNING
ASYNC DISTRIBUTED SGD
WHAT’S DEEP LEARNING?
WHAT’S THE DIFFERENCE BETWEEN ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING?
Source: Blog post NVIDIA by Michael V. Copeland is a senior editor at WIRED. Prior senior writer at Fortune Magazine.
WHAT’S DEEP LEARNING?
WHAT’S DEEP LEARNING?
Source: Yan LeCun, Marc’Aurelio Ranzato (IMCL 2013)
DEEP LEARNING NEED FOR SPEED
LARGER TRAINING DATASET IMPROVE ACCURACY
DEEP LEARNING NEED FOR SPEED
DISTRIBUTED DEEP LEARNING
NEED:
drastically reduce the time to train DEEP LEARNING large models on even larger datasets.
PPT GOAL:
A BRIEF OVERVIEW OF THE EFFORTS AND CHALLENGES FOR DISTRIBUTED DEEP LEARNING (NEURAL NETWORKS ALGORITHMS) IN A MAP REDUCE PARADIGM
WHAT’S BATCH GRADIENT DESCENT?
An optimization technique used in Deep Learning
Gradient Descent first compute the gradient vector ∇θJ(θ) of the loss function for the whole dataset w.r.t. the parameters θ for the entire training dataset.
direction of the gradients with the learning rate determining how big of an update we perform, according to θ=θ−η⋅∇θJ(θ).
WHAT’S BATCH GRADIENT DESCENT?
SUM Neural Networks algorithms using Batch Gradient Descent are easily parallelizable as follows: (1) The whole data is partitioned into subsets and moved through the network towards the workers; (2) every worker calculates the partial gradient on its shard of the data; (3) The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. It has been proved to speed up x2 (almost)
DISTRIBUTED STOCHASTIC GRADIENT DESCENT
Stochastic Gradient Descent a challenge for parallelization. It’s sequential in nature: SGD in contrast to Batch Gradient Descent, performs the parameter update for each training example (or for every n samples in its Mini-batch version) that is much faster but … not easy to parallelize L
DISTRIBUTED STOCHASTIC GRADIENT DESCENT
LOCK ED
Indeed the problem arises when different workers in every step compute ∇θ and then update the parameters. When one worker is updating the parameters essentially locks them until the updating is over. Trying to distribute SGD optimization mechanism is essentially all about overcoming the fact that some workers will finish their computations sooner than the others and the common parameters are locked by continuous updates.
SYNCHRONOUS DISTRIBUTED SGD
The pitfall is that workers have to wait all the other computations to be completed BEFORE being allowed to proceeding to the next gradient computation.
ASYNCHRONOUS DISTRIBUTED SGD
keeps the current state of all parameters for the model shared across the multiple workers.
parameter updated to be completed before move forward the following computation
STALE GRADIENT PROBLEM
The model parameters hence are at risk of slow convergence or in some case to diverge. Since those approaches remove any explicit synchronization between updates, they implicitly allow workers to compute gradients locally by using model parameters that may contain some set of parameters, which are several steps behind the most updates set of model parameters.
STALE GRADIENT PROBLEM
Many variants of Asynchronous Distributed SGD aim to cope with the Stale Gradient problem by applying a variety of strategies to minimize the impact of its effects. Some approaches are listed below:
provided a new algorithm based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). In addition, the algorithm enables the local workers to perform more exploration. On a hand, the algorithm allows the local variables to fluctuate further from the center variable and on the other hand, reduces the amount of communication between local workers and the master.
JACOPO DI IORIO PLANET AND ENSEMBLE-BASED TREE MODELS
A STORY ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.
REASONS WHY
TREES, BOOSTING AND FORESTS
PLANET
REASONS WHY
PERSONAL DRAMA 1 WORLD DRAMA PERSONAL DRAMA 1
PERSONAL DRAMA 1
REASONS WHY
stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET
PERSONAL DRAMA 1
REASONS WHY
stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET
GRADIENT TREE BOOSTING
PERSONAL DRAMA
REASONS WHY
seat pagine gialle internship PREDICT SEX AND AGE OF UNREGISTERED WEB USERS FROM THEIR ONLINE BEHAVIOUR LARGE AMOUNT OF DATA: BIG DATA PERSONAL DRAMA 2
18-30 30-45 45-60 60+
PERSONAL DRAMA
REASONS WHY
seat pagine gialle internship PREDICT SEX AND AGE OF UNREGISTERED WEB USERS FROM THEIR ONLINE BEHAVIOUR LARGE AMOUNT OF DATA: BIG DATA PERSONAL DRAMA 2
RANDOM FOREST
PERSONAL DRAMA
REASONS WHY
PERSONAL DRAMA 2
REASONS WHY
WORLD DRAMA TREE MODELS AND ENSEMBLE-BASED TREE MODELS ARE EXTENSIVELY USED IN ORDER TO SOLVE SUPERVISED LEARNING PROBLEMS SUCH AS:
NOT SO TRIVIAL ON MASSIVE DATASETS: LARGE INPUTS CREATE A BOTTLENECK BECAUSE OF THE COST OF SCANNING DATA FROM A SECONDARY STORAGE
PLANET
PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES
proposed by Google implemented using MapReduce
A CHAPTER ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.
REASONS WHY
TREES, BOOSTING AND FORESTS
PLANET
TREES, BOOSTING AND FORESTS
DATA
D∗ = {(xi, yi) |xi ∈ DX1, . . . , DXN , yi ∈ DY }
Where xi is a set of N attributes and
yi
is an output we are interested to predict
GOAL
F : DX1 × · · · × DXN → DY
Find a a function able to predict the output from the input
yi xi
HOW?
Classification and regression trees are one of the oldest and most popular TREES BOOSTING RANDOM FOREST
TREES, BOOSTING AND FORESTS
TREES BOOSTING RANDOM FOREST uses a weighted combination
accurate predictive model built using an additive training strategy
Fm
Fm+1(x) = Fm(x) + h(x) Where is an add-in fitted on the residual
h(x)
averages many noisy but approximately unbiased and uncorrelated tree models. A B C D
Tree 1 Tree 2 Tree 3
1 1 1 partition attributes space into non-overlapping regions whose boundaries are defined as a predicate:
If A>B then C
Splits are obtained by reducing the impurity D∗
DA
DB DC DD
A CHAPTER ABOUT PLANET, A PARALLEL LEARNER FOR ASSEMBLING NUMEROUS ENSEMBLE TREES.
REASONS WHY
TREES, BOOSTING AND FORESTS
PLANET
PLANET
TWO-PHASES DISTRIBUTED COMPUTATION:
MAP
workers named mappers
REDUCE
distributed to a series of reducers
to each record
to CONTROLLER Contains the entire tree built so far MODEL FILE MAPREDUCEQUEUE INMEMORYQUEUE nodes for which D is too large MR_ExpandNodes nodes for which D fits in memory MR_InMemory
PLANET
EXAMPLE A B C D
D∗
DA
DB DC DD
MRQ MR_ExpandNodes({0},M,D*)
CONTROLLER Select best split Updates M
> STOPthreshold
DA
>STOPthreshold
DB
> threshold
DB
MRQ …
DA < threshold
InMemQ MR_InMEMORY
…
PLANET
BOOSTING AND RANDOM FORESTS F is setted. Residual computed by computing the current model's prediction. New tree is created: it is sufficient to push the root node for the new tree
Hash-based sampling: same sample in the same tree Nodes of all trees are pushed onto the MRQ: the queues will contain nodes belonging to many different trees instead of a single tree. TRAINING ON A SINGLE TREE MULTIPLE TREES ARE TRAINED IN PARALLEL RANDOM FOREST BOOSTING
Daniele Occhiuto Hybrid Canopy-Fuzzy C-means Clustering
Dai, W., Yu, C., & Jiang, Z. (2016). An Improved Hybrid Canopy-Fuzzy C-Means Clustering Algorithm Based on MapReduce Model.
Canopy and FCM
Map Reduce Parallelization
Hybrid Canopy-FCM approach
CANOPY AND LARGE DATASETS
CANOPY MOTIVATIONS MAINLY THREE WAYS DATASETS CAN BE LARGE : 1. LARGE NUMBER OF ELEMENTS IN THE DATA 2. EACH ELEMENT CAN HAVE MANY FEATURES 3. THERE CAN BE MANY CLUSTERS TO DISCOVER KEY IDEA: GREATLY REDUCE NUMBER OF DISTANCE
1. ROUGH QUICK STAGE TO DIVIDE DATA INTO OVERLAPPING SUBSETS CALLED CANOPIES 2. RIGOROUS FINAL STAGE WITH EXPENSIVE DISTANCE MEASUREMENT ONLY AMONG POINTS THAT OCCUR IN COMMON CANOPIES. Canopy: subsets of the elements - data points or items - that according to the approximate similarity measure are within some distance threshold from a central point. Property: Points not appearing in any common canopies are far enough apart that they could not possibly be in the same cluster. → only approximate distance may not guarantee the property. Overlapping, adequate distance threshold and understanding properties of approximate allow to be confident property holds. RESTRICTION: we do not calculate the distance between two points that never appear in the same canopy. RISK: Inexpensive clustering must not exclude the solution for the expensive clustering. Otherwise, valid data points would be excluded from expensive clustering solution. à ACCURACY LOSS à CHECK OPTIMALITY CONDITION
CANOPY IN PRACTICE
CANOPY ALGORITHM
its distance to all other points. (This is extremely cheap with an inverted index.)
into a canopy.
distance threshold T2.
OPTIMALITY CONDITION for each cluster there exists at least one canopy that completely contains that cluster.
FUZZY C-MEANS CLUSTERING
FCM MOTIVATIONS MAINLY THREE WAYS DATASETS CAN BE LARGE : 1. LARGE NUMBER OF ELEMENTS IN THE DATA 2. EACH ELEMENT CAN HAVE MANY FEATURES 3. THERE CAN BE MANY CLUSTERS TO DISCOVER KEY IDEA: ALLOW ONE PIECE OF DATA TO BELONG TO TWO OR MORE CLUSTERS. EACH ELEMENT HAS A MEMBERESHIP DEGREE EXPRESSING HOW STRONGLY IT BELONGS TO A CLUSTER. based on minimization of the following objective function:
𝐾" = $ $ 𝑣&'
" 𝑦& − 𝑑' +, 1 ≤ 𝑛 < ∞ 2 '34 5 &34
Fuzzy partitions carried out through iterative optimization of the above objective function by updating membership function and centers
𝑣&' = 1 ∑ 𝑦& − 𝑑' 𝑦& − 𝑑7
+ "84 2 734
𝑑' = ∑ 𝑣&'
" ∗ 𝑦& 5 &34
∑ 𝑣&'
" 5 &34
𝑛𝑏𝑦&' = 𝑣&'
(7<4) − 𝑣&' 7 < 𝜁
Termination criterion
MAPREDUCE FCM CLUSTERING
MAPREDUCE OF CANOPY ALGORITHM PRESENTED IN TH COURSE HYBRID CANOPY + FCM ALL BASED ON MAPREDUCE MapReduce of FCM Clustering:
Map process calculates the membership degree of the data objects in the current node on cluster center. The data structure of <key, value> is represented as (center, (point, weight)) . Combine values containing the same key – local
Reduce process issues all cluster centers when whole process terminates. Iteration is necessary given the nature of FCM
the specified threshold we can terminate the algorithm. Classification with a final MapReduce process
HYBRID CANOPY-FCM CLUSTERING
RESULTS IRIS DATASET
Hybrid canopy-FCM was compared with FCM clustering on Car Evaluation dataset and Iris dataset, resulting in improved efficacy of FCM clustering and exposing a reduced execution time when original FCM is based on MapReduce algorithm as dataset size increases. CLASSIC FCM HYBRID CANOPY-FCM
WHY?
FURTHER APPLICATIONS? Lin, D., & Pantel, P. (2001, August). Induction of semantic classes from natural language text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining Dredze, M., McNamee, P., Rao, D., Gerber, A., & Finin, T. (2010, August). Entity disambiguation for knowledge base population. In Proceedings of the 23rd International Conference on Computational Linguistics DOMAIN SPECIFIC CHEAP DISTANCE FUNCTION IS CRUCIAL FOR ACCURACY WHEN APPLYING CANOPY ALGORITHM FUZZY SETS IN CANOPY CLUSTERING? FCM PARALLELIZATION REQUIRES SEVERAL ITERATIONS à APACHE SPARK TO INCREASE SPEED FINAL REMARKS
JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI