machine learning statistical learning and parallel
play

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - PowerPoint PPT Presentation

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE


  1. JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING

  2. INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE MACHINE LEARNING STATISTICAL LEARNING • Data Science (Hayashi 1998, Cleveland 2001) • “Unifying” Books (Hastie et al. 2009, Barber 2011)

  3. WHAT’S HAPPENING NOW? BIG DATA REVOLUTION SOME OPEN ISSUES (Jordan 2013) Inferential issues of more “logical” algorithms (e.g. testing on NNs) Speed issues in computation-intensive algorithms

  4. SPEED ISSUES HOW TO TACKLE THEM? “Divide et Impera” (Lucius Aemilius Paullus, 168 BC) So, split algorithms in simple and parallelizable steps, and then employ massive parallelization strategies!

  5. MACHINE PARALLEL LEARNING, NETWORK ANALYSIS ON STATISTICAL APACHE SPARK M. FONTANA LEARNING J. DI IORIO AND PLANET AND DISTRIBUTED ENSEMBLE- PARALLEL DEEP LEARNING BASED TREE LEARNING COMPUTING C. VOLPETTI A STORY ABOUT MANY HYBRID DIFFERENT THINGS SUCH AS CANOPY-FUZZY FAST COMPUTERS, COOL C-MEANS ALGORITHMS AND ELEGANT CLUSTERING STATISTICS D. OCCHIUTO

  6. MATTEO FONTANA CHAPTER 1 PARALLEL NETWORK ANALYSIS WITH APACHE SPARK

  7. PARALLEL GRAPH STATE NETWORK ANALYTICS OF ART SYSTEM ANALYSIS WITH MOVE TO DATA APACHE PARALLEL SOLUTION DRAWBACKS SYSTEMS? SPARK ABOUT HOW TO PERFORM PARALLEL GRAPH ANALYTICS ON A NEW PERFORMANCE PROGRAMMING SPARK ABSTRACTION

  8. A BRIEF STATE OF THE ART OF NETWORK ANALYSIS The analysis of network data is becoming more and more prevalent in the ML and Statistics communities, and several attempts are made to create novel methods: • Complex objects in network topology (Kolaczyk, 2009) • Networks as statistical units (Sienkiewicz and Wang 2014; Shen et al. 2014) • Graph Analytics Systems – Pregel (Malewicz et al. 2010) – PowerGraph (Gonzalez, Low, and Gu 2012) – GraphLab (Low et al. 2012)

  9. GRAPH ANALYTICS SYSTEM THE IDEAS BEHIND THEM IS THE IMPLEMENTATION OF A RESTRICTED PROGRAMMING ABSTRACTION, THAT ALLOWS FOR THE FAST SPECIFICATION OF GRAPH ALGORITHM ORDERS OF MAGNITUDE IMPROVEMENT OVER DATA-PARALLEL SYSTEMS

  10. DRAWBACKS OF GRAPH ANALYTICS SYSTEMS BLESSING CURSE OPERATIONS REQUIRED IN NETWORK ANALYTICS PIPELINE NEED A MUCH MORE GENERAL VIEW OF THE DATA, “FLOATING” OVER THE GRAPH TOPOLOGY (e.g. graph creation and modification, graph partition, graph comparison)

  11. WHY DON’T WE MOVE TO DATA PARALLEL SYSTEM? ONE IDEA AVOID SPECIALIZED SYSTEMS FOR GRAPH PROCESSING AND MOVE TO GENERAL PURPOSE DATA PARALLE PLATFORMS (e.g. Apache Spark) BUT IMPLEMENTATION OF DATA PARALLEL ALGORITHMS ON DATA PARALLEL INFRASTRUCTURE IS SENSELESS!

  12. THE SOLUTION: FUSION! The idea to solve this issue is creating an hybrid data and graph parallel infrastructure, that allows the data scientist to get the best of both worlds: • Speed in graph computation of Graph Parallel Systems. • Flexibility in the pipeline of Data Parallel Systems.

  13. A NEW PROGRAMMING ABSTRACTION FIRST BRICK: PROGRAMMING ABSTRACTION RDD Based (graph as two RDDs) – GraphX AFTER THE RELEASE OF DATA FRAME API DataFrame Based (graph as two DataFrames) – GraphFrames THEN rewriting of Graph operations (usually, based on Gather, Apply, Scatter paradigm) as distributed joins and aggregations

  14. PERFORMANCE Performance-wise, and looking at the single graph computation (i.e. calculating PageRank) these systems are still outperformed by lower-level implemented systems such as GraphLab. The situation dramatically changes when the whole pipeline is taken into account: being able to deal with I/O, ETL and the other pipeline operations using the same systems has proven to increase dramatically performance.

  15. CLAUDIA VOLPETTI CHAPTER 2 DISTRIBUTED DEEP LEARNING

  16. WHAT’S DEEP LEARNING DEEP NEED FOR DISTRIBUTED SPEED LEARNING? DISTRIBUTED DEEP DEEP LEARNING DISTRIBUTED DISTRIBUTED WHAT IS BATCH STOCHASTIC BATCH GRADIENT GRADIENT GRADIENT DESCENT DESCENT? DESCENT LEARNING SYNCHRONOUS DISTRIBUTED SGD A STORY ABOUT DEEP LEARNING STALE ASYNC GRADIENT DISTRIBUTED SGD PROBLEM

  17. WHAT’S DEEP LEARNING? ? WHAT’S THE DIFFERENCE BETWEEN ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING? Source: Blog post NVIDIA by Michael V. Copeland is a senior editor at WIRED. Prior senior writer at Fortune Magazine.

  18. WHAT’S DEEP LEARNING? ? WHY NOW? - large amounts of data training favors deep learning - faster machines and multicore CPU/GPUs

  19. WHAT’S DEEP LEARNING? Source: Yan LeCun, Marc’Aurelio Ranzato (IMCL 2013)

  20. DEEP LEARNING NEED FOR SPEED LARGER TRAINING IMPROVE DATASET ACCURACY MORE TIME TO TRAIN DEEP LEARNING NEED FOR SPEED

  21. DISTRIBUTED DEEP LEARNING NEED: drastically reduce the time to train DEEP LEARNING large models on even larger datasets. PPT GOAL: A BRIEF OVERVIEW OF THE EFFORTS AND CHALLENGES FOR DISTRIBUTED DEEP LEARNING (NEURAL NETWORKS ALGORITHMS) IN A MAP REDUCE PARADIGM

  22. WHAT’S BATCH GRADIENT DESCENT? An optimization technique used in Deep Learning 1. For a pre-defined number of epochs, Batch Gradient Descent first compute the gradient vector ∇ θ J( θ ) of the loss function for the whole dataset w.r.t. the parameters θ for the entire training dataset. 1. Then updates the parameters in the direction of the gradients with the learning rate determining how big of an update we perform, according to θ = θ − η ⋅∇ θ J( θ ).

  23. WHAT’S BATCH GRADIENT DESCENT? Neural Networks algorithms using Batch Gradient Descent are easily parallelizable as follows: (1) The whole data is partitioned into subsets and moved through the network towards the workers; (2) every worker calculates the partial gradient on its shard of the data; SUM (3) The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. It has been proved to speed up x2 (almost)

  24. DISTRIBUTED STOCHASTIC GRADIENT DESCENT Stochastic Gradient Descent a challenge for parallelization. It’s sequential in nature: SGD in contrast to Batch Gradient Descent, performs the parameter update for each training example (or for every n samples in its Mini-batch version) that is much faster but … not easy to parallelize L

  25. DISTRIBUTED STOCHASTIC GRADIENT DESCENT Indeed the problem arises when different workers in every step compute ∇ θ and then update the parameters. When one worker is updating the parameters LOCK essentially locks them until the updating is over. ED Trying to distribute SGD optimization mechanism is essentially all about overcoming the fact that some workers will finish their computations sooner than the others and the common parameters are locked by continuous updates.

  26. SYNCHRONOUS DISTRIBUTED SGD The pitfall is that workers have to wait all the other computations to be completed BEFORE being allowed to proceeding to the next gradient computation.

  27. ASYNCHRONOUS DISTRIBUTED SGD • This technique let the workers communicate updates through a centralized parameter server, which keeps the current state of all parameters for the model shared across the multiple workers. • This approach comes with the main advantage that workers don’t have to wait for the common parameter updated to be completed before move forward the following computation

  28. STALE GRADIENT PROBLEM The model parameters hence are at risk of slow convergence or in some case to diverge. Since those approaches remove any explicit synchronization between updates, they implicitly allow workers to compute gradients locally by using model parameters that may contain some set of parameters, which are several steps behind the most updates set of model parameters.

  29. STALE GRADIENT PROBLEM Many variants of Asynchronous Distributed SGD aim to cope with the Stale Gradient problem by applying a variety of strategies to minimize the impact of its effects. Some approaches are listed below: 1. Zhang, Choromanska, & LeCun (2015)suggest to separate gradients by their level of staleness in order to limit the impact of very stale gradients by modulate the learning according to it. 2. In Elastic Averaging SGD (Zhang et al., 2015), which proved to overcome the Downpour SGD, is provided a new algorithm based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). In addition, the algorithm enables the local workers to perform more exploration. On a hand, the algorithm allows the local variables to fluctuate further from the center variable and on the other hand, reduces the amount of communication between local workers and the master.

  30. JACOPO DI IORIO CHAPTER 3 PLANET AND ENSEMBLE-BASED TREE MODELS

  31. PLANET REASONS AND WHY ENSEMBLE -BASED TREES, TREE BOOSTING AND FORESTS MODELS A STORY ABOUT PLANET, A PARALLEL LEARNER FOR PLANET ASSEMBLING NUMEROUS ENSEMBLE TREES.

  32. REASONS WHY PERSONAL DRAMA 1 WORLD DRAMA PERSONAL DRAMA 1

  33. REASONS WHY PERSONAL DRAMA 1 stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET

  34. REASONS WHY PERSONAL DRAMA 1 stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 GRADIENT TREE CHALLENGE IN THE ANALYSIS OF A DATASET BOOSTING

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend