MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - PowerPoint PPT Presentation

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING

INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE MACHINE LEARNING STATISTICAL LEARNING • Data Science (Hayashi 1998, Cleveland 2001) • “Unifying” Books (Hastie et al. 2009, Barber 2011)

WHAT’S HAPPENING NOW? BIG DATA REVOLUTION SOME OPEN ISSUES (Jordan 2013) Inferential issues of more “logical” algorithms (e.g. testing on NNs) Speed issues in computation-intensive algorithms

SPEED ISSUES HOW TO TACKLE THEM? “Divide et Impera” (Lucius Aemilius Paullus, 168 BC) So, split algorithms in simple and parallelizable steps, and then employ massive parallelization strategies!

MACHINE PARALLEL LEARNING, NETWORK ANALYSIS ON STATISTICAL APACHE SPARK M. FONTANA LEARNING J. DI IORIO AND PLANET AND DISTRIBUTED ENSEMBLE- PARALLEL DEEP LEARNING BASED TREE LEARNING COMPUTING C. VOLPETTI A STORY ABOUT MANY HYBRID DIFFERENT THINGS SUCH AS CANOPY-FUZZY FAST COMPUTERS, COOL C-MEANS ALGORITHMS AND ELEGANT CLUSTERING STATISTICS D. OCCHIUTO

MATTEO FONTANA CHAPTER 1 PARALLEL NETWORK ANALYSIS WITH APACHE SPARK

PARALLEL GRAPH STATE NETWORK ANALYTICS OF ART SYSTEM ANALYSIS WITH MOVE TO DATA APACHE PARALLEL SOLUTION DRAWBACKS SYSTEMS? SPARK ABOUT HOW TO PERFORM PARALLEL GRAPH ANALYTICS ON A NEW PERFORMANCE PROGRAMMING SPARK ABSTRACTION

A BRIEF STATE OF THE ART OF NETWORK ANALYSIS The analysis of network data is becoming more and more prevalent in the ML and Statistics communities, and several attempts are made to create novel methods: • Complex objects in network topology (Kolaczyk, 2009) • Networks as statistical units (Sienkiewicz and Wang 2014; Shen et al. 2014) • Graph Analytics Systems – Pregel (Malewicz et al. 2010) – PowerGraph (Gonzalez, Low, and Gu 2012) – GraphLab (Low et al. 2012)

GRAPH ANALYTICS SYSTEM THE IDEAS BEHIND THEM IS THE IMPLEMENTATION OF A RESTRICTED PROGRAMMING ABSTRACTION, THAT ALLOWS FOR THE FAST SPECIFICATION OF GRAPH ALGORITHM ORDERS OF MAGNITUDE IMPROVEMENT OVER DATA-PARALLEL SYSTEMS

DRAWBACKS OF GRAPH ANALYTICS SYSTEMS BLESSING CURSE OPERATIONS REQUIRED IN NETWORK ANALYTICS PIPELINE NEED A MUCH MORE GENERAL VIEW OF THE DATA, “FLOATING” OVER THE GRAPH TOPOLOGY (e.g. graph creation and modification, graph partition, graph comparison)

WHY DON’T WE MOVE TO DATA PARALLEL SYSTEM? ONE IDEA AVOID SPECIALIZED SYSTEMS FOR GRAPH PROCESSING AND MOVE TO GENERAL PURPOSE DATA PARALLE PLATFORMS (e.g. Apache Spark) BUT IMPLEMENTATION OF DATA PARALLEL ALGORITHMS ON DATA PARALLEL INFRASTRUCTURE IS SENSELESS!

THE SOLUTION: FUSION! The idea to solve this issue is creating an hybrid data and graph parallel infrastructure, that allows the data scientist to get the best of both worlds: • Speed in graph computation of Graph Parallel Systems. • Flexibility in the pipeline of Data Parallel Systems.

A NEW PROGRAMMING ABSTRACTION FIRST BRICK: PROGRAMMING ABSTRACTION RDD Based (graph as two RDDs) – GraphX AFTER THE RELEASE OF DATA FRAME API DataFrame Based (graph as two DataFrames) – GraphFrames THEN rewriting of Graph operations (usually, based on Gather, Apply, Scatter paradigm) as distributed joins and aggregations

PERFORMANCE Performance-wise, and looking at the single graph computation (i.e. calculating PageRank) these systems are still outperformed by lower-level implemented systems such as GraphLab. The situation dramatically changes when the whole pipeline is taken into account: being able to deal with I/O, ETL and the other pipeline operations using the same systems has proven to increase dramatically performance.

CLAUDIA VOLPETTI CHAPTER 2 DISTRIBUTED DEEP LEARNING

WHAT’S DEEP LEARNING DEEP NEED FOR DISTRIBUTED SPEED LEARNING? DISTRIBUTED DEEP DEEP LEARNING DISTRIBUTED DISTRIBUTED WHAT IS BATCH STOCHASTIC BATCH GRADIENT GRADIENT GRADIENT DESCENT DESCENT? DESCENT LEARNING SYNCHRONOUS DISTRIBUTED SGD A STORY ABOUT DEEP LEARNING STALE ASYNC GRADIENT DISTRIBUTED SGD PROBLEM

WHAT’S DEEP LEARNING? ? WHAT’S THE DIFFERENCE BETWEEN ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DEEP LEARNING? Source: Blog post NVIDIA by Michael V. Copeland is a senior editor at WIRED. Prior senior writer at Fortune Magazine.

WHAT’S DEEP LEARNING? ? WHY NOW? - large amounts of data training favors deep learning - faster machines and multicore CPU/GPUs

WHAT’S DEEP LEARNING? Source: Yan LeCun, Marc’Aurelio Ranzato (IMCL 2013)

DEEP LEARNING NEED FOR SPEED LARGER TRAINING IMPROVE DATASET ACCURACY MORE TIME TO TRAIN DEEP LEARNING NEED FOR SPEED

DISTRIBUTED DEEP LEARNING NEED: drastically reduce the time to train DEEP LEARNING large models on even larger datasets. PPT GOAL: A BRIEF OVERVIEW OF THE EFFORTS AND CHALLENGES FOR DISTRIBUTED DEEP LEARNING (NEURAL NETWORKS ALGORITHMS) IN A MAP REDUCE PARADIGM

WHAT’S BATCH GRADIENT DESCENT? An optimization technique used in Deep Learning 1. For a pre-defined number of epochs, Batch Gradient Descent first compute the gradient vector ∇ θ J( θ ) of the loss function for the whole dataset w.r.t. the parameters θ for the entire training dataset. 1. Then updates the parameters in the direction of the gradients with the learning rate determining how big of an update we perform, according to θ = θ − η ⋅∇ θ J( θ ).

WHAT’S BATCH GRADIENT DESCENT? Neural Networks algorithms using Batch Gradient Descent are easily parallelizable as follows: (1) The whole data is partitioned into subsets and moved through the network towards the workers; (2) every worker calculates the partial gradient on its shard of the data; SUM (3) The reducer then sums the partial gradient from each mapper and does a batch gradient descent to update the weights of the network. It has been proved to speed up x2 (almost)

DISTRIBUTED STOCHASTIC GRADIENT DESCENT Stochastic Gradient Descent a challenge for parallelization. It’s sequential in nature: SGD in contrast to Batch Gradient Descent, performs the parameter update for each training example (or for every n samples in its Mini-batch version) that is much faster but … not easy to parallelize L

DISTRIBUTED STOCHASTIC GRADIENT DESCENT Indeed the problem arises when different workers in every step compute ∇ θ and then update the parameters. When one worker is updating the parameters LOCK essentially locks them until the updating is over. ED Trying to distribute SGD optimization mechanism is essentially all about overcoming the fact that some workers will finish their computations sooner than the others and the common parameters are locked by continuous updates.

SYNCHRONOUS DISTRIBUTED SGD The pitfall is that workers have to wait all the other computations to be completed BEFORE being allowed to proceeding to the next gradient computation.

ASYNCHRONOUS DISTRIBUTED SGD • This technique let the workers communicate updates through a centralized parameter server, which keeps the current state of all parameters for the model shared across the multiple workers. • This approach comes with the main advantage that workers don’t have to wait for the common parameter updated to be completed before move forward the following computation

STALE GRADIENT PROBLEM The model parameters hence are at risk of slow convergence or in some case to diverge. Since those approaches remove any explicit synchronization between updates, they implicitly allow workers to compute gradients locally by using model parameters that may contain some set of parameters, which are several steps behind the most updates set of model parameters.

STALE GRADIENT PROBLEM Many variants of Asynchronous Distributed SGD aim to cope with the Stale Gradient problem by applying a variety of strategies to minimize the impact of its effects. Some approaches are listed below: 1. Zhang, Choromanska, & LeCun (2015)suggest to separate gradients by their level of staleness in order to limit the impact of very stale gradients by modulate the learning according to it. 2. In Elastic Averaging SGD (Zhang et al., 2015), which proved to overcome the Downpour SGD, is provided a new algorithm based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). In addition, the algorithm enables the local workers to perform more exploration. On a hand, the algorithm allows the local variables to fluctuate further from the center variable and on the other hand, reduces the amount of communication between local workers and the master.

JACOPO DI IORIO CHAPTER 3 PLANET AND ENSEMBLE-BASED TREE MODELS

PLANET REASONS AND WHY ENSEMBLE -BASED TREES, TREE BOOSTING AND FORESTS MODELS A STORY ABOUT PLANET, A PARALLEL LEARNER FOR PLANET ASSEMBLING NUMEROUS ENSEMBLE TREES.

REASONS WHY PERSONAL DRAMA 1 WORLD DRAMA PERSONAL DRAMA 1

REASONS WHY PERSONAL DRAMA 1 stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 CHALLENGE IN THE ANALYSIS OF A DATASET

REASONS WHY PERSONAL DRAMA 1 stat under the stars 2 HELD IN SALERNO DURING THE NIGHT BETWEEN 7 AND 8 JUNE 2016 GRADIENT TREE CHALLENGE IN THE ANALYSIS OF A DATASET BOOSTING

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING - PowerPoint PPT Presentation

JACOPO DI IORIO, MATTEO FONTANA, DANIELE OCCHIUTO, CLAUDIA VOLPETTI MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING STATISTICAL LEARNING Separate evolutions, but with shared methodologies MERGE

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Mind RACES: from Reactive to Anticipatory Cognitive Embodied Systems Rino Falcone Institute of

History of linguistic accumulations in Europe between 1000 and 1700 A.D. more or less stable

coreference resolution beneficial to NLP applications? 2. Do we know how to evaluate anaphora

Intr troducti tion to NLP P an and T Text Min xt Minin ing Tutor: R Rahm ahmad ad Mahen

Powerhouse Ventures Limited Rights Issue April 2018 We find great science and build global

Dependency-Based Hybrid Syntactic Analysis for Languages with a Rather Free Word Order Guntis B

City of Pleasanton Cultural Plan Update Project Findings and Recommendations 1 Tonights

SCHOOL OF THE ARTS Art History arthistory.cofc.edu Study the worlds visual heritage in the

Sambuz

Useful Links

Newsletter

Mail Us