Accelerated machine learning inference as a service for particle - PowerPoint PPT Presentation

Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019

Work based on https://arxiv.org/abs/1904.08986   and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman Mark Neubauer Sergo Jindariani Zhenbin Wu Thomas Klijnsma Javier Duarte Ben Kreis Mia Liu Phil Harris Kevin Pedro Jeff Krupa Nhan Tran Giuseppe Di Guglielmo Sang Eon Park Dylan Rankin Suffian Khan Brian Lee Scott Hauck Kalin Ovcharov Shih Chieh Hsu Brandon Perez Paul Chow Kelvin Mei Andrew Putnam Naif Tarafdar Cha Suaysom Vladimir Loncar Ted Way Matt Trahms Jennifer Ngadiuba Colin Versteeg Dustin Werran Maurizio Pierini Sioni Summers � 2

The computing conundrum CMS online filter farm project CMS detector LHC (current) HL-LHC (upgraded) Simultaneous interactions 60 200 L1 accept rate 100 kHz 750 kHz HLT accept rate 1 kHz 7.5 kHz Event size 2.0 MB 7.4 MB CMS offline computing   HLT computing power 0.5 MHS06 9.2 MHS06 profile projection Storage throughput 2.5 GB/s 61 GB/s Event network throughput 1.6 Tb/s 44 Tb/s Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques � 3

The computing conundrum uting Challenges 136PU event (2018) data vs. Run 2/3 → exabytes ) els Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques � 4 → 30+ petabytes/year

The computing conundrum � 5

Heterogeneous compute + Registers Control + + CPUs GPUs Unit (CU) ASICs Arithmetic FPGAs Logic Unit + + + + (ALU) Advances in heterogeneous FLEXIBILITY EFFICIENCY computing driven by } machine learning � 6

Why fast inference? • Training has its own computing Opportunities for Accelerated Machine Learning Inference in Fundamental Physics challenges Javier Duarte 1 , Philip Harris 2 , Alex Himmel 3 , Burt Holzman 3 , Wesley Ketchum 3 , Jim Kowalkowski 3 , Miaoyuan Liu 3 , Brian Nord 3 , Gabriel Perdue 3 , Kevin Pedro 3 , • But happens ~once/year and Nhan Tran 3 , and Mike Williams 2 1 University of California San Diego, La Jolla, CA 92093, USA outside of compute infrastructure 2 Massachusetts Institute of Technology, Cambridge, MA 02139, USA 3 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA • Inference happens on billions ABSTRACT In this brief white paper, we discuss the future computing challenges for fundamental physics experiments. The use cases for of events many times a year deploying machine learning across physics for simulation, reconstruction, and analysis is rapidly growing. This will lead us to many applications where exploring accelerated machine learning algorithm inference could bring valuable and necessary gains in performance. Finally, we conclude by discussing the future challenges in deploying new heterogeneous computing hardware. • Unique challenge across HEP This community report is inspired by discussions at the Fast Machine Learning Workshop 1 held September 10-13, 2019. Contents • Massive datasets of statistically 1 Introduction 1 1.1 Computing model in particle physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 independent events 1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Challenges and Applications for Accelerated Machine Learning Inference 2 2.1 CMS and ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 LHCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 LSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 LIGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 DUNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Outlook and Opportunities 6 Work in progress, [link] � 7

Pros & Cons On how to integrate heterogeneous compute into our computing model Domain ML as a Service   direct   (aaS) connect GPU FPGA ASIC � 8

Pros & Cons On how to integrate heterogeneous compute into our computing model Domain ML Our first study: MLaaS with FPGAs as a Service   direct   (aaS) connect GPU FPGA ASIC � 8

To ML or not to ML Re-engineer physics algorithms   Re-cast physics problem as a   for new hardware machine learning problem Language: OpenCL, OpenMP,   Language: C++, Python HLS, Kokkos,…? (TensorFlow, PyTorch,…) Hardware: CPU, FPGA, GPU Hardware: CPU, FPGA, GPU, ASIC Is there a way to have the best of both worlds   with physics aware ML? � 9

aaS or direct connect C OPROCESSOR   (GPU,FPGA,ASIC) C OPROCESSOR   C OPROCESSOR   (GPU,FPGA,ASIC) (GPU,FPGA,ASIC) C OPROCESSOR   (GPU,FPGA,ASIC) C OPROCESSOR   (GPU,FPGA,ASIC) Pros: Pros: scalable algorithms less system complexity scalable to the grid/cloud no network latency heterogeneity (mixed hardwares) � 10

aaS or direct connect Algo 1 C OPROCESSOR   (GPU,FPGA,ASIC) Algo 2 Algo 1 C OPROCESSOR   C OPROCESSOR   Algo 1 (GPU,FPGA,ASIC) (GPU,FPGA,ASIC) Algo 2 C OPROCESSOR   Algo 2 (GPU,FPGA,ASIC) Algo 1 C OPROCESSOR   (GPU,FPGA,ASIC) Algo 2 Pros: Pros: scalable algorithms less system complexity scalable to the grid/cloud no network latency heterogeneity (mixed hardwares) � 10

hardware choices • GPUs • Power hungry • Batching for optimal performance • Mature software ecosystem • ASICs • Most efficient Op/W • Less flexible • FPGAs • Middle solution, flexible and less   power hungry than GPU • Does not require batching � 11

Brainwave on Azure ML � 14

– Brainwave on Azure ML CPUs CPUs FPGAs FPGAs – CPUs FPGAs FPGAs CPUs rporation � 14

hardware choices � 15

The models Deep Learning in Science and Ind • This talk focused on standard CNN network for top tagging: ResNet-50 • One big network,   single-to-few batch DeepAK8 arXiv:1605.07678 • Another different example • HCal Reco , network for per channel reconstruction in CMS detector • Small network,   16000 batch   (will come back to this) � 16

Tagging tops Averaged over 1000 jets Public top tagging data challenge � 17

SONIC S ervices for O ptimized N etwork I nference on C oprocessors Event Processing Job Parameter   Configuration Sets ML INFER 1 M ODULE 1 M ODULE 6 M ODULE 2 Input Source Output 1 (data or simulation) Output 2 M ODULE 3 … threads ML INFER 2 M ODULE 5 M ODULE 4 Event Setup Database Coprocessor � 18

SONIC S ervices for O ptimized N etwork I nference on C oprocessors • Convert experimental data to neural network inference (TF tensor), send to coprocessor using communication protocol • CMSSW ExternalWork mechanism for asynchronous, non-blocking requests External FPGA, GPU, etc. processing CMSSW acquire () (other work) produce () thread • SONIC CMSSW repository • Supporting gRPC with TensorFlow, working on TensorRT � 19

SONIC: single service Fermilab (IL) → Azure (VA) → Fermilab (IL): < Δ t> ~ 60ms Azure (on-prem): < Δ t> ~ 10ms ResNet50 time on FPGA ~ 1.8 ms, classier on CPU ~ 2ms � 20

SONIC: scale out Scaling Tests Worker Node JetImageProducer Brainwave Service Worker Node JetImageProducer … Worker Node JetImageProducer Simple scaling tests show can hit max throughput of FPGA from SONIC → “worst case” scenario i.e. the optimal way to use the hardware is keep it busy all the time 50 simultaneous CPU jobs saturate 1 FPGA This is conservative since these jobs only ran 1 module � 21

Comparisons Performance Comparisons Type Note Latency [ms] Throughput [img/s] Xeon 2.6 GHz 1750 0.6 CPU* i7 3.6 GHz 500 2 *Performance depends on clock speed, TensorFlow version, # threads (1) batch = 1 7 143 GPU † ** batch = 32 1.5 667 † Directly connected to CPU via PCIe – not a service **Performance depends on batch size & optimization of ResNet-50 remote 60 660 Brainwave on-prem 10 (1.8 on FPGA) 660 � 30x (cloud, remote) to 175x (edge, on-prem) faster than current CMSSW CPU inference FPGA runs at batch-of-1 GPU is competitive at large batch size � 22

Accelerated machine learning inference as a service for particle - PowerPoint PPT Presentation

Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019 Work based on https://arxiv.org/abs/1904.08986 and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman

Casey Rosenthal @caseyrosenthal Part One. SERVICE A SERVICE B SERVICE C SERVICE D SERVICE E

Accelerated Reader What is Accelerated Reader? Accelerated Reader is the number one software

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV 2016 Joe Eaton Ph.D. Accelerated

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Accelerated Learning - for Breakthrough Results Whole brain, person, systems approach Debbie

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Shaders Slide credit to Prof. Zwicker Today Shader programming 2 Complete model Blinn

INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS Overview Low-level control Ad-hoc

Commissioning of The CMS Forward Pixel Detector Ashish Kumar SUNY Buffalo (for the CMS FPix

Real Time Java Real Time Java Filip Pizlo , Jan Vitek Filip Pizlo , Jan Vitek Purdue University

Abstract Abstract. The security of the RSA cryptosystem relies on the believed difficulty

Today https://pollev.com/sprenkle Process Scheduling Review and conclusions Cooperating

CSCI [4|6] 730 Operating Systems CPU Scheduling Maria Hybinette, UGA Maria Hybinette, UGA

Player-Driven Procedural Texturing Henry Goffin, Grue, Chris Hecker, Ocean Quigley, Shalin