Accelerated machine learning inference as a service for particle physics computing
Nhan Tran December 8, 2019
Accelerated machine learning inference as a service for particle - - PowerPoint PPT Presentation
Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019 Work based on https://arxiv.org/abs/1904.08986 and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman
Nhan Tran December 8, 2019
Markus Atkinson Mark Neubauer
Work based on https://arxiv.org/abs/1904.08986 and studies in fastmachinelearning.org community
2
Burt Holzman Sergo Jindariani Thomas Klijnsma Ben Kreis Mia Liu Kevin Pedro Nhan Tran Phil Harris Jeff Krupa Sang Eon Park Dylan Rankin Paul Chow Naif Tarafdar Scott Hauck Shih Chieh Hsu Kelvin Mei Cha Suaysom Matt Trahms Dustin Werran Javier Duarte Vladimir Loncar Jennifer Ngadiuba Maurizio Pierini Sioni Summers Suffian Khan Brian Lee Kalin Ovcharov Brandon Perez Andrew Putnam Ted Way Colin Versteeg Zhenbin Wu
Giuseppe Di Guglielmo
3
CMS detector LHC (current) HL-LHC (upgraded) Simultaneous interactions 60 200 L1 accept rate 100 kHz 750 kHz HLT accept rate 1 kHz 7.5 kHz Event size 2.0 MB 7.4 MB HLT computing power 0.5 MHS06 9.2 MHS06 Storage throughput 2.5 GB/s 61 GB/s Event network throughput 1.6 Tb/s 44 Tb/s
CMS offline computing profile projection CMS online filter farm project Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques
4
Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques
data vs. Run 2/3 → exabytes ) els
136PU event (2018) → 30+ petabytes/year
5
6
FPGAs
EFFICIENCY Control Unit (CU) Registers Arithmetic Logic Unit (ALU)
+ + + + + + +
FLEXIBILITY
CPUs GPUs ASICs
Advances in heterogeneous computing driven by machine learning
6
FPGAs
EFFICIENCY Control Unit (CU) Registers Arithmetic Logic Unit (ALU)
+ + + + + + +
FLEXIBILITY
CPUs GPUs ASICs
Advances in heterogeneous computing driven by machine learning
challenges
independent events
7
Opportunities for Accelerated Machine Learning Inference in Fundamental Physics
Javier Duarte1, Philip Harris2, Alex Himmel3, Burt Holzman3, Wesley Ketchum3, Jim Kowalkowski3, Miaoyuan Liu3, Brian Nord3, Gabriel Perdue3, Kevin Pedro3, Nhan Tran3, and Mike Williams2
1University of California San Diego, La Jolla, CA 92093, USA 2Massachusetts Institute of Technology, Cambridge, MA 02139, USA 3Fermi National Accelerator Laboratory, Batavia, IL 60510, USAABSTRACT
In this brief white paper, we discuss the future computing challenges for fundamental physics experiments. The use cases for deploying machine learning across physics for simulation, reconstruction, and analysis is rapidly growing. This will lead us to many applications where exploring accelerated machine learning algorithm inference could bring valuable and necessary gains in performance. Finally, we conclude by discussing the future challenges in deploying new heterogeneous computing hardware. This community report is inspired by discussions at the Fast Machine Learning Workshop1 held September 10-13, 2019.Contents
1 Introduction 1 1.1 Computing model in particle physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Challenges and Applications for Accelerated Machine Learning Inference 2 2.1 CMS and ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 LHCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 LSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 LIGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 DUNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Outlook and Opportunities 6Work in progress, [link]
On how to integrate heterogeneous compute into our computing model
8
Domain ML as a Service (aaS) direct connect GPU FPGA ASIC
On how to integrate heterogeneous compute into our computing model
8
Domain ML as a Service (aaS) direct connect GPU FPGA ASIC
Our first study: MLaaS with FPGAs
9
Re-engineer physics algorithms for new hardware Language: OpenCL, OpenMP, HLS, Kokkos,…? Hardware: CPU, FPGA, GPU Re-cast physics problem as a machine learning problem Language: C++, Python (TensorFlow, PyTorch,…) Hardware: CPU, FPGA, GPU, ASIC
Is there a way to have the best of both worlds with physics aware ML?
10
COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)Pros: less system complexity no network latency Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)
10
COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)COPROCESSOR
(GPU,FPGA,ASIC)Algo 1 Algo 2 Algo 1 Algo 2 Algo 1 Algo 2 Algo 1 Algo 2
Pros: less system complexity no network latency Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)
power hungry than GPU
11
power hungry than GPU
12
power hungry than GPU
13
14
14
–
–
rporationFPGAs
CPUs
FPGAs
CPUs
FPGAs
CPUs
FPGAs
CPUs
15
15
standard CNN network for top tagging: ResNet-50
single-to-few batch
per channel reconstruction in CMS detector
16000 batch (will come back to this)
16
arXiv:1605.07678 DeepAK8
17
Public top tagging data challenge
Averaged over 1000 jets
Event Setup Database Configuration Parameter Sets Input Source
(data or simulation)
Output 1 Output 2 … threads MODULE 2 MODULE 1 MODULE 3 MODULE 4 ML INFER 1 MODULE 5
Event Processing Job
ML INFER 2 MODULE 6
Services for Optimized Network Inference on Coprocessors
18
Coprocessor
Services for Optimized Network Inference on Coprocessors
19
External processing CMSSW thread
acquire() FPGA, GPU, etc. produce() (other work)
coprocessor using communication protocol
20
Fermilab (IL) → Azure (VA) → Fermilab (IL): <Δt> ~ 60ms Azure (on-prem): <Δt> ~ 10ms ResNet50 time on FPGA ~ 1.8 ms, classier on CPU ~ 2ms
21
from SONIC → “worst case” scenario
Scaling Tests
Brainwave Service Worker Node JetImageProducer Worker Node JetImageProducer
Worker Node JetImageProducer
Simple scaling tests show can hit max throughput of FPGA
i.e. the optimal way to use the hardware is keep it busy all the time 50 simultaneous CPU jobs saturate 1 FPGA This is conservative since these jobs only ran 1 module
22
Type Note Latency [ms] Throughput [img/s] CPU* Xeon 2.6 GHz 1750 0.6 i7 3.6 GHz 500 2 *Performance depends on clock speed, TensorFlow version, # threads (1) GPU†** batch = 1 7 143 batch = 32 1.5 667
†Directly connected to CPU via PCIe – not a service
**Performance depends on batch size & optimization of ResNet-50 Brainwave remote 60 660
10 (1.8 on FPGA) 660
30x (cloud, remote) to 175x (edge, on-prem) faster than current CMSSW CPU inference FPGA runs at batch-of-1 GPU is competitive at large batch size
GPUaaS, FPGA via PCIe/on-prem, more models
23
Azure FPGA (Data Box Edge) installed at Fermilab
Testing latency of inference across Feynman Computing Center
Docker Container on server (PCIe): 14 ± 25ms From FCC server farm: 20 ± 30ms n.b. no “bump-in-the-wire” network access, PCIe only
GPUaaS, FPGA via PCIe/on-prem, more models
24
HCal Reco Network Resnet-50 (Top tag) Network CPU (single-thread)
67 inf/s
0.6 - 2 img/s (depends on CPU) GPUaaS w/TensorRT
333 inf/s (batch 16000) 140 img/s (batch 1) 667 img/s (batch 32)
FPGA (batch 1)
500 inf/s (batch 1)
660 img/s (Brainwave, aaS)
Running both direct connect PCIe and aaS using TensorRT as an inference server Running direct connect using Xilinx Alveo (VU9P) with hls4ml
work-in-progress for aaS with Galapagos (UToronto) and custom REST API
Latency is non-trivial with many “balls-in-the-air” Networking bottlenecks need to be understood
Short answer: when a CPU process is > ~5 ms
25
Obvious: hardware architectures optimized for ML have large gains over CPUs
26
Building a network of heterogeneous resources in the cloud and on-premises Work-in-progress: how to coordinate and orchestrate distributed heterogeneous resources
GPU, ASIC GPU, FPGA GPU, FPGA GPU, FPGA FPGA GPU GPU, FPGA GPU, FPGA
architectures such as Graph
FastML communities
27
“this is where extra material goes”
28
29
17
Google searches 98 PB LHC Science data ~200 PB SKA Phase 1 – 2023 ~300 PB/year science data HL-LHC – 2026 ~600 PB Raw data HL-LHC – 2026 ~1 EB Physics data SKA Phase 2 – mid-2020’s ~1 EB science data LHC – 2016 50 PB raw data Facebook uploads 180 PB Google Internet archive ~15 EB
Yearly data volumes
DUNE 2026 LSST 2021