Accelerated machine learning inference as a service for particle - - PowerPoint PPT Presentation

accelerated machine learning inference as a service for
SMART_READER_LITE
LIVE PREVIEW

Accelerated machine learning inference as a service for particle - - PowerPoint PPT Presentation

Accelerated machine learning inference as a service for particle physics computing Nhan Tran December 8, 2019 Work based on https://arxiv.org/abs/1904.08986 and studies in fastmachinelearning.org community Markus Atkinson Burt Holzman


slide-1
SLIDE 1

Accelerated machine learning inference as a service for particle physics computing

Nhan Tran December 8, 2019

slide-2
SLIDE 2

Markus Atkinson Mark Neubauer

Work based on https://arxiv.org/abs/1904.08986 
 and studies in fastmachinelearning.org community

2

Burt Holzman Sergo Jindariani Thomas Klijnsma Ben Kreis Mia Liu Kevin Pedro Nhan Tran Phil Harris Jeff Krupa Sang Eon Park Dylan Rankin Paul Chow Naif Tarafdar Scott Hauck Shih Chieh Hsu Kelvin Mei Cha Suaysom Matt Trahms Dustin Werran Javier Duarte Vladimir Loncar Jennifer Ngadiuba Maurizio Pierini Sioni Summers Suffian Khan Brian Lee Kalin Ovcharov Brandon Perez Andrew Putnam Ted Way Colin Versteeg Zhenbin Wu

Giuseppe Di Guglielmo

slide-3
SLIDE 3

The computing conundrum

3

CMS detector LHC (current) HL-LHC (upgraded) Simultaneous interactions 60 200 L1 accept rate 100 kHz 750 kHz HLT accept rate 1 kHz 7.5 kHz Event size 2.0 MB 7.4 MB HLT computing power 0.5 MHS06 9.2 MHS06 Storage throughput 2.5 GB/s 61 GB/s Event network throughput 1.6 Tb/s 44 Tb/s

CMS offline computing 
 profile projection CMS online filter farm project Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques

slide-4
SLIDE 4

The computing conundrum

4

Compute needs growing by more than 10x Environments getting more complex Need more sophisticated analysis techniques

data vs. Run 2/3 → exabytes ) els

uting Challenges

136PU event (2018) → 30+ petabytes/year

slide-5
SLIDE 5

The computing conundrum

5

slide-6
SLIDE 6

Heterogeneous compute

6

FPGAs

EFFICIENCY Control Unit (CU) Registers Arithmetic Logic Unit (ALU)

+ + + + + + +

FLEXIBILITY

CPUs GPUs ASICs

}

Advances in heterogeneous computing driven by machine learning

slide-7
SLIDE 7

Heterogeneous compute

6

FPGAs

EFFICIENCY Control Unit (CU) Registers Arithmetic Logic Unit (ALU)

+ + + + + + +

FLEXIBILITY

CPUs GPUs ASICs

}

Advances in heterogeneous computing driven by machine learning

slide-8
SLIDE 8

Why fast inference?

  • Training has its own computing

challenges

  • But happens ~once/year and
  • utside of compute infrastructure
  • Inference happens on billions
  • f events many times a year
  • Unique challenge across HEP
  • Massive datasets of statistically

independent events

7

Opportunities for Accelerated Machine Learning Inference in Fundamental Physics

Javier Duarte1, Philip Harris2, Alex Himmel3, Burt Holzman3, Wesley Ketchum3, Jim Kowalkowski3, Miaoyuan Liu3, Brian Nord3, Gabriel Perdue3, Kevin Pedro3, Nhan Tran3, and Mike Williams2

1University of California San Diego, La Jolla, CA 92093, USA 2Massachusetts Institute of Technology, Cambridge, MA 02139, USA 3Fermi National Accelerator Laboratory, Batavia, IL 60510, USA

ABSTRACT

In this brief white paper, we discuss the future computing challenges for fundamental physics experiments. The use cases for deploying machine learning across physics for simulation, reconstruction, and analysis is rapidly growing. This will lead us to many applications where exploring accelerated machine learning algorithm inference could bring valuable and necessary gains in performance. Finally, we conclude by discussing the future challenges in deploying new heterogeneous computing hardware. This community report is inspired by discussions at the Fast Machine Learning Workshop1 held September 10-13, 2019.

Contents

1 Introduction 1 1.1 Computing model in particle physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Challenges and Applications for Accelerated Machine Learning Inference 2 2.1 CMS and ATLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 LHCb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 LSST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 LIGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.5 DUNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 Outlook and Opportunities 6

Work in progress, [link]

slide-9
SLIDE 9

Pros & Cons

On how to integrate heterogeneous compute into our computing model

8

Domain ML as a Service
 (aaS) direct
 connect GPU FPGA ASIC

slide-10
SLIDE 10

Pros & Cons

On how to integrate heterogeneous compute into our computing model

8

Domain ML as a Service
 (aaS) direct
 connect GPU FPGA ASIC

Our first study: MLaaS with FPGAs

slide-11
SLIDE 11

To ML or not to ML

9

Re-engineer physics algorithms 
 for new hardware Language: OpenCL, OpenMP, 
 HLS, Kokkos,…? Hardware: CPU, FPGA, GPU Re-cast physics problem as a 
 machine learning problem Language: C++, Python (TensorFlow, PyTorch,…) Hardware: CPU, FPGA, GPU, ASIC

Is there a way to have the best of both worlds
 with physics aware ML?

slide-12
SLIDE 12

aaS or direct connect

10

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

Pros: less system complexity no network latency Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)

slide-13
SLIDE 13

aaS or direct connect

10

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

COPROCESSOR


(GPU,FPGA,ASIC)

Algo 1 Algo 2 Algo 1 Algo 2 Algo 1 Algo 2 Algo 1 Algo 2

Pros: less system complexity no network latency Pros: scalable algorithms scalable to the grid/cloud heterogeneity (mixed hardwares)

slide-14
SLIDE 14

hardware choices

  • GPUs
  • Power hungry
  • Batching for optimal performance
  • Mature software ecosystem
  • ASICs
  • Most efficient Op/W
  • Less flexible
  • FPGAs
  • Middle solution, flexible and less


power hungry than GPU

  • Does not require batching

11

slide-15
SLIDE 15

hardware choices

  • GPUs
  • Power hungry
  • Batching for optimal performance
  • Mature software ecosystem
  • ASICs
  • Most efficient Op/W
  • Less flexible
  • FPGAs
  • Middle solution, flexible and less


power hungry than GPU

  • Does not require batching

12

slide-16
SLIDE 16

hardware choices

  • GPUs
  • Power hungry
  • Batching for optimal performance
  • Mature software ecosystem
  • ASICs
  • Most efficient Op/W
  • Less flexible
  • FPGAs
  • Middle solution, flexible and less


power hungry than GPU

  • Does not require batching

13

slide-17
SLIDE 17

Brainwave on Azure ML

14

slide-18
SLIDE 18

Brainwave on Azure ML

14

rporation

FPGAs

CPUs

FPGAs

CPUs

FPGAs

CPUs

FPGAs

CPUs

slide-19
SLIDE 19

hardware choices

15

slide-20
SLIDE 20

hardware choices

15

slide-21
SLIDE 21

The models

  • This talk focused on

standard CNN network for top tagging: ResNet-50

  • One big network, 


single-to-few batch

  • Another different example
  • HCal Reco, network for

per channel reconstruction in CMS detector

  • Small network,


16000 batch
 (will come back to this)

16

arXiv:1605.07678 DeepAK8

Deep Learning in Science and Ind

slide-22
SLIDE 22

Tagging tops

17

Public top tagging data challenge

Averaged over 1000 jets

slide-23
SLIDE 23

Event Setup Database Configuration Parameter
 Sets Input Source

(data or simulation)

Output 1 Output 2 … threads MODULE 2 MODULE 1 MODULE 3 MODULE 4 ML INFER 1 MODULE 5

Event Processing Job

ML INFER 2 MODULE 6

SONIC

Services for Optimized Network Inference on Coprocessors

18

Coprocessor

slide-24
SLIDE 24

SONIC

Services for Optimized Network Inference on Coprocessors

19

External processing CMSSW thread

acquire() FPGA, GPU, etc. produce() (other work)

  • Convert experimental data to neural network inference (TF tensor), send to

coprocessor using communication protocol

  • CMSSW ExternalWork mechanism for asynchronous, non-blocking requests
  • SONIC CMSSW repository
  • Supporting gRPC with TensorFlow, working on TensorRT
slide-25
SLIDE 25

SONIC: single service

20

Fermilab (IL) → Azure (VA) → Fermilab (IL): <Δt> ~ 60ms Azure (on-prem): <Δt> ~ 10ms ResNet50 time on FPGA ~ 1.8 ms, classier on CPU ~ 2ms

slide-26
SLIDE 26

SONIC: scale out

21

from SONIC → “worst case” scenario

Scaling Tests

Brainwave Service Worker Node JetImageProducer Worker Node JetImageProducer

Worker Node JetImageProducer

Simple scaling tests show can hit max throughput of FPGA

i.e. the optimal way to use the hardware is keep it busy all the time 50 simultaneous CPU jobs saturate 1 FPGA This is conservative since these jobs only ran 1 module

slide-27
SLIDE 27

Comparisons

22

  • Performance Comparisons

Type Note Latency [ms] Throughput [img/s] CPU* Xeon 2.6 GHz 1750 0.6 i7 3.6 GHz 500 2 *Performance depends on clock speed, TensorFlow version, # threads (1) GPU†** batch = 1 7 143 batch = 32 1.5 667

†Directly connected to CPU via PCIe – not a service

**Performance depends on batch size & optimization of ResNet-50 Brainwave remote 60 660

  • n-prem

10 (1.8 on FPGA) 660

30x (cloud, remote) to 175x (edge, on-prem) faster than current CMSSW CPU inference FPGA runs at batch-of-1 GPU is competitive at large batch size

slide-28
SLIDE 28

Latest studies

GPUaaS, FPGA via PCIe/on-prem, more models

23

  • ab

Azure FPGA (Data Box Edge) 
 installed at Fermilab

Testing latency of inference across Feynman Computing Center

Docker Container on server (PCIe): 14 ± 25ms From FCC server farm: 20 ± 30ms n.b. no “bump-in-the-wire” network access, PCIe only

slide-29
SLIDE 29

Latest studies

GPUaaS, FPGA via PCIe/on-prem, more models

24

HCal Reco Network Resnet-50 (Top tag) Network CPU (single-thread)

67 inf/s

0.6 - 2 img/s (depends on CPU) GPUaaS w/TensorRT

333 inf/s (batch 16000) 140 img/s (batch 1) 667 img/s (batch 32)

FPGA (batch 1)

500 inf/s (batch 1)

660 img/s (Brainwave, aaS)

Running both direct connect PCIe and aaS using TensorRT as an inference server Running direct connect using Xilinx Alveo (VU9P) with hls4ml

work-in-progress for aaS with Galapagos (UToronto) and custom REST API

slide-30
SLIDE 30

Lessons learned

  • Throughput is paramount [**don’t violate storage throughput (buffer)]
  • as-a-Service:


Latency is non-trivial with many “balls-in-the-air”
 Networking bottlenecks need to be understood

  • Driven by on-chip computation time and batching
  • Work-in-progress: large scale multi-service tests
  • Latency lessons:
  • When do you want to use a coprocessor?


Short answer: when a CPU process is > ~5 ms

  • PCIe latency is < ~1ms
  • On-prem aaS latency ~2-10 ms depending on relative location of nodes
  • Cloud latency is fixed by speed-of-light × some O(2) factor for network switching

25

Obvious: hardware architectures optimized for ML have large gains over CPUs

slide-31
SLIDE 31

Towards abstraction:

  • n-premises, in the cloud, oh my!

26

Building a network of heterogeneous resources in the cloud and on-premises Work-in-progress: how to coordinate and orchestrate distributed heterogeneous resources

GPU, ASIC GPU, FPGA GPU, FPGA GPU, FPGA FPGA GPU GPU, FPGA GPU, FPGA

slide-32
SLIDE 32

wants, needs, & what’s next

  • More ML
  • Build more use-cases with exiting ML algorithms
  • Not ML
  • Try the aaS paradigm with non-ML algorithms
  • More domains
  • Working to integrate with Cosmic and Neutrino SW pipelines and models
  • Push the software/firmware/hardware
  • Push GPU/FPGA optimization and customization, particularly for new network

architectures such as Graph

  • Scale out
  • Process orchestration for multi-accelerator, multi-service, many module jobs
  • Understand the potential to include HPC sites into model
  • Community availability
  • Provide hardware platform for interest parties to test through IRIS-HEP SSL and

FastML communities

27

slide-33
SLIDE 33

“this is where extra material goes”

28

slide-34
SLIDE 34

Big data challenge

29

17

Google searches 98 PB LHC Science data ~200 PB SKA Phase 1 – 2023 ~300 PB/year science data HL-LHC – 2026 ~600 PB Raw data HL-LHC – 2026 ~1 EB Physics data SKA Phase 2 – mid-2020’s ~1 EB science data LHC – 2016 50 PB raw data Facebook uploads 180 PB Google Internet archive ~15 EB

Yearly data volumes

DUNE 2026
 LSST 2021