DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - PowerPoint PPT Presentation

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1

Introduction ◮ Work done in the context of comparing processing resources needed for CPU, GPU, FPGA ◮ I used the simplest possible pedestal subtraction, noise filtering and hit finding on FD MC. Get a “lower bound” on resources needed ◮ I’ll try to concentrate more on the algorithms than the performance today 2

Back-of-envelope calculation ◮ Collection wire samples/s/APA = 2e6 * 960 = 2e9 ◮ On a 2GHz CPU, that gives us 1 clock cycle per sample if we want to handle 1 APA ◮ But, all machines today have multiple cores, and we have single-instruction-multiple-data (SIMD) 3

SIMD Credit: Decora at English Wikipedia. CC Attribution-Share Alike 3.0 ◮ Act on multiple values simultaneously in one instruction ◮ Machines I can access have AVX2 with 256-bit registers, ie 16 16-bit numbers at a time ◮ Now our back-of-envelope looks better: 16 N core clock cycles per sample ◮ Just got access to a system at CERN with AVX-512 4

How I use SIMD ... (ch0, t0) (ch1, t0) (ch2, t0) ... (ch3, t0) (ch15, t0) ... (ch0, t1) (ch1, t1) (ch2, t1) ... (ch3, t1) (ch15, t1) ◮ Register holds the samples for 16 channels at the same time tick ◮ Makes operations on adjacent ticks in same channel easy ◮ (Makes operations on adjacent channels in same tick hard) ◮ (Incidentally this is the opposite order I store the input in memory, but it doesn’t seem to hurt too much?) 5

Extracting waveforms ◮ Much easier to work on this outside larsoft , so I extracted waveforms from non-zero-suppressed MC using gallery ( http://art.fnal.gov/gallery/ ) ◮ Converted waveforms to text format: simple to import/plot from C++ and python 6

Raw waveforms 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Some channels from SN MC which I selected because they look nice ◮ Reminder: FD MC noise is low, no coherent noise 7

Step 1: pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Intuition: pedestal is the median of the waveform ◮ “Frugal streaming” 1 gives an approximation that converges to median: 1. Start with an estimate of the median, read the next sample 2. If sample > median, increase median by 1 3. If sample < median, decrease median by 1 ◮ Unfortunately this “follows” hits too much, so try a modification. . . 1 https://arxiv.org/abs/1407.1121 8

Step 1: modified pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 1. Start with an accumulator=0, an estimate of the median, and read the next sample 2. If sample > median, increase accumulator by 1 3. If sample < median, decrease accumulator by 1 4. If accumulator = X , increase median by 1, reset accumulator to 0 5. If accumulator = − X , decrease median by 1, reset accumulator to 0 ◮ I used X = 10 because it was the first number I thought of ◮ Larger values of X mean you follow hits less, but respond less to real changes in the pedestal. For serious work, would need some investigation 9

Step 2: noise filtering 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ I used a simple FIR lowpass filter (=a discrete convolution with a fixed function) ◮ Hardcoded filter size (7 taps), unrolled inner loop. Lowpass filter with cutoff 0.1 of Nyquist frequency ◮ I’m using integer coefficients, which is why the scale changed ◮ Probably need a bigger filter for more realistic noise 10

Step 3: hit finding 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ Algorithm: first sample over a fixed threshold starts a hit. Integrate time and charge until fall below threshold again ◮ Could make threshold depend on pedestal RMS; require a number of samples above threshold; emit multiple primitives for long time-over-threshold 11

Benchmarking results summary ◮ Tested with a chunk of collection channel MC large enough to not fit in cache ◮ With about 4 threads, the multicore CPU I tested on can keep up with 1 APA worth of data ◮ More details in backups 12

Extensions/TODOs for benchmarking ◮ Consider more realistic input data, like from electronics: ◮ Samples are 12-bit numbers, not 16-bit ◮ Ordering of channels is different ◮ Input is 8b/10b encoded ◮ Run on different machine with more cores, no virtualization ◮ Time individual steps, vary parameters (eg # of taps) ◮ Check distribution of timings (eg, do we occasionally get very long times?) ◮ Eventually would test more complex algorithms ◮ Stream data into memory, eg using GPU (idea from Babak) 13

Algorithm extensions ◮ Deal with coherent noise somehow. Eg MicroBooNE technique: subtract median of a group of channels at the same tick ◮ MicroBooNE has “harmonic” noise at fixed frequencies. Would require large FIR filter to deal with. Not sure if there is another technique available 14

Physics performance studies ◮ We need to understand how well any given algorithm performs, especially in the presence of more realistic noise. I haven’t done this at all ◮ This also needs a more serious noise model (which doesn’t have to be in larsoft : can be standalone, glued to larsoft signal simulation) 15

Backup slides 16

Detour: Memory hierarchy and bandwidth https://software.intel.com/en-us/articles/memory-performance-in-a-nutshell ◮ Main memory bandwidth sets an upper limit on how much data we can process ◮ 100 GB/s more than enough to handle 1 or 2 APAs 17

Measuring memory bandwidth ◮ Can we actually achieve this memory bandwidth? ◮ Used the STREAM benchmark 2 , which effectively just does memcpy ◮ Ran on dunegpvm01 . With 1 thread, get ∼ 10 GB/s; with 4 threads, get ∼ 35 GB/s (17.5 GB/s in + 17.5 GB/s out) 2 https://github.com/jeffhammond/STREAM , http://www.cs.virginia.edu/stream/ref.html 18

Strategy details ◮ Run on DUNE FD detector MC (it’s all I’ve got. . . ) ◮ Use the simplest algorithms I can think of ◮ Use only collection channels, all calculations with short integers (16 bits) ◮ Write a simple non-SIMD code to check the results ◮ SIMD code written in C++ using “intrinsic” functions ◮ Nicer interfaces exist, though I haven’t tried them. Eg http://quantstack.net/xsimd , http://agner.org/optimize/#vectorclass 19

What code with intrinsics looks like // s holds the samples in 16 channels at the same tick // This whole block achieves the following : // if the sample s is greater than median , add one to accum // if the sample s is less than median , add one to accum // For reasons that I don ’t understand , there ’s no cmplt // for ‘‘compare less -than ’’, so we have to compare greater , // compare equal , and take everything else to be compared // less -then // ’epi16 ’ is a type marker for ’16-bit signed integer ’ // Create masks for which channels are >, < median __m256i is_gt= _mm256_cmpgt_epi16 (s, median); __m256i is_eq= _mm256_cmpeq_epi16 (s, median); // The value we add to the accumulator in each channel __m256i to_add = _mm256_set1_epi16 (-1); // Really want an epi16 version of this , but the cmpgt and // cmplt functions set their epi16 parts to 0xffff or 0x0 , // so treating everything as epi8 works the same to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (1) , is_gt); to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (0) , is_eq); // Actually do the adding accum = _mm256_add_epi16 (accum , to_add); 20

Test details ◮ Use DUNE FD MC waveforms, as seen above. ◮ Using 4492 samples × 1835 collection wires × 16 repeats = 69 APA · ms (since 960 collection wires/APA) ◮ Using short int (2 bytes) for samples, so size is 252 MB (big enough to not fit in cache) ◮ Start with this data in memory, ordered like: ( c 0 , t 0 ) , ( c 0 , t 1 ) , . . . ( c 0 , t N ) , ( c 1 , t 0 ) , ( c 1 , t 1 ) ◮ Loop over the data, and store the output hits (not the intermediate steps). Repeat 10 times, take average ◮ Timing doesn’t include putting input data in memory, allocating output buffer, compacting output hits from SIMD code ◮ Ran with multiple threads. Each thread gets a contiguous block of 16 N channels to deal with ◮ No time chunking: all 4492 ticks get processed at once 21

System under test: 2 (system 1 in backups) ◮ lscpu : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz CPU MHz: 1200.281 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K 22

Timing results: system 2 Threads 1 2 4 8 16 32 64 Non-SIMD ms 1322.3 667.1 347.5 195.0 110.4 80.6 86.3 APA/server 0.1 0.1 0.2 0.4 0.6 0.9 0.8 GB/s 0.4 0.7 1.4 2.5 4.4 6.1 5.7 SIMD ms 124.9 76.0 48.4 24.3 14.9 10.9 11.6 APA/server 0.6 0.9 1.4 2.8 4.6 6.3 5.9 GB/s 3.9 6.5 10.1 20.2 33.1 45.0 42.3 ◮ Apologies for gigantic table: I’ve highlighted the most interesting values ◮ “APA/server” is just the ratio of “APA · ms data processed” (69, in this case) to ms elapsed, so take it with a pinch of salt beyond observing whether it’s greater than 1 ◮ ie 2–4 cores can keep up with the data from one APA ◮ There’s a few 10s of % variation between runs in these numbers 23

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - PowerPoint PPT Presentation

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1 Introduction Work done in the context of comparing processing resources needed for CPU, GPU, FPGA I used the simplest possible pedestal subtraction, noise

DAQ Architecture Giovanna Lehmann Miotto DAQ Design Review 3 Nov 2016 Introduction From

DAQ Needs from Calibrations--- UPDATE What DAQ needs from calibration SYSTEMs What DAQ

Comments on DUNE DAQ Challenges Architecture Ba Babak Abi DUNE DAQ Simulations Meeting 16 16

DUNE FD DAQ Firmware Status (October DAQ Sprint Summary) David Cussans DUNE Upstream DAQ Meeting

MicroBooNE DAQ Experience Eric Church, PNNL SBN/DUNE DAQ Mee6ng

DUNE Single-Phase FD DAQ Overview Matt Graham, SLAC on behalf of DAQ team DUNE Calibration

SVT DAQ 2019 Physics Run Cameron Bravo (SLAC) Introduction SVT DAQ system underwent a major

The DAQ system The DAQ system : RD51s SRS review Classic flavor of SRS ATCA flavor

Thoughts on alternate DUNE DAQ design Georgia Karagiorgi DUNE DAQ Meeting Oct. 16, 2017

Status of ECL Status of ECL Trigger-DAQ workshop, 2017.08.24 Trigger-DAQ workshop, 2017.08.24

.0 on xTCA DAQ working group meeting Gilles Wittwer Tuesday June 10, 2014 DAQ block diagram PC

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Electronics modules & DAQ system IGARASHI Youichi 1. Front-end studies COLLABORATION KEK

Trigger and DAQ at LHC Trigger and DAQ at LHC C.Schwick Contents Contents INTRODUCTION The

MicroBooNE status Pawel Guzowski The University of Manchester DAQ uptime POT-weighted DAQ

Overview Alan G. Prosser Fermilab Detector R&D Program Review 29 October 2014 DAQ and

Domain-specific modeling: Towards a Food and Drink Gazetteer Authors: Andrey Tagarev, Laura

Asymptotic properties of quantum states and channels Ion Nechita, ukasz Pawela , Zbigniew

Radio pulsar studies in Poland and prospects for the POLFAR telescopes. Wojciech Lewandowski

Application of Correspondence Analysis and Related Methods to Evaluation of Knowledge and Skills

2.2: Numerical summary Measures of location. Measures of spread. Measures of form.

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Estimating Size and Effort Dr. James A. Bednar jbednar@inf.ed.ac.uk

ON-Bases and Least Square Method Artem Los (arteml@kth.se) February 21th, 2017 Artem Los

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - PowerPoint PPT Presentation

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1 Introduction Work done in the context of comparing processing resources needed for CPU, GPU, FPGA I used the simplest possible pedestal subtraction, noise

DAQ Architecture Giovanna Lehmann Miotto DAQ Design Review 3 Nov 2016 Introduction From

DAQ Needs from Calibrations--- UPDATE What DAQ needs from calibration SYSTEMs What DAQ

Comments on DUNE DAQ Challenges Architecture Ba Babak Abi DUNE DAQ Simulations Meeting 16 16

DUNE FD DAQ Firmware Status (October DAQ Sprint Summary) David Cussans DUNE Upstream DAQ Meeting

MicroBooNE DAQ Experience Eric Church, PNNL SBN/DUNE DAQ Mee6ng

DUNE Single-Phase FD DAQ Overview Matt Graham, SLAC on behalf of DAQ team DUNE Calibration

SVT DAQ 2019 Physics Run Cameron Bravo (SLAC) Introduction SVT DAQ system underwent a major

The DAQ system The DAQ system : RD51s SRS review Classic flavor of SRS ATCA flavor

Thoughts on alternate DUNE DAQ design Georgia Karagiorgi DUNE DAQ Meeting Oct. 16, 2017

Status of ECL Status of ECL Trigger-DAQ workshop, 2017.08.24 Trigger-DAQ workshop, 2017.08.24

.0 on xTCA DAQ working group meeting Gilles Wittwer Tuesday June 10, 2014 DAQ block diagram PC

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Electronics modules &amp; DAQ system IGARASHI Youichi 1. Front-end studies COLLABORATION KEK

Trigger and DAQ at LHC Trigger and DAQ at LHC C.Schwick Contents Contents INTRODUCTION The

MicroBooNE status Pawel Guzowski The University of Manchester DAQ uptime POT-weighted DAQ

Overview Alan G. Prosser Fermilab Detector R&amp;D Program Review 29 October 2014 DAQ and

Domain-specific modeling: Towards a Food and Drink Gazetteer Authors: Andrey Tagarev, Laura

Asymptotic properties of quantum states and channels Ion Nechita, ukasz Pawela , Zbigniew

Radio pulsar studies in Poland and prospects for the POLFAR telescopes. Wojciech Lewandowski

Application of Correspondence Analysis and Related Methods to Evaluation of Knowledge and Skills

2.2: Numerical summary Measures of location. Measures of spread. Measures of form.

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University

Estimating Size and Effort Dr. James A. Bednar jbednar@inf.ed.ac.uk

ON-Bases and Least Square Method Artem Los (arteml@kth.se) February 21th, 2017 Artem Los

Electronics modules & DAQ system IGARASHI Youichi 1. Front-end studies COLLABORATION KEK

Overview Alan G. Prosser Fermilab Detector R&D Program Review 29 October 2014 DAQ and