DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - - PowerPoint PPT Presentation

daq algorithms on cpus
SMART_READER_LITE
LIVE PREVIEW

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - - PowerPoint PPT Presentation

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1 Introduction Work done in the context of comparing processing resources needed for CPU, GPU, FPGA I used the simplest possible pedestal subtraction, noise


slide-1
SLIDE 1

DAQ algorithms on CPUs

Philip Rodrigues

University of Oxford

May 24, 2018

1

slide-2
SLIDE 2

Introduction

◮ Work done in the context of comparing processing resources needed for CPU, GPU, FPGA ◮ I used the simplest possible pedestal subtraction, noise filtering and hit finding on FD MC. Get

a “lower bound” on resources needed

◮ I’ll try to concentrate more on the algorithms than the performance today

2

slide-3
SLIDE 3

Back-of-envelope calculation

◮ Collection wire samples/s/APA = 2e6 * 960 = 2e9 ◮ On a 2GHz CPU, that gives us 1 clock cycle per sample if we want to handle 1 APA ◮ But, all machines today have multiple cores, and we have single-instruction-multiple-data

(SIMD)

3

slide-4
SLIDE 4

SIMD

Credit: Decora at English Wikipedia. CC Attribution-Share Alike 3.0 ◮ Act on multiple values simultaneously in one instruction ◮ Machines I can access have AVX2 with 256-bit registers, ie 16 16-bit numbers at a time ◮ Now our back-of-envelope looks better: 16Ncore clock cycles per sample ◮ Just got access to a system at CERN with AVX-512

4

slide-5
SLIDE 5

How I use SIMD

... (ch0, t0) (ch1, t0) (ch2, t0) (ch3, t0) (ch15, t0)

...

... (ch0, t1) (ch1, t1) (ch2, t1) (ch3, t1) (ch15, t1)

...

◮ Register holds the samples for 16 channels at the same time tick ◮ Makes operations on adjacent ticks in same channel easy ◮ (Makes operations on adjacent channels in same tick hard) ◮ (Incidentally this is the opposite order I store the input in memory, but it doesn’t seem to hurt

too much?)

5

slide-6
SLIDE 6

Extracting waveforms

◮ Much easier to work on this outside larsoft, so I extracted waveforms from

non-zero-suppressed MC using gallery (http://art.fnal.gov/gallery/)

◮ Converted waveforms to text format: simple to import/plot from C++ and python

6

slide-7
SLIDE 7

Raw waveforms

1000 2000 3000 4000 400 450 500 550 600 ◮ Some channels from SN MC which I selected because they look nice ◮ Reminder: FD MC noise is low, no coherent noise

7

slide-8
SLIDE 8

Step 1: pedestal finding

1000 2000 3000 4000 400 450 500 550 600 ◮ Intuition: pedestal is the median of the waveform ◮ “Frugal streaming”1 gives an approximation that converges to median:

  • 1. Start with an estimate of the median, read the next sample
  • 2. If sample > median, increase median by 1
  • 3. If sample < median, decrease median by 1

◮ Unfortunately this “follows” hits too much, so try a modification. . . 1https://arxiv.org/abs/1407.1121

8

slide-9
SLIDE 9

Step 1: modified pedestal finding

1000 2000 3000 4000 400 450 500 550 600

  • 1. Start with an accumulator=0, an estimate of the median, and read the next sample
  • 2. If sample > median, increase accumulator by 1
  • 3. If sample < median, decrease accumulator by 1
  • 4. If accumulator = X, increase median by 1, reset accumulator to 0
  • 5. If accumulator = −X, decrease median by 1, reset accumulator to 0

◮ I used X = 10 because it was the first number I thought of ◮ Larger values of X mean you follow hits less, but respond less to real changes in the pedestal.

For serious work, would need some investigation

9

slide-10
SLIDE 10

Step 2: noise filtering

1000 2000 3000 4000 10000 5000 5000 10000 ◮ I used a simple FIR lowpass filter (=a discrete convolution with a fixed function) ◮ Hardcoded filter size (7 taps), unrolled inner loop. Lowpass filter with cutoff 0.1 of Nyquist

frequency

◮ I’m using integer coefficients, which is why the scale changed ◮ Probably need a bigger filter for more realistic noise

10

slide-11
SLIDE 11

Step 3: hit finding

1000 2000 3000 4000 10000 5000 5000 10000 ◮ Algorithm: first sample over a fixed threshold starts a hit. Integrate time and charge until fall

below threshold again

◮ Could make threshold depend on pedestal RMS; require a number of samples above threshold;

emit multiple primitives for long time-over-threshold

11

slide-12
SLIDE 12

Benchmarking results summary

◮ Tested with a chunk of collection channel MC large enough to not fit in cache ◮ With about 4 threads, the multicore CPU I tested on can keep up with 1 APA worth of data ◮ More details in backups

12

slide-13
SLIDE 13

Extensions/TODOs for benchmarking

◮ Consider more realistic input data, like from electronics: ◮ Samples are 12-bit numbers, not 16-bit ◮ Ordering of channels is different ◮ Input is 8b/10b encoded ◮ Run on different machine with more cores, no virtualization ◮ Time individual steps, vary parameters (eg # of taps) ◮ Check distribution of timings (eg, do we occasionally get very long times?) ◮ Eventually would test more complex algorithms ◮ Stream data into memory, eg using GPU (idea from Babak)

13

slide-14
SLIDE 14

Algorithm extensions

◮ Deal with coherent noise somehow. Eg MicroBooNE technique: subtract median of a group of

channels at the same tick

◮ MicroBooNE has “harmonic” noise at fixed frequencies. Would require large FIR filter to deal

  • with. Not sure if there is another technique available

14

slide-15
SLIDE 15

Physics performance studies

◮ We need to understand how well any given algorithm performs, especially in the presence of

more realistic noise. I haven’t done this at all

◮ This also needs a more serious noise model (which doesn’t have to be in larsoft: can be

standalone, glued to larsoft signal simulation)

15

slide-16
SLIDE 16

Backup slides

16

slide-17
SLIDE 17

Detour: Memory hierarchy and bandwidth

https://software.intel.com/en-us/articles/memory-performance-in-a-nutshell ◮ Main memory bandwidth sets an upper limit on how much data we can process ◮ 100 GB/s more than enough to handle 1 or 2 APAs

17

slide-18
SLIDE 18

Measuring memory bandwidth

◮ Can we actually achieve this memory bandwidth? ◮ Used the STREAM benchmark2, which effectively just does memcpy ◮ Ran on dunegpvm01. With 1 thread, get ∼10 GB/s; with 4 threads, get ∼35 GB/s (17.5 GB/s

in + 17.5 GB/s out)

2https://github.com/jeffhammond/STREAM, http://www.cs.virginia.edu/stream/ref.html

18

slide-19
SLIDE 19

Strategy details

◮ Run on DUNE FD detector MC (it’s all I’ve got. . . ) ◮ Use the simplest algorithms I can think of ◮ Use only collection channels, all calculations with short integers (16 bits) ◮ Write a simple non-SIMD code to check the results ◮ SIMD code written in C++ using “intrinsic” functions ◮ Nicer interfaces exist, though I haven’t tried them. Eg http://quantstack.net/xsimd,

http://agner.org/optimize/#vectorclass

19

slide-20
SLIDE 20

What code with intrinsics looks like

// s holds the samples in 16 channels at the same tick // This whole block achieves the following : // if the sample s is greater than median , add

  • ne to

accum // if the sample s is less than median , add

  • ne to

accum // For reasons that I don ’t understand , there ’s no cmplt // for ‘‘compare less -than ’’, so we have to compare greater , // compare equal , and take everything else to be compared // less -then // ’epi16 ’ is a type marker for ’16-bit signed integer ’ // Create masks for which channels are >, < median __m256i is_gt= _mm256_cmpgt_epi16 (s, median); __m256i is_eq= _mm256_cmpeq_epi16 (s, median); // The value we add to the accumulator in each channel __m256i to_add = _mm256_set1_epi16 (-1); // Really want an epi16 version

  • f this , but

the cmpgt and // cmplt functions set their epi16 parts to 0xffff

  • r 0x0 ,

// so treating everything as epi8 works the same to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (1) , is_gt); to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (0) , is_eq); // Actually do the adding accum = _mm256_add_epi16 (accum , to_add);

20

slide-21
SLIDE 21

Test details

◮ Use DUNE FD MC waveforms, as seen above. ◮ Using 4492 samples × 1835 collection wires × 16 repeats = 69 APA · ms (since 960 collection

wires/APA)

◮ Using short int (2 bytes) for samples, so size is 252 MB (big enough to not fit in cache) ◮ Start with this data in memory, ordered like:

(c0, t0), (c0, t1), . . . (c0, tN), (c1, t0), (c1, t1)

◮ Loop over the data, and store the output hits (not the intermediate steps). Repeat 10 times,

take average

◮ Timing doesn’t include putting input data in memory, allocating output buffer, compacting

  • utput hits from SIMD code

◮ Ran with multiple threads. Each thread gets a contiguous block of 16N channels to deal with ◮ No time chunking: all 4492 ticks get processed at once

21

slide-22
SLIDE 22

System under test: 2 (system 1 in backups)

◮ lscpu:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz CPU MHz: 1200.281 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K

22

slide-23
SLIDE 23

Timing results: system 2

Threads 1 2 4 8 16 32 64 Non-SIMD ms 1322.3 667.1 347.5 195.0 110.4 80.6 86.3 APA/server 0.1 0.1 0.2 0.4 0.6 0.9 0.8 GB/s 0.4 0.7 1.4 2.5 4.4 6.1 5.7 SIMD ms 124.9 76.0 48.4 24.3 14.9 10.9 11.6 APA/server 0.6 0.9 1.4 2.8 4.6 6.3 5.9 GB/s 3.9 6.5 10.1 20.2 33.1 45.0 42.3

◮ Apologies for gigantic table: I’ve highlighted the most interesting values ◮ “APA/server” is just the ratio of “APA · ms data processed” (69, in this case) to ms elapsed, so

take it with a pinch of salt beyond observing whether it’s greater than 1

◮ ie 2–4 cores can keep up with the data from one APA ◮ There’s a few 10s of % variation between runs in these numbers

23

slide-24
SLIDE 24

System under test: 1

◮ dunegpvm01, for which lscpu says (in part):

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 4 NUMA node(s): 1 Vendor ID: GenuineIntel CPU MHz: 2299.998 BogoMIPS: 4599.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K

24

slide-25
SLIDE 25

Timing results: system 1

Threads 1 2 4 8 16 32 64 Non-SIMD ms 1453.3 722.3 480.8 419.8 400.1 379.6 360.2 APA/server 0.0 0.1 0.1 0.2 0.2 0.2 0.2 GB/s 0.3 0.7 1.0 1.2 1.2 1.3 1.4 SIMD ms 122.9 65.4 49.9 39.7 41.3 41.1 43.1 APA/server 0.6 1.1 1.4 1.7 1.7 1.7 1.6 GB/s 4.0 7.5 9.8 12.4 11.9 12.0 11.4

◮ Apologies for gigantic table: I’ve highlighted the most interesting values ◮ “APA/server” is just the ratio of “APA · ms data processed” (69, in this case) to ms elapsed, so

take it with a pinch of salt beyond observing whether it’s greater than 1

◮ ie 2–4 cores can keep up with the data from one APA ◮ There’s a few 10s of % variation between runs in these numbers ◮ This system only has 4 cores, so don’t expect improvement past 4 threads

25