daq algorithms on cpus
play

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May - PowerPoint PPT Presentation

DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1 Introduction Work done in the context of comparing processing resources needed for CPU, GPU, FPGA I used the simplest possible pedestal subtraction, noise


  1. DAQ algorithms on CPUs Philip Rodrigues University of Oxford May 24, 2018 1

  2. Introduction ◮ Work done in the context of comparing processing resources needed for CPU, GPU, FPGA ◮ I used the simplest possible pedestal subtraction, noise filtering and hit finding on FD MC. Get a “lower bound” on resources needed ◮ I’ll try to concentrate more on the algorithms than the performance today 2

  3. Back-of-envelope calculation ◮ Collection wire samples/s/APA = 2e6 * 960 = 2e9 ◮ On a 2GHz CPU, that gives us 1 clock cycle per sample if we want to handle 1 APA ◮ But, all machines today have multiple cores, and we have single-instruction-multiple-data (SIMD) 3

  4. SIMD Credit: Decora at English Wikipedia. CC Attribution-Share Alike 3.0 ◮ Act on multiple values simultaneously in one instruction ◮ Machines I can access have AVX2 with 256-bit registers, ie 16 16-bit numbers at a time ◮ Now our back-of-envelope looks better: 16 N core clock cycles per sample ◮ Just got access to a system at CERN with AVX-512 4

  5. How I use SIMD ... (ch0, t0) (ch1, t0) (ch2, t0) ... (ch3, t0) (ch15, t0) ... (ch0, t1) (ch1, t1) (ch2, t1) ... (ch3, t1) (ch15, t1) ◮ Register holds the samples for 16 channels at the same time tick ◮ Makes operations on adjacent ticks in same channel easy ◮ (Makes operations on adjacent channels in same tick hard) ◮ (Incidentally this is the opposite order I store the input in memory, but it doesn’t seem to hurt too much?) 5

  6. Extracting waveforms ◮ Much easier to work on this outside larsoft , so I extracted waveforms from non-zero-suppressed MC using gallery ( http://art.fnal.gov/gallery/ ) ◮ Converted waveforms to text format: simple to import/plot from C++ and python 6

  7. Raw waveforms 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Some channels from SN MC which I selected because they look nice ◮ Reminder: FD MC noise is low, no coherent noise 7

  8. Step 1: pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 ◮ Intuition: pedestal is the median of the waveform ◮ “Frugal streaming” 1 gives an approximation that converges to median: 1. Start with an estimate of the median, read the next sample 2. If sample > median, increase median by 1 3. If sample < median, decrease median by 1 ◮ Unfortunately this “follows” hits too much, so try a modification. . . 1 https://arxiv.org/abs/1407.1121 8

  9. Step 1: modified pedestal finding 600 550 500 450 400 0 1000 2000 3000 4000 1. Start with an accumulator=0, an estimate of the median, and read the next sample 2. If sample > median, increase accumulator by 1 3. If sample < median, decrease accumulator by 1 4. If accumulator = X , increase median by 1, reset accumulator to 0 5. If accumulator = − X , decrease median by 1, reset accumulator to 0 ◮ I used X = 10 because it was the first number I thought of ◮ Larger values of X mean you follow hits less, but respond less to real changes in the pedestal. For serious work, would need some investigation 9

  10. Step 2: noise filtering 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ I used a simple FIR lowpass filter (=a discrete convolution with a fixed function) ◮ Hardcoded filter size (7 taps), unrolled inner loop. Lowpass filter with cutoff 0.1 of Nyquist frequency ◮ I’m using integer coefficients, which is why the scale changed ◮ Probably need a bigger filter for more realistic noise 10

  11. Step 3: hit finding 10000 5000 0 5000 10000 0 1000 2000 3000 4000 ◮ Algorithm: first sample over a fixed threshold starts a hit. Integrate time and charge until fall below threshold again ◮ Could make threshold depend on pedestal RMS; require a number of samples above threshold; emit multiple primitives for long time-over-threshold 11

  12. Benchmarking results summary ◮ Tested with a chunk of collection channel MC large enough to not fit in cache ◮ With about 4 threads, the multicore CPU I tested on can keep up with 1 APA worth of data ◮ More details in backups 12

  13. Extensions/TODOs for benchmarking ◮ Consider more realistic input data, like from electronics: ◮ Samples are 12-bit numbers, not 16-bit ◮ Ordering of channels is different ◮ Input is 8b/10b encoded ◮ Run on different machine with more cores, no virtualization ◮ Time individual steps, vary parameters (eg # of taps) ◮ Check distribution of timings (eg, do we occasionally get very long times?) ◮ Eventually would test more complex algorithms ◮ Stream data into memory, eg using GPU (idea from Babak) 13

  14. Algorithm extensions ◮ Deal with coherent noise somehow. Eg MicroBooNE technique: subtract median of a group of channels at the same tick ◮ MicroBooNE has “harmonic” noise at fixed frequencies. Would require large FIR filter to deal with. Not sure if there is another technique available 14

  15. Physics performance studies ◮ We need to understand how well any given algorithm performs, especially in the presence of more realistic noise. I haven’t done this at all ◮ This also needs a more serious noise model (which doesn’t have to be in larsoft : can be standalone, glued to larsoft signal simulation) 15

  16. Backup slides 16

  17. Detour: Memory hierarchy and bandwidth https://software.intel.com/en-us/articles/memory-performance-in-a-nutshell ◮ Main memory bandwidth sets an upper limit on how much data we can process ◮ 100 GB/s more than enough to handle 1 or 2 APAs 17

  18. Measuring memory bandwidth ◮ Can we actually achieve this memory bandwidth? ◮ Used the STREAM benchmark 2 , which effectively just does memcpy ◮ Ran on dunegpvm01 . With 1 thread, get ∼ 10 GB/s; with 4 threads, get ∼ 35 GB/s (17.5 GB/s in + 17.5 GB/s out) 2 https://github.com/jeffhammond/STREAM , http://www.cs.virginia.edu/stream/ref.html 18

  19. Strategy details ◮ Run on DUNE FD detector MC (it’s all I’ve got. . . ) ◮ Use the simplest algorithms I can think of ◮ Use only collection channels, all calculations with short integers (16 bits) ◮ Write a simple non-SIMD code to check the results ◮ SIMD code written in C++ using “intrinsic” functions ◮ Nicer interfaces exist, though I haven’t tried them. Eg http://quantstack.net/xsimd , http://agner.org/optimize/#vectorclass 19

  20. What code with intrinsics looks like // s holds the samples in 16 channels at the same tick // This whole block achieves the following : // if the sample s is greater than median , add one to accum // if the sample s is less than median , add one to accum // For reasons that I don ’t understand , there ’s no cmplt // for ‘‘compare less -than ’’, so we have to compare greater , // compare equal , and take everything else to be compared // less -then // ’epi16 ’ is a type marker for ’16-bit signed integer ’ // Create masks for which channels are >, < median __m256i is_gt= _mm256_cmpgt_epi16 (s, median); __m256i is_eq= _mm256_cmpeq_epi16 (s, median); // The value we add to the accumulator in each channel __m256i to_add = _mm256_set1_epi16 (-1); // Really want an epi16 version of this , but the cmpgt and // cmplt functions set their epi16 parts to 0xffff or 0x0 , // so treating everything as epi8 works the same to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (1) , is_gt); to_add = _mm256_blendv_epi8 (to_add , _mm256_set1_epi16 (0) , is_eq); // Actually do the adding accum = _mm256_add_epi16 (accum , to_add); 20

  21. Test details ◮ Use DUNE FD MC waveforms, as seen above. ◮ Using 4492 samples × 1835 collection wires × 16 repeats = 69 APA · ms (since 960 collection wires/APA) ◮ Using short int (2 bytes) for samples, so size is 252 MB (big enough to not fit in cache) ◮ Start with this data in memory, ordered like: ( c 0 , t 0 ) , ( c 0 , t 1 ) , . . . ( c 0 , t N ) , ( c 1 , t 0 ) , ( c 1 , t 1 ) ◮ Loop over the data, and store the output hits (not the intermediate steps). Repeat 10 times, take average ◮ Timing doesn’t include putting input data in memory, allocating output buffer, compacting output hits from SIMD code ◮ Ran with multiple threads. Each thread gets a contiguous block of 16 N channels to deal with ◮ No time chunking: all 4492 ticks get processed at once 21

  22. System under test: 2 (system 1 in backups) ◮ lscpu : Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz CPU MHz: 1200.281 CPU max MHz: 3000.0000 CPU min MHz: 1200.0000 BogoMIPS: 4200.00 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K 22

  23. Timing results: system 2 Threads 1 2 4 8 16 32 64 Non-SIMD ms 1322.3 667.1 347.5 195.0 110.4 80.6 86.3 APA/server 0.1 0.1 0.2 0.4 0.6 0.9 0.8 GB/s 0.4 0.7 1.4 2.5 4.4 6.1 5.7 SIMD ms 124.9 76.0 48.4 24.3 14.9 10.9 11.6 APA/server 0.6 0.9 1.4 2.8 4.6 6.3 5.9 GB/s 3.9 6.5 10.1 20.2 33.1 45.0 42.3 ◮ Apologies for gigantic table: I’ve highlighted the most interesting values ◮ “APA/server” is just the ratio of “APA · ms data processed” (69, in this case) to ms elapsed, so take it with a pinch of salt beyond observing whether it’s greater than 1 ◮ ie 2–4 cores can keep up with the data from one APA ◮ There’s a few 10s of % variation between runs in these numbers 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend