tid air electronics dune fd daq atca rce based solution
play

TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt - PowerPoint PPT Presentation

Oct. 30-31, 2017 TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell Readout Overview 2 RCE Platform Clustering DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet


  1. Oct. 30-31, 2017 TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell

  2. Readout Overview 2

  3. RCE Platform Clustering DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet Switch Switch DPM 2 DPM 3 DPM 3 DPM 2 COB COB DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet Switch Switch DPM 2 DPM 3 DPM 3 DPM 2 COB COB Off shelf link • The RCE nodes are interconnected through Ethernet • Each COB contains a low latency 10/40Gbps Ethernet switch - Cut through latency < 300ns • The COB supports a full mesh 14-slot backplane - Each COB has a direct 10Gbps link to every other COB in a crate - Any RCE in an ATCA shelf has a maximum of two switches between it any every other RCE - 14 * 8 = 112 RCEs in a low latency cluster • Reliable UDP protocol allows direct firmware to firmware data sharing • Allows for low latency data sharing between nodes - Edge channels - APA combining - Neural Network data sharing 3

  4. Data Flow In RCEs Timing Interface Filtering Network Compression Stack Back Feature & Other Filtering Front SW & End Ends Extraction Processing DAQ FW ? (RUDP) Filtering SuperNova SuperNova Pre-Buffer Post Buffer ● Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion ○ Simply add more cards and move fibers ● Target is 640 channels per RCE (1x APA per COB) 4

  5. Interleaved Data Channel Filtering ● Based on MicroBoone data, initial study suggest that we need minimum 16-tap FIR filter ○ Two notch filters + high pass filter ○ More taps for more complex filters ● Built from Xilinx FIR IP core (v7.2) ○ Interleaves the channels to share FPGA resources ○ Unique coefficient set is specified for each interleaved data channel ■ 64 interleaved data channel per FIR module ■ Coefficients set filter shape, allows different filter approaches per channel ○ Assume a fast processing clock: 200 MHz ■ 100x faster than 2 MHz packet rate kLUTs/IPCore kFFs/IPCore DSPs/IPCore Number of taps (kLUTs/DPM) (kFFs/DPM) (DSP48/DPM) 2.37 2.145 16 16-TAP (23.7) (21.45) (160) 3.285 2.667 24 24-TAP (32.85) (26.67) (240) 4.201 3.188 32 32-TAP (42.01) (31.88) (320) 5

  6. Supernova Buffering In Two Stages ● Pre trigger buffer stores data in a ring buffer waiting for a super nova trigger ○ 640 channels per RCE (1x APA per COB) ○ 2 MHz ADC sampling rate ○ 12-bits per ADC ○ Bandwidth: 15.36 Gbps (1.92 GB/s) ■ 640 x 2MHz x 12b ○ Each DPM has 32 GB RAM: ■ 19.2 TB DDR4 RAM for all system across 150x COBs ○ Total Memory for supernova buffering: 31 GB ■ PL 16 GB + PS 15 GB (1GB for Kernel & OS) ○ Without compression: 16.1 seconds pre-trigger buffer ■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes ● Post trigger buffer stores data in flash based SSD before backend DAQ ○ Write sequence occurs once per supernova trigger ■ Low write wearing over experiment lifetime ○ Support up to 2x mSATA/RCE ○ Low bandwidth background readout post trigger ■ Does not impact normal data taking ○ ~$100K for mSATA SD buffering 6

  7. Optional Compression ● Compression is costly in FPGA resources ○ Plan to compress in the ARM processor before shipping out the trigger data to ART-DAQ ○ Compress in the PS made possible by the improved Zynq Ultrascale+ CPU ● Here’s what we estimate the compression would be: Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM Note Arithmetic 292 120 75 22.3 Probability Currently implemented in proto-DUNE (86%) (18%) (<1%) (38%) Encoding 143 60 75 22.3 Huffman Estimated from JJ Russell (43%) (9%) (<1%) (38%) Almost half of XCZU15EG LUTs 7

  8. Deploying Neural Networks In DUNE DAQ FPGAs are a great fit for Neural Network Deployment ● Current research on quantized 8-bit classification in FPGAs ● Layers processing can be pipelined to allow for high frame rate classification ● Layers can be deployed across multiple FPGAs ● Data sharing via interconnects when required ● First layer processing can occur in same FPGA as pre-filtering Detector ● Later classification layers can be Conv Conv Full Region implemented in back end daq FPGA FPGA using GPUs or FPGA co-processors ● Resources estimates for possible Detector Conv Conv Full neural network deployment Region approaches ongoing FPGA FPGA ○ Full LeNet implemented easily in RCE like FPGA ○ Working on VGG-16 Detector Conv Conv Full network Region 8 FPGA FPGA

  9. RCE Firmware Resource Estimates FW Module Notes kLUTs kFFs DSPs BRAM/URAM (Mb) RCE Core 34.76 43.7 0 4.68 Based on existing RCE core PL DDR MIG Based on Xilinx IP core 18.12 20.33 3 0.94 Data Receiver 7.01 27.4 0 2.85 Based on existing WIB Receiver Filtering 42.01 31.88 320 0 Based on Xilinx IP core Hit-finding TBD TBD TBD TBD More investigation needed Total 101.9 123.31 323 8.47 Summation of modules above Utilization 30% 18% 9% 15% Based on XCZU15EG-1FFVB1156E ● With filtering there is plenty of FPGA resources available for additional processing ○ Hit finding ○ Convolution/classification layers for neural network image recognition ○ Lower Resource Compression Algorithm (being investigated by JJ) 9

  10. Waveform Extraction • A version of waveform extraction was done for 35-ton. – Not used, due to time to deploy limitations • Resource usage was modest (based on 35-ton effort). – Estimate ~100KLUTs for 5 x 128 channel groups • Much of the logic dealt with pseudo-signal processing • This logic moves to the filtering section (previous slide) • Needs more work to minimize the resource requirements if used in DUNE • Trigger primitives would be a natural by-product of the extracted waveforms – Since streaming, this will be a low latency path to use for trigger formation. • Learned a lot since that implementation – The 35-ton implementation worked on packets 1024 samples • Would go to a continuous streaming model • Packetizing would be done afterwards for practical reasons ● See slides from JJ Russell here: https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing 10

  11. Current RCE Upgrade Efforts 11

  12. New COB design for LSST: Switch Upgrade ● Adding 40 GbE support to the DPMs and backplane interfaces ● Significantly increasing front panel bandwidth ○ From 10 Gbps to 200 Gbps ● COB design embeds first level network switch layer within the ATCA crate ○ Reduces facility network requirements, direct connect to Infiniband enabled switches ○ Direct link to server room per rack or COB 12

  13. DPM Redesign for DUNE ● Oxford/SLAC Collaboration ● Optimized for large memory buffering on the DPM ● 32 GB per DPM for super nova ○ 16 GB PS (only support dual RANK) ○ 16 GB PL ■ Wired up to support QUAD RANK SODIMM for possible 32 GB memory ● 20 of 24 HS RTMs supported ○ 80 links/COB @ 1.25 Gbps (8B/10B) ○ 20 links/COB @ 5 Gbps (8B/10B) ○ 8 links/COB @ 10 Gbps (64B/66B) ● Standard PL 10/40 GbE interface + 2x PS 1GbE ● Supports up to 2x mSATA/DPM on the RTM 13

  14. Comparing Technical Specifics: Processor Side (PS) Application CPU RT CPU GPU PS DDR # CPUs Frequency # CPUs Frequency # CPUs Frequency Type Size Width Switching Speed Raw Peak Bandwidth 2 0 0 1GB 32-bit 1066Mbps 34Gbps 800MHz DDR3 N/A N/A proto-DUNE 4 2 2 16GB 64-bit 2400Mbps 153.6Gbps 1.2GHz DDR4 500MHz 600MHz DUNE ~4 times more bandwidth https://www.xilinx.com/support/documentation/data_sheets/ds925-zynq-ultrascale-plus.pdf 14

  15. Comparing Technical Specifics: Programmable Logic (PL) kLUTs kFFs DDR BRAM URAM DSP48 XC7Z045-2FFG900E 218 437.2 0GB 19.2Mb 0Mb 900 (proto-DUNE) XCZU15EG-1FFVB1156E 341 683 16GB 26.2Mb 31.5Mb 3,528 (DUNE) 15

  16. Other Possible RCE/COB Modifications ● Move COB/RCE platform to pizza box ○ Possible cost reduction ○ High risk for maintainability and reliability ■ Fan failures are a high risk ■ Power supplies failures a second level risk ■ Filtering will be required, but could be done at the rack level ■ Loss of telecom class reliability and uptime ● Simplify COB and remove network switch ○ Keep some local interconnects for data sharing ○ Cost reduction possibilities ■ Shifts networking cost to outside the crate ● COB like ATCA carrier for commercial ZYNQ modules ○ Currently being investigated ○ Unclear of cost reduction relative to current COB/DPM due to loss in density 16

  17. RCE/COB Based Solution Costing 17

  18. ATCA Packaging for DUNE ● 1 APA = 2560 channels ● 1 APA per COB ○ 4 DPMs per COB ○ 640 channels per DPM ○ Assuming 128 channel per high speed link ■ 5x high speed links per DPM ● 150 APA for the entire system = 150 COBs ● Total Rack space: 165U ○ 11x 14-slot ATCA crates ○ 15U per 14-slot ATCA Crate ■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460 18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend