 
              Oct. 30-31, 2017 TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell
Readout Overview 2
RCE Platform Clustering DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet Switch Switch DPM 2 DPM 3 DPM 3 DPM 2 COB COB DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet Switch Switch DPM 2 DPM 3 DPM 3 DPM 2 COB COB Off shelf link • The RCE nodes are interconnected through Ethernet • Each COB contains a low latency 10/40Gbps Ethernet switch - Cut through latency < 300ns • The COB supports a full mesh 14-slot backplane - Each COB has a direct 10Gbps link to every other COB in a crate - Any RCE in an ATCA shelf has a maximum of two switches between it any every other RCE - 14 * 8 = 112 RCEs in a low latency cluster • Reliable UDP protocol allows direct firmware to firmware data sharing • Allows for low latency data sharing between nodes - Edge channels - APA combining - Neural Network data sharing 3
Data Flow In RCEs Timing Interface Filtering Network Compression Stack Back Feature & Other Filtering Front SW & End Ends Extraction Processing DAQ FW ? (RUDP) Filtering SuperNova SuperNova Pre-Buffer Post Buffer ● Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion ○ Simply add more cards and move fibers ● Target is 640 channels per RCE (1x APA per COB) 4
Interleaved Data Channel Filtering ● Based on MicroBoone data, initial study suggest that we need minimum 16-tap FIR filter ○ Two notch filters + high pass filter ○ More taps for more complex filters ● Built from Xilinx FIR IP core (v7.2) ○ Interleaves the channels to share FPGA resources ○ Unique coefficient set is specified for each interleaved data channel ■ 64 interleaved data channel per FIR module ■ Coefficients set filter shape, allows different filter approaches per channel ○ Assume a fast processing clock: 200 MHz ■ 100x faster than 2 MHz packet rate kLUTs/IPCore kFFs/IPCore DSPs/IPCore Number of taps (kLUTs/DPM) (kFFs/DPM) (DSP48/DPM) 2.37 2.145 16 16-TAP (23.7) (21.45) (160) 3.285 2.667 24 24-TAP (32.85) (26.67) (240) 4.201 3.188 32 32-TAP (42.01) (31.88) (320) 5
Supernova Buffering In Two Stages ● Pre trigger buffer stores data in a ring buffer waiting for a super nova trigger ○ 640 channels per RCE (1x APA per COB) ○ 2 MHz ADC sampling rate ○ 12-bits per ADC ○ Bandwidth: 15.36 Gbps (1.92 GB/s) ■ 640 x 2MHz x 12b ○ Each DPM has 32 GB RAM: ■ 19.2 TB DDR4 RAM for all system across 150x COBs ○ Total Memory for supernova buffering: 31 GB ■ PL 16 GB + PS 15 GB (1GB for Kernel & OS) ○ Without compression: 16.1 seconds pre-trigger buffer ■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes ● Post trigger buffer stores data in flash based SSD before backend DAQ ○ Write sequence occurs once per supernova trigger ■ Low write wearing over experiment lifetime ○ Support up to 2x mSATA/RCE ○ Low bandwidth background readout post trigger ■ Does not impact normal data taking ○ ~$100K for mSATA SD buffering 6
Optional Compression ● Compression is costly in FPGA resources ○ Plan to compress in the ARM processor before shipping out the trigger data to ART-DAQ ○ Compress in the PS made possible by the improved Zynq Ultrascale+ CPU ● Here’s what we estimate the compression would be: Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM Note Arithmetic 292 120 75 22.3 Probability Currently implemented in proto-DUNE (86%) (18%) (<1%) (38%) Encoding 143 60 75 22.3 Huffman Estimated from JJ Russell (43%) (9%) (<1%) (38%) Almost half of XCZU15EG LUTs 7
Deploying Neural Networks In DUNE DAQ FPGAs are a great fit for Neural Network Deployment ● Current research on quantized 8-bit classification in FPGAs ● Layers processing can be pipelined to allow for high frame rate classification ● Layers can be deployed across multiple FPGAs ● Data sharing via interconnects when required ● First layer processing can occur in same FPGA as pre-filtering Detector ● Later classification layers can be Conv Conv Full Region implemented in back end daq FPGA FPGA using GPUs or FPGA co-processors ● Resources estimates for possible Detector Conv Conv Full neural network deployment Region approaches ongoing FPGA FPGA ○ Full LeNet implemented easily in RCE like FPGA ○ Working on VGG-16 Detector Conv Conv Full network Region 8 FPGA FPGA
RCE Firmware Resource Estimates FW Module Notes kLUTs kFFs DSPs BRAM/URAM (Mb) RCE Core 34.76 43.7 0 4.68 Based on existing RCE core PL DDR MIG Based on Xilinx IP core 18.12 20.33 3 0.94 Data Receiver 7.01 27.4 0 2.85 Based on existing WIB Receiver Filtering 42.01 31.88 320 0 Based on Xilinx IP core Hit-finding TBD TBD TBD TBD More investigation needed Total 101.9 123.31 323 8.47 Summation of modules above Utilization 30% 18% 9% 15% Based on XCZU15EG-1FFVB1156E ● With filtering there is plenty of FPGA resources available for additional processing ○ Hit finding ○ Convolution/classification layers for neural network image recognition ○ Lower Resource Compression Algorithm (being investigated by JJ) 9
Waveform Extraction • A version of waveform extraction was done for 35-ton. – Not used, due to time to deploy limitations • Resource usage was modest (based on 35-ton effort). – Estimate ~100KLUTs for 5 x 128 channel groups • Much of the logic dealt with pseudo-signal processing • This logic moves to the filtering section (previous slide) • Needs more work to minimize the resource requirements if used in DUNE • Trigger primitives would be a natural by-product of the extracted waveforms – Since streaming, this will be a low latency path to use for trigger formation. • Learned a lot since that implementation – The 35-ton implementation worked on packets 1024 samples • Would go to a continuous streaming model • Packetizing would be done afterwards for practical reasons ● See slides from JJ Russell here: https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing 10
Current RCE Upgrade Efforts 11
New COB design for LSST: Switch Upgrade ● Adding 40 GbE support to the DPMs and backplane interfaces ● Significantly increasing front panel bandwidth ○ From 10 Gbps to 200 Gbps ● COB design embeds first level network switch layer within the ATCA crate ○ Reduces facility network requirements, direct connect to Infiniband enabled switches ○ Direct link to server room per rack or COB 12
DPM Redesign for DUNE ● Oxford/SLAC Collaboration ● Optimized for large memory buffering on the DPM ● 32 GB per DPM for super nova ○ 16 GB PS (only support dual RANK) ○ 16 GB PL ■ Wired up to support QUAD RANK SODIMM for possible 32 GB memory ● 20 of 24 HS RTMs supported ○ 80 links/COB @ 1.25 Gbps (8B/10B) ○ 20 links/COB @ 5 Gbps (8B/10B) ○ 8 links/COB @ 10 Gbps (64B/66B) ● Standard PL 10/40 GbE interface + 2x PS 1GbE ● Supports up to 2x mSATA/DPM on the RTM 13
Comparing Technical Specifics: Processor Side (PS) Application CPU RT CPU GPU PS DDR # CPUs Frequency # CPUs Frequency # CPUs Frequency Type Size Width Switching Speed Raw Peak Bandwidth 2 0 0 1GB 32-bit 1066Mbps 34Gbps 800MHz DDR3 N/A N/A proto-DUNE 4 2 2 16GB 64-bit 2400Mbps 153.6Gbps 1.2GHz DDR4 500MHz 600MHz DUNE ~4 times more bandwidth https://www.xilinx.com/support/documentation/data_sheets/ds925-zynq-ultrascale-plus.pdf 14
Comparing Technical Specifics: Programmable Logic (PL) kLUTs kFFs DDR BRAM URAM DSP48 XC7Z045-2FFG900E 218 437.2 0GB 19.2Mb 0Mb 900 (proto-DUNE) XCZU15EG-1FFVB1156E 341 683 16GB 26.2Mb 31.5Mb 3,528 (DUNE) 15
Other Possible RCE/COB Modifications ● Move COB/RCE platform to pizza box ○ Possible cost reduction ○ High risk for maintainability and reliability ■ Fan failures are a high risk ■ Power supplies failures a second level risk ■ Filtering will be required, but could be done at the rack level ■ Loss of telecom class reliability and uptime ● Simplify COB and remove network switch ○ Keep some local interconnects for data sharing ○ Cost reduction possibilities ■ Shifts networking cost to outside the crate ● COB like ATCA carrier for commercial ZYNQ modules ○ Currently being investigated ○ Unclear of cost reduction relative to current COB/DPM due to loss in density 16
RCE/COB Based Solution Costing 17
ATCA Packaging for DUNE ● 1 APA = 2560 channels ● 1 APA per COB ○ 4 DPMs per COB ○ 640 channels per DPM ○ Assuming 128 channel per high speed link ■ 5x high speed links per DPM ● 150 APA for the entire system = 150 COBs ● Total Rack space: 165U ○ 11x 14-slot ATCA crates ○ 15U per 14-slot ATCA Crate ■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460 18
Recommend
More recommend