- Oct. 30-31, 2017
TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt - - PowerPoint PPT Presentation
TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt - - PowerPoint PPT Presentation
Oct. 30-31, 2017 TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell Readout Overview 2 RCE Platform Clustering DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet
Readout Overview
2
3
RCE Platform Clustering
- The RCE nodes are interconnected through Ethernet
- Each COB contains a low latency 10/40Gbps Ethernet switch
- Cut through latency < 300ns
- The COB supports a full mesh 14-slot backplane
- Each COB has a direct 10Gbps link to every other COB in a crate
- Any RCE in an ATCA shelf has a maximum of two switches between it any every other RCE
- 14 * 8 = 112 RCEs in a low latency cluster
- Reliable UDP protocol allows direct firmware to firmware data sharing
- Allows for low latency data sharing between nodes
- Edge channels
- APA combining
- Neural Network data sharing
COB
DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM
COB
DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM
COB
DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM
COB
DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM
Off shelf link
Data Flow In RCEs
4
Filtering Filtering Filtering
Front Ends
Feature Extraction SuperNova Pre-Buffer SuperNova Post Buffer Network Stack SW & FW (RUDP) Timing Interface
Back End DAQ
- Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion
○ Simply add more cards and move fibers
- Target is 640 channels per RCE (1x APA per COB)
Compression & Other Processing ?
Interleaved Data Channel Filtering
- Based on MicroBoone data, initial study suggest that we need minimum 16-tap FIR filter
○ Two notch filters + high pass filter ○ More taps for more complex filters
- Built from Xilinx FIR IP core (v7.2)
○ Interleaves the channels to share FPGA resources ○ Unique coefficient set is specified for each interleaved data channel ■ 64 interleaved data channel per FIR module ■ Coefficients set filter shape, allows different filter approaches per channel ○ Assume a fast processing clock: 200 MHz ■ 100x faster than 2 MHz packet rate
5
Number of taps
kLUTs/IPCore (kLUTs/DPM) kFFs/IPCore (kFFs/DPM) DSPs/IPCore (DSP48/DPM) 16-TAP
2.37 (23.7) 2.145 (21.45) 16 (160)
24-TAP
3.285 (32.85) 2.667 (26.67) 24 (240)
32-TAP
4.201 (42.01) 3.188 (31.88) 32 (320)
Supernova Buffering In Two Stages
- Pre trigger buffer stores data in a ring buffer waiting for a super nova trigger
○ 640 channels per RCE (1x APA per COB) ○ 2 MHz ADC sampling rate ○ 12-bits per ADC ○ Bandwidth: 15.36 Gbps (1.92 GB/s) ■ 640 x 2MHz x 12b ○ Each DPM has 32 GB RAM: ■ 19.2 TB DDR4 RAM for all system across 150x COBs ○ Total Memory for supernova buffering: 31 GB ■ PL 16 GB + PS 15 GB (1GB for Kernel & OS) ○ Without compression: 16.1 seconds pre-trigger buffer ■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes
- Post trigger buffer stores data in flash based SSD before backend DAQ
○ Write sequence occurs once per supernova trigger ■ Low write wearing over experiment lifetime ○ Support up to 2x mSATA/RCE ○ Low bandwidth background readout post trigger ■ Does not impact normal data taking ○ ~$100K for mSATA SD buffering
6
Optional Compression
- Compression is costly in FPGA resources
○ Plan to compress in the ARM processor before shipping out the trigger data to ART-DAQ ○ Compress in the PS made possible by the improved Zynq Ultrascale+ CPU
- Here’s what we estimate the compression would be:
7
Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM Note Arithmetic Probability Encoding 292 (86%) 120 (18%) 75 (<1%) 22.3 (38%) Currently implemented in proto-DUNE Huffman 143 (43%) 60 (9%) 75 (<1%) 22.3 (38%) Estimated from JJ Russell
Almost half of XCZU15EG LUTs
8
Deploying Neural Networks In DUNE DAQ
Detector Region
FPGA
Conv Conv
FPGA
Conv Conv
FPGA
Conv Conv
FPGA
Full
FPGA
Full
FPGA
Full FPGAs are a great fit for Neural Network Deployment
- Current research on quantized
8-bit classification in FPGAs
- Layers processing can be
pipelined to allow for high frame rate classification
- Layers can be deployed across
multiple FPGAs
- Data sharing via
interconnects when required
- First layer processing can occur
in same FPGA as pre-filtering
- Later classification layers can be
implemented in back end daq using GPUs or FPGA co-processors
- Resources estimates for possible
neural network deployment approaches ongoing ○ Full LeNet implemented easily in RCE like FPGA ○ Working on VGG-16 network Detector Region Detector Region
9
RCE Firmware Resource Estimates
FW Module kLUTs kFFs DSPs BRAM/URAM (Mb) Notes RCE Core 34.76 43.7 4.68 Based on existing RCE core PL DDR MIG 18.12 20.33 3 0.94 Based on Xilinx IP core Data Receiver 7.01 27.4 2.85 Based on existing WIB Receiver Filtering 42.01 31.88 320 Based on Xilinx IP core Hit-finding TBD TBD TBD TBD More investigation needed Total 101.9 123.31 323 8.47 Summation of modules above Utilization 30% 18% 9% 15% Based on XCZU15EG-1FFVB1156E
- With filtering there is plenty of FPGA resources available for additional processing
○ Hit finding ○ Convolution/classification layers for neural network image recognition ○ Lower Resource Compression Algorithm (being investigated by JJ)
10
Waveform Extraction
- A version of waveform extraction was done for 35-ton.
– Not used, due to time to deploy limitations
- Resource usage was modest (based on 35-ton effort).
– Estimate ~100KLUTs for 5 x 128 channel groups
- Much of the logic dealt with pseudo-signal processing
- This logic moves to the filtering section (previous slide)
- Needs more work to minimize the resource requirements if used in DUNE
- Trigger primitives would be a natural by-product of the extracted waveforms
– Since streaming, this will be a low latency path to use for trigger formation.
- Learned a lot since that implementation
– The 35-ton implementation worked on packets 1024 samples
- Would go to a continuous streaming model
- Packetizing would be done afterwards for practical reasons
- See slides from JJ Russell here:
https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing
11
Current RCE Upgrade Efforts
12
New COB design for LSST: Switch Upgrade
- Adding 40 GbE support to the DPMs and backplane interfaces
- Significantly increasing front panel bandwidth
○ From 10 Gbps to 200 Gbps
- COB design embeds first level network switch layer within the ATCA crate
○ Reduces facility network requirements, direct connect to Infiniband enabled switches ○ Direct link to server room per rack or COB
13
- Oxford/SLAC Collaboration
- Optimized for large memory buffering on the DPM
- 32 GB per DPM for super nova
○ 16 GB PS (only support dual RANK) ○ 16 GB PL
■ Wired up to support QUAD RANK SODIMM for possible 32 GB memory
- 20 of 24 HS RTMs supported
○ 80 links/COB @ 1.25 Gbps (8B/10B) ○ 20 links/COB @ 5 Gbps (8B/10B) ○ 8 links/COB @ 10 Gbps (64B/66B)
- Standard PL 10/40 GbE interface + 2x PS 1GbE
- Supports up to 2x mSATA/DPM on the RTM
DPM Redesign for DUNE
Comparing Technical Specifics: Processor Side (PS)
Application CPU RT CPU GPU PS DDR
# CPUs Frequency # CPUs Frequency # CPUs Frequency Type Size Width Switching Speed Raw Peak Bandwidth proto-DUNE
2
800MHz
N/A N/A
DDR3
1GB 32-bit 1066Mbps 34Gbps
DUNE
4
1.2GHz
2
500MHz
2
600MHz
DDR4
16GB 64-bit 2400Mbps 153.6Gbps
14
https://www.xilinx.com/support/documentation/data_sheets/ds925-zynq-ultrascale-plus.pdf ~4 times more bandwidth
Comparing Technical Specifics: Programmable Logic (PL)
kLUTs kFFs DDR BRAM URAM DSP48 XC7Z045-2FFG900E (proto-DUNE)
218 437.2 0GB 19.2Mb 0Mb 900
XCZU15EG-1FFVB1156E (DUNE)
341 683 16GB 26.2Mb 31.5Mb 3,528
15
Other Possible RCE/COB Modifications
- Move COB/RCE platform to pizza box
○ Possible cost reduction ○ High risk for maintainability and reliability ■ Fan failures are a high risk ■ Power supplies failures a second level risk ■ Filtering will be required, but could be done at the rack level ■ Loss of telecom class reliability and uptime
- Simplify COB and remove network switch
○ Keep some local interconnects for data sharing ○ Cost reduction possibilities ■ Shifts networking cost to outside the crate
- COB like ATCA carrier for commercial ZYNQ modules
○ Currently being investigated ○ Unclear of cost reduction relative to current COB/DPM due to loss in density
16
17
RCE/COB Based Solution Costing
18
ATCA Packaging for DUNE
- 1 APA = 2560 channels
- 1 APA per COB
○ 4 DPMs per COB ○ 640 channels per DPM ○ Assuming 128 channel per high speed link ■ 5x high speed links per DPM
- 150 APA for the entire system = 150 COBs
- Total Rack space: 165U
○ 11x 14-slot ATCA crates ○ 15U per 14-slot ATCA Crate
■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460
19
ATCA Power/Cooling Estimates for DUNE
- COB Max Power: 300W
○ ~100W for ETH SW ○ 36W for RTM (limited by 3A fuse) ○ 160 W for digital processing ■ 40W per DPM
- Total Max Power: 45kW
- Cooling via forced air (Integrated into the ATCA platform)
- Power and thermal monitoring via standard IPMI interface
- Example of ATCA crate that support 400W per slot
○ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460
20
ATCA Costs for DUNE
- RCE Cost Estimate:
○ ~$22k/unit ○ COB + DPMs + DTM + RTM
- 14-slot ATCA crates
○ ~$8.5k/unit ○ IPMI + shelf manager + 10GbE/40GbE backplane + fans + power supplies
- Total ATCA Hardware Cost: $3.4M
○ 11x ATCA crates ○ 150x RCE ATCA slots
21
Packaging And Architecture Thoughts
22
ATCA Components
Air Intake Filter Intake Fans Power supply DC or AC input Shelf Manager
- Telecom standard designed for “5 nines” uptime
- Almost all components can be replaced in the field
- Redundancy is available if desired
○ N + 1 redundancy for power supplies ○ Redundant shelf managers
- System is designed to handle one fan failure in each fan tray
○ Shelf manager generates alarm to request fan tray replacement
Exit Fans
Application Card Shelf Manager
23
ATCA Shelf Management & IPMI
Shelf Managers Ethernet Console Power Supplies Fan Trays EEPROM IPMC EPROMs
- ATCA uses IPMI for management purposes
○ Intelligent Platform Management Interface
- Manages and monitors all shelf based components
○ Power supply status and power ○ Shelf inlet and exit temperatures ○ Fan speed control and monitoring ○ Application card control and monitoring
- Redundant EEPROMs contain all shelf information
○ Shelf serial number, location and ID ○ Shelf manager IP/MAC address
- Application card hosts IPMC
○ Intelligent Platform Management Controller
- IPMC hosts all application card information in local EEPROM
○ MAC addresses ○ Serial number, card type & revision
How Do We Commission Without A Smart WIB?
- Concerns have been raised about having a “dumb” wib and commissioning the front
end boards ○ Do we need WIB intelligence to commission a front end?
- Many existing toolsets and hardware exists
○ Lots of RCE based designs started with small scale development
- Use the RCE platform:
○ With the RCE in place local python scripts can interact with the front end and accept data. ○ Full daq firmware is not required, existing tools can accept any data from the front end
- Use an RCE development board (or any Xilinx dev board for that matter)
○ Same environment as the real RCE ○ Existing firmware & software can be used ○ Portable and can be used with a laptop
- Use a PC with a PCI-Express card
○ Firmware & software already exists ○ Also python capable
24
Reliability
- The system design must consider reliability and maintenance of its components
○ How often will a component fail? ■ How many channels are impacted by each failure possibility ■ Yearly cost of replacing components ○ How easy is it to replace a failed component ■ Downtime ■ What level of modularity ■ Can a replacement item be obtained (same model?) ○ How susceptible to damage is each component ■ Fibers between the detector and the surface ■ Patch panels and interconnects along the way
- Are there architecture choices which enable failed units to be bypassed
○ Use network protocols allowing rerouting ○ Optical cross connect switches to allow fiber failures to be bypassed ○ Processing units susceptible to failure (i.e. GPUs) deployed in a resource pool without a dedicated connection to a detector unit ■ A failed unit can be taken offline and a “hot” replacement re-allocated ■ Necessary network infrastructure must also be considered
- How are devices do monitored for potential failures
○ Temperature sensors & fan speed sensors ○ Fiber optic power measurements to detect failed optics ○ Alarms?
25
Reliability
26
- Spinning fans on standard PCI-Express
plug in board will fail?
- What is the replacement process?
- How many other processing units must be
taken down to replace the failed unit?
- How much of the detector is lost when
card is the gateway to the front end?
- How are fan failures reported, if at all?
LCLS-2 Approach To PCI-Express Fiber Boards
27
- LCLS-2 DAQ system uses KCU1500 boards as gateway to back end processing
from mid level DAQ and cameras
- Expected that processing resources will change and that boards may fail over time
- Uses crossbar optical switches to allow flexible connection between any front end
and a pool of KCU1500 boards located in back end servers
- GPUs when used are accessed through Ethernet and Infiniband