TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt - - PowerPoint PPT Presentation

tid air electronics dune fd daq atca rce based solution
SMART_READER_LITE
LIVE PREVIEW

TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt - - PowerPoint PPT Presentation

Oct. 30-31, 2017 TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell Readout Overview 2 RCE Platform Clustering DPM 0 DPM 1 DTM DTM DPM 1 DPM 0 Ethernet Ethernet


slide-1
SLIDE 1
  • Oct. 30-31, 2017

Matt Graham, Mark Convery, Ryan Herbst, Larry Ruckman, JJ Russell

TID-AIR Electronics: DUNE FD DAQ: ATCA RCE-based Solution

slide-2
SLIDE 2

Readout Overview

2

slide-3
SLIDE 3

3

RCE Platform Clustering

  • The RCE nodes are interconnected through Ethernet
  • Each COB contains a low latency 10/40Gbps Ethernet switch
  • Cut through latency < 300ns
  • The COB supports a full mesh 14-slot backplane
  • Each COB has a direct 10Gbps link to every other COB in a crate
  • Any RCE in an ATCA shelf has a maximum of two switches between it any every other RCE
  • 14 * 8 = 112 RCEs in a low latency cluster
  • Reliable UDP protocol allows direct firmware to firmware data sharing
  • Allows for low latency data sharing between nodes
  • Edge channels
  • APA combining
  • Neural Network data sharing

COB

DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM

COB

DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM

COB

DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM

COB

DPM 0 DPM 1 DPM 2 DPM 3 Ethernet Switch DTM

Off shelf link

slide-4
SLIDE 4

Data Flow In RCEs

4

Filtering Filtering Filtering

Front Ends

Feature Extraction SuperNova Pre-Buffer SuperNova Post Buffer Network Stack SW & FW (RUDP) Timing Interface

Back End DAQ

  • Flexible architecture allows front ends to be allocated across RCEs in a flexible fashion

○ Simply add more cards and move fibers

  • Target is 640 channels per RCE (1x APA per COB)

Compression & Other Processing ?

slide-5
SLIDE 5

Interleaved Data Channel Filtering

  • Based on MicroBoone data, initial study suggest that we need minimum 16-tap FIR filter

○ Two notch filters + high pass filter ○ More taps for more complex filters

  • Built from Xilinx FIR IP core (v7.2)

○ Interleaves the channels to share FPGA resources ○ Unique coefficient set is specified for each interleaved data channel ■ 64 interleaved data channel per FIR module ■ Coefficients set filter shape, allows different filter approaches per channel ○ Assume a fast processing clock: 200 MHz ■ 100x faster than 2 MHz packet rate

5

Number of taps

kLUTs/IPCore (kLUTs/DPM) kFFs/IPCore (kFFs/DPM) DSPs/IPCore (DSP48/DPM) 16-TAP

2.37 (23.7) 2.145 (21.45) 16 (160)

24-TAP

3.285 (32.85) 2.667 (26.67) 24 (240)

32-TAP

4.201 (42.01) 3.188 (31.88) 32 (320)

slide-6
SLIDE 6

Supernova Buffering In Two Stages

  • Pre trigger buffer stores data in a ring buffer waiting for a super nova trigger

○ 640 channels per RCE (1x APA per COB) ○ 2 MHz ADC sampling rate ○ 12-bits per ADC ○ Bandwidth: 15.36 Gbps (1.92 GB/s) ■ 640 x 2MHz x 12b ○ Each DPM has 32 GB RAM: ■ 19.2 TB DDR4 RAM for all system across 150x COBs ○ Total Memory for supernova buffering: 31 GB ■ PL 16 GB + PS 15 GB (1GB for Kernel & OS) ○ Without compression: 16.1 seconds pre-trigger buffer ■ Assuming 12-bit packing to remove 4-bit overhead when packing into bytes

  • Post trigger buffer stores data in flash based SSD before backend DAQ

○ Write sequence occurs once per supernova trigger ■ Low write wearing over experiment lifetime ○ Support up to 2x mSATA/RCE ○ Low bandwidth background readout post trigger ■ Does not impact normal data taking ○ ~$100K for mSATA SD buffering

6

slide-7
SLIDE 7

Optional Compression

  • Compression is costly in FPGA resources

○ Plan to compress in the ARM processor before shipping out the trigger data to ART-DAQ ○ Compress in the PS made possible by the improved Zynq Ultrascale+ CPU

  • Here’s what we estimate the compression would be:

7

Algorithm kLUTs/DPM kFFs/DPM DSP48/DPM RAM(Mb)/DPM Note Arithmetic Probability Encoding 292 (86%) 120 (18%) 75 (<1%) 22.3 (38%) Currently implemented in proto-DUNE Huffman 143 (43%) 60 (9%) 75 (<1%) 22.3 (38%) Estimated from JJ Russell

Almost half of XCZU15EG LUTs

slide-8
SLIDE 8

8

Deploying Neural Networks In DUNE DAQ

Detector Region

FPGA

Conv Conv

FPGA

Conv Conv

FPGA

Conv Conv

FPGA

Full

FPGA

Full

FPGA

Full FPGAs are a great fit for Neural Network Deployment

  • Current research on quantized

8-bit classification in FPGAs

  • Layers processing can be

pipelined to allow for high frame rate classification

  • Layers can be deployed across

multiple FPGAs

  • Data sharing via

interconnects when required

  • First layer processing can occur

in same FPGA as pre-filtering

  • Later classification layers can be

implemented in back end daq using GPUs or FPGA co-processors

  • Resources estimates for possible

neural network deployment approaches ongoing ○ Full LeNet implemented easily in RCE like FPGA ○ Working on VGG-16 network Detector Region Detector Region

slide-9
SLIDE 9

9

RCE Firmware Resource Estimates

FW Module kLUTs kFFs DSPs BRAM/URAM (Mb) Notes RCE Core 34.76 43.7 4.68 Based on existing RCE core PL DDR MIG 18.12 20.33 3 0.94 Based on Xilinx IP core Data Receiver 7.01 27.4 2.85 Based on existing WIB Receiver Filtering 42.01 31.88 320 Based on Xilinx IP core Hit-finding TBD TBD TBD TBD More investigation needed Total 101.9 123.31 323 8.47 Summation of modules above Utilization 30% 18% 9% 15% Based on XCZU15EG-1FFVB1156E

  • With filtering there is plenty of FPGA resources available for additional processing

○ Hit finding ○ Convolution/classification layers for neural network image recognition ○ Lower Resource Compression Algorithm (being investigated by JJ)

slide-10
SLIDE 10

10

Waveform Extraction

  • A version of waveform extraction was done for 35-ton.

– Not used, due to time to deploy limitations

  • Resource usage was modest (based on 35-ton effort).

– Estimate ~100KLUTs for 5 x 128 channel groups

  • Much of the logic dealt with pseudo-signal processing
  • This logic moves to the filtering section (previous slide)
  • Needs more work to minimize the resource requirements if used in DUNE
  • Trigger primitives would be a natural by-product of the extracted waveforms

– Since streaming, this will be a low latency path to use for trigger formation.

  • Learned a lot since that implementation

– The 35-ton implementation worked on packets 1024 samples

  • Would go to a continuous streaming model
  • Packetizing would be done afterwards for practical reasons
  • See slides from JJ Russell here:

https://docs.google.com/presentation/d/1XufamuZOdFGkIlHZEw4N8nXMSUEbK9OlhQ9pcAGn4wk/edit?usp=sharing

slide-11
SLIDE 11

11

Current RCE Upgrade Efforts

slide-12
SLIDE 12

12

New COB design for LSST: Switch Upgrade

  • Adding 40 GbE support to the DPMs and backplane interfaces
  • Significantly increasing front panel bandwidth

○ From 10 Gbps to 200 Gbps

  • COB design embeds first level network switch layer within the ATCA crate

○ Reduces facility network requirements, direct connect to Infiniband enabled switches ○ Direct link to server room per rack or COB

slide-13
SLIDE 13

13

  • Oxford/SLAC Collaboration
  • Optimized for large memory buffering on the DPM
  • 32 GB per DPM for super nova

○ 16 GB PS (only support dual RANK) ○ 16 GB PL

■ Wired up to support QUAD RANK SODIMM for possible 32 GB memory

  • 20 of 24 HS RTMs supported

○ 80 links/COB @ 1.25 Gbps (8B/10B) ○ 20 links/COB @ 5 Gbps (8B/10B) ○ 8 links/COB @ 10 Gbps (64B/66B)

  • Standard PL 10/40 GbE interface + 2x PS 1GbE
  • Supports up to 2x mSATA/DPM on the RTM

DPM Redesign for DUNE

slide-14
SLIDE 14

Comparing Technical Specifics: Processor Side (PS)

Application CPU RT CPU GPU PS DDR

# CPUs Frequency # CPUs Frequency # CPUs Frequency Type Size Width Switching Speed Raw Peak Bandwidth proto-DUNE

2

800MHz

N/A N/A

DDR3

1GB 32-bit 1066Mbps 34Gbps

DUNE

4

1.2GHz

2

500MHz

2

600MHz

DDR4

16GB 64-bit 2400Mbps 153.6Gbps

14

https://www.xilinx.com/support/documentation/data_sheets/ds925-zynq-ultrascale-plus.pdf ~4 times more bandwidth

slide-15
SLIDE 15

Comparing Technical Specifics: Programmable Logic (PL)

kLUTs kFFs DDR BRAM URAM DSP48 XC7Z045-2FFG900E (proto-DUNE)

218 437.2 0GB 19.2Mb 0Mb 900

XCZU15EG-1FFVB1156E (DUNE)

341 683 16GB 26.2Mb 31.5Mb 3,528

15

slide-16
SLIDE 16

Other Possible RCE/COB Modifications

  • Move COB/RCE platform to pizza box

○ Possible cost reduction ○ High risk for maintainability and reliability ■ Fan failures are a high risk ■ Power supplies failures a second level risk ■ Filtering will be required, but could be done at the rack level ■ Loss of telecom class reliability and uptime

  • Simplify COB and remove network switch

○ Keep some local interconnects for data sharing ○ Cost reduction possibilities ■ Shifts networking cost to outside the crate

  • COB like ATCA carrier for commercial ZYNQ modules

○ Currently being investigated ○ Unclear of cost reduction relative to current COB/DPM due to loss in density

16

slide-17
SLIDE 17

17

RCE/COB Based Solution Costing

slide-18
SLIDE 18

18

ATCA Packaging for DUNE

  • 1 APA = 2560 channels
  • 1 APA per COB

○ 4 DPMs per COB ○ 640 channels per DPM ○ Assuming 128 channel per high speed link ■ 5x high speed links per DPM

  • 150 APA for the entire system = 150 COBs
  • Total Rack space: 165U

○ 11x 14-slot ATCA crates ○ 15U per 14-slot ATCA Crate

■ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

slide-19
SLIDE 19

19

ATCA Power/Cooling Estimates for DUNE

  • COB Max Power: 300W

○ ~100W for ETH SW ○ 36W for RTM (limited by 3A fuse) ○ 160 W for digital processing ■ 40W per DPM

  • Total Max Power: 45kW
  • Cooling via forced air (Integrated into the ATCA platform)
  • Power and thermal monitoring via standard IPMI interface
  • Example of ATCA crate that support 400W per slot

○ http://www.asis-pro.com/maxum-atca-systems/14-Slot-14U-MaXum-460

slide-20
SLIDE 20

20

ATCA Costs for DUNE

  • RCE Cost Estimate:

○ ~$22k/unit ○ COB + DPMs + DTM + RTM

  • 14-slot ATCA crates

○ ~$8.5k/unit ○ IPMI + shelf manager + 10GbE/40GbE backplane + fans + power supplies

  • Total ATCA Hardware Cost: $3.4M

○ 11x ATCA crates ○ 150x RCE ATCA slots

slide-21
SLIDE 21

21

Packaging And Architecture Thoughts

slide-22
SLIDE 22

22

ATCA Components

Air Intake Filter Intake Fans Power supply DC or AC input Shelf Manager

  • Telecom standard designed for “5 nines” uptime
  • Almost all components can be replaced in the field
  • Redundancy is available if desired

○ N + 1 redundancy for power supplies ○ Redundant shelf managers

  • System is designed to handle one fan failure in each fan tray

○ Shelf manager generates alarm to request fan tray replacement

Exit Fans

slide-23
SLIDE 23

Application Card Shelf Manager

23

ATCA Shelf Management & IPMI

Shelf Managers Ethernet Console Power Supplies Fan Trays EEPROM IPMC EPROMs

  • ATCA uses IPMI for management purposes

○ Intelligent Platform Management Interface

  • Manages and monitors all shelf based components

○ Power supply status and power ○ Shelf inlet and exit temperatures ○ Fan speed control and monitoring ○ Application card control and monitoring

  • Redundant EEPROMs contain all shelf information

○ Shelf serial number, location and ID ○ Shelf manager IP/MAC address

  • Application card hosts IPMC

○ Intelligent Platform Management Controller

  • IPMC hosts all application card information in local EEPROM

○ MAC addresses ○ Serial number, card type & revision

slide-24
SLIDE 24

How Do We Commission Without A Smart WIB?

  • Concerns have been raised about having a “dumb” wib and commissioning the front

end boards ○ Do we need WIB intelligence to commission a front end?

  • Many existing toolsets and hardware exists

○ Lots of RCE based designs started with small scale development

  • Use the RCE platform:

○ With the RCE in place local python scripts can interact with the front end and accept data. ○ Full daq firmware is not required, existing tools can accept any data from the front end

  • Use an RCE development board (or any Xilinx dev board for that matter)

○ Same environment as the real RCE ○ Existing firmware & software can be used ○ Portable and can be used with a laptop

  • Use a PC with a PCI-Express card

○ Firmware & software already exists ○ Also python capable

24

slide-25
SLIDE 25

Reliability

  • The system design must consider reliability and maintenance of its components

○ How often will a component fail? ■ How many channels are impacted by each failure possibility ■ Yearly cost of replacing components ○ How easy is it to replace a failed component ■ Downtime ■ What level of modularity ■ Can a replacement item be obtained (same model?) ○ How susceptible to damage is each component ■ Fibers between the detector and the surface ■ Patch panels and interconnects along the way

  • Are there architecture choices which enable failed units to be bypassed

○ Use network protocols allowing rerouting ○ Optical cross connect switches to allow fiber failures to be bypassed ○ Processing units susceptible to failure (i.e. GPUs) deployed in a resource pool without a dedicated connection to a detector unit ■ A failed unit can be taken offline and a “hot” replacement re-allocated ■ Necessary network infrastructure must also be considered

  • How are devices do monitored for potential failures

○ Temperature sensors & fan speed sensors ○ Fiber optic power measurements to detect failed optics ○ Alarms?

25

slide-26
SLIDE 26

Reliability

26

  • Spinning fans on standard PCI-Express

plug in board will fail?

  • What is the replacement process?
  • How many other processing units must be

taken down to replace the failed unit?

  • How much of the detector is lost when

card is the gateway to the front end?

  • How are fan failures reported, if at all?
slide-27
SLIDE 27

LCLS-2 Approach To PCI-Express Fiber Boards

27

  • LCLS-2 DAQ system uses KCU1500 boards as gateway to back end processing

from mid level DAQ and cameras

  • Expected that processing resources will change and that boards may fail over time
  • Uses crossbar optical switches to allow flexible connection between any front end

and a pool of KCU1500 boards located in back end servers

  • GPUs when used are accessed through Ethernet and Infiniband