Firmware Projects and Hardware Demonstrators Yuri Gershtein, - - PowerPoint PPT Presentation

firmware projects and hardware demonstrators
SMART_READER_LITE
LIVE PREVIEW

Firmware Projects and Hardware Demonstrators Yuri Gershtein, - - PowerPoint PPT Presentation

Firmware Projects and Hardware Demonstrators Yuri Gershtein, Rutgers University for Tracklet Team Technical Review 28-Aug-2017 1 Outline Reminder about structure of Firmware Barrel Disk


slide-1
SLIDE 1

Firmware Projects and Hardware Demonstrators

Yuri Gershtein, Rutgers University for Tracklet Team

  • Technical Review

28-Aug-2017

1

slide-2
SLIDE 2

Yuri Gershtein 8/28/2017

Outline

▪ Reminder about structure of Firmware

  • Barrel
  • Disk

▪ Hardware Test Stands

  • CTP7 / VC709

▪ Comparison tools

  • Emulation – Simulation -

Hardware

▪ Current status / plans

  • Tracklet 2.0
  • No sector-to-sector

communication (?)

  • 25G link demo board
slide-3
SLIDE 3

Yuri Gershtein 8/28/2017

High Level Overview

3

▪ Massively parallel track reconstruction

– Divide detector into 28 φ sectors (2 GeV track spans max 2 sectors), spanning all eta – Time multiplexed system (TMUX=6)

▪ Consider a SECTOR, consisting of one board (SECTOR PROCESSOR, SP), as the top-level FW unit ▪ Tracklets formed within a sector ▪ 2 GeV track can project into its adjacent sectors ▪ Projections must be sent to other sectors for stub matching ▪ Minimal data duplication ▪ One SP board for each sector 
 → need inter-SP communication

With 25G DTC->SP links, have enough bandwidth to duplicate data to SP instead of inter-SP communication. Saves on complexity and reduces latency

slide-4
SLIDE 4

Yuri Gershtein 8/28/2017

Firmware Overview

4

▪ Tracklet algorithm lends itself to a modular, massively parallelized, pipelined, fixed-latency implementation, ideal for FPGAs

  • Algorithm consists of 11 steps
  • Each step processes a new BX every 25ns * NTMUX

▪ High level overview: a few (simple) calculations, lots of data movement, replicated many times

  • We use aggressive partitioning to reduce occupancy

(combinatorics!)

  • Massive parallelization speeds processing

▪ Most challenging part of the firmware is to move the data to where it needs to be for calculations.

slide-5
SLIDE 5

Yuri Gershtein 8/28/2017

Firmware processing steps

5

  • 1. Layer Router
  • 2. VM Router
  • 3. Tracklet Engines
  • 4. Forming tracklets, initial

param estimate

  • A. Projection to neighbors
  • 5. Transmit projections
  • 6. Organize projections
  • 7. Match Engines
  • 8. Match projections to stubs
  • 9. Match transmission
  • 10. Track fit
  • 11. Duplicate removal
  • Data Movement
  • Calculations
  • Links
slide-6
SLIDE 6

Yuri Gershtein 8/28/2017

Two FW projects

6

▪ There are two firmware projects to cover half of a sector

  • ½ Barrel project:
  • Hybrid + Disk project: slightly different math and fit is more memory intensive

▪ Infrastructure code (moving data, etc) shared between projects ▪ Different challenges presented by each project

▪ Hybrid + Disk project

▪ ½ barrel project

L1+L2 L3+L4 L5+L6 L1+D1 D1+D2 D3+D4

slide-7
SLIDE 7

Yuri Gershtein 8/28/2017

Fixed Latency Design

7

▪ Processing modules read from, and write to, memories (BRAMs)

  • Mechanism for hand-off btw algo stages
  • Dual-ported memories decouple clock domains

▪ Each processing step takes a fixed time to produce its first

  • utput (latency)

▪ Pipelined design then produces new output for next NTMUX*25 ns. ▪ After that time, move to next BX (truncate if necessary) ▪ Result: Fixed latency for entire processing chain. ▪ Pipelined: Multiple BX in flight at one instant in time

slide-8
SLIDE 8

Yuri Gershtein 8/28/2017

Firmware implementation structure

8

▪ Firmware consists of a few hand-optimized Verilog modules replicated thousands of times. ▪ Wiring up these modules is automated via python scripts driven by configuration data.

  • Configuration stored in master configuration file
  • Same config steers C++ emulation and firmware
  • Different sub-configurations allow specializations for barrel,

hybrid+disk projects, sharing code ▪ Debugging easier as the C++ emulation has the same memory-processing-memory format

▪ FW is stored in GitHub repo for efficient many-person development

  • Tagged FW, emulation repos kept in sync
slide-9
SLIDE 9

Yuri Gershtein 8/28/2017

Track finder project generation flow chart

9

Output File Master Config. Wires.py

  • Proj. Conf.

Files Vivado/ simulation Input emulation Sector Processor Tracklet Emulation Stubs LUTs L1 Tracks Reduced Config SubProject.py C++ or Python software Firmware simulation Processing In hardware 'Bit' file

Legend:

Project Generation Verilog Code

slide-10
SLIDE 10

Yuri Gershtein 8/28/2017 Stub organization Forming tracklets Organize tracklet projections Match tracklet projections to stubs Track fit Projection transmission to neighbors Match trans- mission

Tools generate top-level HDL, c++ emulator configuration, and this diagram

10

Stub input

Duplicate removal is the next step Each step takes predetermined amount

  • f 3me – fixed latency

Track output processing steps (red) implement the algorithm

1/4 barrel project

Not all connections shown

slide-11
SLIDE 11

Yuri Gershtein 8/28/2017

Nearest-neighbor communication

11

▪ Reminder: Low-pt tracks project to neighboring sectors — need communication ▪ Use serial links over fiber-optics for inter-sector communication

  • 8b/10b protocol, synchronized across

boards

▪ Estimated positions for projected tracklets are sent to neighbor ▪ Corrections to estimated position are sent back to original board for final fit

Implemented in hardware

slide-12
SLIDE 12

Yuri Gershtein 8/28/2017

Link Organization

12

▪ Projections to be sent are grouped to balance the loads on the transceivers ▪ Use a prio encoder to read many memories (almost all empty)

  • MUX memory sources and send a single data stream
  • DeMUX on the receiving end and route to appropriate

memory ▪ Transceivers for returning matches are ~identical ▪ Vivado Simulation is done up until the edge of the transceiver

  • Latency simulated with a holding FIFO
slide-13
SLIDE 13

Yuri Gershtein 8/28/2017

TRACK FINDER DEVELOPMENT WORKFLOW

DTC DTC

slide-14
SLIDE 14

Yuri Gershtein 8/28/2017

Hardware Platform: Virtex-7 FPGA XC7VX690T

Developed at U Wisconsin Xilinx VC709 Board OSU, Cornell, Rutgers Test Stand at Cornell (4 CTP7) Test Stand at CERN (4 CTP7) CTP7 Board Test Stand at Rutgers (4 CTP7)

slide-15
SLIDE 15

Yuri Gershtein 8/28/2017

Demonstrator Test Stands (4 CTP7s)

▪ Hardware:

  • 4 CTP7 boards in μTCA
  • 3 sector processors (SPs)
  • 1 data source and sink
  • AMC13 Card for clock/sync
  • Represents 1 of N TMux

▪ I/O:

  • Data Source: Emulates DTC

✦ Provides stub inputs for both

central sector and neighbors.

  • Sector boards:

✦ Neighbor communication of

projections and matches.

  • Data Sink: Captures L1 Tracks.
slide-16
SLIDE 16

Yuri Gershtein 8/28/2017

Links and Synchronization

▪ Inter-board communication modules built based on Wisconsin’s protocol

  • 8b/10b encoding, 10 Gbps link speed
  • We can run with system configured as


TMux = 4 or TMux = 6.

▪ Each CTP7 has 48 optical outputs and 67 inputs at 10 Gbits/s

  • Sufficient I/O for demonstrator needs

▪ Synchronization of boards from 
 central BC0 & clock sent from AMC13

slide-17
SLIDE 17

Yuri Gershtein 8/28/2017

Demonstrator Tests

▪ Latency

  • Design is a fixed latency approach.
  • Time from start of sending of input stubs from “DTC” to arrival of first

track at track sink.

  • Measured with various designs
  • Half barrel
  • Hybrid + Disks

▪ Performance versus Emulation

  • Capture output of SP and compare bitwise with emulation

▪ Stability

  • Run extended period of time with error checking
slide-18
SLIDE 18

Yuri Gershtein 8/28/2017

Latency measurement

▪ Single Pass measurement

  • end-to-end measurement

▪ Includes

  • transmission of data
  • all processing steps

▪ Done for many events

▪ single muon, ttbar

  • Measured latency includes
  • stub input links from DTC board to

sector board

  • processing of each step, including

inter-board communicaMon

  • track output links back to DTC board
  • Measurement with a clock counter
  • 240 MHz clock
  • same as processing clock
  • implemented on the DTC emulator
  • start: read enable of input memory
  • Write counter output to a memory

when:

  • first stub is sent or
  • valid tracks are received
  • a list of Mme stamps
  • BX of the received track is also wriVen

into the memory

slide-19
SLIDE 19

Yuri Gershtein 8/28/2017

Latency measurement

  • First stub

sent

  • Bunch

crossing in binary

Single Muons

Latency = Trk_Clk - Start - BX*36

Arrival of track at sink First stub sent (ev#1) 36 ticks of 240MHz clk in 150 ns

Latency = 800 clks = 3.33 μs

Until 1st possible track arrives at sink

Latency = 836 clks = 3.48 μs

Until last possible track arrives at sink

+ 36 clks/150 ns

slide-20
SLIDE 20

Yuri Gershtein 8/28/2017

Latency Model

TMUX 6, 240 MHz CLK

slide-21
SLIDE 21

Yuri Gershtein 8/28/2017

Latency Model

TMUX 6, 240 MHz CLK Processing time of each module before moving to the next BX (TMUX = 6)

slide-22
SLIDE 22

Yuri Gershtein 8/28/2017

Latency Model

TMUX 6, 240 MHz CLK Overhead in each processing module

slide-23
SLIDE 23

Yuri Gershtein 8/28/2017

Latency Model

TMUX 6, 240 MHz CLK Inter-board communication latency:

  • Transmission protocol for stub inputs, projections,

matches and track outputs

  • 76 clk (240MHz) measured with ChipScope
slide-24
SLIDE 24

Yuri Gershtein 8/28/2017

Latency Model

TMUX 6, 240 MHz CLK

slide-25
SLIDE 25

Yuri Gershtein 8/28/2017

Firmware tests performed - results vs emulation

▪ Three-way comparison

  • Reality: All the hard work lies in comparison of emulation to Vivado simulation.
  • Agreement between Vivado and Hardware has always been excellent.

▪ Two Designs (flat barrel geom.):

  • Half Barrel
  • Hybrid + Disks

▪ Samples

  • Single muon
  • ttbar + 200 PU (100 events)

✦ 26 sectors, both designs: Vivado <—> Emulation ✦ Half Barrel, one sector: Vivado <—>Hardware

Results: Single muon: perfect agreement ttbar: ~99% agreement

slide-26
SLIDE 26

Yuri Gershtein 8/28/2017

Half Barrel - ttbar + 200 PU

PT φ zo η Local

slide-27
SLIDE 27

Yuri Gershtein 8/28/2017

Hybrid + Disks: ttbar + 200PU

PT φ zo η Local

slide-28
SLIDE 28

Yuri Gershtein 8/28/2017

Initial configure Input link unplugged Input link plugged Reset

Stable running

▪ Send 100 distinct single muon events from source to processing board; repeat many times to test stability for long times cf bx, orbit, timescales ▪ Compare final tracks to the expected tracks in the Track Sink, accumulate errors. ▪ Error counter connected to a register that is readout every minute. ▪ Unplug cable to convince yourself it actually does something ▪ DEMONSTRATE: Processing boards run error-free over long times.

Error count

slide-29
SLIDE 29

Yuri Gershtein 8/28/2017

FW project resource usage I

▪ 3 test projects:

  • Half-barrel
  • Disks + Hybrid
  • Half-sector

▪ Virtex 7 (XC7VX690T) FPGA

Half-barrel

Includes all parts of algorithm and seeding

slide-30
SLIDE 30

Yuri Gershtein 8/28/2017

FW project resource usage II

Hybrid + Disks Half-sector

NO DUPLICATE REMOVAL OR L1F1

Half-sector almost within reach with current project and technology.

Includes all parts of algorithm and seeding

slide-31
SLIDE 31

Yuri Gershtein 8/28/2017

Module resource usage

LUT Logic LUT Memory BRAM DSP LR 64 SL 50 100 VMR 150 AS 12 2 VMS 40/20 60/4 0/1 0/0 TE 38 2 1 SP 22 36 TC 859 184 0.5 51 Tpar 16 4 TProj 69/12 152/4 0/1.5 PT 570 6

slide-32
SLIDE 32

Yuri Gershtein 8/28/2017

Module resource usage

LUT Logic LUT Memory BRAM DSP PR 260 2 AP 10 1 VMP 22 28 ME 33 1 CM 29 32 MC 600 17 FM 10/12 4/4 1/1.5 MT 747 6 FT 2491 239 18.5 32 TF 49 4 PD 11202 3

slide-33
SLIDE 33

Yuri Gershtein 8/28/2017

Full sector estimate

▪ Given the resources used in each module, we can estimate how much we need for a full sector in ultimate system

  • This is to be compared with the resources available in the 


Ultrascale+ (VUXP) FPGA.

  • FPGA resources used shown as percentage of total available for that chip

▪ Goal: One SP in one FPGA ▪ — Resource needs compatible with Ultrascale+ resources

LUT Logic LUT Memory BRAM DSP Full sector 279733 151191 2721.5 1818 VU3P 32% 81% 85% 80% VU5P 21% 53% 58% 52% VU7P 16% 40% 42% 40% VU9P 11% 27% 28% 27% VU11P 10% 27% 29% 20% VU13P 7% 20% 22% 15%

full sector: ±η; includes barrel, disk and hybrid

slide-34
SLIDE 34

Yuri Gershtein 8/28/2017

Demo platform for 25G links

▪ High speed link project

  • Explore links, connectors,

layout,

▪ Based on existing g-2 project

  • Dual-height uTCA blade
  • Includes PICMG MTCA.4 RTM

▪ Explore different 25 G technologies, including fiber and copper RTM interconnects

  • uTCA: limit exposure to new

tech for fast turn-around

▪ Kintex and Virtex Ultrascale

  • Kintex for processing, KU115
  • Ultrascale for I/O, VU080
  • B2104 project, -2 speed

grade, footprint-compatible

AMC Connect RTM Copper Connect Power Supply Atmel AVR (MMC) UltraScale FPGA B2104 Package 32 GTH 32 GTY KU095 KU115 VU080 VU095 VU125 VU160 VU190 Zynq FPGA XC7Z010

Firefly 12 Ch RECV 4 GTY 4 GTY QSFP28 4 x 25Gbps XMIT/RECV 4 GTY 4 GTY 4 GTY

RTM Fiber Connect

MTP 24 12 GTH OUT SFP 1 Gbps For IPbus 1 GTH

AXI Chip to Chip

Ethernet RJ45

ENET +12V +3.3V IPMI

QSFP28 4 x 25Gbps XMIT/RECV QSFP28 4 x 25Gbps XMIT/RECV QSFP28 4 x 25Gbps XMIT/RECV UTILITY I/O 4 GTY 1 GTH SPI CONFIG FLASH SDHC CARD DDR3 MEMORY

SERIAL LINK DEVELOPMENT RTM

USB JTAG 4 GTY QSFP28 4 x 25Gbps XMIT/RECV

Power

SMA Digital Buffers SMA SMA SMA Firefly 12 Ch XMIT 12 GTY OUT 12 GTY IN Firefly 12 Ch RECV Firefly 12 Ch XMIT 12 GTH IN MTP 24 Firefly 12 Ch RECV Firefly 12 Ch XMIT 12 GTH IN MTP 24 12 GTH OUT

slide-35
SLIDE 35

Yuri Gershtein 8/28/2017

Demo platform for 25G links

35

▪ YuuuuuGE!

slide-36
SLIDE 36

Yuri Gershtein 8/28/2017

Summary

▪ Demonstrator Test Stand

  • Based on CTP7 (Vertex-7) board
  • Sector Processor plus nearest neighbors and data source/sink.

▪ Latency Measurements

  • Single Measurement: stub source to track sink
  • Excellent agreement with latency model

▪ Emulation vs Firmware/Hardware

  • Perfect agreement for single particle events
  • Excellent agreement (~99%) for ttbar + 200PU

▪ Stable running ▪ Design appears to be scalable to the 
 UltraScale++ generation of FPGAs

(first trk) 3.33 μs 3.48 μs (last possible trk)

slide-37
SLIDE 37

Yuri Gershtein 8/28/2017

backup

slide-38
SLIDE 38

Yuri Gershtein 8/28/2017

Latency measurement — ttbar events

  • First stub

sent

  • Bunch

crossing in binary

ttbar

Latency = Trk_Clk - Start - BX*36

Arrival of track at sink First stub sent (ev#1) 36 ticks of 240MHz clk in 150 ns

Latency = 800 clks = 3.33 μs

Until event’s 1st track arrives at sink

Latency = 836 clks = 3.48 μs

Until last possible track arrives at sink

+ 36 clks/150 ns

  • MulMple

tracks reported.

slide-39
SLIDE 39

Yuri Gershtein 8/28/2017

Comparisons (curv and t)

Half Barrel Hybrid+Disks

curv. curv. t t

slide-40
SLIDE 40

Yuri Gershtein 8/28/2017

Hardware board: CTP7

▪ CTP7 is a Virtex-7 based system, 690T-2

  • 3500 DSPs, 1400 BRAMs
  • 80 GTH optical transceivers, 


(67 RX + 48 TX on CTP7)

  • 14 TX+RX to backplane

▪ XC7Z045 ZYNQ SoC for slow controls, configuration ▪ 3 4-board test stands 
 (at Cornell, Rutgers and at CERN)

Developed at Univ. Wisconsin

slide-41
SLIDE 41

Yuri Gershtein 8/28/2017

Reminder: Demonstrator parameters

▪ Demonstrator project describes flat barrel tracker

  • Segmented in 28 sectors in phi, covering all eta
  • Time multiplex factor of six (150 ns per step)

▪ Ultimate project will be with tilted barrel:

  • Estimate 27(?) sectors with same TMUX factor
  • Tracklet 2.0 configuration (VMs of long strips)

✦ better load balancing leads to smaller effects of truncation ✦ Approach/firmware building blocks the same

  • stub rate per module will be lower in this configuration by x2

L1+L2 L3+L4 L5+L6 L1+D1 D1+D2 D3+D4

Half
 Barrel Hybrid
 + Disk