[PPT] - AM+FPGA Mezzanine Card Status Zhen Hu, Fermilab February 3, 2016 PowerPoint Presentation

SLIDE 1

AM+FPGA Mezzanine Card Status

Zhen Hu, Fermilab February 3, 2016

SLIDE 2

Scope of This Talk

Data ¡Delivery ¡ Pa,ern ¡ Recogni2on ¡

Emulate ¡data ¡coming ¡ from ¡DTC ¡ Deliver ¡stubs ¡to ¡the ¡pa9ern ¡ recogni:on ¡engine ¡

Fast ¡PR ¡using ¡AM ¡

Calculate ¡track ¡parameters ¡ in ¡FPGA ¡

Data ¡Sourcing ¡

§ Pattern Recognition Mezzanine (PRM) Firmware

§ Core piece of firmware implemented that interacts with the Associative Memory (AM) § Developed from scratch with low latency goals for L1

~1µs for PR & Track Fitting 1. Local address (inside a module) èglobal address (in the full tracker) 2. Fast Pattern Recognition with coarse resolution, while the full resolution stubs stored in a "smart library” (data

rganizer)

3. When patterns found (roads), the full resolution stubs have to be retrieved from data organizer very fast

Track ¡Fi8ng ¡

Director's ¡Review ¡– ¡Mezzanine Card Status ¡

2 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 3

§ Hardware: Pattern Recognition Mezzanine (PRM) § Firmware

§ Overview of PRM firmware

ü A fast, reasonably sized AM in ultrascale FPGA ü Data Organizer for low latency, high speed pipelined operation ü Stub local to global address conversion on the fly

§ Current status

§ Firmware fully developed, now bench test the performance with different types of events

ü Start with single muon ü Then add 140/200 PU, high energetic jets …

Goal: test with processing full trigger tower stubs

3 ¡

Outline

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡ Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 4

PRM: UltraScale+UltraScale/VIPRAM Mezzanine

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

4 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Master ¡ Slave ¡

VIPRAM ¡

SRAM ¡

§ Prototype pattern recognition engine for L1 tracking trigger demonstration

§ Dual Kintex UltraScale FPGAs § 36 Mb low latency DDR II+ static RAM § Socket for Next Generation VIPRAM ASIC (Vertically Integrated Pattern Recognition Associative Memory) firmware ¡

SLIDE 5

AM ¡pa9erns ¡stored ¡

Stubs ¡library ¡

Stubs from Pulsar for a given BX and Trigger Tower Incoming stubs ~ few x100 per layer Roads fired (<100) Roads with all associated stubs Comb Clean up Track fitting stage

Stubs ¡library ¡

Data ¡Organizer ¡

Master FPGA tracks Clean up Δt2

Δt3 ¡

Pattern Recognition Engine Flow

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

5 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 6

Pattern and Superstrip (ss)

§ A pattern is a low resolution track

§ Made of 1 superstrip per layer § A superstrip is a group of adjacent strips

§ Fountain superstrip configuration

§ “Fountain” defined by natural spread in ϕ

Due to multiple scattering, secondary

interactions

§ Different z-segmentations

§ The working point chosen in this talk:

§ Scale factor (sf) for the fountain superstrip dϕ = 1 § Number of z-segmentation (nz) = 4 § Pattern back size = 1862K @95%

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡ Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

6 ¡

1mm ¡ 7mm ¡

SLIDE 7

§ AM in FPGA: very closely follows the AM ASIC (chip) design

§ Match two silicon tiers in ASIC with two modules in FPGA firmware

CAM Tier -> a 2D array of Pattern Modules

– 4k patterns LUT = 50% KU040 Utilization

I/O Tier -> fired roads serialization and output

§ Pipelined operation

CAM tier: processes pattern matching with

stubs for current event N

I/O tier: outputs road addresses for event N-1

at the same time

§ Logic is optimized for  7-Series/UltraScale architecture

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

7 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

PRM Core Module (1) AM in FPGA

AM ¡pa9erns ¡stored ¡

Comb Clean up Track fitting stage

Data ¡Organizer ¡

Master FPGA Clean up

I/O TIER CAM TIER

Road Flags Input Stubs Road Address

Pattern Module Array "CAM Tier"

Layer Data Inputs Road Address Clock Clock Control

Road Serialization Logic "I/O Tier"

Road Flags

ASIC ¡ firmware ¡

SLIDE 8

PRI ENC COL MUX COLUMN-OR BITS COL NUM COL VECTOR Road Address 32 x 32 Array Sort Nodes

I/O Tier: Road Serialization

§ Synchronous, pipelined operation § Zero suppressed output § Low latency: first road out 7 clock cycles after EOE (end of event)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

8 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Road ¡fired ¡ Fast ¡output ¡ as ¡in ¡ASIC ¡ projec:on ¡

SLIDE 9

§ A critical module for fast stub storage and fast interactions with AM

§ Input stubs (full resolution) temporarily stored in data organizer while AM doing pattern matching § Once the patterns are found (road), sent back to data organizer and retrieve the full resolution stubs for each fired road

§ This is a new design after a few iterations

§ Low latency for L1 operation

Avoids conventional read-modify-write cycles
Takes advantage of the BlockRAM “read first” mode

§ Fully pipelined operation to match AM readout

The data organizer must concurrently store stubs for event N

while recalling stubs for event N-1

§ The data organizer stores stubs in RAM at the address pointed to by the ss ID

§ Super strip ID 10 bits à 1k memory locations § Store up to 4 stubs (local or global) per super strip

§ Masking and scrubbing logic insures old stubs are not read out

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

9 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

PRM Core Module (2) Data Organizer

AM ¡pa9erns ¡stored ¡

Comb Clean up Track fitting stage

Data ¡Organizer ¡

Master FPGA Clean up

SLIDE 10

Data Organizer: Block Diagram

WSSID DIN EOE

2k x 64 Dual-Port BRAM in "Read First" mode

eoe_reg count_reg

ld_count_reg

din_reg

1

RSSID

dout0_reg dout1_reg dout2_reg dout3_reg

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

A B

1

A B 1

1

A B 2 A B 3

wssid_reg

1

wea_reg (write address) (read address)

It ¡writes ¡like ¡a ¡FIFO(first ¡in ¡first ¡out), ¡but ¡reads ¡like ¡a ¡memory ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

10 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 11

StubAdd 100 200 300 400 500 600 700 800 900 φ Global- 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

hist_phi

hist_phi Entries 961 Mean 322.8 RMS 228.6

hist_phi

StubAdd 100 200 300 400 500 600 700 800 900 Global-R(cm) 22 22.1 22.2 22.3 22.4 22.5

hist_r

hist_r Entries 961 Mean 479.5 RMS 278

hist_r

PRM Core Module (3) Local to Global Conversion

§ Why?

§ Stub received in local coordinates, and converted to global coordinates (for ss, track fitting …)

Local coordinate (in red):

– ModuleID, nStubAddr (for ϕ), Zlocal

Global cylindrical coordinate (in blue):

– ϕ, Z, and R

§ How?

§ The global coordinates of the edges of a module is known given the module ID

ModuleID -> Global (ra, ϕa , Za), ( rb, ϕb, Zb)

§ Global ϕ or R is an approximate linear function of nStubAddr in a small range

Fitted with 8 linear functions
ϕ = ϕi + nStubAddr * Δϕi
R = Ri + nStubAddr * ΔRi

§ Global Z is a linear function of Zlocal

Z =Za + Zlocal*ΔZ(width per Z segment)

Fit to get ϕi, Ri, Δϕi, Δri in 8 sub-ranges

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

11 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

300 400 500 600 700

Local ¡nStubAddr ¡ ¡ ¡ ¡ Local ¡nStubAddr ¡ ¡ ¡ ¡ Global ¡R ¡ Global ¡ϕ ¡ FPGA ¡friendly ¡

SLIDE 12

Track ¡Fi\ng ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

12 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

PRM GTH Performance PRM Firmware Block Diagram

(SSID ¡= ¡SuperStrip ¡ID) ¡ If ¡in ¡a ¡busy ¡environment ¡ ¡

Example ¡with ¡1 ¡muon ¡

SLIDE 13

x [cm]

50 100

y [cm]

50 100 150

= 5 GeV

T,0

p = 1.1/3 η /8 π = 3 φ CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

= 5 GeV

T,0

p = 1.1/3 η /8 π = 3 φ CMS Preliminary Phase II Simulation

Bench Test the Firmware performance

Board ¡level ¡test: ¡select ¡40 ¡modules ¡(within ¡the ¡purple ¡box) ¡ covering ¡single ¡muon ¡tracks ¡with: ¡ ¡ ¡ η ¡= ¡1.1/3, ¡ϕ ¡= ¡3π/8, ¡ ¡pT ¡> ¡3 ¡GeV, ¡ ¡|z0|<4cm

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

13 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

§ PRM Firmware is fully developed

§ Bench test: check if all the modules in PRM firmware work as expected

§ Gradually increase the complexity; carefully verify the firmware functions piece by piece

§ Single muon event § 4 muons in one event: AM, … § Single muon + PileUp: Data Org, … § Single muon + Jet(200 GeV): … § Multi events, full tower, …

Divide ¡and ¡conquer ¡

SLIDE 14

Single Muon

Single ¡μ ¡ ¡ local ¡stubs ¡ EOE ¡

Single ¡μ ¡ ¡ global ¡stubs ¡

RoadID ¡from ¡ VIPRAM ¡

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

12 ¡CLK ¡from ¡local ¡stubs ¡input ¡to ¡RoadID ¡found; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

9 ¡CLK ¡from ¡RoadID ¡to ¡global ¡stubs ¡output ¡ ¡ ü Split ¡in ¡more ¡stages ¡in ¡backup ¡

21 ¡CLK ¡cycles ¡latency ¡in ¡total ¡for ¡single ¡muon ¡event ¡

go ¡through ¡the ¡whole ¡PRM ¡chain ¡ ü If ¡running ¡at ¡200MHz, ¡ ¡21 ¡CLK ¡~ ¡105ns ¡ 21 CLK 12 CLK

14 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡ Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

SLIDE 15

4 Separate Muons

4 ¡road ¡found ¡
4 ¡CLK ¡cycles ¡for ¡4 ¡global ¡stubs ¡output ¡
The ¡1st ¡and ¡4th ¡muons ¡are ¡close: ¡
In ¡layer ¡6 ¡and ¡7, ¡stubs ¡are ¡located ¡in ¡same ¡SS ¡
Addi:onal ¡stubs ¡are ¡read ¡out ¡for ¡each ¡road: ¡mul:-‑combina:ons ¡ ¡

4 ¡μ ¡local ¡ stubs ¡ 4 ¡RoadID ¡ 4 ¡μ ¡ ¡ global ¡stubs ¡ Addi:onal ¡ global ¡stubs ¡

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

21 CLK

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

15 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 16

Single Muon + PU200 (in 40 modules)

Single ¡muon ¡+ ¡PU200 ¡in ¡the ¡40 ¡modules. ¡ ¡ In ¡this ¡event: ¡ ¡

14 ¡Stubs ¡in ¡Layer7 ¡(PU ¡adds ¡13 ¡CLK ¡cycles ¡for ¡

data ¡input) ¡

Input ¡clock ¡latency ¡-‑> ¡the ¡busiest ¡layer ¡
More ¡modules ¡included ¡-‑> ¡more ¡stubs ¡from ¡PU ¡in ¡each ¡

Layer ¡

PU ¡contributes ¡1 ¡extra ¡stub ¡to ¡layer ¡5 ¡for ¡output ¡ ¡

Single ¡μ ¡+ ¡ PU200 ¡ Single ¡μ ¡ ¡ global ¡stubs ¡ PU ¡global ¡stub ¡

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

16 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

single ¡muon ¡paAern ¡sCll ¡is ¡rather ¡clean ¡

21 CLK

RoadID ¡

SLIDE 17

Nstub Input due to PU

# of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 01 TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10 2 10 3 10 01 TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10 2 10 3 10 01 TT27, Layer10, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10 2 10 3 10 01 TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10 2 10 3 10 02 TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 02 TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 02 TT27, Layer10, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 02 TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10 2 10 3 10 03 TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 03 TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 03 TT27, Layer10, edge of a module, z 0.004000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 03 TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 04 TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 04 TT27, Layer05, center of a module, z 0.001000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 04 TT27, Layer10, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10 2 10 3 10 04 TT27, Layer05, center of a module,z 0.000000

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 TriggerTower16, Layer05

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 80 TriggerTower16, Layer06

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer07

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 80 90 TriggerTower16, Layer08

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer09

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer10

Based ¡on ¡simula:on, ¡stub ¡<= ¡4 ¡is ¡a ¡

good ¡star:ng ¡point ¡of ¡implemen:ng ¡ the ¡firmware ¡for ¡DO ¡ ¡

DO ¡implementa:on ¡is ¡flexible ¡and ¡can ¡

be ¡extended ¡to ¡> ¡4 ¡ Nstub ¡per ¡layer ¡in ¡barrel ¡tower ¡(200PU) ¡ Up ¡to ¡200 ¡stubs ¡per ¡layer ¡for ¡layer ¡5 ¡ ¡

‑> ¡input ¡latency ¡~ ¡200 ¡extra ¡CLK ¡cycle ¡

¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

17 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Nstub/SS ¡

Nstub Output due to PU

Layer ¡5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡6 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡7 ¡ ¡Layer ¡8 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡9 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡10 ¡

SLIDE 18

Single Muon + Jet

Single ¡μ ¡+ ¡Jet ¡ Single ¡μ ¡ ¡ global ¡stubs ¡

RoadID ¡from ¡ VIPRAM ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

18 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Jet ¡global ¡stub ¡

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

SLIDE 19

19 ¡

Single Muon + Jet, more statistics

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡ Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

§ 1000 events, Nstub/SS, sf1z4

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 05, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 06, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 07, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 08, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 09, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 01 TT27, Layer 10, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 02 TT27, Layer 05, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 02 TT27, Layer 06, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

10 2 10 3 10 02 TT27, Layer 07, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

10 2 10 3 10 02 TT27, Layer 08, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

10 2 10 3 10 02 TT27, Layer 09, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 02 TT27, Layer 10, z 0.001005

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 03 TT27, Layer 05, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

10 2 10 3 10 03 TT27, Layer 06, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 03 TT27, Layer 07, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 03 TT27, Layer 08, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 03 TT27, Layer 09, z 0.001005

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 03 TT27, Layer 10, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 05, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 06, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 07, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 08, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 09, z 0.000000

# of stubs per ss (sf1z4)

2 4 6 8 10 12 14

Entries

1 10 2 10 3 10 04 TT27, Layer 10, z 0.000000

Layer ¡5 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡6 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡7 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡8 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡9 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Layer ¡10 ¡ Well below the limit of 4 stubs per ss

SLIDE 20

Running Firmware in PRM Board

First time running everything in PRM board

ü All the function blocks are implemented in the Master FPGA of proPRM for simplification ü Event MEM and Init MEM are BlockRAM with Initial Values (ROM)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

20 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 21

Compare Results between Board & Simulation

Test with a single muon event in PRM board:

Set "GO_out_reg” to "1" by JTag cable
The firmware starts to run step by step
Finally we get the global ϕ of each stubs successfully, shown in the vivado GUI by JTag:

Compare with single muon event in simulator: Global ϕ and Z of each stubs Got exactly same values as in simulator Successful with 100MHz and 200MHz SYSCLK. Will work on 250MHz, 300MHz

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

21 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 22

Summary

§ Hardware: great performance § Firmware: the pattern recognition stage is fully developed, including:

§ Data Organizer, PRAM, Local to global converter

§ Test bench: Successfully processed MC events through the whole firmware simulation for board level test

§ Started with a single muon event: the performance looks good § Then added more and more: multi muon, single muon+PU, single muon +Jet § PR Latency is consistent with what we estimated: ~100ns @ 200MHz § Goal: test with processing full trigger tower stubs

§ Board test results agree with behavioral simulation results § Send the same single muon events to Track Fitting (TF)

(Details about TF in the next talk by Marco)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

22 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 23

Backup

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

23 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 24

L1 Tracking Trigger Stages

Data ¡Sourcing ¡ Data ¡Delivery ¡ Pa,ern ¡ Recogni2on ¡ Track ¡Fit ¡ Emulate ¡data ¡ coming ¡from ¡DTC ¡ Deliver ¡stubs ¡to ¡the ¡ pa9ern ¡recogni:on ¡ engine ¡ Fast ¡PR ¡using ¡AM ¡ Calculate ¡track ¡ parameters ¡in ¡FPGA ¡

Pulsar‑2b/RTM/ IPMC Data sourcing firmware PR ¡Mezzanine ¡Card ¡ Track ¡fi\ng ¡FW ¡ Pulsar‑2b/RTM/ IPMC PRB firmware

Data ¡format ¡ ¡
StubRate/module ¡
Module ¡overlap ¡

masking ¡

PRB ¡format ¡
Linearized ¡track ¡

fi\ng ¡simula:on ¡ ¡

PRM ¡input ¡file ¡ ¡
StubRate/ss ¡

Ini2al ¡look ¡with ¡single ¡muon ¡and ¡single ¡muon+200PU ¡events ¡

1. ¡Latency; ¡ ¡2. ¡Efficiency; ¡3. ¡Precision ¡ ¡ ¡

PR ¡Mezzanine ¡Card ¡+ ¡ AM ¡Chip ¡ Data ¡organizer ¡Local ¡ to ¡global ¡

Combina:on ¡cleaner ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

24 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 25

FNAL mezzanine card

§ The new mezzanine card, the ProtoPRM has been designed to

§ Host the new protoVIPRAM_L1CMS ASIC § Emulate current (and future) VIPRAM designs using an FPGA § Serve as a track finding engine for L1 Tracking Trigger demonstration Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

25 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 26

§ Master and slave FPGAs interconnection, 15.6 Gbps achieved

PRM GTH Performance

8 ¡GTH ¡ 3 ¡ G T H ¡ 4 ¡GTH ¡ 4 ¡GTH ¡

Un:l ¡recently ¡the ¡emphasis ¡has ¡been ¡on ¡ hardware ¡development. ¡ ¡ ¡ Now ¡that ¡the ¡hardware ¡is ¡working ¡great, ¡we ¡are ¡ focusing ¡on ¡firmware ¡implementa:on. ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

26 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

§ Master FPGA to Pulsar2b FPGA through FMC, 10.0 Gbps achieved § FPGA to QSFP+, 10.3Gbps achieved

SLIDE 27

Module Overlap Masking

§ This is what can happen when tracks cross the overlap regions.

§ A single layer is composed of silicon detectors (detector pairs) arranged in 4 “sub”-layers § Up to 4 good stubs in a layer.

§ Masking

§ Mask the overlapping module areas such that a track will encounter (mostly) one unmasked module per layer. § The stubs in the masked regions will NOT be propagated down in the L1TT chain

§ Algorithm developed by Roberto Rossin, details in today’s talk:

Strip# η Strip# φ

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

27 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 28

# of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10

2

10

3

10

01

TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

01

TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

01

TT27, Layer10, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10

2

10

3

10

01

TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10

2

10

3

10

02

TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

02

TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

02

TT27, Layer10, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

02

TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 10

2

10

3

10

03

TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

03

TT27, Layer05, center of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

03

TT27, Layer10, edge of a module, z 0.004000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

03

TT27, Layer05, center of a module,z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

04

TT27, Layer05, edge of a module, z 0.000000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

04

TT27, Layer05, center of a module, z 0.001000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

04

TT27, Layer10, edge of a module, z 0.001000 # of stubs per ss (sf1z4) 2 4 6 8 10 12 14 Entries 1 10

2

10

3

10

04

TT27, Layer05, center of a module,z 0.000000

StubRate/ss (sf1z4, 200PU, nomasking)

§ ϕ-segments, Z-segments, and masking may not be able to reduce the stub rate in some cases

local Z=3, local phi=481, simPt=2.26 local Z=3, local phi=483, simPt=0.35 local Z=4, local phi=480, simPt=2.26 local Z=4, local phi=480, simPt=-99 local Z=4, local phi=482, simPt=-99 local Z=4, local phi=482, simPt=0.35

Layer 5 ladder edge Layer 5 ladder center Layer 10 ladder center Layer 10 ladder edge z #1 z #2 z #3 z #4

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

28 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 29

§ The data organizer stores stubs in RAM  at the address pointed to by the SSID § SSID = 10 bits à 1k memory locations § Store up to four subs per SSID § The data organizer architecture is usually geared towards  read-modify-write operations § A complete redesign of the data organizer was required § Our VIPRAM/PRAM readout is pipelined § The data organizer must concurrently store stubs for event N while recalling stubs for event N-1 § Data Organizer must “ping pong” dual RAM banks § RAM “scrubbing” function must be implemented § Periodic clearing of the RAM is done with writes (no global reset) § Prevent stubs from old events from being read out (“masking”)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

29 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Data Organizer Overview

SLIDE 30

Data Organizer: RAM organization

§ Early versions of the Data Organizer used wide single bank RAM to store the stubs § example: store A in addr1, store B in addr2, read addr1, shift, store A,C back in addr1… § FPGA BlockRAMs are “double registered” which leads to latency of two clock cycles § Very complex logic is needed to guard against data loss when doing pipelined read-modify-write cycles with consecutive stubs with the same SSID

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

30 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 31

Data Organizer: new RAM organization

§ We started with read-modify-write cycles, then get rid of it § Use 7-Series/UltraScale BlockRAM “read first mode” § As data is written into BlockRAM the previous data at that location is pushed to the output § Four cascaded RAMs hold the stub information § example: store A in addr1 of first RAM, store B in addr2 of first block, store C in addr1 of first block (A is pushed into addr1 of second block) § Simple, fast, and efficient configuration resembles an “array of FIFOs” § Dual port BlockRAMs simplify the “ping pong” event manipulation § Port A is used for writing event N stubs § Port B is used for reading event N-1 stubs

A C B 1 2 3 1 2 3 1 2 3 1 2 3

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

31 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 32

Data Organizer: Block Diagram

WSSID DIN EOE

2k x 64 Dual-Port BRAM in "Read First" mode

eoe_reg count_reg

ld_count_reg

din_reg

1

RSSID

dout0_reg dout1_reg dout2_reg dout3_reg

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

A B

1

A B 1

1

A B 2 A B 3

wssid_reg

1

wea_reg (write address) (read address)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

32 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 33

Data Organizer: Writing

§ RAM blocks are 2k x 64

§ wide enough to store stub information in global coordinates § the event number is stored with the data § 16 BlockRAMs per data organizer

§ Event counter increments with End-of-Event (EOE)

§ 12 bit count_reg = event N

§ dual bank “ping pong” is achieved with address partitions

§ if count_reg is even, write stubs into 0x3FF-0x000 § if count_reg is odd, write stubs into 0x7FF-0x400

§ If EOE=1 the stub on DIN will NOT be written

§ this cycle is needed for scrubbing, more on this later…

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

33 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 34

Data Organizer: Reading

§ Read up to four stubs pointed to by RSSID § Stubs are output in parallel dout3..dout0 § Read latency is 3 cycles § Mask off any stubs that are not tagged with event N-1 (old_count_reg) § Address partitioning avoids read/write conflicts:

§ When count_reg is even, read from 0x7FF-0x400 § When count_reg is odd, read from 0x3FF-0x000

§ After EOE is asserted, need to wait 6 clock cycles before attempting to read out stubs

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

34 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 35

Data Organizer: Scrubbing

§ Event counters are 12 bits

§ wraps around after 4096 events

§ Slim possibility that a stub from a very old event (with a matching event number) could stick around and be read out § Solution: every time EOE is asserted erase one address in each RAM block

§ This address increments with EOE § Over 2048 events the RAM is entirely zeroed out

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

35 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 36

§ Fully synthesizable VHDL model of current VIPRAM_L1CMS

§ Multi-Tier pipelined readout § “CAM tier” processes stubs for current event N § “I/O tier” captures road flags and outputs road addresses for  event N-1

§ Design optimized for  7-Series/UltraScale architecture § Fairly close to “cycle accurate” timing § 6 input layer buses, 10 bits per layer § 1k to 4k patterns, fully programmable § ternary/’X’/DCBs in any bit position

I/O TIER CAM TIER

Road Flags Input Stubs Road Address

Pattern Module Array "CAM Tier"

Layer Data Inputs Road Address Clock Clock Control

Road Serialization Logic "I/O Tier"

Road Flags

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

36 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

FPGA PRAM Overview

SLIDE 37

CAM Tier: Building Blocks

§ 5-bit pattern match logic based on SRLC32E primitive

§ One SLICEM in 7-Series/UltraScale § Fully programmable, serial pattern load § ternary or ‘X’ bits in any position

§ Layer Module = 2 x SRLC32E + glue logic § Pattern Module = 6 Layer Modules + majority logic

§ Miss-0, Miss-1, Miss-2, etc.

§ CAM Tier is a 2D array of Pattern Modules

§ 32x32 (1k) and 64x64 (4k) versions

§ Reset by EOE control signal

KU040 LUT Utilization 6 x 10 x 1k = 12% 6 x 10 x 4k = 50% 8 x 15 x 1k = 22% 8 x 15 x 4k = 89% * includes road serialization logic

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

37 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 38

CAM Tier Logic

§ 7-Series/UltraScale primitive SRLC32E is a variable-depth shift register

§ Up to 32 FF’s deep § Tiny: uses one SLICEM (112k of these in KU040)

§ We use this a different way to match patterns § Shift register contains our 32-bit pattern

§ serial load on D and Q31

§ Input-A becomes the data input, a 5 bit pattern recognition “engine”

§ ternary or ‘X’ bits in any position

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

38 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 39

CAM Tier: Layer Module

§ Two SRLC32E’s in parallel form a 10-bit pattern checker for 1 layer

§ 64 bit pattern loaded serially, daisy chained

§ Output is registered § Once MATCH output is set it remains set until cleared by EOE

Match A[4..0] D CE SRLC32E Q Q31 A[4..0] D CE SRLC32E Q Q31 Pattern In Data[9..5] Data[4..0] Pattern Out 1 EOE LOAD

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

39 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 40

CAM Tier: Pattern Module

§ Pattern Module = Six Layer modules + majority logic § Majority logic supports: Miss-0, Miss-1, and Miss-2 modes

§ Simple to add additional modes if needed

§ CAM Tier = 2D array of Pattern Modules

§ 32 x 32 à 1k patterns § 64 x 64 à 4k patterns

Layer Module

Majority Logic

Road Flag Mode 6 Layer Input Data

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

40 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 41

I/O Tier Logic

§ Register the road flag bits when EOE=1

§ Capture road flags corresponding to event N-1

§ Identify the addresses of the fired roads § Fully synchronous, pipelined operation § Fully zero suppressed output

§ roads are read out consecutively in one group

§ Low latency

§ First road address output 7 clock cycles after EOE

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

41 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 42

I/O Tier: Serialization (1k version)

PRI ENC COL MUX COLUMN-OR BITS COL NUM COL VECTOR Road Address 32 x 32 Array Sort Nodes

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

42 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 43

I/O Tier: Sorting Logic

§ When EOE=1 register road bits § OR the road bits in each column § “Peel-Away” priority encoder selects columns with 1 or more fired roads § a wide MUX passes each column vector into the sorting tree, which is made of these nodes:

A B

FIFO

C

FSM

Simple decision logic: A B FIFO Operation 0 0 Empty nothing 0 0 NotEmpty read FIFO 1 0 X pass-A 0 1 Empty pass-B 0 1 NotEmpty read FIFO, Sto-B 1 1 X pass-A, Sto-B

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

43 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 44

FPGA/PRAM resources and performance

§ 6 layer x 10-bit x 1k patterns

§ ~12% KU040 LUT resources

§ 6 layer x 10-bit x 4k patterns

§ ~50% KU040 LUT resources

§ 6 layer x 10-bit x 8k patterns

§ may be possible in KU040

§ Old numbers

§ 8 layer x 15-bit x 1k patterns

~25% KU040 LUT resources

§ 8 layer x 15-bit x 4k patterns

may be possible in KU040

§ 250 MHz in slowest (-1) speed grade

§ Performance appears to be limited by skew on the single clock net § Clock partitioning, using PLL/MMCM, etc. should improve this

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

44 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 45

Local and Global Coordinate

§ Barrel Trigger Tower 27: § 136 PS, 279 2S

PS: 960 strips, 32 Z-segments
2S: 1016 strips, 2 Z-segments

§ 4.7M numbers are needed to index all positions § Local Coordinate: § 8 bits ModuleID § 11 bits StubAddr: (first 3 bits for ChipID) § 5/1 bits Z-segment for PS/2S § Global Cylindrical Coordinate: § ϕ, Z, and R § Brute force to do one-to-one L2G conversion: § 200Mb Look-Up-Table needed

4.7M*(18+13+14)

§ 21Mb BRAM in Kintex UltraScale KU040

Min Num of Bits for Global Coordinate of whole tracker detector

Bit # Range resolution Reference φ 18 2π rad 2.4e-5 rad Half strip width in out most layer: 4.2e-5 rad z 13 6 m 0.73 mm Z-segment PS/2S: 1.6mm/ 5cm r 14 1.2 m 73 µm Precision of module installation: 200µm(?)

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

45 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 46

StubAdd 100 200 300 400 500 600 700 800 900 Global-R(cm) 22 22.1 22.2 22.3 22.4 22.5

hist_r

hist_r Entries 961 Mean 479.5 RMS 278

hist_r

Local to Global Conversion: r and ϕ

§ ModuleID and StripAddr are known for each stub § 2 edges of the module could be measured after installation

§ (ra, ϕa), ( rb, ϕb )

§ Global ϕ or r is a function of StripAddr, and this function could be calculated mathematically with (ra, ϕa), ( rb, ϕb ) § For each module, the stubAddr to ϕ/r functions are approximated by 8 linear functions

§ ϕ = ϕ0 + Δϕ*nStubAddr § R = R0 + ΔR*nStubAddr 300 400 500 600 700

StubAdd 100 200 300 400 500 600 700 800 900 φ Global- 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

hist_phi

hist_phi Entries 961 Mean 322.8 RMS 228.6

hist_phi

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

46 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

(rb, ϕb) (ra, ϕa)

Local nStubAddr -> Global R, ϕ

X Y

SLIDE 47

Local to Global Conversion: Z

ModuleID and Z-segment are

known for each stub

2 edges of the module could be

measured after installation

– Za, Zb

Global Z is equal to

– Za + Z segment * width per segment – Width ΔZ= (Zb-Za)/32 for PS – Width ΔZ= (Zb-Za)/2 for 2S

For each module, Z = Z0 +

ΔZ*zSeg

z-Segment 5 10 15 20 25 30 Global-Z(cm) 0.5 1 1.5 2 2.5 3 3.5 4 4.5

hist_z

hist_z Entries 32 Mean 10.17 RMS 7.541 hist_z Entries 32 Mean 10.17 RMS 7.541

hist_z

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

47 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

Zb Za

PS/2S ¡

Z R

Local Zlocal -> Global Z

SLIDE 48

Local to Global Conversion

If module is rotated § Global r, φ need correction from Z-segment § Global Z needs correction from StripAddr

X Y Z

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

48 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 49

Firmware Simulation Test Bench

Local ¡to ¡ SSID ¡ Conversion ¡

Data

Organizer

PRAM ¡

1K ¡Pa9erns ¡

Road ¡to ¡ SSID ¡ Conversion ¡

Local ¡to ¡ Global ¡ Conversion ¡

COMB* ¡ TF* ¡ PRM Main FPGA LOCx6 ¡ SSx6 ¡ RoadID ¡ SSx6 ¡ LOCx6x4 ¡ GCx6x4 ¡ Track ¡Param ¡ Out ¡TXT ¡ Pulsar2b ¡ Init ¡TXT ¡ Init ¡TXT ¡ Init ¡TXT ¡ Init ¡TXT ¡ Event ¡TXT ¡ (PRBF.2) ¡

BlockRAM Usage L2SS 2.5x6 Road2SS(1K pattern) 2 DO 8x6 L2G 4.5x6x4 Total 173

600 BlockRAM in KU040

SSx6 ¡ LOCx6 ¡ PRM Slave FPGA

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

49 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 50

Firmware Simulation Test Bench

L2G ¡ L2SS ¡

DO ¡ VIMPRAM ¡ 1K ¡Pa9erns ¡

Road2SS ¡

COMB* ¡ TF* ¡ PRM FPGA LOCx6 ¡ SSx6 ¡ RoadID ¡ SSx6 ¡ GCx6x4 ¡ Track ¡Param ¡ Out ¡TXT ¡ Pulsar2b ¡ Init ¡TXT ¡ Init ¡TXT ¡ Init ¡TXT ¡ Event ¡TXT ¡

*COMB and TF is placeholder 600 BlockRAM in KU040

SSx6 ¡ GCx6 ¡

BlockRAM Usage L2SS 2.5x6 Road2SS(1K pattern) 2 DO 16x6 L2G 4.5x6 Total 140

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

50 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 51

Preliminary Single Muon Event in Firmware Simulation

LOC→SS→PRAM→RoadID→SS→DO→LOC→GC ¡
21 ¡clk ¡cycles ¡latency ¡needed ¡for ¡single ¡muon ¡event ¡go ¡through ¡the ¡whole ¡chain ¡

If ¡running ¡at ¡250MHz, ¡ ¡21clk ¡~ ¡84ns. ¡ ¡ ¡ ¡ ¡ ¡ ¡(MulC ¡muons ¡event ¡in ¡backup) ¡

The ¡same ¡single ¡muon ¡event ¡will ¡be ¡sent ¡to ¡Track ¡Fi\ng ¡ ¡

Local ¡stubs ¡ input ¡

End ¡of ¡Event ¡

SSID ¡

Road ¡ID ¡

SSID ¡ Local ¡stubs ¡ from ¡DO ¡ Global ¡stubs ¡

PRAM, 12 clk

RoadID→SS 2 clk

DO 3 clk

Local → global 4 clk Sys ¡clk ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

51 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 52

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

4 close muons

4 ¡close ¡muons: ¡in ¡each ¡layer, ¡all ¡4 ¡stubs ¡are ¡located ¡in ¡a ¡same ¡SS ¡ ¡ ¡ ü 4 ¡CLK ¡for ¡sending ¡4 ¡local ¡stubs ¡into ¡PRM ¡ ü 1 ¡pa9ern ¡fired ¡ ü This ¡is ¡an ¡extreme ¡case: ¡4 ¡sets ¡of ¡global ¡stubs ¡are ¡read ¡out ¡ with ¡1 ¡clock ¡cycle ¡

Combina:on ¡builder ¡will ¡handle ¡them ¡in ¡the ¡later ¡step ¡

4 ¡μ ¡ ¡ local ¡stubs ¡ EOE ¡

4 ¡μ ¡ ¡ global ¡stubs ¡

RoadID ¡from ¡ VIPRAM ¡

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

21 CLK

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

52 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 53

4 separate muons

4 ¡CLK ¡cycles ¡for ¡4 ¡local ¡stubs ¡input ¡
4 ¡CLK ¡cycles ¡for ¡4 ¡global ¡stubs ¡output ¡
The ¡1st ¡and ¡4th ¡muons ¡are ¡close: ¡
In ¡layer ¡6 ¡and ¡7, ¡stubs ¡are ¡located ¡in ¡same ¡SS ¡
Addi:onal ¡stubs ¡are ¡read ¡out ¡for ¡each ¡road: ¡mul:-‑combina:ons ¡ ¡

¡ 4 ¡μ ¡local ¡ stubs ¡ 4 ¡RoadID ¡ 4 ¡μ ¡ ¡ global ¡stubs ¡ Addi:onal ¡ global ¡stubs ¡

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

21 CLK

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

53 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 54

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

Single muon + Jet

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

x [cm]

50 100

y [cm]

50 100 150

CMS Preliminary Phase II Simulation

d quark jet, pT=50 GeV d quark jet, pT=200 GeV d quark jet, pT=500 GeV

z [cm]

50 100

r [cm]

50 100 150

CMS Preliminary Phase II Simulation

Jet ¡with ¡Higher ¡pT ¡tends ¡to ¡produce ¡more ¡stubs. ¡ ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

54 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 55

Single muon + Jet (50 GeV)

Single ¡μ ¡+ ¡Jet ¡ Single ¡μ ¡ ¡ global ¡stubs ¡

RoadID ¡from ¡ VIPRAM ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

55 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 56

Single muon + Jet (500 GeV)

Single ¡μ ¡+ ¡Jet ¡ Single ¡μ ¡ ¡ global ¡stubs ¡

RoadID ¡from ¡ VIPRAM ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

56 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 57

Preliminary Single Muon Event in Firmware Simulation

x [cm]

50 100

y [cm]

50 100 150

= 64 GeV

T,0

p = 0.258925 η = 1.26675 φ CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

= 64 GeV

T,0

p = 0.258925 η = 1.26675 φ CMS Preliminary Phase II Simulation

x [cm]

50 100

y [cm]

50 100 150

= 64 GeV

T,0

p = 0.258925 η = 1.26675 φ CMS Preliminary Phase II Simulation

z [cm]

50 100

r [cm]

50 100 150

= 64 GeV

T,0

p = 0.258925 η = 1.26675 φ CMS Preliminary Phase II Simulation

A single muon event Pattern matched, road fired

L

c

a l s t u b s s e n t t

D

O

Ini:al ¡look ¡with ¡single ¡muon, ¡in ¡the ¡process ¡of ¡adding: ¡

Mul: ¡muon, ¡Single ¡muon+Jet, ¡Single ¡muon+PU(140,200) ¡ Mul: ¡muon+PU, ¡X+PU, ¡9H+PU ¡

G l

b

a l s t u b s s e n t t

T

F

Details ¡in ¡the ¡next ¡talk ¡ given ¡by ¡Marco ¡Trovato ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

57 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 58

Event with 2 muons

First event: Single muon Second event: Two muons

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

58 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 59

x [mm]

1000
500

500 1000 y [mm]

1000
500

500 1000 1 1.2 1.4 1.6 1.8 2 x [mm]

1000
500

500 1000 y [mm]

1000
500

500 1000 1 1.2 1.4 1.6 1.8 2

z [mm]

2000
1000

1000 2000 r [mm] 200 400 600 800 1000 1200 0.5 1 1.5 2

7 8 9 10 11 13 12 15 14 18 19 22 21 20 6 5

CMS outer tracker and trigger towers

6(η) × 8(ϕ) trigger towers

Trigger ¡tower ¡details: ¡h9p://mersi.web.cern.ch/mersi/layouts/.repository/TechnicalProposal2014/trigger_cpus.html ¡ ¡

barrel layers endcap layers r Full tower emulation barrel hybrid endcap

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

59 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 60

Module design

Up ¡to ¡210 ¡strips ¡
Half ¡precision ¡measurement: ¡ ¡

up ¡to ¡211 ¡values ¡

So ¡11 ¡bits ¡needed ¡for ¡the ¡

stub ¡address ¡ ¡ ¡ ¡ ¡

23=8 ¡chips ¡
So ¡the ¡first ¡3 ¡bits ¡of ¡the ¡stub ¡

address ¡corresponds ¡to ¡the ¡ chip ¡ID ¡ ¡ ¡ ¡ ¡ Half ¡precision ¡ ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

60 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 61

1080 7734 5695 919

# of connections

1 2 3 4 5 6 7 8 9 10

Modules

1000 2000 3000 4000 5000 6000 7000 8000

Entries 15428 Mean 1.478 RMS 0.8642 Entries 15428 Mean 1.478 RMS 0.8642

A B C The 8BX data train will be the same as directly from the modules DTC stage L1A data Fibers from Module A to neighbor towers to neighbor tower Fibers from Module B Fibers from Module C Assume up to 10Gbps The neighbor data sharing is fully taken care of at the DTC stage by simple duplication For outer layers, the data could also be (optionally) merged at DTC stage

For demonstration, simple assumptions about DTC

61 ¡

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 62

N_stub/module with 1000 events

module ID (z)

10 20 30 40 50 60 70

) φ # of stubs per event per module (average over

0.5 1 1.5 2 2.5 3 Barrel

Layer 10 Layer 9 Layer 8 Layer 7 Layer 6 Layer 5

ladder ID (r)

2 4 6 8 10 12 14 16 18 20

) φ # of stubs per event per module (average over

0.5 1 1.5 2 2.5 3 Endcap

Layer 11 Layer 12 Layer 13 Layer 14 Layer 15

) φ ladder ID (

10 20 30 40 50 60 70 80

# of stubs per event per module

0.5 1 1.5 2 2.5 3 3.5 Barrel, modules near z=0

Layer 10, moduleID(z)=12 Layer 9, moduleID(z)=12 Layer 8, moduleID(z)=12 Layer 7, moduleID(z)=26 Layer 6, moduleID(z)=27 Layer 5, moduleID(z)=31

) φ module ID (

10 20 30 40 50 60 70 80 90 100

# of stubs per event per module

0.5 1 1.5 2 2.5 3 Endcap, innermost layer (layer 11) Ring 0 Ring 1 Ring 2 Ring 3 Ring 4 Ring 5 Ring 6 Ring 7 Ring 8 Ring 9 Ring 10 Ring 11 Ring 12 Ring 13 Ring 14

Average over ϕ

3 GeV

(Ring)

Zoom in ϕ view

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

62 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 63

After merging

z [mm]

2000
1000

1000 2000 r [mm] 200 400 600 800 1000 1200 0.5 1 1.5 2

Trigger Tower ID

5 10 15 20 25 30 35 40 45 50

Modules

100 200 300 400 500

z [mm] 500 1000 1500 2000 2500 r [mm] 200 400 600 800 1000 1200 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2

4.0 η

2S

PS ¡

< 400 everywhere So every trigger tower can be handled by one crate

Area that 2 neighbor modules to be merged

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

63 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 64

L1 readout performance

§ L1 trigger data rate: 40MHz ~ 25ns § 8 bunch crossings: 40MHz/8 = 5MHz ~ 200ns § CIC to GBT readout rate: 320MHz (per link)

§ 320MHz / 5MHz = 64 clock cycles (in 200ns)

§ CIC links used for trigger: 4bits(4links) * 64 clk = 256 bits (in 200ns)

4 links for trigger 1 link for raw

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

64 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

SLIDE 65

Fountain superstrip

65 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡ Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

SLIDE 66

Nstub per layer in barrel tower (200PU)

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 TriggerTower16, Layer05

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 80 TriggerTower16, Layer06

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer07

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 80 90 TriggerTower16, Layer08

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer09

# of stubs per tower per layer

50 100 150 200 250 300

Entries

10 20 30 40 50 60 70 TriggerTower16, Layer10

Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10

3 GeV

Director's ¡Review ¡– ¡Mezzanine ¡Card ¡Status ¡

66 ¡

Zhen ¡Hu, ¡February ¡3, ¡2016 ¡

AM+FPGA Mezzanine Card Status

Zhen Hu, Fermilab February 3, 2016

Scope of This Talk

Data ¡Delivery ¡ Pa,ern ¡ Recogni2on ¡

Fast ¡PR ¡using ¡AM ¡

Data ¡Sourcing ¡

§ Pattern Recognition Mezzanine (PRM) Firmware

§ Core piece of firmware implemented that interacts with the Associative Memory (AM) § Developed from scratch with low latency goals for L1

Track ¡Fi8ng ¡

§ Hardware: Pattern Recognition Mezzanine (PRM) § Firmware

§ Overview of PRM firmware

ü A fast, reasonably sized AM in ultrascale FPGA ü Data Organizer for low latency, high speed pipelined operation ü Stub local to global address conversion on the fly

§ Current status

§ Firmware fully developed, now bench test the performance with different types of events

ü Start with single muon ü Then add 140/200 PU, high energetic jets …

Outline

PRM: UltraScale+UltraScale/VIPRAM Mezzanine

§ Prototype pattern recognition engine for L1 tracking trigger demonstration

§ Dual Kintex UltraScale FPGAs § 36 Mb low latency DDR II+ static RAM § Socket for Next Generation VIPRAM ASIC (Vertically Integrated Pattern Recognition Associative Memory) firmware ¡

AM ¡pa9erns ¡stored ¡

Stubs ¡library ¡

Stubs from Pulsar for a given BX and Trigger Tower Incoming stubs ~ few x100 per layer Roads fired (<100) Roads with all associated stubs Comb Clean up Track fitting stage

Stubs ¡library ¡

Data ¡Organizer ¡

Master FPGA tracks Clean up Δt2

Δt3 ¡

Pattern Recognition Engine Flow

Pattern and Superstrip (ss)

§ A pattern is a low resolution track

§ Made of 1 superstrip per layer § A superstrip is a group of adjacent strips

§ Fountain superstrip configuration

§ “Fountain” defined by natural spread in ϕ

interactions

§ Different z-segmentations

§ The working point chosen in this talk:

§ Scale factor (sf) for the fountain superstrip dϕ = 1 § Number of z-segmentation (nz) = 4 § Pattern back size = 1862K @95%

§ AM in FPGA: very closely follows the AM ASIC (chip) design

§ Match two silicon tiers in ASIC with two modules in FPGA firmware

§ Pipelined operation

§ Logic is optimized for 7-Series/UltraScale architecture

PRM Core Module (1) AM in FPGA

ASIC ¡ firmware ¡

I/O Tier: Road Serialization

Road ¡fired ¡ Fast ¡output ¡ as ¡in ¡ASIC ¡ projec:on ¡

§ A critical module for fast stub storage and fast interactions with AM

§ This is a new design after a few iterations

§ The data organizer stores stubs in RAM at the address pointed to by the ss ID

§ Masking and scrubbing logic insures old stubs are not read out

PRM Core Module (2) Data Organizer

Data Organizer: Block Diagram

It ¡writes ¡like ¡a ¡FIFO(first ¡in ¡first ¡out), ¡but ¡reads ¡like ¡a ¡memory ¡

PRM Core Module (3) Local to Global Conversion

§ Why?

§ Stub received in local coordinates, and converted to global coordinates (for ss, track fitting …)

§ How?

§ The global coordinates of the edges of a module is known given the module ID

§ Global ϕ or R is an approximate linear function of nStubAddr in a small range

§ Global Z is a linear function of Zlocal

Fit to get ϕi, Ri, Δϕi, Δri in 8 sub-ranges

Local ¡nStubAddr ¡ ¡ ¡ ¡ Local ¡nStubAddr ¡ ¡ ¡ ¡ Global ¡R ¡ Global ¡ϕ ¡ FPGA ¡friendly ¡

PRM GTH Performance PRM Firmware Block Diagram

Example ¡with ¡1 ¡muon ¡

Bench Test the Firmware performance

Board ¡level ¡test: ¡select ¡40 ¡modules ¡(within ¡the ¡purple ¡box) ¡ covering ¡single ¡muon ¡tracks ¡with: ¡ ¡ ¡ η ¡= ¡1.1/3, ¡ϕ ¡= ¡3π/8, ¡ ¡pT ¡> ¡3 ¡GeV, ¡ ¡|z0|<4cm

§ PRM Firmware is fully developed

§ Gradually increase the complexity; carefully verify the firmware functions piece by piece

Divide ¡and ¡conquer ¡

Single Muon

Single ¡μ ¡ ¡ local ¡stubs ¡ EOE ¡

9 ¡CLK ¡from ¡RoadID ¡to ¡global ¡stubs ¡output ¡ ¡ ü Split ¡in ¡more ¡stages ¡in ¡backup ¡

go ¡through ¡the ¡whole ¡PRM ¡chain ¡ ü If ¡running ¡at ¡200MHz, ¡ ¡21 ¡CLK ¡~ ¡105ns ¡ 21 CLK 12 CLK

4 Separate Muons

4 ¡μ ¡local ¡ stubs ¡ 4 ¡RoadID ¡ 4 ¡μ ¡ ¡ global ¡stubs ¡ Addi:onal ¡ global ¡stubs ¡

21 CLK

Single Muon + PU200 (in 40 modules)

Single ¡muon ¡+ ¡PU200 ¡in ¡the ¡40 ¡modules. ¡ ¡ In ¡this ¡event: ¡ ¡

data ¡input) ¡

Single ¡μ ¡+ ¡ PU200 ¡ Single ¡μ ¡ ¡ global ¡stubs ¡ PU ¡global ¡stub ¡

single ¡muon ¡paAern ¡sCll ¡is ¡rather ¡clean ¡

21 CLK

§ Logic is optimized for  7-Series/UltraScale architecture