Intro to SKARAB for programmers (and how to use HMC!) Jason Manley - - PowerPoint PPT Presentation

intro to skarab for programmers
SMART_READER_LITE
LIVE PREVIEW

Intro to SKARAB for programmers (and how to use HMC!) Jason Manley - - PowerPoint PPT Presentation

Intro to SKARAB for programmers (and how to use HMC!) Jason Manley 2017 CASPER workshop Hardware Hardware Virtex 7, 690T FPGA 4 Mezzanine sites per SKARAB 2 in front, 2 in back 16 SERDES links per site Designed to


slide-1
SLIDE 1

Intro to SKARAB for programmers

(and how to use HMC!)

Jason Manley

2017 CASPER workshop

slide-2
SLIDE 2

Hardware

slide-3
SLIDE 3

Hardware

  • Virtex 7, 690T FPGA
  • 4 Mezzanine sites per SKARAB

○ 2 in front, 2 in back ○ 16 SERDES links per site ○ Designed to early PowerMX standard.

  • Fans over-provisioned, normally run

around 20% - 30% rated speed.

slide-4
SLIDE 4

Hardware

Mezzanine cards allow trading off of memory vs IO capacity. Four cards per SKARAB.

  • Only one type of off-chip memory currently available on SKARAB: HMC.

○ HMC replaces QDR/SRAM and also DRAM found on previous CASPER boards.

  • 40G mezzanine card offers 4x40G QSFP Ethernet ports, can drive optics or copper.

○ No more complicated, flaky PHY chips that need firmware loaded to function properly.

  • An ADC is now also available, with other cards to follow.
slide-5
SLIDE 5

Hardware: HMC Mezzanine card

  • 1x HMC device per card
  • HMC is 2GiB or 4GiB
  • Two independent interfaces per

card: 2x “half-width” (8 lane) links at 10Gbps per lane.

  • Each link is bi-directional.
  • Up to 160Gbps throughput per card.
slide-6
SLIDE 6

Hardware: QSFP 40G mezzanine card

Quad 40G QSFP Ethernet card PHY-less (purely passive). Does have a little micro processor for SFP management (power, temp etc). Able to drive optics directly. Tested with up to 7m passive cables. Recommend AOC (Active Optical Cables) for anything 5m and over. Does not currently work in “breakout” mode with spider/octopus cables. (turning one 40G port into 4x10G ports)

slide-7
SLIDE 7

Compared to existing CASPER hardware

iBOB ROACH ROACH-2 SKARAB

Logic cells 53K 94K 476K 693K DSP slices 232 640 2016 3600 BRAM capacity 4.2Mb 8.8Mb 38Mb 53Mb SRAM capacity 2x18Mb 2x36Mb 4x144Mb HMC < 8x 32Gib 8x 30Gbps R+W SRAM bandwidth 9Gbps 43Gbps 200Gbps DDR capacity (max)

  • 1x8Gb

1x16Gb DDR bandwidth (total)

  • 38Gbps

50Gbps Ethernet ports 2x 10G 4x10G 8x10G < 16x40G

slide-8
SLIDE 8

Hardware

Uses the JASPER flow, not the traditional CASPER flow. Python now forms the backend for managing:

  • busses
  • ‘Yellowblock’

Backend is Xilinx VIVADO, not ISE (hard break at Virtex-6/ROACH-2; no overlapping tool support). (recall Wesley’s JASPER/VIVADO in talk on Monday) SKARAB incorporates all the lessons-learnt from SKA-SA’s sizable deployments of iBOB/BEE2, ROACH-1 and ROACH-2s. After compiling a bitstream, interacting with a SKARAB from a network-attached control computer using any of the standard tools is the same as working with any previous CASPER hardware. But it is quite different under-the-hood...

slide-9
SLIDE 9

Remotely controlling SKARABs

Previous CASPER boards (iBOBs, BEE2s, ROACH1s, ROACH2s) all had out of band management ports (separate 100Mbps or 1G Ethernet ports from the 10G data ports). SKARAB can do everything in-band: data, management as well as (re)programming

  • Eventually over any network interface,
  • But currently only over 1G port or first 40G port.
  • Work in progress!

SKARAB does not have a separate management processor.

  • It uses a lightweight on-FPGA ‘softcore’ MicroBlaze.
  • Microblaze is reloaded whenever FPGA is reprogrammed
  • Process must be robust, and managed carefully, to avoid losing comms to boards.

Simpler setup and maintenance:

  • Just need a power cable and network cable to each SKARAB.
  • Network appliance: No need for managing boot servers, Linux filesystems etc
  • Entire platform can be managed remotely, including upgrading all firmware over network.
  • Designed for large-scale deployments (MeerKAT, with an eye on SKA).
slide-10
SLIDE 10

SKARAB startup sequencing

  • Onboard flash memory ships with two (space for up to four) bitstreams pre-loaded.

○ “Golden Image” and “Multiboot Image” ○ Exactly same bitstream; ○ Tries to boot multiboot image quickly. If that fails, falls back to golden image more slowly. ○ You can load your own images here, if you want, but that’s not the idea…

  • Most large CASPER deployments have a control computer on the network to configure the FPGA boards.

SKARAB is designed to work in this environment.

  • Host computer stores your various bitstreams.
  • So, when SKARAB boots, loads flash image, asks for DHCP. Server then knows about new SKARAB

board on network, and can load whichever DSP gateware image, configure registers and set it to work.

  • Default is for DHCP on all network ports on startup.

○ (SKARAB wants DHCP server. Hard-coding IP addresses in your bitstreams no longer so easy.) ○ Hostname support, for example, skarab020394-01. ○ LLDP support (boards announce themselves to switches)

  • MAC addresses are based on serial number and network port.

○ First 40G port has hostname skarab020302-01, with MAC 06:50:02:03:02:01

  • After loading DSP bitstream, network interfaces flap and a new DHCP transaction ensues. Depending on

your DHCP server and network (switch), can take a few seconds to bring link back up.

slide-11
SLIDE 11

What’s working?

Working Not (yet) working Basic JASPER toolflow Legacy CASPER toolflow (and never will) Polling sensors (power, temp, fans etc) Automatic fan speed control HMC Mezzanine cards Retrieval of logs for hardware errors First 40G ethernet port Arbitrary combinations of Ethernet and HMC cards 1G ethernet port Onboard USB JTAG bridge Remote reprogramming and control Fast (~1 second) remote reloading of FPGA gateware Remote updates (flash firmware) Large wishbone bus (timing implications; WIP) DHCP, LLDP, ARP, PING and other network services Comprehensive DRC during compile Python casperfpga interfaces (mostly; WIP)

slide-12
SLIDE 12

Tips for designs

  • Keep to the UDP port compiled-in to your yellowblock for all your high-speed traffic.

○ Else, can overwhelm microblaze with traffic; especially problematic while trying to reprogram. ○ Yellowblock default is to use 7148 (SPEAD default at SKA-SA). ○ Don’t ever use: ■ 7778 decimal (0x1e62); that’s for controlling the microblaze. ■ 29000 decimal (0x7148); that’s used for reprogramming.

  • In the event of a network failure at startup, SKARAB will try indefinitely to get a DHCP lease.
  • LEDs on front panel indicate DHCP success on golden image (useful for basic/visual debugging).
  • Check for updates regularly. Development’s very fluid at the moment, and nothing is “stable” yet.
  • Current bus architecture limitations prevent very large numbers of attachments (~50 slaves ok).
  • Good news is that V7 seems to have much better routing resources, especially when building large
  • BRAMs. Timing much easier for large FFTs and snapshot blocks than on V6.
  • Large designs easily meet timing at 240MHz.
  • You’ll get to play with all this stuff during Adam’s SKARAB tutorials.
slide-13
SLIDE 13

HMC memory

What is Hybrid Memory Cube?

  • Stacked DRAM on a chip, with a built-in management layer.
  • Designed and optimised for very high throughput, not low-latency. Perfect for RA instrumentation!
  • HMC takes care of itself, including error detection on memory cells and IO operations.

○ Don’t have to deal with refreshes, bank management etc in FPGA controller anymore.

  • HMC contains “smarts”... has buffers and a small ALU. (can build accumulator inside the memory!)
  • External interface is high speed serial (“SERDES”) links.
  • HMC supports up to 4 sets of bidirectional 16-lane links, with each lane operating up to 15Gbps...

○ That’s up to 1.9Tbps. It’s FAST!

  • Micron already on 3rd generation HMC. SKARAB uses 2nd generation at lower speeds.
slide-14
SLIDE 14

Accessing HMC memory

  • Yellowblock packages your instructions (read/write) into flits.
  • A flit is a “packet” containing a header (instruction) and data (see HMC datasheet

for details).

  • Fortunately, all of this is abstracted-away for user;

○ Yellowblock makes HMC look like a conventional memory interface.

  • Each HMC yellowblock offers two dual-ported interfaces.
  • Simultaneous read and write operations are combined into a single flit.
  • Memory is organised into Vaults, Banks and DRAMs.
  • The controller allows you to arbitrarily map these into your address bits.
  • By default, SKARAB’s implementation optimises for linear reads and writes.
  • Yellowblock accesses 256 bits at a time, and presents a 256 bit bus.

○ One clock cycle per read&/write request ○ No need for burst reads or writes: truly random access possible. a26 ... a8 a7 a6 a5 a4 a3 a2 a1 a0 D19 ... D0 B3 B2 B1 B0 V3 V2 V1 V0

slide-15
SLIDE 15

Accessing HMC memory

  • Yellowblock packages your instructions (read/write) into flits.
  • A flit is a “packet” containing a header (instruction) and data (see HMC datasheet

for details).

  • Fortunately, all of this is abstracted-away for user;

○ Yellowblock makes HMC look like a conventional memory interface.

  • Each HMC yellowblock offers two dual-ported interfaces.
  • Simultaneous read and write operations are combined into a single flit.
  • Memory is organised into Vaults, Banks and DRAMs.
  • The controller allows you to arbitrarily map these into your address bits.
  • By default, SKARAB’s implementation optimises for linear reads and writes.
  • Yellowblock accesses 256 bits at a time, and presents a 256 bit bus.

○ One clock cycle per read&/write request ○ No need for burst reads or writes: truly random access possible. a26 ... a8 a7 a6 a5 a4 a3 a2 a1 a0 D19 ... D0 B3 B2 B1 B0 V3 V2 V1 V0

slide-16
SLIDE 16
  • There are 16 vaults per HMC device.
  • Four are co-located with each link (collection of SERDES lanes).
  • They are interconnected on-chip using a switched network, so any link can access any vault.
  • Naturally, accessing co-located memory is faster than hopping through the switches to get to memory

located on other links.

  • Mapping is as you’d expect:

○ Link 1: vaults 0,1,2,3 ○ Link 2: vaults 4,5,6,7 ○ Link 3: vaults 8,9,10,11 ○ Link 4: vaults 12,13,14,15

HMC vaults and links

  • SKARAB has links 2 and 3 connected.
  • Thus, half the memory can be accessed locally,

incurring minimum latency.

  • Accessing “remote” vaults (0-3 and 12-15) will incur

additional latency, but the switching network is full crossbar (no reduction in bandwidth).

slide-17
SLIDE 17

HMC: More on vaults

  • To increase throughput, data must be striped over multiple Vaults.
  • Maximum throughput performance requires you to use all vaults.
  • Each vault has a buffer for transactions.
  • If you keep accessing the same vault continuously, operations will queue and performance will degrade.

○ NNB for matrix-transpose (corner-turner).

  • Vaults operate semi-autonomously, and respond as quickly as they can.

○ Latency, throughput and order of operations thus not guaranteed. ○ You can issue a request to vault 1 and then another to vault 2, and get the response back from vault 2 first and then the reply from vault 1 some time later. ○ Performance heavily dependent upon your access patterns.

  • To keep track of your read requests, you issue 9-bit tags with each read request.

○ Responses contain your tags so you can sort them out again. ○ This can complicate things enormously.

  • Data is also cached in the HMC, so if you issue the same read request twice, you get the second response

back very quickly, and possibly before many earlier read requests.

  • Typical latency: ~80 FPGA clock cycles (230MHz) in VACC applications.
  • Typical out-of-order: ranges from 0 to ~230, depending on access patterns and speeds.
slide-18
SLIDE 18

HMC yellowblock

  • HMC controller automatically performs POST upon startup.
  • After POST, HMC monitors itself.
  • 6-bit error code reported in event of failure during operation.
  • Checks include:

○ flit (SERDES comms) errors ○ ECC in DRAM core ○ Buffer overruns ○ Internal logic errors

  • For best performance: linear access, simultaneous read and write flits
  • Higher-level HMC blocks available in DSP library:

○ Wideband, programmable delay line ○ Corner-Turner (matrix transpose) ○ Vector-accumulator (buffered, with backpressure)

slide-19
SLIDE 19
  • Latency through the chip is not guaranteed.
  • Throughput is not guaranteed, and depends on access patterns.
  • No SKARAB support yet for special instructions (just basic read&write).
  • Most applications will need a reorder block after the HMC to deal with out-of-order responses.
  • If you’re doing reads and writes, issue these instructions simultaneously.

HMC conclusions & considerations

slide-20
SLIDE 20

40G ethernet core, forty_gbe

  • Yellowblock interface exactly like the 10G ethernet core,

○ but with 256b interfaces instead of 64b interfaces.

  • 40G core now does proper RX CRC checking (uses a lot of HW resources, though).
  • No longer managed by tcpborphserver and tgtap software process on PPC.

○ Microblaze softcore manages all network services.

  • Features in place already:

○ DHCP with auto-renew and hostname support based on serial number ○ LLDP reporting and discovery ○ ARP ○ Ping ○ Multicast TX and RX, including subscription to multiple sequential addresses. IGMPv2 signalling.

  • As with 10G core, multicasting RX uses bitmask arrangement.

○ Can only subscribe to contiguous chunks of 2^N addresses.

  • Current status, limitations and work in progress:

○ At the moment, 40G yellowblock is hard-coded for the first QSFP port on the third mezzanine site. ○ 40G yellowblock currently pulls-in microblaze infrastructure, so all designs must contain a 40G core, even if you’re not using it!

slide-21
SLIDE 21

40G Ethernet and HMC resources

  • Hardware resources for 40G ethernet and HMC cores:

Total available Per 40G port Per HMC mezzanine card Slices 108300 9448 (3.1%) 14173 (13.1%) BRAM 1470 24.5 (1.7%) 116 (7.9%) DSP48 3600 0 (0%) 4 (0.1%)

slide-22
SLIDE 22

Questions & Comments

Jason Manley jason@ska.ac.za