FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - - PowerPoint PPT Presentation

fpgas as tools and architectures at eth systems fpgas as
SMART_READER_LITE
LIVE PREVIEW

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - - PowerPoint PPT Presentation

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool . Analysing a multi-Gb trace stream in real time. BRISC Research Architecture


slide-1
SLIDE 1

FPGAs as Tools and Architectures at ETH Systems

slide-2
SLIDE 2

14 September 2016 David Cock 2 | |

FPGAs as Tools and Architectures at ETH Systems

Real-Time Tracing and Verification

The FPGA as a tool.

Analysing a multi-Gb trace stream in real time. BRISC – Research Architecture for Large Systems

The FPGA as an architecture.

A platform for hardware and software research.

Expose the coherent interface to an FPGA, with lots and lots of fast IO links.

slide-3
SLIDE 3

14 September 2016 David Cock 3 | |

Real-Time Tracing and Verification

slide-4
SLIDE 4

14 September 2016 David Cock 4 | |

Collide instructions at 0.99c, and observe the decay products.

We're Going to Build a Large Program Collider

Images: CERN; Chaix & Morel et associés

ad

slide-5
SLIDE 5

14 September 2016 David Cock 5 | |

Programmers Once (Thought They) Understood Computer Architecture

Image: Computer Systems, A Programmer's Perspective, Bryant & O'Hallaron, 2011

slide-6
SLIDE 6

14 September 2016 David Cock 6 | |

Symmetric Multiprocessors Were Fairly Simple

WB WB Cache Cache RAM

slide-7
SLIDE 7

14 September 2016 David Cock 7 | |

Concurrent Code Makes Architecture Visible

Consider message passing.

Pretty much the simplest thing you can do with shared memory.

Systems like Barrelfish rely on it.

When are barriers required?

You can't write good code, without sufficiently understanding the hardware.

We're combining components in new ways.

slide-8
SLIDE 8

14 September 2016 David Cock 8 | |

Message Passing with Shared Memory

CPU RAM CPU Write: *x = 42 Read: *x = 42 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1

slide-9
SLIDE 9

14 September 2016 David Cock 9 | |

Message Passing with a Write Buffer

CPU RAM CPU Write: *x = 42 Read: *x = 0 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1 WB *y = 1

slide-10
SLIDE 10

14 September 2016 David Cock 10 | |

Message Passing with a Barrier

CPU RAM CPU Write: *x = 42 Read: *x = 42 *x = 0 *x = 42 *y = 1 *y = 0 Write: *y = 1 Read: *y = 1 WB *y = 1 *x = 42

slide-11
SLIDE 11

14 September 2016 David Cock 11 | |

Of Course, CPUs Aren't That Simple

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 RAM Coherent Interconnect PCI 9 hops

slide-12
SLIDE 12

14 September 2016 David Cock 12 | |

You Can't Trust the Hardware

seL4 was verified modulo a hardware model.

The Cortex A8 has bugs:

Cache flushes don't work.

As of today, these “errata” are still not public.

We rediscovered these by accident.

Non-coherent memory is coming.

Source: Chip Errata for the i.MX51, Freescale Semiconductor

slide-13
SLIDE 13

14 September 2016 David Cock 13 | |

And Then There's Rack Scale...

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI PCI NIC NIC NIC NIC NIC NIC NIC NIC

TOR TOR Backhaul

slide-14
SLIDE 14

14 September 2016 David Cock 14 | |

There's a Lot of Data Available

CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI CPU WB L1 CPU WB L1 L2 CPU WB L1 CPU WB L1 L2 L3 L3 RAM Coherent Interconnect PCI PCI NIC NIC NIC NIC NIC NIC NIC NIC

TOR TOR Backhaul Program trace Cache dumps Port mirroring Openflow Event triggers

slide-15
SLIDE 15

14 September 2016 David Cock 15 | |

ARM High-Speed Serial Trace Port

Streams from the Embedded Trace Macrocell.

Cycle-accurate control flow + events @ 6GiB/s+

Compatible with FPGA PHYs.

Well-documented protocol.

Aurora 8/10

Available on ARMv8

Image: Teledyne Lecroy

slide-16
SLIDE 16

14 September 2016 David Cock 16 | |

The HSSTP Hardware

The official tool is CHF10,000 per core.

The cable run is maximum 15cm.

It's PHY-compatible with common FPGAs

A CHF6k FGPA could easily handle 10.

15x cheaper!

We have a development prototype.

slide-17
SLIDE 17

14 September 2016 David Cock 17 | |

HSSTP Testbench

slide-18
SLIDE 18

14 September 2016 David Cock 18 | |

Fancy Triggering and Filtering

The ETM has sophisticated filtering e.g. Sequencer.

Bn and Fn can be just about any events on the SoC.

States can enable/disable trace,

  • r log events.

A powerful facility for pre-filtering

State 0 State 1 State 2 State 3 B2 B1 B0 F0 F1 F2

slide-19
SLIDE 19

14 September 2016 David Cock 19 | |

Filtering and Offload in an FPGA

We'll need to intelligently filter high-rate data.

We're using an FPGA for the physical interface already.

How much processing could we do?

We have expertise in the group with FPGA query offloading

We have a Master's student working on this.

slide-20
SLIDE 20

14 September 2016 David Cock 20 | |

What Could We Do With This Data?

slide-21
SLIDE 21

14 September 2016 David Cock 21 | |

Hardware Tracing for Correctness

unmap(pa); cleanDCache(); flushTLB();

Are HW operatjons right?

5Gb/s Filter at line rate Check temporal assertjons Log & process offmine

  • Real time pipeline trace on ARM.
  • Can halt and inspect caches.
  • HW has “errata” (bugs).
  • Check that it actually works!
  • Catch transient and race bugs.
slide-22
SLIDE 22

14 September 2016 David Cock 22 | |

Hardware Tracing for Performance

5Gb/s Filter at line rate Log & process offmine

URPC[0]= x; URPC[1]= 1; while(!URPC[1]); x= URPC[0];

1 2

x 1 x

Core 0 Core 1

Cache 0 Cache 1

INVAL(0) READ(1) …

Is URPC optjmal?

  • Should see N coherency messages.
  • Do we?

‐ The HW knows!

slide-23
SLIDE 23

14 September 2016 David Cock 23 | |

Properties to Check: Security

Runtime verification is an established field.

Lots of existing work to build on.

What properties could we check efficiently?

How could we map them to the filtering pipeline?

/* A very simple TESLA assertion. */ TESLA_WITHIN(example_syscall, previously(security_check(ANY(ptr),

  • , op) == 0));

http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/

slide-24
SLIDE 24

14 September 2016 David Cock 24 | |

Processing Engine

That's a lot of data, how can we process it?

This is what rack-scale systems are for!

We have a software pipeline, thanks to a Master's student: Andrei Pârvu.

slide-25
SLIDE 25

14 September 2016 David Cock 25 | |

Properties to Check: Memory Management

Could we check this?

void *a = malloc(); ... {a is still allocated} free(a);

Gp $free(x) −> P !$free(x) S x = $malloc;

It's always been true that... ...if x is freed now, then... ...before this free... ...there were no frees of x, since it was allocated.

slide-26
SLIDE 26

14 September 2016 David Cock 26 | |

Checking LTL with Automata

Gp $free(x) −> P !$free(x) S x = $malloc;

This is a well-studied problem, and standard algorithms exist:

11000000 00100211 00111011 00111111 11000111 malloc free free malloc free free free malloc malloc malloc 00111010 00111110 11000110 free malloc malloc free malloc free

free malloc malloc

slide-27
SLIDE 27

14 September 2016 David Cock 27 | |

Bound Variables and Multiple Automata

So far only one x value.

Every x needs an automaton instance.

Gp $free(1) −> P !$free(1) S 1 = $malloc; Gp $free(2) −> P !$free(2) S 2 = $malloc; Gp $free(3) −> P !$free(3) S 3 = $malloc;

free malloc malloc free malloc malloc free malloc malloc

Requires dynamic allocation.

Not trivial in HW.

slide-28
SLIDE 28

14 September 2016 David Cock 28 | |

A Streaming Verification Engine

HSSTP Packet Capture

Sources Capture Processing Properties

ETM Sequencer FPGA Capture Dataflow Engine FPGA Offload TESLA malloc() pairing Coherence correctness Constraints Requirements

slide-29
SLIDE 29

14 September 2016 David Cock 29 | |

Software Pipeline Performance LTL checking in software

1 2 3 4 5 6 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (seconds) Number of events (1000s) No double allocation No double frees No leaks

slide-30
SLIDE 30

14 September 2016 David Cock 30 | |

Software Pipeline Performance Trace parsing in software

20 40 60 80 100 120 140 160 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (seconds) Number of events(1000s) Write trace Trace Write trace w/ASM ASM Write parsed trace Parser

slide-31
SLIDE 31

14 September 2016 David Cock 31 | |

Offloading Verification

  • Think regular expressions for infinite

streams.

  • As for REs, we compile a checking

automaton.

  • Run the automaton in real time and

look for violations.

  • FPGAs are good at state machines.
slide-32
SLIDE 32

14 September 2016 David Cock 32 | |

Offloading Parsing

Currently the bulk of the runtime.

Not as straightforward on the FPGA.

Current student project.

slide-33
SLIDE 33

14 September 2016 David Cock 33 | |

An Instrumented Rack-Scale System

  • 64 SoCs x 5Gb/s = 320Gb/s trace output.
  • Online checkers (e.g. automata) will be

essential at this scale.

  • We're going to build this:

– A rack of ARMv8 cores & FPGAs.

slide-34
SLIDE 34

14 September 2016 David Cock 34 | |

BRISC

slide-35
SLIDE 35

| |

A deadly embrace

Product hardware is designed for current applicatjon workloads running on Linux. Innovatjon (and research) in system sofuware is constrained by available commodity hardware.

slide-36
SLIDE 36

| |

The Gap.

For many commercially relevant workloads, cores spend much

  • f their tjme in the OS.

BUT:

  • Processor architects ignore OS designers

– Simply don’t understand the OS problem – Cores rarely evaluated with >1 app running anyway

  • HPC people try to remove the OS

– And then blow the rest of their s/w development budget puttjng it back in a user library.

  • and OS design people?

– Complain among themselves and try and deal with it – Don't even try to infmuence hardware

slide-37
SLIDE 37

| |

A deadly embrace

Product hardware is designed for current applicatjon workloads running on Linux. Innovatjon (and research) in system sofuware is constrained by available commodity hardware.

slide-38
SLIDE 38

| |

Solutjon: BRISC

38

 A hardware research platgorm for system

sofuware

 Massively overengineered wrt. products  Highly confjgurable building block for rackscale

slide-39
SLIDE 39

| |

Sketch

Large server-class SoC Large server-class SoC High-end FPGA (e.g Xilinx Zynq ZU17EG) High-end FPGA (e.g Xilinx Zynq ZU17EG) Coherence 100 Gb Ethernet ≥ 0.5TB DDR4 ≥ 0.5TB DDR4

As many 100Gb QSFP+ cages as possible

~ 32GB DDR4 ~ 32GB DDR4 SATA, PCIe, UART, NVMe, USB UART, USB, SD

slide-40
SLIDE 40

| |

All kinds of uses for this…

  • Plug lots together for rack-scale computjng
  • Use the FPGA for data processing offmoad
  • Emulate large distributed NVRAM
  • Sequester processors using the FPGA
  • Runtjme verifjcatjon of program trace
  • Experiment scaling coherency
  • Build a dataprocessing network switch
  • etc. etc. etc.
slide-41
SLIDE 41

| |

Higher goal: research amplifjcatjon

  • Seed the research community

– Remove major barrier to innovatjon at a stroke

  • Precedents:

– PlanetLab – Berkeley Unix – …

13 September 2016

slide-42
SLIDE 42

| |

Questions?

slide-43
SLIDE 43

14 September 2016 David Cock 43 | |

Checking LTL with Automata

Gp $free(x) −> P !$free(x) S x = $malloc;

This is a well-studied problem, and standard algorithms exist:

Gp P, at t-1 „P was true until t-1“ P, at t „P is still true at t“ Gp P, at t „P has always been true“ 1 1 1 1 1