FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - PowerPoint PPT Presentation

FPGAs as Tools and Architectures at ETH Systems

FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool .  Analysing a multi-Gb trace stream in real time.  BRISC – Research Architecture for Large Systems The FPGA as an architecture .  A platform for hardware and software research.  Expose the coherent interface to an FPGA, with lots and  lots of fast IO links. David Cock | 14 September 2016 | 2

Real-Time Tracing and Verification David Cock | 14 September 2016 | 3

We're Going to Build a Large Program Collider ad Collide instructions at 0.99 c , and observe the decay products. Images: CERN; Chaix & Morel et associés David Cock | 14 September 2016 | 4

Programmers Once (Thought They) Understood Computer Architecture Image: Computer Systems, A Programmer's Perspective, David Cock | 14 September 2016 | 5 Bryant & O'Hallaron, 2011

Symmetric Multiprocessors Were Fairly Simple RAM Cache WB Cache WB David Cock | 14 September 2016 | 6

Concurrent Code Makes Architecture Visible Consider message passing.  Pretty much the simplest thing you can do with shared memory.  Systems like Barrelfish rely on it.  When are barriers required?  You can't write good code, without sufficiently  understanding the hardware. We're combining components in  new ways. David Cock | 14 September 2016 | 7

Message Passing with Shared Memory CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 42 Write: *y = 1 RAM *x = 0 *x = 42 *y = 1 *y = 0 David Cock | 14 September 2016 | 8

Message Passing with a Write Buffer CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 0 Write: *y = 1 *x = 42 WB *y = 1 RAM *x = 0 *y = 0 *y = 1 David Cock | 14 September 2016 | 9

Message Passing with a Barrier CPU CPU Read: *y = 1 Write: *x = 42 Read: *x = 42 Write: *y = 1 *x = 42 WB *y = 1 RAM *x = 42 *x = 0 *y = 0 *y = 1 David Cock | 14 September 2016 | 10

Of Course, CPUs Aren't That Simple CPU CPU CPU CPU WB WB WB WB 9 hops L1 L1 L1 L1 L2 L2 Coherent PCI Interconnect RAM L3 David Cock | 14 September 2016 | 11

You Can't Trust the Hardware Source: Chip Errata for the i.MX51, Freescale Semiconductor seL4 was verified modulo  a hardware model . The Cortex A8 has bugs:  Cache flushes don't work.  As of today, these “errata”  are still not public. We rediscovered these by  accident. Non-coherent memory is  coming. David Cock | 14 September 2016 | 12

And Then There's Rack Scale... CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM TOR TOR CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Backhaul David Cock | 14 September 2016 | 13

There's a Lot of Data Available Cache dumps Program trace CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Port mirroring CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Event triggers TOR TOR CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM CPU CPU CPU CPU CPU CPU CPU CPU WB WB WB WB WB WB WB WB L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L3 L3 L3 L3 Coherent Coherent PCI NIC NIC PCI Interconnect Interconnect RAM RAM Backhaul Openflow David Cock | 14 September 2016 | 14

ARM High-Speed Serial Trace Port Image: Teledyne Lecroy Streams from the Embedded  Trace Macrocell . Cycle-accurate control flow +  events @ 6GiB/s+ Compatible with FPGA PHYs.  Well-documented protocol.  Aurora 8/10  Available on ARMv8  David Cock | 14 September 2016 | 15

The HSSTP Hardware The official tool is CHF10,000 per core.  The cable run is maximum 15cm.  It's PHY-compatible with common FPGAs  A CHF6k FGPA could easily handle 10.  15x cheaper!  We have a development prototype.  David Cock | 14 September 2016 | 16

HSSTP Testbench David Cock | 14 September 2016 | 17

Fancy Triggering and Filtering The ETM has sophisticated  State 0 filtering e.g. Sequencer . B0 F0 Bn and Fn can be just about any  events on the SoC. State 1 States can enable/disable trace,  B1 F1 or log events. State 2 A powerful facility for pre-filtering  B2 F2 State 3 David Cock | 14 September 2016 | 18

Filtering and Offload in an FPGA We'll need to intelligently filter high-rate  data. We're using an FPGA for the physical  interface already. How much processing could we do?  We have expertise in the group with  FPGA query offloading We have a Master's student working on this.  David Cock | 14 September 2016 | 19

What Could We Do With This Data? David Cock | 14 September 2016 | 20

Hardware Tracing for Correctness Are HW operatjons right? 5Gb/s unmap(pa); cleanDCache(); flushTLB(); Filter at line rate ● Real time pipeline trace on ARM. ● Can halt and inspect caches. ● HW has “errata” (bugs). ● Check that it actually works! ● Catch transient and race bugs. Check temporal Log & process offmine assertjons David Cock | 14 September 2016 | 21

Hardware Tracing for Performance • Should see N coherency messages. 5Gb/s • Do we? ‐ The HW knows! Filter at line rate Is URPC optjmal? Cache 0 x 1 1 INVAL(0) URPC[0]= x; READ(1) URPC[1]= 1; … x Core 0 Cache 1 while(!URPC[1]); x= URPC[0]; Log & process offmine 2 Core 1 David Cock | 14 September 2016 | 22

Properties to Check: Security Runtime verification is an  established field. Lots of existing work to  build on. What properties could we  /* A very simple TESLA assertion. */ check efficiently? TESLA_WITHIN(example_syscall, previously(security_check(ANY(ptr), How could we map them  o, op) == 0)); to the filtering pipeline? http://www.cl.cam.ac.uk/research/security/ctsrd/tesla/ David Cock | 14 September 2016 | 23

Processing Engine That's a lot of data, how can we process it?  This is what rack-scale systems are for!  We have a software pipeline, thanks to a Master's  student: Andrei Pârvu. David Cock | 14 September 2016 | 24

Properties to Check: Memory Management void *a = malloc(); ... {a is still allocated} free(a); Could we check this?  Gp $free( x ) −> P !$free( x ) S x = $malloc; It's always been ... before this free... true that... ...there were no frees of x , since it was allocated. ...if x is freed now , then... David Cock | 14 September 2016 | 25

Checking LTL with Automata This is a well-studied problem, and standard algorithms exist: Gp $free( x ) −> P !$free( x ) S x = $malloc; 00100211 malloc 00111011 malloc 00111010 malloc free malloc malloc malloc malloc malloc free malloc free 00111111 free 00111110 malloc free free 11000111 11000110 free free free 11000000 David Cock | 14 September 2016 | 26

Bound Variables and Multiple Automata malloc So far only one x value. malloc  free Every x needs an  malloc automaton instance. malloc Gp $free( 1 ) −> P !$free( 1 ) S 1 = $malloc; free Gp $free( 2 ) −> P !$free( 2 ) S 2 = $malloc; Gp $free( 3 ) −> P !$free( 3 ) S 3 = $malloc; malloc Requires dynamic allocation.  malloc free Not trivial in HW.  David Cock | 14 September 2016 | 27

A Streaming Verification Engine Capture Processing Properties Sources ETM HSSTP TESLA Dataflow Sequencer Engine malloc() Packet FPGA pairing FPGA Capture Capture Offload Coherence correctness Constraints Requirements David Cock | 14 September 2016 | 28

Software Pipeline Performance LTL checking in software 6 No double allocation No double frees No leaks 5 4 Time (seconds) 3 2 1 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of events (1000s) David Cock | 14 September 2016 | 29

Software Pipeline Performance Trace parsing in software 160 Write trace Trace 140 Write trace w/ASM ASM Write parsed trace 120 Parser 100 Time (seconds) 80 60 40 20 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of events(1000s) David Cock | 14 September 2016 | 30

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - PowerPoint PPT Presentation

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool . Analysing a multi-Gb trace stream in real time. BRISC Research Architecture

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Architectures Architectural styles Software architectures Architectures versus middleware

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Interact with talent The ETH Career Center | 2 Contents Switzerland / ETH About the ETH

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

INF5140 Specification and Verification of Parallel Systems Spring 2017 Institutt for

[537] Concurrency Bugs Chapter 32 Tyler Harter 10/22/14 Review Semaphores CVs vs.

DDS and xFS February 3, 2004 DDS Structure 1 DDS Maps DDS Recovery GC ocscillation Node

Positive modal separation logics Fredrik Dahlqvist University College London Resource Reasoning

Intro To Java Week 4 Wednesday, November 19, 14 Homework Review Week 1 Hadamard Week 2

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared-Memory Concurrency Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and - PowerPoint PPT Presentation

FPGAs as Tools and Architectures at ETH Systems FPGAs as Tools and Architectures at ETH Systems Real-Time Tracing and Verification The FPGA as a tool . Analysing a multi-Gb trace stream in real time. BRISC Research Architecture

The BIST History of FPGAs FPGAs The BIST History of The BISTory BISTory of of FPGAs FPGAs

Physical Design For FPGAs Rajeev Jayaraman Physical Implementation Tools Xilinx Inc. ISPD-2001

FPGAs 1 CMPE691/491: Advanced FPGA Design FPGAs Large array of configurable logic blocks

Architectures Architectural styles Software architectures Architectures versus middleware

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto

Interact with talent The ETH Career Center | 2 Contents Switzerland / ETH About the ETH

with FP FPGAs: Cas ase Stu tudy on on a a Key-Value Store FPGAs in the Cloud Wider

A Network of Time Division Multiplexing for FPGAs Rosemary Francis Motivation FPGAs are

Linux and FPGAs Chad D. Kersey chad@cdkersey.com cdkersey@gatech.edu Linux and FPGAs - p. 1/9

FPGAs 1 To read more This days papers: Brown and Rose, Architecture of FPGAs and

High-Speed Computing &amp; Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are

Virtex-7 FPGAs Target Software Virtex-7 FPGAs Target Software Defined Radio Applications Defined

Physical optimization for Physical optimization for FPGAs using post- FPGAs using post-

Hybrid Dot-Product Design for FP-Enabled FPGAs Bogdan Pasca Intel ARITH 2019, June 10-12, 2019

Gigabit Ethernet Gigabit Ethernet implementation for implementation for FPGAs FPGAs Grzegorz

FPGAs milliseconds+ to reconfjgure custom chips ??? (next week) FPGAs ??? GPUs

INF5140 Specification and Verification of Parallel Systems Spring 2017 Institutt for

[537] Concurrency Bugs Chapter 32 Tyler Harter 10/22/14 Review Semaphores CVs vs.

DDS and xFS February 3, 2004 DDS Structure 1 DDS Maps DDS Recovery GC ocscillation Node

Positive modal separation logics Fredrik Dahlqvist University College London Resource Reasoning

Intro To Java Week 4 Wednesday, November 19, 14 Homework Review Week 1 Hadamard Week 2

Programming with Shared Memory In a shared memory system, any memory location can be accessible by

Shared-Memory Concurrency Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015

Introduction Workloads for Experiments Introduction to workloads CS 239 Workload

High-Speed Computing & Co-Processing with FPGAs FPGAs (Field Programmable Gate Arrays) are