Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo - - PowerPoint PPT Presentation

machine farm
SMART_READER_LITE
LIVE PREVIEW

Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo - - PowerPoint PPT Presentation

Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense Application acceleration


slide-1
SLIDE 1

Flexible Architecture Research Machine (FARM)

RAMP Retreat June 25, 2009

Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun

slide-2
SLIDE 2

Motivation

 Why CPUs + FPGAs make sense

 Application acceleration

 Prototyping new functionality, low volume production  FPGAs getting computationally denser

 Simulators/Research prototypes

 Software matters: experiment with new architectures  Combine best of both worlds

slide-3
SLIDE 3

Research Challenges

 When is it a good idea to use FPGAs + CPUs?

 Coarse-grained applications are great

 Video encoding, DSP, etc.

 But what about fine-grained communication

 Fine-grained in space?  Fine-grained in time?

 How?

 Hardware vs software balance  Mechanisms to reduce/hide overheads

slide-4
SLIDE 4

The Stanford FARM

 High performance, yet flexible

 Commodity CPUs, memory, I/O for fast system with rich SW

support

 FPGAs to prototype new accelerators

 FARM in a nutshell

 A research machine

 personalize computing (threads, vectors, reconfigurable, …)  personalize memory (shared mem, transactions, streams, …)  personalize I/O (off-loading engines, coherent I/O, …)

 An industrial strength cluster

 State of the art CPUs, GPUs, memory, and I/O  Infiniband or PCIe interconnect  Scalable to 10s or 100s of nodes

slide-5
SLIDE 5

Memory Memory Memory Memory

Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3

FARM Node

slide-6
SLIDE 6

Memory Memory Memory Memory

Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3

FPGA

SRAM I O

FARM Node

slide-7
SLIDE 7

Memory Memory Memory Memory

Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3

FPGA

SRAM I O

GPU/Stream

FARM Node

slide-8
SLIDE 8

Memory Memory Memory Memory

Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3

FPGA

SRAM

(scalable) Infiniband Or PCIe Interconnect

I O

FARM System View

slide-9
SLIDE 9

Procyon System

 Initial platform for FARM  From A&D Technology, Inc.  Full system board

 AMD Opteron Socket F  Two DDR2 DIMMs  USB/eSATA/VGA/GigE  Sun OpenSolaris OS

 Extra CPU board

 AMD Opteron Socket F

 FPGA Board

 Altera Stratix II FPGA

 All connected via HT backplane

 Also provides PCIe and PCI

slide-10
SLIDE 10

Procyon System Communication Diagram

2MB L3 Shared Cache

Hyper Transport 2MB L3 Shared Cache Hyper Transport

32 Gbps 32 Gbps ~60ns

AMD Barcelona

6.4 Gbps ~380ns 6.4 Gbps

cHTCore™ Hyper Transport (PHY, LINK)

Altera Stratix II FPGA (132k Logic Gates)

Configurable Coherent Cache Data Transfer Engine Cache IF Data Stream IF User Application MMR IF

1.8GHz Core 0 64K L1 512KB L2 Cache 1.8GHz Core 3 64K L1 512KB L2 Cache

1.8GHz Core 0 64K L1 512KB L2 Cache 1.8GHz Core 3 64K L1 512KB L2 Cache

 Components to manage system communication  Numbers from the A&D Procyon

slide-11
SLIDE 11

Overhead on Procyon

 Issues to resolve

 FPGA Communication latencies: Also non-uniform

access times from different cores

 Frequency discrepancy: 1.8 GHz CPUs vs 100 MHz FPGA

 FPGA round trip from closer Opteron: ~1400 instructions  FPGA round trip from farther Opteron: ~1700 instructions

 Synchronization

slide-12
SLIDE 12

A Simple Analytical Model

 Goals

 High level model for predicting accelerator speedup  Intuition into when accelerating makes sense

Hardware requirements

Application requirements

slide-13
SLIDE 13

A Simple Analytical Model

Toff Time to execute the offloaded work on the processor a Acceleration factor for the offloaded work (doubled rate would have a=0.5) Ton Time to execute remaining work (i.e. unaccelerated work) on the processor G Percentage of offloaded work done between each communication with the accelerator tovlp Time the processor is doing work in parallel with communication and/or work done on the accelerator tovhd Communication overhead

฀  Speedup  G(Ton  Toff ) G(Ton aToff )  tovhd  tovlp

slide-14
SLIDE 14

A Simple Analytical Model: Synchronization

slide-15
SLIDE 15

Model Verification

 Microbenchmark

 Essentially a loop which offloads “work” to the FPGA

 Use no-ops to simulate unaccelerated work on the processor

 Each instance of communication transfers 64 bytes of data  Used to measure speedup for varying system/application choices

slide-16
SLIDE 16

A Simple Analytical Model: Results

0.5 1 1.5 2 2.5 0.01 0.1 1 10 100

Seepdup Granularity (Normalized by Roundtrip Latency)

Full Synch (Modeled) Half Synch (Modeled) ASynch (Modeled) Full Synch (Measured) Half Synch (Measured) Asynch (Measured)

theoretical speedup limit (limited by offloaded work) breakeven point for full synch model breakeven point for half synch model

slide-17
SLIDE 17

Initial Application: Transactional Memory

 Accelerate STM without changing the processor

 Use FPGA in FARM to detect conflicts between

transactions

 Significantly improve expensive read barriers in STM

systems

 Can use FPGA to atomically perform transaction commit

 Provides strong isolation from non-transactional access  Not used in current rendition of FARM

 Good application for varying granularity of

communication

 FPGA communication on all shared memory accesses:

potential worst case (lots of communication)

slide-18
SLIDE 18

HT Core

FPGA Hardware Overview

HT Interface HT Cache RSM Committer

slide-19
SLIDE 19

FPGA Utilization

CPU Frequency 1.8 GHz HyperTransport Frequency HT400 FPGA Device Stratix II EP2S130 Logic Utilization 62% Total Registers 43K Combinational LUTs 51% Dedicated Logic Registers 41% Pin Usage 33% Block Memory 10% (depends on cache) PLLs 4/12 (33%) Logic Frequency 100 MHz

slide-20
SLIDE 20

CPU  FPGA Communication

 Driver

 Modify system registers to create DRAM

address space mapped to FPGA

 “Unlimited” size (40 bit addresses)

 User application maps addresses to virtual

space using mmap

 No kernel changes necessary

slide-21
SLIDE 21

CPU  FPGA Commands

 Uncached stores

 Half-synchronous communication  Writes strictly ordered

 Write combining buffers

 Asynchronous until buffer overflow  Command offset: configure addresses to

maximize merging

 DMA

 Fully asynchronous  Write to cached memory and pull from FPGA

slide-22
SLIDE 22

FPGA  CPU Communication

 FPGA writes to coherent memory

 Need a static physical address (e.g. pinned

page cache) or coherent TLB on FPGA

 Asynchronous but expensive, usually

involves stealing a cache line from CPUs…

 CPU reads memory mapped registers

  • n the FPGA

 Synchronous, but efficient

slide-23
SLIDE 23

Communication in TM

 CPU  FPGA

 Use write-combining buffer  DMA not needed, yet.

 FPGA  CPU

 Violation notification uses coherent writes

 Free incremental validation

 Final validation uses MMR

slide-24
SLIDE 24

Tolerating FPGA-CPU Latency

 Challenge: Unbounded latency leads to unknown

  • rdering of commands from various processors

 Solution: Decouple timeline of CPU command

firing from FPGA reception

 Embed a global time stamp in commands to FPGA  Software or hardware increments time stamp when

necessary

 Divides time into “epochs”  Currently using atomic increment – looking into Lamport

clocks

 FPGA uses time stamp to reason about ordering

slide-25
SLIDE 25

Global and Local Epochs

A B C C B A

 Global Epochs

 Finer grain but requires global state  Know A < B,C but nothing about B and C

 Local Epochs

 Cheaper, but coarser grain (non-overlapping epochs)  Know C < B, but nothing about A and B or A and C

Global Epochs Local Epochs

Epoch N Epoch N+1 Epoch N-1

slide-26
SLIDE 26

Example: Use in TM

 Read Barrier

 Send command with global timestamp and

read reference to FPGA

 FPGA maintains per-txn bloom filter

 Commit

 Send commands with global timestamp and

each written reference to FPGA

 FPGA notifies CPU of already known violations  Maintains a bloom filter for this epoch

 Violates new reads with same epoch

slide-27
SLIDE 27

TM Time Stamp illustration

CPU 0 CPU 1 FPGA

Read x Start Commit Lock x Violate x

slide-28
SLIDE 28

Synchronization “Fence”

 Occasionally you need to synchronize

 E.g. TM validation before commit  Decoupling FPGA/CPU makes this

expensive – should be rare

 Send fence command to FPGA  FPGA notifies CPU when done

 Initially used coherent write – too

expensive

 Improved: CPU reads MMR

slide-29
SLIDE 29

Results

Single thread execution breakdown for STAMP apps

slide-30
SLIDE 30

Results

Speedup over sequential execution for STAMP apps

slide-31
SLIDE 31

Classic Lessons

 Bandwidth  CPU vs Simulator

 In-order single-cycle CPUs do not look like modern

processors (Opteron)

 Off chip is hard

 CPUs optimized for caches not off-chip communication

 Wish list

 Truly asynchronous “fire and forget” method of writing

to the FPGA

 Accelerator write directly into the cache

slide-32
SLIDE 32

Possible Directions

 Possibility of building a much bigger

system (~28 cores)

 Security

 Memory watchdog, encryption, etc.

 Traditional hardware accelerators

 Scheduling, cryptography, video encoding,

etc.

 Communication Accelerator

 Partially-coherent cluster with FPGA

connecting coherence domains

slide-33
SLIDE 33

Questions