Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo - PowerPoint PPT Presentation

Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun

Motivation  Why CPUs + FPGAs make sense  Application acceleration  Prototyping new functionality, low volume production  FPGAs getting computationally denser  Simulators/Research prototypes  Software matters: experiment with new architectures  Combine best of both worlds

Research Challenges  When is it a good idea to use FPGAs + CPUs?  Coarse-grained applications are great  Video encoding, DSP, etc.  But what about fine-grained communication  Fine-grained in space?  Fine-grained in time?  How?  Hardware vs software balance  Mechanisms to reduce/hide overheads

The Stanford FARM  High performance, yet flexible  Commodity CPUs, memory, I/O for fast system with rich SW support  FPGAs to prototype new accelerators  FARM in a nutshell  A research machine  personalize computing (threads, vectors, reconfigurable, …)  personalize memory (shared mem , transactions, streams, …)  personalize I/O (off- loading engines, coherent I/O, …)  An industrial strength cluster  State of the art CPUs, GPUs, memory, and I/O  Infiniband or PCIe interconnect  Scalable to 10s or 100s of nodes

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Memory Memory

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Core 0 Core 1 I FPGA O Core 2 Core 3 SRAM Memory Memory

FARM Node Memory Memory Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 GPU/Stream I FPGA O SRAM Memory Memory

FARM System View Memory Memory (scalable) Core 0 Core 1 Core 0 Core 1 Core 2 Core 3 Core 2 Core 3 Infiniband Or PCIe Core 0 Core 1 I Interconnect FPGA O Core 2 Core 3 SRAM Memory Memory

Procyon System  Initial platform for FARM  From A&D Technology, Inc.  Full system board  AMD Opteron Socket F  Two DDR2 DIMMs  USB/eSATA/VGA/GigE  Sun OpenSolaris OS  Extra CPU board  AMD Opteron Socket F  FPGA Board  Altera Stratix II FPGA  All connected via HT backplane  Also provides PCIe and PCI

Procyon System Communication Diagram Altera Stratix II FPGA (132k Logic Gates) 1.8GHz 1.8GHz 1.8GHz 1.8GHz MMR Core 0 Core 3 Core 0 Core 3 User Application … … IF 64K L1 64K L1 64K L1 64K L1 512KB 512KB 512KB 512KB Cache IF L2 L2 L2 L2 Configurable Cache Cache Cache Cache Data Stream IF Coherent Cache 2MB 2MB Data L3 Shared Cache L3 Shared Cache Transfer Engine cHTCore™ 32 Gbps 6.4 Gbps Hyper Hyper Hyper Transport (PHY, LINK) Transport Transport 32 Gbps 6.4 Gbps ~60ns ~380ns AMD Barcelona  Components to manage system communication  Numbers from the A&D Procyon

Overhead on Procyon  Issues to resolve  FPGA Communication latencies: Also non-uniform access times from different cores  Frequency discrepancy: 1.8 GHz CPUs vs 100 MHz FPGA  FPGA round trip from closer Opteron: ~1400 instructions  FPGA round trip from farther Opteron: ~1700 instructions  Synchronization

A Simple Analytical Model  Goals  High level model for predicting accelerator speedup  Intuition into when accelerating makes sense Hardware requirements  Application requirements 

A Simple Analytical Model G ( T on  T off ) Speedup  G ( T on  a T off )  t ovhd  t ovlp T off Time to execute the offloaded work on the processor Acceleration factor for the offloaded work (doubled rate a would have a =0.5) ฀  Time to execute remaining work (i.e. unaccelerated T on work) on the processor Percentage of offloaded work done between each G communication with the accelerator Time the processor is doing work in parallel with t ovlp communication and/or work done on the accelerator t ovhd Communication overhead

A Simple Analytical Model: Synchronization

Model Verification  Microbenchmark  Essentially a loop which offloads “work” to the FPGA  Use no-ops to simulate unaccelerated work on the processor  Each instance of communication transfers 64 bytes of data  Used to measure speedup for varying system/application choices

A Simple Analytical Model: Results theoretical speedup limit (limited by offloaded work) 2.5 2 1.5 Seepdup Full Synch (Modeled) Half Synch (Modeled) 1 ASynch (Modeled) Full Synch (Measured) 0.5 Half Synch (Measured) Asynch (Measured) 0 0.01 breakeven point for 0.1 1 breakeven point for 10 100 half synch model Granularity (Normalized by Roundtrip Latency) full synch model

Initial Application: Transactional Memory  Accelerate STM without changing the processor  Use FPGA in FARM to detect conflicts between transactions  Significantly improve expensive read barriers in STM systems  Can use FPGA to atomically perform transaction commit  Provides strong isolation from non-transactional access  Not used in current rendition of FARM  Good application for varying granularity of communication  FPGA communication on all shared memory accesses: potential worst case (lots of communication)

FPGA Hardware Overview Committer Cache RSM HT Interface HT Core HT

FPGA Utilization CPU Frequency 1.8 GHz HyperTransport Frequency HT400 FPGA Device Stratix II EP2S130 Logic Utilization 62% Total Registers 43K Combinational LUTs 51% Dedicated Logic Registers 41% Pin Usage 33% Block Memory 10% (depends on cache) PLLs 4/12 (33%) Logic Frequency 100 MHz

CPU  FPGA Communication  Driver  Modify system registers to create DRAM address space mapped to FPGA  “Unlimited” size (40 bit addresses)  User application maps addresses to virtual space using mmap  No kernel changes necessary

CPU  FPGA Commands  Uncached stores  Half-synchronous communication  Writes strictly ordered  Write combining buffers  Asynchronous until buffer overflow  Command offset: configure addresses to maximize merging  DMA  Fully asynchronous  Write to cached memory and pull from FPGA

FPGA  CPU Communication  FPGA writes to coherent memory  Need a static physical address (e.g. pinned page cache) or coherent TLB on FPGA  Asynchronous but expensive, usually involves stealing a cache line from CPUs…  CPU reads memory mapped registers on the FPGA  Synchronous, but efficient

Communication in TM  CPU  FPGA  Use write-combining buffer  DMA not needed, yet.  FPGA  CPU  Violation notification uses coherent writes  Free incremental validation  Final validation uses MMR

Tolerating FPGA-CPU Latency  Challenge: Unbounded latency leads to unknown ordering of commands from various processors  Solution: Decouple timeline of CPU command firing from FPGA reception  Embed a global time stamp in commands to FPGA  Software or hardware increments time stamp when necessary  Divides time into “epochs”  Currently using atomic increment – looking into Lamport clocks  FPGA uses time stamp to reason about ordering

Global and Local Epochs Epoch N-1 Epoch N Epoch N+1 A A B B C C Global Epochs Local Epochs  Global Epochs  Finer grain but requires global state  Know A < B,C but nothing about B and C  Local Epochs  Cheaper, but coarser grain (non-overlapping epochs)  Know C < B, but nothing about A and B or A and C

Example: Use in TM  Read Barrier  Send command with global timestamp and read reference to FPGA  FPGA maintains per-txn bloom filter  Commit  Send commands with global timestamp and each written reference to FPGA  FPGA notifies CPU of already known violations  Maintains a bloom filter for this epoch  Violates new reads with same epoch

TM Time Stamp illustration CPU 0 CPU 1 FPGA Read x Start Commit Lock x Violate x

Synchronization “Fence”  Occasionally you need to synchronize  E.g. TM validation before commit  Decoupling FPGA/CPU makes this expensive – should be rare  Send fence command to FPGA  FPGA notifies CPU when done  Initially used coherent write – too expensive  Improved: CPU reads MMR

Results Single thread execution breakdown for STAMP apps

Results Speedup over sequential execution for STAMP apps

Classic Lessons  Bandwidth  CPU vs Simulator  In-order single-cycle CPUs do not look like modern processors (Opteron)  Off chip is hard  CPUs optimized for caches not off-chip communication  Wish list  Truly asynchronous “fire and forget” method of writing to the FPGA  Accelerator write directly into the cache

Possible Directions  Possibility of building a much bigger system (~28 cores)  Security  Memory watchdog, encryption, etc.  Traditional hardware accelerators  Scheduling, cryptography, video encoding, etc.  Communication Accelerator  Partially-coherent cluster with FPGA connecting coherence domains

Questions

Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo - PowerPoint PPT Presentation

Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense Application acceleration

AT Buzby Farm y SEPTA Woodstown, NJ Farm to SEPTA Farm to Cedar Meadow Farm SEPTA Holtwood PA

Farm Energy IQ Farms Today Securing Our Energy Future Dairy Farm Energy Efficiency Gary

Rough Ridge Farm Changes to the Farm System Presented by: Peter Young Farm Advisory Services

Farm Energy IQ Farm Energy IQ Farms Today Securing Our Energy Future Farms Today Securing Our

Randolph Farm 2010 Casselmonte Farm Casselmonte Farm Three Seasons SPRING SUMMER FALL $ $ $ F A

Less is More Neil Heseltine Hill Top Farm Malham @hilltopfarmgirl Hill Top Farm

Farm Bill Overview What Is the Farm Bill? Comprehensive legislation governing farm, food,

Farm Service Agency Farm Loan Programs 2 Farm Loan Programs If you are a farmer or rancher who

FARM SERVICE AGENCY FARM LOANS PROGRAMS The Farm Service Agency offers loans to help farmers and

NEW HOPE FARM Title New Hope slides.jpg Plantation Services NEW HOPE FARM Raod.jpg Plantation

Future mushroom farm Topics Layout of future mushroom farm Picking system Robot packing &

Farm Energy IQ Farms Today Securing Our Energy Future Farm Energy Efficiency Principles Tom

The Overnight Experience at Paradise Farm Camps Paradise Farm Camps History Paradise Farm Camps

2017 NW Iowa Research Farm and On-Farm Project Results 2018 Crop Advantage Series Joel DeJong,

2020 GLAS TRADITIONAL FARM BUILDINGS GRANT SCHEME SEMINAR 12 March 2020 www.brookfield.farm Glas

Farm Energy IQ Farms Today Securing Our Energy Future On Farm Biogas Production and Use Ed

Social Media Bootcamp Kearney, Nebraska Feb. 21, 2018 We work with groups in the ag community to

12/8/17 Supporting Relationships for Farm Success Healthy on-farm relationships between spouses,

Securing Software Supply Chains with in in-to toto to Tobias Furuholm Combient NotPetya

Access to Land: Economics of Leasing Gordon Groover Ag and Applied Economics Department

Good Schools Make Good Neighbors: Introduction Human Capital Spillovers in Early 20th Human

A comparative analysis of global agricultural policies - Lessons for the future CAP Simone

Agenda Syed Nayyar Hussain , Director, Securities and Exchange Commission of Pakistan, MEFIN

ASDF 3 Why Lisp is Now an Acceptable Scripting Language Franois-Ren Rideau

Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo - PowerPoint PPT Presentation

Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense Application acceleration

AT Buzby Farm y SEPTA Woodstown, NJ Farm to SEPTA Farm to Cedar Meadow Farm SEPTA Holtwood PA

Farm Energy IQ Farms Today Securing Our Energy Future Dairy Farm Energy Efficiency Gary

Rough Ridge Farm Changes to the Farm System Presented by: Peter Young Farm Advisory Services

Farm Energy IQ Farm Energy IQ Farms Today Securing Our Energy Future Farms Today Securing Our

Randolph Farm 2010 Casselmonte Farm Casselmonte Farm Three Seasons SPRING SUMMER FALL $ $ $ F A

Less is More Neil Heseltine Hill Top Farm Malham @hilltopfarmgirl Hill Top Farm

Farm Bill Overview What Is the Farm Bill? Comprehensive legislation governing farm, food,

Farm Service Agency Farm Loan Programs 2 Farm Loan Programs If you are a farmer or rancher who

FARM SERVICE AGENCY FARM LOANS PROGRAMS The Farm Service Agency offers loans to help farmers and

NEW HOPE FARM Title New Hope slides.jpg Plantation Services NEW HOPE FARM Raod.jpg Plantation

Future mushroom farm Topics Layout of future mushroom farm Picking system Robot packing &amp;

Farm Energy IQ Farms Today Securing Our Energy Future Farm Energy Efficiency Principles Tom

The Overnight Experience at Paradise Farm Camps Paradise Farm Camps History Paradise Farm Camps

2017 NW Iowa Research Farm and On-Farm Project Results 2018 Crop Advantage Series Joel DeJong,

2020 GLAS TRADITIONAL FARM BUILDINGS GRANT SCHEME SEMINAR 12 March 2020 www.brookfield.farm Glas

Farm Energy IQ Farms Today Securing Our Energy Future On Farm Biogas Production and Use Ed

Social Media Bootcamp Kearney, Nebraska Feb. 21, 2018 We work with groups in the ag community to

12/8/17 Supporting Relationships for Farm Success Healthy on-farm relationships between spouses,

Securing Software Supply Chains with in in-to toto to Tobias Furuholm Combient NotPetya

Access to Land: Economics of Leasing Gordon Groover Ag and Applied Economics Department

Good Schools Make Good Neighbors: Introduction Human Capital Spillovers in Early 20th Human

A comparative analysis of global agricultural policies - Lessons for the future CAP Simone

Agenda Syed Nayyar Hussain , Director, Securities and Exchange Commission of Pakistan, MEFIN

ASDF 3 Why Lisp is Now an Acceptable Scripting Language Franois-Ren Rideau

Future mushroom farm Topics Layout of future mushroom farm Picking system Robot packing &