H ARDWARE P REPROCESSING F RAMEWORK (HPF) Traditional hardware - PowerPoint PPT Presentation

M AMBA : C LOSING THE P ERFORMANCE G AP IN P RODUCTIVE H ARDWARE D EVELOPMENT F RAMEWORKS Shunning Jiang , Berkin Ilbeyi, Christopher Batten School of Electrical and Computer Engineering Cornell University 0/17

T HE T RADITIONAL F LOW * HDL: hardware description language * DUT: design under test * TB: test bench * synth: synthesis Traditional hardware description language - Example: Verilog ✓ Fast edit-debug-sim loop ✓ Single language for design and testbench X Difficult to parameterize X Require specific ways to build powerful testbench 1/17

~12 GRAD STUDENTS TAPED OUT CELERITY IN 9 MONTHS Traditional hardware description language - Example: Verilog ✓ Fast edit-debug-sim loop ✓ Single language for design C++ � Verilog � Chisel � Verilog SystemVerilog and testbench PyMTL � Verilog X Difficult to parameterize Scott Davidson, Shaolin Xie, Christopher Torng, Khalid Al-Hawaj, Austin Rovinski, Tutu Ajayi, Luis Vega, Chun Zhao, Ritchie Zhao, Steve Dai, Aporva Amarnath, Bandhav Veluri, Paul Gao, Anuj Rao, Gai Liu, Rajesh K. Gupta, Zhiru Zhang, Ronald G. Dreslinski, X Require specific ways to Christopher Batten, and Michael B. Taylor. "The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast build powerful testbench Architectures and Design Methodologies for Fast Chips." IEEE Micro , 38(2):30 – 41, Mar/Apr. 2018. (special issue for top picks from HOTCHIPS-29) 1/17

H ARDWARE P REPROCESSING F RAMEWORK (HPF) Traditional hardware Hardware preprocessing description language framework (HPF) - Example: Verilog - Example: Genesis2 ✓ Fast edit-debug-sim loop ✓ Better parametrization with ✓ Single language for design insignificant coding style change and testbench X Multiple languages create X Difficult to parameterize semantic gap X Require specific ways to X Still difficult to build powerful build powerful testbench testbench 1/17

H ARDWARE G ENERATION F RAMEWORK (HGF) Traditional hardware Hardware preprocessing Hardware generation description language framework (HPF) framework (HGF) - Example: Verilog - Example: Genesis2 - Example: Chisel ✓ Fast edit-debug-sim loop ✓ Better parametrization with ✓ Powerful parametrization ✓ Single language for design ✓ Single language for design insignificant coding style change and testbench X Slower edit-debug-sim loop X Multiple languages create X Difficult to parameterize X Yet still difficult to build semantic gap X Require specific ways to X Still difficult to build powerful powerful testbench (can only build powerful testbench generate simple testbench) testbench 1/17

H ARDWARE G ENERATION AND S IMULATION F RAMEWORK (HGSF) ✓ Powerful parametrization ✓ Single language for design and testbench ✓ Powerful testbench (unleash Python’s full power!) ✓ Fast edit-sim-debug loop Hardware generation and simulation framework (HGSF) - Example: PyMTL 2/17

H ARDWARE G ENERATION AND S IMULATION F RAMEWORK (HGSF) ✓ Powerful parametrization ✓ Single language for design and testbench ✓ Powerful testbench (unleash Python’s full power!) ✓ Fast edit-sim-debug loop Hardware generation Sad fact: The loop is only and simulation fast when simulating a small framework (HGSF) amount of cycles on a small - Example: PyMTL design! 2/17

C LOSING THE PERFORMANCE GAP IN HGSF S ▪ Understanding the performance gap ▪ Background on tracing JIT compiler ▪ Co-optimizing the JIT and the HGSF ▪ Mamba performance Hardware generation and simulation framework (HGSF) - Example: PyMTL 3/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • We implement a 64-bit radix-four iterative divider to the same level of detail in all frameworks using control/datapath split • Higher is better • Log scale – the gap is larger than it seems 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • CVS is 20X faster than Icarus • Verilator requires C++ testbench, only works with synthesizable code, takes time to compile, but is 200+X faster than Icarus 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • Chisel (HGF) generates Verilog and simulates Verilog – the same performance! 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • Using CPython interpreter, Python-based HGSFs are much slower than CVS and even 10X slower than Icarus 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • Simply applying unmodified PyPy JIT interpreter brings ~10X speedup for Python-based HGSFs, but they are still significantly slower than CVS 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • Hybrid C/C++ cosimulation improves the performance but: • Only works with a subset of code • May require the user to work with C/C++ and Python at the same time 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER • Hybrid C/C++ cosimulation improves the performance but: • Only works with a subset of code • May require the user to work with C/C++ and Python at the same time. 4/17

S IMULATION PERFORMANCE OF 64- BIT ITERATIVE DIVIDER 4/17

INTERPRETER AND J UST -I N -T IME COMPILER FOR DYNAMIC LANGUAGES ▪ Dynamic languages provide vast productivity features. As a result, they require interpreter. (e.g. CPython) 6/17

INTERPRETER AND J UST -I N -T IME COMPILER FOR DYNAMIC LANGUAGES ▪ Dynamic languages provide vast productivity features. As a result, they require interpreter. (e.g. CPython) ▪ However, interpreters are slow. ▪ Just-in-time (JIT) compiler addresses the performance gap 6/17

H OW TRACING JIT WORKS # This is a hot loop def max ( a , b ): for i in xrange ( 10000000 ): if a > b : ... = max( ..., ... ) return a else: return b # The first trace is generated # when integers are passed as args # and a is actually greater than b guard_type(a, int) # type check guard_type(b, int) # type check c = int_gt(a, b) # check if a>b guard_true(c) return(a) 7/17

H OW TRACING JIT WORKS # This is a hot loop def max ( a , b ): for i in xrange ( 10000000 ): if a > b : ... = max( ..., ... ) return a else: return b # The first trace is generated # when integers are passed as args # and a is actually greater than b guard_type(a, int) # type check guard_type(b, int) # type check c = int_gt(a, b) # check if a>b guard_true(c) return(a) # bridge out of guard_true(c) # The second trace is generated # when guard_true(c) fails return(b) 7/17

H OW TRACING JIT WORKS # This is a hot loop def max ( a , b ): for i in xrange ( 10000000 ): if a > b : ... = max( ..., ... ) return a else: return b # bridge out of guard_type(a, int) # The third trace is generated # The first trace is generated # when floats are passed as args # when integers are passed as args guard_type(a, float) # type check # and a is actually greater than b guard_type(b, float) # type check guard_type(a, int) # type check c = float_gt(a, b) # check if a>b guard_type(b, int) # type check guard_true(c) c = int_gt(a, b) # check if a>b return(a) guard_true(c) return(a) # bridge out of guard_true(c) # The second trace is generated # when guard_true(c) fails return(b) 7/17

C HALLENGES OF HGSF S ON TRACING JIT ▪ By nature, event-driven simulation is bad for tracing JIT ▪ Control flows in logic blocks turn into guards that fail often ▪ Emulating fix- width data types using Python’s seamless BigInt is not the most efficient ▪ … 9/17

C HALLENGES : EVENT - DRIVEN SIMULATION ▪ Every signal value change check is a frequently failing guard ▪ Event- driven simulation’s inner loop is a bad pattern for tracing JIT 10/17

C HALLENGES : EVENT - DRIVEN SIMULATION ▪ Event- driven simulation’s inner loop is a bad pattern for tracing JIT num_cycles = 1000000 for i in xrange ( num_cycles ): while not event_queue . empty (): block = event_queue . pop () block () 10/17

C HALLENGES : EVENT - DRIVEN SIMULATION ▪ Event- driven simulation’s inner loop is a bad pattern for tracing JIT # The first trace is for blk1 guard_equal(block, blk1) < execute the code of blk1 > num_cycles = 1000000 jump_to_loop(while_loop) for i in xrange ( num_cycles ): while not event_queue . empty (): block = event_queue . pop () # The second trace is for blk2 block () guard_equal(block, blk2) < execute the code of blk2 > jump_to_loop(while_loop) # The third trace is for blk3 guard_equal(block, blk3) < execute the code of blk3 > jump_to_loop(while_loop) 10/17

H ARDWARE P REPROCESSING F RAMEWORK (HPF) Traditional hardware - PowerPoint PPT Presentation

M AMBA : C LOSING THE P ERFORMANCE G AP IN P RODUCTIVE H ARDWARE D EVELOPMENT F RAMEWORKS Shunning Jiang , Berkin Ilbeyi, Christopher Batten School of Electrical and Computer Engineering Cornell University 0/17 T HE T RADITIONAL F LOW * HDL:

Nuclear Fuel Reprocessing By Daniel Bolgren Jeff Menees Goals of the Project Develop a

Recovering and Reprocessing Resources from Waste Tabled on 6 June 2019 This presentation

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel

Lecture 10: Ideal Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020

HARDWARE H ARDWARE T YPES Microcontroller (MCU) Arduino, ESP8266, Particle Single Board

E XPLOITING S EMANTIC C OMMUTATIVITY IN H ARDWARE S PECULATION G UOWEI Z HANG , V IRGINIA C HIU , D

T IME T RAVELING H ARDWARE AND S OFTWARE S YSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT F OR F

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Steriliza)on of Endoscopes And Why We Chose a Low Temperature Reprocessing Method Presenter:

Patient Safety Begins with You Medical Device Reprocessing Sandra Comand, Clinical Manager /

OfficeSPEC OFFICE PRESENTATION Eliminate the Time and Expense of Reprocessing and Reduce Risk of

INNOVATIVE PLASTIC SORTING AND REPROCESSING SOLUTION TODAY'S PROGRAM 09.00-09.15 Welcome

reprocessing plant Jim Bishop Head of Radiological Protection Sellafield Sellafield 2 km

Possibility of Disposal for Spent Nuclear Fuel Reprocessing in the Aspect of the Radiological

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD

MOVEMENT DESENTIZATION AND REPROCESSING) IN PRIMARY CARE Bradley Samuel, PHD Associate Professor

Recovering from intrusions in distributed systems with Dare Taesoo Kim Ramesh Chandra, Nickolai

iRODS in the Cloud: SciDAS and NIH Helium Commons Commons Claris Castillo

Enhancements to the pd developer branch initiated by the vibrez project Thomas Grill, Hannes

Galactic X-ray Surveys and Galactic X-ray Source Populations Bob Warwick University of

TCPA COMPLIANCE IN THE HEALTHCARE INDUSTRY: UNDERSTANDING AND MITIGATING RISKS DEREK KEARL,

About Generic Drugs Ameet Sarpatwari , J.D., Ph.D. Instructor in Medicine, Harvard Medical School

LiveCompare: Grocery Bargain Hunting Through Participatory Sensing Linda Deng Landon P. Cox

CS 423 Operating System Design: OS support for Synchronization Tianyin Tianyin Xu Xu (MIC

H ARDWARE P REPROCESSING F RAMEWORK (HPF) Traditional hardware - PowerPoint PPT Presentation

M AMBA : C LOSING THE P ERFORMANCE G AP IN P RODUCTIVE H ARDWARE D EVELOPMENT F RAMEWORKS Shunning Jiang , Berkin Ilbeyi, Christopher Batten School of Electrical and Computer Engineering Cornell University 0/17 T HE T RADITIONAL F LOW * HDL:

Nuclear Fuel Reprocessing By Daniel Bolgren Jeff Menees Goals of the Project Develop a

Recovering and Reprocessing Resources from Waste Tabled on 6 June 2019 This presentation

High Performance Fortran (HPF) Source: Chapter 7 of &quot;Designing and building parallel

Lecture 10: Ideal Filters Mark Hasegawa-Johnson ECE 401: Signal and Image Analysis, Fall 2020

HARDWARE H ARDWARE T YPES Microcontroller (MCU) Arduino, ESP8266, Particle Single Board

E XPLOITING S EMANTIC C OMMUTATIVITY IN H ARDWARE S PECULATION G UOWEI Z HANG , V IRGINIA C HIU , D

T IME T RAVELING H ARDWARE AND S OFTWARE S YSTEMS Xiangyao Yu, Srini Devadas CSAIL, MIT F OR F

M ULTICORE H ARDWARE S HARED R ESOURCES : U NDERSTANDING OF THE S TATE OF THE A RT Gabriel

Steriliza)on of Endoscopes And Why We Chose a Low Temperature Reprocessing Method Presenter:

Patient Safety Begins with You Medical Device Reprocessing Sandra Comand, Clinical Manager /

OfficeSPEC OFFICE PRESENTATION Eliminate the Time and Expense of Reprocessing and Reduce Risk of

INNOVATIVE PLASTIC SORTING AND REPROCESSING SOLUTION TODAY'S PROGRAM 09.00-09.15 Welcome

reprocessing plant Jim Bishop Head of Radiological Protection Sellafield Sellafield 2 km

Possibility of Disposal for Spent Nuclear Fuel Reprocessing in the Aspect of the Radiological

Production &amp; Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&amp;C PPD

MOVEMENT DESENTIZATION AND REPROCESSING) IN PRIMARY CARE Bradley Samuel, PHD Associate Professor

Recovering from intrusions in distributed systems with Dare Taesoo Kim Ramesh Chandra, Nickolai

iRODS in the Cloud: SciDAS and NIH Helium Commons Commons Claris Castillo

Enhancements to the pd developer branch initiated by the vibrez project Thomas Grill, Hannes

Galactic X-ray Surveys and Galactic X-ray Source Populations Bob Warwick University of

TCPA COMPLIANCE IN THE HEALTHCARE INDUSTRY: UNDERSTANDING AND MITIGATING RISKS DEREK KEARL,

About Generic Drugs Ameet Sarpatwari , J.D., Ph.D. Instructor in Medicine, Harvard Medical School

LiveCompare: Grocery Bargain Hunting Through Participatory Sensing Linda Deng Landon P. Cox

CS 423 Operating System Design: OS support for Synchronization Tianyin Tianyin Xu Xu (MIC

High Performance Fortran (HPF) Source: Chapter 7 of "Designing and building parallel

Production & Reprocessing Unified Overview Jean-Roch Vlimant Big Picture O&C PPD