Increasing the Efficiency of an Embedded Multi-Core Bytecode - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache Martin Zabel, Thomas B. Preußer, Rainer G. Spallek JTRES‘12 25.10.2012 Martin.Zabel@tu-dresden.de http://vlsi-eda.inf.tu-dresden.de

Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 2 von 28 Core Bytecode Processor using an Object Cache

Motivation Why Java? � Object orientation, portability � Automatic memory management, security � Support for thread parallelization Why Java-(bytecode-)processor? Why Java-(bytecode-)processor? � Native execution of Java-bytecode � no OS, no interpretation, no re-compilation � Real-time � Suited for embedded systems with limited resources Why multi-core processors? � Power consumption increases over-proportional with clock frequency. � Use thread-level parallelism instead. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 3 von 28 Core Bytecode Processor using an Object Cache

Java Multi-Core Processor SHAP Multi-Core Architecture configurable Examples: JopCMP, jamuth, Core n-1 REALJava and SHAP Stack Method Cache Common property: central shared heap for all cores Memory Manager SHAP Multi-Core: SHAP Multi-Core: Core1 oller Controlle 32 32 Stack Method Garbage Wishbone Bus � Local stack-memory per core Cache Collector 32 Data Core0 � Method-cache per core Code 32 Stack Method Cache � Pipelining of heap-accesses 8 � Concurrent GC for real-time apps 32 � Maximum speed-up of 8 for programs with an above-average DDR: 16 UART SDR: 32 DMA number of memory accesses [1] Graphics Unit Memory � CLDC, constant-time interface - SRAM Ethernet MAC - DDR-RAM method dispatch, … Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 4 von 28 Core Bytecode Processor using an Object Cache

Goal Further reduce the demands on the heap memory interface to achieve higher speed-ups through thread-level parallelism. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 5 von 28 Core Bytecode Processor using an Object Cache

Related Work Common solution for object-oriented processors: Cache for objects in analogy to data-caches [2] [3] Tag Data Object Object Offset Offset Object Content Object Content Reference Especially for real-time systems: Separate Caches for different data areas [4]: � Classic data cache for data at static addresses (e.g. class data) � Object-cache for data at dynamic addresses (e.g. objects) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 7 von 28 Core Bytecode Processor using an Object Cache

Indirect Object-Addressing Stack Problem of JopCMP and SHAP: � Object-table stored in external memory. � Additional latency for each heap-access. � Additional demand on memory bandwidth. Object-Table Solution: � Translation look-aside buffer (TLB) [1] � Virtually indexed object-cache [2] Heap Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 8 von 28 Core Bytecode Processor using an Object Cache

Cache Coherence Problem: Coherence of distributed caches Advantage of the Java Virtual Machine: Synchronization only when [5] � entering a critical section, or � accessing a “volatile” variable. � accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 9 von 28 Core Bytecode Processor using an Object Cache

Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9], all except microkernel benchmarks � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 11 von 28 Core Bytecode Processor using an Object Cache

Heap-Access Analysis 25 Data Accesses (Core Itself) h Utilization [%] Bytecode Fetches (Method Cache) Accesses for Memory Management 20 15 15 Memory Bandwidth U 11% 10 5 0 AES Bubble Kfl Lift Matrix N- Sieve UdpIp Sort Mul Queens Benchmark Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 12 von 28 Core Bytecode Processor using an Object Cache

Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9] � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Results: Results: 1. Most frequently object-accesses are reads on arrays 1 and member variables. 2. 80% of all object accesses concentrate on 6 objects. 3. Frequent access onto the first user-specific object offsets (-2 and 1) � Further evaluation : small full-associative cache for each core 1 (already accounts for implicit reads of array length) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 13 von 28 Core Bytecode Processor using an Object Cache

Small Full-associative Local Cache Storing only invariant data: � Would require no extra logic for cache coherency. � In general, only class information pointer and array size are invariant. � Significant reduction only for array-intensive programs. BubbleSort BubbleSort 33% 33% of all of all Sieve 20% memory accesses MatrixMul 17% Write-through instead of write-back: � No special GC interaction required. � Simple cache coherence: Invalidate cache when • entering a critical section, or • accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 14 von 28 Core Bytecode Processor using an Object Cache

Cache Design (1) Cache integration: into memory manager port Core modifications: � Bytecode to access volatile variables � Microcode to invalidate cache SHAP Multi-Core with AOC (Excerpt) configurable Data Port Data Port AOC Controller Memory 32 32 Data Manager Core0 Code 32 Stack Method Cache Garbage 8 Collector DDR: 16 Wishbone Bus SDR: 32 Memory - SRAM - DDR-RAM Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 16 von 28 Core Bytecode Processor using an Object Cache

Cache Design (2) Features: � Address & Offset-Cache (AOC) with write-through, LRU-strategy � 1 valid-bit per cached word � Configurable: Core External Memory # of cache lines, Stack Object Table Heap Offset cached offsets cached offsets Object 1 Data W -2 Physical Addr. Data X -1 Object 2 Object Handle Data Y 0 Disadvantage: Object 3 Data Z 1 Additional latency of 1 clock cycle during Adress- and Offset-Cache (AOC) Tags Address Mem. Valid & Data Memory cache miss Line V Offset -2 V Offset 1 0 Object Handle Physical Addr. 1 Data W 1 Data Z 1 2 3 Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 17 von 28 Core Bytecode Processor using an Object Cache

Cache Configuration Huge configuration space: > , 0 ∈ l l N � Cache lines: [ − , ] with , ∈ , > 0 x y x y N x � Offsets: no (address only), range 1 , 2 , , 18 = n K But synthesis for cores to expensive. � Search for good initial configuration. Configuration space exploration: � Baseline design for comparison : TLB with 2 entries � Benchmarks: • JemBench • JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers) � Platform: • SHAP on Virtex-5 FPGA XC5VLX110T • Same clock frequency of 80 MHz as baseline design Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 18 von 28 Core Bytecode Processor using an Object Cache

Setup: 1 Core, Cache Address Only 1.14 SparseMatmultInt MatrixMul (N=20) 1.12 Lift 1.1 AES erformance UdpIp 1.08 NQueens 1.06 Kfl Kfl Relative Perf HS 1.04 Sieve 1.02 BS 1 0.98 0.96 Baseline 2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines Design Cache Configuration � Use 8 lines � � � Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 19 von 28 Core Bytecode Processor using an Object Cache

Setup: 1 Core, 8 Lines 1.45 SparseMatmultInt 1.4 MatrixMul (N=20) Lift 1.35 AES ormance UdpIp 1.3 Relative Perform NQueens NQueens 1.25 Kfl HS 1.2 Sieve 1.15 BS 1.1 1.05 1 Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55] � Cache Offsets -2 & 1 � � � Cache Configuration Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 20 von 28 Core Bytecode Processor using an Object Cache

Increasing the Efficiency of an Embedded Multi-Core Bytecode - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

FreeBSD on high performance multi-core embedded PowerPC systems Rafa Jaworowski

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Increasing Disparity: The Scanlan Effect 14 Oct 2018 V1A V1A Increasing Disparity: The Scanlan

Power Power Efficiency Efficiency of of Wavelength Wavelength- -Routed Routed Optical

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Embedded implicatures Bart Geurts Embedded implicatures?!? (with Nausicaa Pouscoulous) In:

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

EMBEDDED RUST ON THE BEAGLEBOARD X15 MEETING EMBEDDED Jonathan Pallant 14 November 2018

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Embedded C for Zynq C r i s t i a n S i s t e r n a U n i v e r s i d a d N a c i o n a l

Pla$orm independence and languages CSCI 136: Fundamentals of

Flexible, Scalable Mesh and Data Management using PETSc DMPlex M. Lange 1 M. Knepley 2 G. Gorman 1

Parallel Objects for Multicores A Glimpse at the Parallel Language Encore Dave Clarke &

Quadratic Sieve implementation for factorization backup, benchmark and network communication

On a Tabling Engine that Can Exploit Or-Parallelism Ricardo Rocha Fernando Silva V tor

Descriptors CSE 576 Ali Farhadi Many slides from Larry

Feature Tracking and Optical Flow Computer Vision Jia-Bin Huang, Virginia Tech Many slides from

Local features: detection and description Kristen Grauman UT Austin Tues Feb 27 Announcements

Increasing the Efficiency of an Embedded Multi-Core Bytecode - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

Embedded PC The modular Industrial PC for mid-range control Embedded PC 1 Embedded OS

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Platform Convergence Journey Windows Embedded Standard 7 Windows Embedded Standard 8 Converged

FreeBSD on high performance multi-core embedded PowerPC systems Rafa Jaworowski

Embedded PC The modular Industrial PC for mid-range control Stefan Hoppe 14.09.2007 1 Embedded

Increasing Disparity: The Scanlan Effect 14 Oct 2018 V1A V1A Increasing Disparity: The Scanlan

Power Power Efficiency Efficiency of of Wavelength Wavelength- -Routed Routed Optical

4TU MASTER EMBEDDED SYSTEMS Bert Molenkamp 19/03/2020 Master Embedded Systems 1 Table of

Embedded implicatures Bart Geurts Embedded implicatures?!? (with Nausicaa Pouscoulous) In:

HW/SW Codesign w/ FPGAs Embedded Systems ECE 495/595 Overview (Slides from Embedded Systems

Embedded Embedded Architecture Architecture Systems Systems Jakob Engblom, PhD Jakob

EMBEDDED RUST ON THE BEAGLEBOARD X15 MEETING EMBEDDED Jonathan Pallant 14 November 2018

Embedded systems and the role of programmable logic devices in embedded systems Embedded system :

Embedded C for Zynq C r i s t i a n S i s t e r n a U n i v e r s i d a d N a c i o n a l

Pla$orm independence and languages CSCI 136: Fundamentals of

Flexible, Scalable Mesh and Data Management using PETSc DMPlex M. Lange 1 M. Knepley 2 G. Gorman 1

Parallel Objects for Multicores A Glimpse at the Parallel Language Encore Dave Clarke &amp;

Quadratic Sieve implementation for factorization backup, benchmark and network communication

On a Tabling Engine that Can Exploit Or-Parallelism Ricardo Rocha Fernando Silva V tor

Descriptors CSE 576 Ali Farhadi Many slides from Larry

Feature Tracking and Optical Flow Computer Vision Jia-Bin Huang, Virginia Tech Many slides from

Local features: detection and description Kristen Grauman UT Austin Tues Feb 27 Announcements

Parallel Objects for Multicores A Glimpse at the Parallel Language Encore Dave Clarke &