increasing the efficiency of an embedded multi core
play

Increasing the Efficiency of an Embedded Multi-Core Bytecode - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache


  1. Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache Martin Zabel, Thomas B. Preußer, Rainer G. Spallek JTRES‘12 25.10.2012 Martin.Zabel@tu-dresden.de http://vlsi-eda.inf.tu-dresden.de

  2. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 2 von 28 Core Bytecode Processor using an Object Cache

  3. Motivation Why Java? � Object orientation, portability � Automatic memory management, security � Support for thread parallelization Why Java-(bytecode-)processor? Why Java-(bytecode-)processor? � Native execution of Java-bytecode � no OS, no interpretation, no re-compilation � Real-time � Suited for embedded systems with limited resources Why multi-core processors? � Power consumption increases over-proportional with clock frequency. � Use thread-level parallelism instead. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 3 von 28 Core Bytecode Processor using an Object Cache

  4. Java Multi-Core Processor SHAP Multi-Core Architecture configurable Examples: JopCMP, jamuth, Core n-1 REALJava and SHAP Stack Method Cache Common property: central shared heap for all cores Memory Manager SHAP Multi-Core: SHAP Multi-Core: Core1 oller Controlle 32 32 Stack Method Garbage Wishbone Bus � Local stack-memory per core Cache Collector 32 Data Core0 � Method-cache per core Code 32 Stack Method Cache � Pipelining of heap-accesses 8 � Concurrent GC for real-time apps 32 � Maximum speed-up of 8 for programs with an above-average DDR: 16 UART SDR: 32 DMA number of memory accesses [1] Graphics Unit Memory � CLDC, constant-time interface - SRAM Ethernet MAC - DDR-RAM method dispatch, … Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 4 von 28 Core Bytecode Processor using an Object Cache

  5. Goal Further reduce the demands on the heap memory interface to achieve higher speed-ups through thread-level parallelism. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 5 von 28 Core Bytecode Processor using an Object Cache

  6. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 6 von 28 Core Bytecode Processor using an Object Cache

  7. Related Work Common solution for object-oriented processors: Cache for objects in analogy to data-caches [2] [3] Tag Data Object Object Offset Offset Object Content Object Content Reference Especially for real-time systems: Separate Caches for different data areas [4]: � Classic data cache for data at static addresses (e.g. class data) � Object-cache for data at dynamic addresses (e.g. objects) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 7 von 28 Core Bytecode Processor using an Object Cache

  8. Indirect Object-Addressing Stack Problem of JopCMP and SHAP: � Object-table stored in external memory. � Additional latency for each heap-access. � Additional demand on memory bandwidth. Object-Table Solution: � Translation look-aside buffer (TLB) [1] � Virtually indexed object-cache [2] Heap Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 8 von 28 Core Bytecode Processor using an Object Cache

  9. Cache Coherence Problem: Coherence of distributed caches Advantage of the Java Virtual Machine: Synchronization only when [5] � entering a critical section, or � accessing a “volatile” variable. � accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 9 von 28 Core Bytecode Processor using an Object Cache

  10. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 10 von 28 Core Bytecode Processor using an Object Cache

  11. Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9], all except microkernel benchmarks � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 11 von 28 Core Bytecode Processor using an Object Cache

  12. Heap-Access Analysis 25 Data Accesses (Core Itself) h Utilization [%] Bytecode Fetches (Method Cache) Accesses for Memory Management 20 15 15 Memory Bandwidth U 11% 10 5 0 AES Bubble Kfl Lift Matrix N- Sieve UdpIp Sort Mul Queens Benchmark Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 12 von 28 Core Bytecode Processor using an Object Cache

  13. Heap-Access Analysis Evaluation: � Benchmark suite JemBench (Version 2.0) [9] � SHAP Multi-Core with 1 core and trace unit [1] � Recording of executed bytecodes and memory accesses Results: Results: 1. Most frequently object-accesses are reads on arrays 1 and member variables. 2. 80% of all object accesses concentrate on 6 objects. 3. Frequent access onto the first user-specific object offsets (-2 and 1) � Further evaluation : small full-associative cache for each core 1 (already accounts for implicit reads of array length) Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 13 von 28 Core Bytecode Processor using an Object Cache

  14. Small Full-associative Local Cache Storing only invariant data: � Would require no extra logic for cache coherency. � In general, only class information pointer and array size are invariant. � Significant reduction only for array-intensive programs. BubbleSort BubbleSort 33% 33% of all of all Sieve 20% memory accesses MatrixMul 17% Write-through instead of write-back: � No special GC interaction required. � Simple cache coherence: Invalidate cache when • entering a critical section, or • accessing a “volatile” variable. Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 14 von 28 Core Bytecode Processor using an Object Cache

  15. Outline 1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 5 Conclusion Conclusion Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 15 von 28 Core Bytecode Processor using an Object Cache

  16. Cache Design (1) Cache integration: into memory manager port Core modifications: � Bytecode to access volatile variables � Microcode to invalidate cache SHAP Multi-Core with AOC (Excerpt) configurable Data Port Data Port AOC Controller Memory 32 32 Data Manager Core0 Code 32 Stack Method Cache Garbage 8 Collector DDR: 16 Wishbone Bus SDR: 32 Memory - SRAM - DDR-RAM Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 16 von 28 Core Bytecode Processor using an Object Cache

  17. Cache Design (2) Features: � Address & Offset-Cache (AOC) with write-through, LRU-strategy � 1 valid-bit per cached word � Configurable: Core External Memory # of cache lines, Stack Object Table Heap Offset cached offsets cached offsets Object 1 Data W -2 Physical Addr. Data X -1 Object 2 Object Handle Data Y 0 Disadvantage: Object 3 Data Z 1 Additional latency of 1 clock cycle during Adress- and Offset-Cache (AOC) Tags Address Mem. Valid & Data Memory cache miss Line V Offset -2 V Offset 1 0 Object Handle Physical Addr. 1 Data W 1 Data Z 1 2 3 Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 17 von 28 Core Bytecode Processor using an Object Cache

  18. Cache Configuration Huge configuration space: > , 0 ∈ l l N � Cache lines: [ − , ] with , ∈ , > 0 x y x y N x � Offsets: no (address only), range 1 , 2 , , 18 = n K But synthesis for cores to expensive. � Search for good initial configuration. Configuration space exploration: � Baseline design for comparison : TLB with 2 entries � Benchmarks: • JemBench • JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers) � Platform: • SHAP on Virtex-5 FPGA XC5VLX110T • Same clock frequency of 80 MHz as baseline design Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 18 von 28 Core Bytecode Processor using an Object Cache

  19. Setup: 1 Core, Cache Address Only 1.14 SparseMatmultInt MatrixMul (N=20) 1.12 Lift 1.1 AES erformance UdpIp 1.08 NQueens 1.06 Kfl Kfl Relative Perf HS 1.04 Sieve 1.02 BS 1 0.98 0.96 Baseline 2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines Design Cache Configuration � Use 8 lines � � � Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 19 von 28 Core Bytecode Processor using an Object Cache

  20. Setup: 1 Core, 8 Lines 1.45 SparseMatmultInt 1.4 MatrixMul (N=20) Lift 1.35 AES ormance UdpIp 1.3 Relative Perform NQueens NQueens 1.25 Kfl HS 1.2 Sieve 1.15 BS 1.1 1.05 1 Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55] � Cache Offsets -2 & 1 � � � Cache Configuration Increasing the Efficiency of an Embedded Multi- Martin Zabel Folie 20 von 28 Core Bytecode Processor using an Object Cache

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend