Increasing the Efficiency of an Embedded Multi-Core Bytecode - - PowerPoint PPT Presentation

increasing the efficiency of an embedded multi core
SMART_READER_LITE
LIVE PREVIEW

Increasing the Efficiency of an Embedded Multi-Core Bytecode - - PowerPoint PPT Presentation

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache Processor Using an Object Cache


slide-1
SLIDE 1

Faculty of Computer Science, Institute for Computer Engineering, Chair for VLSI-Design, Diagnostic und Architecture

Increasing the Efficiency of an Embedded Multi-Core Bytecode Processor Using an Object Cache

Martin Zabel,

Thomas B. Preußer, Rainer G. Spallek JTRES‘12

Martin.Zabel@tu-dresden.de http://vlsi-eda.inf.tu-dresden.de

25.10.2012

Processor Using an Object Cache

slide-2
SLIDE 2

Outline

1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 Conclusion

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 2 von 28

5 Conclusion

slide-3
SLIDE 3

Motivation

Why Java? Object orientation, portability Automatic memory management, security Support for thread parallelization Why Java-(bytecode-)processor?

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 3 von 28

Why Java-(bytecode-)processor? Native execution of Java-bytecode no OS, no interpretation, no re-compilation Real-time Suited for embedded systems with limited resources Why multi-core processors? Power consumption increases over-proportional with clock frequency. Use thread-level parallelism instead.

slide-4
SLIDE 4

Java Multi-Core Processor

Examples: JopCMP, jamuth, REALJava and SHAP Common property: central shared heap for all cores SHAP Multi-Core:

SHAP Multi-Core Architecture

Manager Memory

32

Core1 Method Cache Stack Core n-1

  • ller

configurable

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 4 von 28

SHAP Multi-Core: Local stack-memory per core Method-cache per core Pipelining of heap-accesses Concurrent GC for real-time apps Maximum speed-up of 8 for programs with an above-average number of memory accesses [1] CLDC, constant-time interface method dispatch, …

Garbage Collector

32 32 32 8 32

Method Cache Data Code Method Cache Stack Stack Core0 Controlle UART Graphics Unit Ethernet MAC

  • SRAM
  • DDR-RAM

Memory Wishbone Bus

DDR: 16 SDR: 32

DMA

slide-5
SLIDE 5

Goal

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 5 von 28

Further reduce the demands on the heap memory interface to achieve higher speed-ups through thread-level parallelism.

slide-6
SLIDE 6

Outline

1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 Conclusion

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 6 von 28

5 Conclusion

slide-7
SLIDE 7

Related Work

Common solution for object-oriented processors: Cache for objects in analogy to data-caches [2] [3] Tag

Object Offset

Data

Object Content

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 7 von 28

Especially for real-time systems: Separate Caches for different data areas [4]: Classic data cache for data at static addresses (e.g. class data) Object-cache for data at dynamic addresses (e.g. objects)

Object Reference Offset Object Content

slide-8
SLIDE 8

Indirect Object-Addressing

Problem of JopCMP and SHAP: Object-table stored in external memory. Additional latency for each heap-access. Additional demand on memory bandwidth. Stack Object-Table

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 8 von 28

Solution: Translation look-aside buffer (TLB) [1] Virtually indexed object-cache [2] Heap

slide-9
SLIDE 9

Cache Coherence

Problem: Coherence of distributed caches Advantage of the Java Virtual Machine: Synchronization only when [5] entering a critical section, or accessing a “volatile” variable.

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 9 von 28

accessing a “volatile” variable.

slide-10
SLIDE 10

Outline

1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 Conclusion

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 10 von 28

5 Conclusion

slide-11
SLIDE 11

Heap-Access Analysis

Evaluation: Benchmark suite JemBench (Version 2.0) [9], all except microkernel benchmarks SHAP Multi-Core with 1 core and trace unit [1] Recording of executed bytecodes and memory accesses

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 11 von 28

slide-12
SLIDE 12

Heap-Access Analysis

15 20 25 h Utilization [%] Data Accesses (Core Itself) Bytecode Fetches (Method Cache) Accesses for Memory Management

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 12 von 28

5 10 15 AES Bubble Sort Kfl Lift Matrix Mul N- Queens Sieve UdpIp Memory Bandwidth U Benchmark

11%

slide-13
SLIDE 13

Heap-Access Analysis

Evaluation: Benchmark suite JemBench (Version 2.0) [9] SHAP Multi-Core with 1 core and trace unit [1] Recording of executed bytecodes and memory accesses Results:

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 13 von 28

Results:

  • 1. Most frequently object-accesses are reads on arrays1 and member

variables.

  • 2. 80% of all object accesses concentrate on 6 objects.
  • 3. Frequent access onto the first user-specific object offsets (-2 and 1)

Further evaluation: small full-associative cache for each core

1 (already accounts for implicit reads of array length)

slide-14
SLIDE 14

Small Full-associative Local Cache

Storing only invariant data: Would require no extra logic for cache coherency. In general, only class information pointer and array size are invariant. Significant reduction only for array-intensive programs. BubbleSort 33%

  • f all

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 14 von 28

Write-through instead of write-back: No special GC interaction required. Simple cache coherence: Invalidate cache when

  • entering a critical section, or
  • accessing a “volatile” variable.

BubbleSort 33%

  • f all

memory accesses Sieve 20% MatrixMul 17%

slide-15
SLIDE 15

Outline

1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 Conclusion

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 15 von 28

5 Conclusion

slide-16
SLIDE 16

Cache Design (1)

Cache integration: into memory manager port Core modifications: Bytecode to access volatile variables Microcode to invalidate cache

SHAP Multi-Core with AOC (Excerpt)

Data Port configurable

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 16 von 28

Method Cache

32 8

Stack Core0

32

AOC Data Port

32

Controller Garbage Collector

  • SRAM
  • DDR-RAM

Memory Manager Memory Data Code

DDR: 16 SDR: 32 Wishbone Bus

slide-17
SLIDE 17

Features: Address & Offset-Cache (AOC) with write-through, LRU-strategy 1 valid-bit per cached word Configurable: # of cache lines, cached offsets

Cache Design (2)

Core External Memory

Object Table Heap Stack Offset

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 17 von 28

cached offsets Disadvantage: Additional latency of 1 clock cycle during cache miss

Adress- and Offset-Cache (AOC)

Physical Addr. Object 1 Object 2 Object 3 Object Handle Data W Data Y Data Z

  • 2
  • 1

1 Data X 1 2 3 Line Physical Addr. Address Mem. Valid & Data Memory Offset -2 Offset 1 V V Data W Data Z 1 1 Object Handle Tags

slide-18
SLIDE 18

Cache Configuration

Huge configuration space: Cache lines: Offsets: no (address only), range But synthesis for cores to expensive. Search for good initial configuration. N l l ∈ > , , , with ] , [ > ∈ − x N y x y x 18 , , 2 , 1 K = n

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 18 von 28

Configuration space exploration: Baseline design for comparison: TLB with 2 entries Benchmarks:

  • JemBench
  • JavaGrande Framework [7]: HeapSort, SparseMatMult (with integers)

Platform:

  • SHAP on Virtex-5 FPGA XC5VLX110T
  • Same clock frequency of 80 MHz as baseline design
slide-19
SLIDE 19

Setup: 1 Core, Cache Address Only

1.06 1.08 1.1 1.12 1.14 erformance SparseMatmultInt MatrixMul (N=20) Lift AES UdpIp NQueens Kfl

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 19 von 28

0.96 0.98 1 1.02 1.04 Baseline Design 2 Lines 4 Lines 8 Lines 16 Lines 32 Lines 64 Lines Relative Perf Cache Configuration Kfl HS Sieve BS

  • Use 8 lines
slide-20
SLIDE 20

Setup: 1 Core, 8 Lines

1.3 1.35 1.4 1.45

  • rmance

SparseMatmultInt MatrixMul (N=20) Lift AES UdpIp NQueens

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 20 von 28

1 1.05 1.1 1.15 1.2 1.25 Only Addr[-1, 0] [-2, 1] [-4, 3] [-8, 7] [-16, 15] [-8, 23] [-32, 31][-16, 47] [-8, 55] Relative Perform Cache Configuration NQueens Kfl HS Sieve BS

  • Cache Offsets -2 & 1
slide-21
SLIDE 21

Speed-Up of JemBench MatrixMul

Matrix multiplication: Default with 20x20 matrix Reference value: Single-core

12 14 16 18 p S Ideal AOC with 8 lines, offset -2 and 1 AOC with 8 lines, address only Baseline Design

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 21 von 28

Single-core performance Ideal speed-up: 10 for n ≥ 10 cores Speed-up maximum: 8,3

  • 9,4

@ 10 cores

2 4 6 8 10 2 4 6 8 10 12 14 16 18 Speed-Up Processor Cores n

slide-22
SLIDE 22

Speed-Up of JemBench MatrixMul (N=90)

Matrix multiplication: Extended to 90x90 matrix Reference value: Single-core

12 14 16 18 p S Ideal AOC with 8 lines, offset -2 and 1 AOC with 8 lines, address only Baseline Design

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 22 von 28

Single-core performance Ideal speed-up: 15 for 15--17 cores Speed-up maximum: 9,5

  • 14,0

@ 15 cores

2 4 6 8 10 2 4 6 8 10 12 14 16 18 Speed-Up Processor Cores n

slide-23
SLIDE 23

Speed-Up of JGF SparseMatMult with Integer

Matrix multiplication: Reduced to 4.000x4.000 matrix with 16.000 non-zero integer elements Reference value:

12 14 16 18 p S Ideal AOC with 8 lines, offset -8 to 7 AOC with 8 lines, offset -2 and 1 AOC with 8 lines, address only Baseline Design

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 23 von 28

Reference value: Single-core performance Speed-up maximum: 7,6

  • 14,7 / 15,5

@ 17 cores

2 4 6 8 10 2 4 6 8 10 12 14 16 18 Speed-Up Processor Cores n

slide-24
SLIDE 24

FPGA Resource Usage

Synthesis results for 8 cache lines, offsets -2 & 1: Up to 17 cores at 80 MHz (same as baseline design) fit on XC5VLX110T Additional hardware resources per core:

  • For up to 15 cores:

Configuration Register LUTs RAMB36

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 24 von 28

* = includes distributed RAM

  • 16 and 17 cores: over-proportional due to high FPGA utilization.

Configuration Register LUTs RAMB36 Address Only +8% +3%*

  • Offsets -2 & 1

+9% +8%*

  • Offsets [-8,7]

+17% +9% +1

slide-25
SLIDE 25

Outline

1 Motivation 2 Related Work 3 Heap-Access Analysis 4 Implementation & Results 5 Conclusion

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 25 von 28

5 Conclusion

slide-26
SLIDE 26

Conclusion

Baseline design: SHAP Multi-Core with 2-lines TLB Realized object-cache: Full-associative combined address and offset cache Configurable in # of lines and cached offsets Results:

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 26 von 28

Results: With 4 cache-lines: Amortization of additional latency With 8 cache-lines:

  • Speed-up maximum increased from formerly 8—9 to now 14.
  • Single-core performance increased by 12 to 21%.

Up to twice absolute compute performance. Requires only 9% more hardware resources.

slide-27
SLIDE 27

Thank you for your attention!

Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 27 von 28

Thank you for your attention! Questions?

slide-28
SLIDE 28

Selected Literature

1. Zabel, Martin: Effiziente Mehrkernarchitektur für eingebettete Java-Bytecode-Prozessoren, Dresden, Technische Universität, Diss., 2012 2. Vijaykrishnan, N.; Ranganathan, N.: Supporting object accesses in a Java processor. In: IE Proc. Computers and Digital Techniques 147 (2000), Nr. 6, S. 435–443. – ISSN 1350–2387 3. Wright, G.; Seidl, M. L.; Wolczko, M.: An object-aware memory architecture. In:

  • Sci. Comput. Program. 62 (2006), Nr. 2, S. 145–163. – ISSN 0167–6423

4. Huber, B.; Puffitsch, W.; Schoeberl, M.: WCET driven design space exploration of an object Martin Zabel Increasing the Efficiency of an Embedded Multi- Core Bytecode Processor using an Object Cache Folie 28 von 28

  • cache. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems

(JTRES’10). ACM, 2010, S. 26–35 5. Lindholm, T.; Yellin, F.: The Java(TM) Virtual Machine Specification. 2nd edition. Amsterdam : Addison-Wesley Longman, 1999. – ISBN 978–0201432947 6. SCHOEBERL, Martin ; PREUSSER, Thomas B. ; UHRIG, Sascha: The embedded Java benchmark suite JemBench. In: Proc. 8th Int’lWorkshop on Java Technologies for Real-Time and Embedded Systems (JTRES’10), S. 120–127 7. Java Grande Forum Benchmark-Suite. – http://www2.epcc.ed.ac.uk/computing/research_activities/java_grande/index_1.html