APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - - PowerPoint PPT Presentation

▶

May 10, 2023 488 likes •663 views

Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek

SLIDE 1

Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES

JTRES 2010

Martin Zabel, Rainer G. Spallek

Prague, 19.08.2010

SLIDE 2

Itinerary

Motivation
Application Requirements
SHAP Multi-Core Design
Performance Evaluation
Summary

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16

SLIDE 3

Motivation

Complexity of applications increases:

– Raise computational throughput. – Decrease latency.

Previous Approaches: smarter Java bytecode single-cores.

– Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism.

Now: thread-level parallelism exploited by multi-cores.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16

SLIDE 4

Application Requirements

Address Spaces

Code Area:

– Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching.

Shared Java Heap:

– UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization.

Shared Peripherals.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16

SLIDE 5

Application Requirements

Memory Bandwidth

SHAP single-core already includes multi-port memory manager:

– DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). Clock Cycle 1 2 3 4 5 6 7 Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16

SLIDE 6

Application Requirements

Memory Bandwidth Utilization

Application ML505 DE2 Queens 3.09% 6.35% Lift 10.18% 22.12% FScriptME 8.46% 17 .60% SMMI 13.26% 30.43% El-Kharashi ≈ 10% ≈ 21% Utilization: u = mTotal/eTotal = P

fb · mb P

fb · eb Clock Cycle 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 DE2 Req. C0 W C1 W C2 W C3 W C0 DE2 Repl. C0 C1 C2 C3

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16

SLIDE 7

Application Requirements

Conclusion

⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16

SLIDE 8

SHAP Multi-Core Microarchitecture

MMU w. variable port count.
Full-duplex memory bus with

pipelined transactions.

Multi-threaded, real-time capable

cores with local on-chip stack and method cache.

Exact and fully concurrent

non-blocking garbage collector.

Native execution of Java bytecode.
Fast atomic operations for

independent locks.

Synthesizable for a variable number
f cores.

SHAP Multi−Core Architecture

Garbage Collector Manager Memory

32 32 32 8 32

Method Cache Data Code Method Cache Stack Core1 Stack Core0 Method Cache Stack Core n−1 configurable UART Graphics Unit Ethernet MAC

− SRAM − DDR−RAM

Memory

DDR: 16 SDR: 32

DMA Ctrl Controller Wishbone Bus

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16

SLIDE 9

SHAP Implementation on ML505

Main evaluation platform: Xilinx ML505 Development Board

– Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores.

Setup: 8 KB stack and 4 KB method cache per core. Minimum I/O.
Chip-space scales linear:

LUTs(n) ≈ 2794 + 2831 · n FFs(n) ≈ 1933 + 1447 · n 18 kbit BRAMs(n) = 1 + 2 · n 36 kbit BRAMs(n) = 1 + 3 · n Multiplier(n) = 2 + 3 · n

Other platforms available: Xilinx Spartan-3/3E Starter Kit,

Altera DE2 Development Board.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16

SLIDE 10

Relative Speed-Up

Measured on: Xilinx ML505 Virtex-5 Development Board.
Pipelined ZBT SRAM with 32-bit data bus.

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Speedup S Number of Cores n Queens Lift FScriptME SparseMatmultInt

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16

SLIDE 11

Area Efficiency

Absolute / relative area of LUTs, FFs, BRAMs and multipliers is unknown.
Speed-Up in relation to count of BRAMs on ML505.

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 6 7 8 9 Area Efficiency AE Number of Cores n Queens Lift FScriptME SparseMatmultInt

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16

SLIDE 12

Comparison Against Related Projects

Other Java bytecode multi-cores: JopCMP

, jamuth IP multi-core, REALJava.

Comparison of absolute performance, etc.
Using same platform.

⇒ JopCMP on Altera DE2 Development Board.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16

SLIDE 13

Comparison Against JopCMP

Absolute Performance on DE2 Board

SHAP @ 60 MHz, JopCMP @ 90 MHz
Asynchronous SRAM with only 16-bit data bus.

20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n SHAP ML505 SHAP DE2 jopCMP DE2 2KB I-Cache jopCMP DE2 1KB I-Cache 20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16

SLIDE 14

Comparison Against JopCMP

Chip-Space on DE2 Board

SHAP implements GC in hardware.
SHAP-core requires about 23% more LEs than a JopCMP core.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16

SLIDE 15

Synchronization Performance

Sync. limits speed-up ⇒ Keep as short / rare as possible.
Highly application / API dependent.
Typical: synchronized blocks.

– Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() Invoke 46 Prepare 119 A 6 Update 118 1 R 10

Alternative: compare and swap.

– Short blocking period: only atomic bus access. – Complex code.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16

SLIDE 16

Summary

Multi-core Java bytecode processor w. shared heap.
Multi-port MMU w. pipelined transactions.
Chip-space scales linear.
Application mix:

– Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120% (> 1 core).

Better absolute performance than related project JopCMP

. http://shap.inf.tu-dresden.de

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16