APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - - PowerPoint PPT Presentation

application requirements and efficiency of embedded java
SMART_READER_LITE
LIVE PREVIEW

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - - PowerPoint PPT Presentation

Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek


slide-1
SLIDE 1

Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES

JTRES 2010

Martin Zabel, Rainer G. Spallek

Prague, 19.08.2010

slide-2
SLIDE 2

Itinerary

  • Motivation
  • Application Requirements
  • SHAP Multi-Core Design
  • Performance Evaluation
  • Summary

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16

slide-3
SLIDE 3

Motivation

  • Complexity of applications increases:

– Raise computational throughput. – Decrease latency.

  • Previous Approaches: smarter Java bytecode single-cores.

– Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism.

  • Now: thread-level parallelism exploited by multi-cores.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16

slide-4
SLIDE 4

Application Requirements

Address Spaces

  • Code Area:

– Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching.

  • Shared Java Heap:

– UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization.

  • Shared Peripherals.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16

slide-5
SLIDE 5

Application Requirements

Memory Bandwidth

  • SHAP single-core already includes multi-port memory manager:

– DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). Clock Cycle 1 2 3 4 5 6 7 Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16

slide-6
SLIDE 6

Application Requirements

Memory Bandwidth Utilization

Application ML505 DE2 Queens 3.09% 6.35% Lift 10.18% 22.12% FScriptME 8.46% 17 .60% SMMI 13.26% 30.43% El-Kharashi ≈ 10% ≈ 21% Utilization: u = mTotal/eTotal = P

b

fb · mb P

b

fb · eb Clock Cycle 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 DE2 Req. C0 W C1 W C2 W C3 W C0 DE2 Repl. C0 C1 C2 C3

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16

slide-7
SLIDE 7

Application Requirements

Conclusion

⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16

slide-8
SLIDE 8

SHAP Multi-Core Microarchitecture

  • MMU w. variable port count.
  • Full-duplex memory bus with

pipelined transactions.

  • Multi-threaded, real-time capable

cores with local on-chip stack and method cache.

  • Exact and fully concurrent

non-blocking garbage collector.

  • Native execution of Java bytecode.
  • Fast atomic operations for

independent locks.

  • Synthesizable for a variable number
  • f cores.

SHAP Multi−Core Architecture

Garbage Collector Manager Memory

32 32 32 8 32

Method Cache Data Code Method Cache Stack Core1 Stack Core0 Method Cache Stack Core n−1 configurable UART Graphics Unit Ethernet MAC

− SRAM − DDR−RAM

Memory

DDR: 16 SDR: 32

DMA Ctrl Controller Wishbone Bus

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16

slide-9
SLIDE 9

SHAP Implementation on ML505

  • Main evaluation platform: Xilinx ML505 Development Board

– Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores.

  • Setup: 8 KB stack and 4 KB method cache per core. Minimum I/O.
  • Chip-space scales linear:

LUTs(n) ≈ 2794 + 2831 · n FFs(n) ≈ 1933 + 1447 · n 18 kbit BRAMs(n) = 1 + 2 · n 36 kbit BRAMs(n) = 1 + 3 · n Multiplier(n) = 2 + 3 · n

  • Other platforms available: Xilinx Spartan-3/3E Starter Kit,

Altera DE2 Development Board.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16

slide-10
SLIDE 10

Relative Speed-Up

  • Measured on: Xilinx ML505 Virtex-5 Development Board.
  • Pipelined ZBT SRAM with 32-bit data bus.

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Speedup S Number of Cores n Queens Lift FScriptME SparseMatmultInt

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16

slide-11
SLIDE 11

Area Efficiency

  • Absolute / relative area of LUTs, FFs, BRAMs and multipliers is unknown.
  • Speed-Up in relation to count of BRAMs on ML505.

0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 6 7 8 9 Area Efficiency AE Number of Cores n Queens Lift FScriptME SparseMatmultInt

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16

slide-12
SLIDE 12

Comparison Against Related Projects

  • Other Java bytecode multi-cores: JopCMP

, jamuth IP multi-core, REALJava.

  • Comparison of absolute performance, etc.
  • Using same platform.

⇒ JopCMP on Altera DE2 Development Board.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16

slide-13
SLIDE 13

Comparison Against JopCMP

Absolute Performance on DE2 Board

  • SHAP @ 60 MHz, JopCMP @ 90 MHz
  • Asynchronous SRAM with only 16-bit data bus.

20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n SHAP ML505 SHAP DE2 jopCMP DE2 2KB I-Cache jopCMP DE2 1KB I-Cache 20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16

slide-14
SLIDE 14

Comparison Against JopCMP

Chip-Space on DE2 Board

  • SHAP implements GC in hardware.
  • SHAP-core requires about 23% more LEs than a JopCMP core.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16

slide-15
SLIDE 15

Synchronization Performance

  • Sync. limits speed-up ⇒ Keep as short / rare as possible.
  • Highly application / API dependent.
  • Typical: synchronized blocks.

– Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() Invoke 46 Prepare 119 A 6 Update 118 1 R 10

  • Alternative: compare and swap.

– Short blocking period: only atomic bus access. – Complex code.

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16

slide-16
SLIDE 16

Summary

  • Multi-core Java bytecode processor w. shared heap.
  • Multi-port MMU w. pipelined transactions.
  • Chip-space scales linear.
  • Application mix:

– Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120% (> 1 core).

  • Better absolute performance than related project JopCMP

. http://shap.inf.tu-dresden.de

TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16