Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur
APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES
JTRES 2010
Martin Zabel, Rainer G. Spallek
APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - - PowerPoint PPT Presentation
Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek
Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur
Martin Zabel, Rainer G. Spallek
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16
– Raise computational throughput. – Decrease latency.
– Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16
– Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching.
– UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16
– DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). Clock Cycle 1 2 3 4 5 6 7 Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16
Application ML505 DE2 Queens 3.09% 6.35% Lift 10.18% 22.12% FScriptME 8.46% 17 .60% SMMI 13.26% 30.43% El-Kharashi ≈ 10% ≈ 21% Utilization: u = mTotal/eTotal = P
b
fb · mb P
b
fb · eb Clock Cycle 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 DE2 Req. C0 W C1 W C2 W C3 W C0 DE2 Repl. C0 C1 C2 C3
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16
⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16
pipelined transactions.
cores with local on-chip stack and method cache.
non-blocking garbage collector.
independent locks.
SHAP Multi−Core Architecture
Garbage Collector Manager Memory
32 32 32 8 32
Method Cache Data Code Method Cache Stack Core1 Stack Core0 Method Cache Stack Core n−1 configurable UART Graphics Unit Ethernet MAC
− SRAM − DDR−RAM
Memory
DDR: 16 SDR: 32
DMA Ctrl Controller Wishbone Bus
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16
– Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores.
LUTs(n) ≈ 2794 + 2831 · n FFs(n) ≈ 1933 + 1447 · n 18 kbit BRAMs(n) = 1 + 2 · n 36 kbit BRAMs(n) = 1 + 3 · n Multiplier(n) = 2 + 3 · n
Altera DE2 Development Board.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Speedup S Number of Cores n Queens Lift FScriptME SparseMatmultInt
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16
0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 6 7 8 9 Area Efficiency AE Number of Cores n Queens Lift FScriptME SparseMatmultInt
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16
, jamuth IP multi-core, REALJava.
⇒ JopCMP on Altera DE2 Development Board.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16
20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n SHAP ML505 SHAP DE2 jopCMP DE2 2KB I-Cache jopCMP DE2 1KB I-Cache 20000 40000 60000 80000 100000 120000 140000 1 2 3 4 5 6 7 8 9 Executions / s Number of Cores n
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16
– Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() Invoke 46 Prepare 119 A 6 Update 118 1 R 10
– Short blocking period: only atomic bus access. – Complex code.
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16
– Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120% (> 1 core).
. http://shap.inf.tu-dresden.de
TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16