application requirements and efficiency of embedded java
play

APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE - PowerPoint PPT Presentation

Fakultt Informatik Institut fr Technische Informatik, Professur fr VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek


  1. Fakultät Informatik Institut für Technische Informatik, Professur für VLSI-Entwurfssysteme, Diagnostik und Architektur APPLICATION REQUIREMENTS AND EFFICIENCY OF EMBEDDED JAVA BYTECODE MULTI-CORES JTRES 2010 Martin Zabel, Rainer G. Spallek Prague, 19.08.2010

  2. Itinerary • Motivation • Application Requirements • SHAP Multi-Core Design • Performance Evaluation • Summary TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 2 of 16

  3. Motivation • Complexity of applications increases: – Raise computational throughput. – Decrease latency. • Previous Approaches: smarter Java bytecode single-cores. – Just-In-Time compilation. – Instruction-level parallelism: bytecode folding, VLIW packets. – Bit-level parallelism. • Now: thread-level parallelism exploited by multi-cores. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 3 of 16

  4. Application Requirements Address Spaces • Code Area: – Accessed frequently. – Duplication ⇒ Chip-space intensive. – Sharing ⇒ Efficient method caching. • Shared Java Heap: – UMA / NUMA. – Fast atomic operations for monitor lock/unlock. – Independent locks, otherwise performance is degraded. – Memory bus utilization. • Shared Peripherals. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 4 of 16

  5. Application Requirements Memory Bandwidth • SHAP single-core already includes multi-port memory manager: – DMA and cache-line filling. – Pipelined transactions using outstanding reads. – Maximum bandwidth with pipelined memory (ZBT SRAM). 0 1 2 3 4 5 6 7 Clock Cycle Request Data Cache Cache DMA DMA Reply Data Cache Cache DMA DMA TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 5 of 16

  6. Application Requirements Memory Bandwidth Utilization Utilization: Application ML505 DE2 = u m Total /e Total Queens 3.09% 6.35% Lift 10.18% 22.12% P f b · m b FScriptME 8.46% 17 .60% b = SMMI 13.26% 30.43% P f b · e b El-Kharashi ≈ 10% ≈ 21% b Clock Cycle 0 1 2 3 4 5 6 7 8 9 10 ML505 Req. C0 C1 C2 C3 C0 ML505 Repl. C0 C1 C2 C3 W W W W DE2 Req. C0 C1 C2 C3 C0 DE2 Repl. C0 C1 C2 C3 TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 6 of 16

  7. Application Requirements Conclusion ⇒ UMA setup is suitable if whole memory subsystem can be operated in a pipelined fashion. ⇒ Bandwidth sufficient for up to 10 cores on ML505. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 7 of 16

  8. SHAP Multi-Core Microarchitecture SHAP Multi−Core Architecture configurable • MMU w. variable port count. Core n−1 • Full-duplex memory bus with Stack Method pipelined transactions. Cache • Multi-threaded, real-time capable Memory cores with local on-chip stack and Manager Core1 method cache. Controller 32 Stack Method Garbage Wishbone Bus Cache Collector • Exact and fully concurrent 32 Data Core0 non-blocking garbage collector. Code Stack 32 Method Cache 8 • Native execution of Java bytecode. 32 • Fast atomic operations for DDR: 16 independent locks. UART SDR: 32 • Synthesizable for a variable number Graphics Unit Memory DMA Ctrl Ethernet MAC − SRAM of cores. − DDR−RAM TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 8 of 16

  9. SHAP Implementation on ML505 • Main evaluation platform: Xilinx ML505 Development Board – Virtex-5 XC5VLX50T – 1 MB external ZBT SRAM with 32-bit data bus. – 80 MHz clock frequency for up to 9 cores. • Setup: 8 KB stack and 4 KB method cache per core. Minimum I/O. • Chip-space scales linear: LUTs ( n ) 2794 + 2831 · n ≈ FFs ( n ) 1933 + 1447 · n ≈ 18 kbit BRAMs ( n ) = 1 + 2 · n 36 kbit BRAMs ( n ) = 1 + 3 · n Multiplier ( n ) = 2 + 3 · n • Other platforms available: Xilinx Spartan-3/3E Starter Kit, Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 9 of 16

  10. Relative Speed-Up • Measured on: Xilinx ML505 Virtex-5 Development Board. • Pipelined ZBT SRAM with 32-bit data bus. 9 Queens 8 Lift FScriptME 7 SparseMatmultInt Speedup S 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 10 of 16

  11. Area Efficiency • Absolute / relative area of LUTs, FFs, BRAMs and multipliers is unknown. • Speed-Up in relation to count of BRAMs on ML505. 1.4 1.2 Area Efficiency AE 1 0.8 0.6 0.4 Queens Lift 0.2 FScriptME SparseMatmultInt 0 1 2 3 4 5 6 7 8 9 Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 11 of 16

  12. Comparison Against Related Projects • Other Java bytecode multi-cores: JopCMP , jamuth IP multi-core, REALJava. • Comparison of absolute performance, etc. • Using same platform. ⇒ JopCMP on Altera DE2 Development Board. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 12 of 16

  13. Comparison Against JopCMP Absolute Performance on DE2 Board • SHAP @ 60 MHz, JopCMP @ 90 MHz • Asynchronous SRAM with only 16-bit data bus. 140000 140000 SHAP ML505 120000 120000 SHAP DE2 jopCMP DE2 2KB I-Cache 100000 100000 jopCMP DE2 1KB I-Cache Executions / s Executions / s 80000 80000 60000 60000 40000 40000 20000 20000 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 Number of Cores n Number of Cores n TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 13 of 16

  14. Comparison Against JopCMP Chip-Space on DE2 Board • SHAP implements GC in hardware. • SHAP-core requires about 23% more LEs than a JopCMP core. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 14 of 16

  15. Synchronization Performance • Sync. limits speed-up ⇒ Keep as short / rare as possible. • Highly application / API dependent. • Typical: synchronized blocks. – Long blocking periods for field update(s). – Only small amount for atomic bus access. Example: java.util.concurrent.LinkedBlockingQueue.put() 46 119 6 118 1 10 Prepare Update Invoke A R • Alternative: compare and swap. – Short blocking period: only atomic bus access. – Complex code. TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 15 of 16

  16. Summary • Multi-core Java bytecode processor w. shared heap. • Multi-port MMU w. pipelined transactions. • Chip-space scales linear. • Application mix: – Typical: 10% memory bandwidth utilization. – Almost linear speed-up for up to 9 cores. – Area efficiency of ≈ 120 % ( > 1 core). • Better absolute performance than related project JopCMP . http://shap.inf.tu-dresden.de TU Dresden, 19.08.2010 Embedded Java Bytecode Multi-Cores slide 16 of 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend