Investigating Hardware Micro-Instruction Folding in a Java Embedded - - PowerPoint PPT Presentation

investigating hardware micro instruction folding in a
SMART_READER_LITE
LIVE PREVIEW

Investigating Hardware Micro-Instruction Folding in a Java Embedded - - PowerPoint PPT Presentation

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Investigating Hardware Micro-Instruction Folding in a Java Embedded Processor Flavius Gruian 1 Mark Westmijze 2 1 Lund University, Sweden


slide-1
SLIDE 1

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Investigating Hardware Micro-Instruction Folding in a Java Embedded Processor

Flavius Gruian1 Mark Westmijze2

1Lund University, Sweden

flavius.gruian@cs.lth.se

2University of Twente, The Netherlands

m.westmijze@student.utwente.nl

Java Technologies for Real-time and Embedded Systems, 2010

1 / 17

slide-2
SLIDE 2

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Outline

1

Introduction

2

Folding BlueJEP

3

Implementation and Experiments

4

Discussion

5

Conclusion

2 / 17

slide-3
SLIDE 3

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Goal

What are we trying to do?

Implement bytecode folding on an existing Java embedded processor and evaluate the results with respect to: theoretical estimates absolute speed-up performance w.r.t. device area

3 / 17

slide-4
SLIDE 4

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Goal

What are we trying to do?

Implement bytecode folding on an existing Java embedded processor and evaluate the results with respect to: theoretical estimates absolute speed-up performance w.r.t. device area Finally... Is it worth it?

3 / 17

slide-5
SLIDE 5

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Starting Point

Original Processor Architecture

BlueJEP BlueSpec System Verilog Java Embedded Processor, a redesign of JOP [M. Sch¨

  • berl]

micro-programmed, stack machine core predictable rather than high-performance (RT systems) JOP micro-instruction set (for ease of programming)

4 / 17

slide-6
SLIDE 6

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Starting Point

Original Processor Architecture

BlueJEP BlueSpec System Verilog Java Embedded Processor, a redesign of JOP [M. Sch¨

  • berl]

micro-programmed, stack machine core predictable rather than high-performance (RT systems) JOP micro-instruction set (for ease of programming) specified in BSV [see JTRES 2007]

4 / 17

slide-7
SLIDE 7

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion BlueJEP Architecture

Six Stages Micro-Programmed Pipeline

Fetch Bytecode Fetch micro-I

Decode & Fetch Register

Fetch Stack Execute Write- back

micro- ROM BC2 microA jump table

bypass forward BC- Cache

jpc

Stack Registers bus interface (OPB) load cache

SP VP MD MrA MwA PC

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 bcfifo decfifo fsfifo exfifo wbfifo

OPD

const

CacheCtl

rollback MMU access registers 5 / 17

slide-8
SLIDE 8

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-address machines that emulate them

stack code 3-address code ≈7 bytes ≈4 bytes iload a add a, b, c iload b iadd istore c 6 / 17

slide-9
SLIDE 9

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-address machines that emulate them

stack code 3-address code ≈7 bytes ≈4 bytes iload a add a, b, c iload b iadd istore c

folding pattern length depends on the available resources (ALUs, memory ports)

6 / 17

slide-10
SLIDE 10

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Theory

Bytecode Folding Theory

stack machine (JVM) code can be shorter on multi-address machines that emulate them

stack code 3-address code ≈7 bytes ≈4 bytes iload a add a, b, c iload b iadd istore c

folding pattern length depends on the available resources (ALUs, memory ports) bytecodes are grouped in classes by resource access, e.g.: P producer: pushes a value in the stack C consumer: pops a value in the stack O operation: uses top two and pushes back a result S special: not foldable (breaks a pattern)

6 / 17

slide-11
SLIDE 11

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Scheme

Adopted Folding Scheme

fixed folding pattern approach [picoJava-II] micro-instruction level (rather than bytecode level) maximum length of four micro-instructions (at most four single instruction bytecodes) Folding Pattern Length ppoc 4 poc 3 ppc 3 pc 2

  • c

2 po 2

7 / 17

slide-12
SLIDE 12

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Scheme

Pre-design Estimates

How much is the number of executed clock cycles reduced?

8 / 17

slide-13
SLIDE 13

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Folding Scheme

Pre-design Estimates

How much is the number of executed clock cycles reduced? Processed cycle accurate simulation traces say: ≈ 30% fewer cycles for 0-delay memory ≈ 25% fewer cycles for realistic memory

8 / 17

slide-14
SLIDE 14

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Design

Architectural Changes

Increase fetch parallelism to allow folding: wider fetch-bytecode stage: up to four bytecodes must be available simultaneously. multiple bytecode FIFOs: to feed the next stage with sequences of bytecodes. wider fetch-instruction stage: up to four different micro-addresses must be read simultaneously. multiple micro-instruction FIFOs: to provide patterns to the decode stage. folding schemes in the decode stage: to identify and handle foldable patterns.

9 / 17

slide-15
SLIDE 15

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Design

Configurability

Highly configurable architecture:

1 bytecode bandwidth (1,2,4) 2 micro-instruction bandwidth (1,2,4) 3 foldable patterns 10 / 17

slide-16
SLIDE 16

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Design

Configurability

Highly configurable architecture:

1 bytecode bandwidth (1,2,4) 2 micro-instruction bandwidth (1,2,4) 3 foldable patterns

Fetch Bytecode Fetch micro-I

Decode & Fetch Register micro- ROM BC2 microA jump table

Stage 1 Stage 2 Stage 3

bcfifos decfifos

Figure: Handling 2 bytecodes, 4 micro-instructions simultaneously.

10 / 17

slide-17
SLIDE 17

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Setup and Tools

Synthesis → device area, maximum clock frequency FPGA, Xilinx Virtex-5 (XC5VLX30-3) BSV compiler 2006.11, BSV → Verilog Xilinx EDK 9.1i, Verilog + IPs → System Xilinx ISE 9.1i, System → FPGA Chipscope, to calibrate simulation

11 / 17

slide-18
SLIDE 18

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Setup and Tools

Synthesis → device area, maximum clock frequency FPGA, Xilinx Virtex-5 (XC5VLX30-3) BSV compiler 2006.11, BSV → Verilog Xilinx EDK 9.1i, Verilog + IPs → System Xilinx ISE 9.1i, System → FPGA Chipscope, to calibrate simulation Simulation → executed clock cycles Desktop, Linux BSV compiler 2006.11, BSV → Executable custom tools for parsing the output from instrumented code

11 / 17

slide-19
SLIDE 19

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Results

Original vs. Folding Configurations (2,2; 2,4)

0.5 1.0 1.5 2.0 2.5

Relative Clk Cycles Relative Clk Frequency Relative Device Area Relative Performance Relative Performance/Area Unit 12 / 17

slide-20
SLIDE 20

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion Results

Original vs. Folding Configurations (4,4)

0.5 1.0 1.5 2.0 2.5

Relative Clk Cycles Relative Clk Frequency Relative Device Area Relative Performance Relative Performance/Area Unit 13 / 17

slide-21
SLIDE 21

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Discussion

Introducing folding and more patterns: + reduce the executed clock cycles (as in theory), but

  • . . . greatly reduce the maximum clock frequency
  • . . . and also greatly increase the required device area

14 / 17

slide-22
SLIDE 22

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Discussion

Introducing folding and more patterns: + reduce the executed clock cycles (as in theory), but

  • . . . greatly reduce the maximum clock frequency
  • . . . and also greatly increase the required device area

Performance/area unit gets as low as 1/4 for some designs with maximal folding! Introducing more simple processors instead of using folding would be more efficient.

14 / 17

slide-23
SLIDE 23

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Provisions

Reservations: using RT-level vhdl instead of BSV may offer better control

  • ver the critical path

introducing more stages may increase clock frequency multi-method caches instead of one-method cache would improve overall performance

  • ther applications than the one we used (GC) could exhibit

more folding potential more elaborate folding schemes may be more effective

15 / 17

slide-24
SLIDE 24

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP and conclude that the performance greatly decreases although the number of executed cycles is reduced.

16 / 17

slide-25
SLIDE 25

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP and conclude that the performance greatly decreases although the number of executed cycles is reduced. Observation Theoretical gains are not enough to show efficiency. Complete implementations must be evaluated!

16 / 17

slide-26
SLIDE 26

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Finally...

Summary We evaluated folding schemes for BlueJEP and conclude that the performance greatly decreases although the number of executed cycles is reduced. Observation Theoretical gains are not enough to show efficiency. Complete implementations must be evaluated! Recommendation For our case, using several simple processors is potentially more efficient.

16 / 17

slide-27
SLIDE 27

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Thank you!

17 / 17

slide-28
SLIDE 28

Introduction Folding BlueJEP Implementation and Experiments Discussion Conclusion

Thank you! Questions?

17 / 17