SLIDE 1 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 1 (4/7/09) Microprocessors/Embedded Cores (Slides from Patrick Schaumont’s course notes) The most successful programmable component of the past decades is, without doubt, the microprocessor. Just about any electronic device more complicated than a pushbutton seems to con- tain a microprocessor. There have been a number of drivers for the popularity of the microprocessor.
- Microprocessors come with tools (compilers and assemblers), that help a designer
create applications. The availability of a compiler to automatically translate a code into a binary for a microprocessor is an enormous advantage for development. It de-couples the design of the application software from the application hard- ware. An embedded software designer can therefore be proficient in one programming language like C, and this alone allows him to move seamlessly across different microprocessor architectures.
SLIDE 2 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 2 (4/7/09) Microprocessors/Embedded Cores No hardware development technique has ever been able to de-couple, in a simi- lar way, the design of an application from its implementation. Even micro-programming requires significant knowledge of the underlying architecture.
- There have been very few devices that have been able to cope as efficiently with
reuse as microprocessors have done. A general-purpose embedded core by itself is an excellent example of reuse. Moreover, microprocessors have also dictated the rules of reuse for electronic system design in general. They have provided bus protocols that enabled the physical integration of an entire system. Their compilers have enabled the development of standard software librar- ies.
SLIDE 3 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 3 (4/7/09) Microprocessors/Embedded Cores
- No other programmable components have the same scalability as microprocessors.
The same concept (i.e. stored-program computer) has been implemented across a large range of word-lengths (4-bit ... 64-bit) and basic architecture-types. In addition, microprocessors have also extended their reach to entire chips, containing many other components, while staying ’in command’ of the system. This approach is commonly called System-On-Chip (SoC). Given that entire courses exist on the topic of microprocessors, we need to be very selective on the topics we will cover in a single lecture on this topic. We will therefore focus specifically on issues that are relevant to hardware-software codesign, including:
- Forms of parallelism in microprocessors
- RISC Pipeline and pipeline hazards
- Brief overview of the Microblaze processor
- Dynamic program analysis, with examples in the SA-110 (StrongARM)
SLIDE 4 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 4 (4/7/09) Forms of Parallelism in Microprocessors A microprocessor is a machine for sequential execution of operations. Internally, the microprocessor architecture enables parallel execution of these opera- tions, where possible. This parallelism can be classified in three categories.
Standard microprocessors are 32-bit or larger, even though shorter word-lengths (4-bit, 8-bit, 16-bit) are still in use for low-performance/ low-power apps. The standard operation on such a microprocessor therefore processes 32- bits at a time. In an actual application, the typical word-length may match, be larger, or be smaller than those 32-bits. For example, multimedia apps on 8 bits-per-pixel, and internet packet for- matting are naturally formulated at byte-length. Certain cryptographic apps, on the other hand, may require hundreds of bits per number.
SLIDE 5 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 5 (4/7/09) Forms of Parallelism in Microprocessors
- Bit-level parallelism (cont)
In either case, it is up to the designer to match the natural word-lengths of the application to the word-length provided by the processor. This is not an easy task. It is well known for example that microprocessors perform badly on ’bit twiddling’ operations while custom hardware excels on these.
- Operator-level parallelism
After an application is decomposed into operations, one wants to exploit all
- pportunities for concurrent execution.
Microprocessors offer several mechanisms to map these operations to parallel
Of the three described below, in most practical processors, these techniques tend to be exclusive (ie. a processor will use one, but not all of them).
SLIDE 6 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 6 (4/7/09) Forms of Parallelism in Microprocessors
- Operator-level parallelism (cont)
- SIMD: Single-Instruction Multiple Data
This type of parallelism is useful to parallel process multiple values with the same instruction. High-end Intel processors provide a SIMD unit (called MMX) for multime- dia processing. SIMD instructions are not a by-product of a compiler -- usually a designer has to write assembly programs to call these instructions explicitly.
Pipelined operators enable overlapped execution of instructions by means
Modern compilers are well equipped to deal with pipelining and its com- plexities.
SLIDE 7 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 7 (4/7/09) Forms of Parallelism in Microprocessors
- Operator-level parallelism (cont)
- VLIW: Very Long Instruction Word
Here, a set of operators is available in the microprocessor and VLIW instructions invoke these operators in parallel. It is non-trivial to optimally map a program into VLIW instructions such that they take full advantage of the parallelism available. This is usually done by a compiler.
While real systems consist of concurrent tasks, there are almost no generally- accepted techniques to implement parallel tasks in a microprocessor. With the exception of hyper-threading. As a result, microprocessors rely on threading software to provide a sequential implementation of concurrent tasks.
SLIDE 8 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 8 (4/7/09) Forms of Parallelism in Microprocessors
- Task-level parallelism (cont)
On the other hand, it is generally agreed that task-level parallelism is the next big target for microprocessors. This type of parallelism will leverage multiple processors on a chip (MPSoC) technology. A very convincing argument driving this hypothesis is provided by Deszo Sima. This figure shows the efficiency of Intel processors after normalization to their clock frequency. Year of first volume shipment 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 x 0.5 1.0 1.5 2.0 Efficiency of the microarchitecture (SPECint92/100 MHz) i386 x i486 x Pentium x Pentium Pro x Pentium II Pentium III x
SLIDE 9
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 9 (4/7/09) Forms of Parallelism in Microprocessors It can be seen that since the mid 90’s, this efficiency (i.e. the number of operations completed per clock cycle) remains relatively constant. In other words, processor performance improvements of the past decade are not due to better architectures, but rather due to increasing the processor clock speed. Multi-processor architectures, that implement task-level parallelism, are a solution to this problem. RISC Pipeline: Operation and Hazards We will now focus on one particular form of processor parallelism: the RISC pipe- line. The following is an overview of a generic RISC pipeline structure. The figure below shows a five-stage pipeline, in which standard instructions take 5 clock cycles to complete (this is the instruction latency).
SLIDE 10
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 10 (4/7/09) RISC Pipeline: Operation and Hazards Each dashed line represents a pipeline register. Instruction Fetch: an instruction is retrieved from memory or the instruction cache. Instruction Decode: an instruction is decoded and the register operands for this instruction are fetched. Branch instructions will modify the PC during this phase. Instruction Fetch Instruction Decode Execute Buffer Write-back Instruction Memory Read Instruction Decode Evaluate Next-PC Register Read Custom DP Datapath Operation Data Memory R/W Register Write (Memory)
SLIDE 11 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 11 (4/7/09) RISC Pipeline: Operation and Hazards Execute: the operands are input to the datapath operators and executed. Buffer: the data memory is accessed using an address generated in the execute phase. Write-back: registers are updated to reflect the final result of the instruction execu- tion. In an ideal situation, the architecture above can complete 1 instruction per clock cycle (this is instruction throughput). Even though instruction latency is 5 clock cycles, the pipeline enables over- lapped execution of these instructions to increase throughput. The clock cycle time is limited by the slowest component in the pipeline, plus the
- verhead of the pipeline registers (clock skew and setup).
If a pipeline stage is too slow, additional pipeline stages can be added spreading the computation over multiple clock cycles. Doing so also extends the instruction latency.
SLIDE 12 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 12 (4/7/09) RISC Pipeline: Operation and Hazards The ideal situation of one instruction per clock cycle is the best case scenario. A pipeline stall occurs when the progress of instructions through the pipeline is tem- porarily halted. The cause of such a stall is a pipeline hazard. Pipeline interlock hardware can detect pipeline hazards automatically and resolve them automatically. There are several types of pipeline hazards. We discuss these in the context of examples from a SA-110 microprocessor (which is the processor modeled by SimIt-ARM). The following generalizations can be made:
- Data hazards are caused by unfulfilled data dependencies
- Control hazards are caused by branches
- Structural hazards are caused by resource conflicts and cache misses
SLIDE 13
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 13 (4/7/09) Control Hazards Branches are the most common form of pipeline stalls. As shown in the pipeline architecture, a branch is only executed in stage 2 of the pipeline. When this occurs, another instruction has already entered the pipeline. This instruction follows the branch instruction sequentially, so if the branch is taken, its execution needs to be cancelled. The branch instruction at location X introduces a one-cycle stall that ripples through the pipeline. Cycle Fetch Decode Execute Buffer Write-back F(X) 1 F(X+4)
Branch Y
2 F(Y)
stall idle
3 F(Y+4)
D(Y) stall idle
4 ...
D(Y+4) E(Y) stall
5 ...
... E(Y+4) B(Y) stall
SLIDE 14
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 14 (4/7/09) Control Hazards Some processors (including the Microblaze) have a branch-delay instruction. A branch-delay instruction that allows the instruction following the branch to com- plete. The instruction inserted into the branch-delay slot must be an instruction that would execute independent of the branch. The compiler can automatically determine candidate instructions for the slot -- if none exist, a NO-OP instruction is inserted. This fills in the stall slots with execution of an instruction at X+4. int accumulate() { int i, j; for (i=0; i<100; i++) j += i; return j; }
SLIDE 15
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 15 (4/7/09) Control Hazards Compiling generates the following assembly code for Microblaze: addk r4,r0,r0 ;clear r4 (holds i) addk r3,r3,r4 ;j = j + i $L9: addik r4,r4,1 ;i = i + 1 addik r18,r0,99 ;r18 <- 99 cmp r18,r4,r18 ;compare i and 99 bgeid r18,$L9 ;delayed branch if not equal addk r3,r3,r4 ;j = j + i (branch delay slot) Obviously, this is not optimized, e.g., loading r18 inside the loop can be moved above the loop. Also, the first addk r3, r3, r4 is not needed if r18 is loaded instead with 100 because the instruction in the branch delay slot (addk r3, r3, r4) is always executed. This type of manipulation of the instruction sequence can be done at compile time and can increase utilization of the pipeline.
SLIDE 16
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 16 (4/7/09) Data Hazards Registers are updated only during the write-back phase. However, it is possible that a register value is required value before that value has reached the write-back phase. This situation will be detected by the pipeline interlock, which will stall part of the pipeline. When the data becomes available, it will be directly forwarded to the execution stage where it is needed. Consider an example of the StrongARM SA-110 as it executes the following instruc- tions: LDR R1, [R0,+4]! ; R1 <- mem[R0+4], R0 <- R0 + 4 MOV R2, R1 ; R2 <- R1
SLIDE 17
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 17 (4/7/09) Data Hazards After MOV R2, R1 is decoded, the pipeline interlock senses that a source operand for this instruction (R1) is produced by the previous instruction. It will therefore stall the first three stages of the pipeline to allow this previous instruction to move forward to Buffer. After the buffer stage, the value of R1 has been fetched from memory. The R1 value is directly forwarded to the waiting MOV instruction, and the pipeline unfreezes. Cycle Fetch Decode Execute Buffer Write-back F(X) 1 F(X+4)
LDR R1, [R0, +4]
2 F(X+8)
MOV R2, R1 Calc R0+4
3 stall
stall stall Read(R0+4)
4 F(X+12)
D(X+8) E(X) stall Write R0, R1
5 ...
... ... B(X) stall
6 ...
... ... ... Write R2
SLIDE 18 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 18 (4/7/09) Structural Hazards Structural Hazards are caused by instructions that require more resources than the processor has available. These cause stalls to be introduced into the pipeline, similar to data and control haz- ards. Consider the execution of a ’load-multiple’ instruction on the SA-110. This instruction performs a number of register loads from consecutive memory addresses. LDMIA R1, {R2, R3, R4};
- > R2<-mem[R1], R3<-mem[R1+4], R4<-mem[R1+8]
Since the pipeline has only a single memory port, these loads proceed over multiple clock cycles, with the pipeline stalled as necessary.
SLIDE 19 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 19 (4/7/09) Structural Hazards The two stalls are introduced to leave time for the additional two reads required, calc/read[R1+4] and calc/read[R1+8], beyond the first read. While cache misses are technically not a structural hazard, they also introduce stalls.
- An instruction-cache miss causes the Fetch phase to be extended.
- A data-cache miss causes the Buffer phase to be extended.
We will treat cache misses as a structural hazard. Cycle Fetch Decode Execute Buffer Write-back F(X) 1 F(X+4)
LDMIA R1, {R2, R3, R4}
2 stall
stall calc[R1]
3 stall
stall calc[R1+4] read[R1]
4 F(X+8)
D(X+4) calc[R1+8] read[R1+4] write R2
5 F(X+12)
D(X+8) E(X+4) read[R1+8] write R3
6 F(X+16)
D(X+12) E(X+8) B(X+4) write R4
SLIDE 20
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 20 (4/7/09) The Microblaze Processor The Microblaze processor is a configurable processor for the FPGA. A key document describing the Microblaze architecture is the Microblaze Processor Reference Guide (from XPS: Help->EDK Online Documentation and then select the links Processor Reference Guides->Microblaze Processor Reference Guide). I-Cache Bus IF IXCL_M IXCL_S IOPB ILMB Instruction Buffer Program Counter Special Purpose Registers ALU Shift Barrel Shift Multiplier Divider FPU Instruction Decode Register File 32 X 32 D-Cache Bus IF DXCL_S DXCL_M DOPB DLMB MFSL 0..7 SFSL 0..7
SLIDE 21 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 21 (4/7/09) The Microblaze Processor The processor has the following features:
- Harvard architecture (separate data-bus and instruction-bus)
- 32 registers of 32-bit each
- 32-bit instruction word (3 operands, 2 addressing modes)
- 32-bit address space
- Single-issue pipeline, configurable as a three-stage or a five-stage pipeline.
The processor provides several optional features (shaded areas in fig.), including:
- Optional data path features (hardware shifter, multiplier, divider, floating-point unit)
- Several instruction-side interfaces
- On-chip Peripheral Bus Interface (IOPB)
- Local Memory-Bus Interface (ILMB)
- Cache-Link Interfaces (IXCL_S and IXCL_M)
- Several data-side interfaces
- On-chip Peripheral Bus Interface (DOPB)
- Local Memory-Bus Interface (DLMB)
- Cache-Link Interfaces (DXCL_S and DXCL_M)
- Coprocessor interfaces (MFSL and SFSL)
SLIDE 22
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 22 (4/7/09) Static Program Analysis for Microblaze It is important to be able to analyze the performance of program at the level of assem- bly language. In order to accomplish this, you must have access to the processor architecture documentation. You will encounter many different processor architectures over your career. Moving easily from one architecture to another is an important skill. We will perform Static Program Analysis on a small example for a Microblaze pro- cessor. The objective of the analysis is to quantify the performance on Microblaze for a given C program. You accomplish this by analyzing the assembly language produced by the C compiler and referencing of the processor documentation.
SLIDE 23 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 23 (4/7/09) Static Program Analysis for Microblaze Consider the C code which implements a convolution operation. int array[256]; int c[256]; int main() { int i, a; a = 0; for (i=0; i<256; i++) a += array[i] * c[256 - i]; return a; } The following EDK shell command will produce an assembly file conv.s from conv.c The -O2 flag indicates optimize and the -S generates assembly code. mb-gcc -O2 -S conv.c The format of the assembly code is described in Application Binary Interface (ABI) for the Microblaze and includes:
- How the processor uses registers and how the memory is organized
- How C functions are called (including parameter passing and local variable storage)
SLIDE 24 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 24 (4/7/09) Static Program Analysis for Microblaze Chapter 3 of the Microblaze Microprocessor Reference Guide describes the ABI.
- 1. .text
- 2. .align 2
- 3. .globl main
- 4. .ent main
main:
- 5. .frame r1,44,r15
- 6. .mask 0x01c88000
- 7. addik r1,r1,-44
- 8. swi r22,r1,32
- 9. swi r23,r1,36
- 10. addik r22,r0,array
- 11. addik r23,r0,c+1024
- 12. swi r19,r1,28
- 13. swi r24,r1,40
- 14. swi r15,r1,0
- 15. addk r24,r0,r0
- 16. addik r19,r0,255
$L5:
- 17. lwi r5,r22,0
- 18. lwi r6,r23,0
- 19. brlid r15,__mulsi3
- 20. addik r19,r19,-1
- 21. addk r24,r24,r3
- 22. addik r22,r22,4
- 23. bgeid r19,$L5
- 24. addik r23,r23,-4
- 25. addk r3,r24,r0
- 26. lwi r15,r1,0
- 27. lwi r19,r1,28
- 28. lwi r22,r1,32
- 29. lwi r23,r1,36
- 30. lwi r24,r1,40
- 31. rtsd r15,8
- 32. addik r1,r1,44
- 33. .end main
SLIDE 25 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 25 (4/7/09) Static Program Analysis for Microblaze The ABI indicates that register r1 is used as stack pointer, and r0 is always zero. Also, r19 - r31 are callee-saved registers. Lines 7-9 through 12-14 manipulate the stack.
- Line 7 grows the stack pointer by 44 bytes (11 words).
Note that the Microblaze stack grows downwards.
- Lines 8, 9, 12, 13 and 14 saves registers on the stack.
These registers (r22, r23, r19, r24, r15) will be used as temporary variables by the program. They are restored just before the main function terminates, in lines 26-30.
$Lfe1:
- 34. .size main,$Lfe1-main
- 35. .bss
- 36. .comm array,1024,4
- 37. .type array, @object
- 38. .comm c,1024,4
- 39. .type c, @object
SLIDE 26 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 26 (4/7/09) Static Program Analysis for Microblaze Other inferences we can make:
- r22 is initialized with array, the starting address of the array (line 10).
- r23 is initialized with c+1024, which is the start of the c variable plus an offset of
1024 (line 11) r23 points to the end of the c variable space (it has 256 storage words)
- r24 is initialized to 0, and could be the loop counter or the accumulator (line 15)
- r19 is initialized to 255, which is the loop count - 1 (line 16)
The loop body is given on lines 17-24. Loops always start with a label, e.g., $L5, and terminate with a branch instruc- tion to that label. In this case, the last instruction of the loop body is on line 24 because the branch
- n line 23 is a delayed-branch (ends with a ’d’).
Lines 17 & 18 read from the array and c variables and stores the result in r5 and r6.
SLIDE 27
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 27 (4/7/09) Static Program Analysis for Microblaze The instruction on line 19 is a function call, implemented as a branch which saves the return address on the stack. r15 is used to hold the return address. The function’s name, __mulsi3, indicates that it performs a multiplication. The Microblaze processor (used here) does not contain multiplier hardware. This requires the compiler to provide an implementation for the multiplication in the instruction: a += array[i] * c[256 - i]; Functions starting with double-underscore are such compiler-provided functions. According to the ABI, the result of this function is placed in registers r3 and r4. Line 21 shows that r3 is accumulated to r24. Therefore, r24 is the accumulator.
SLIDE 28 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 28 (4/7/09) Static Program Analysis for Microblaze There are three adjustments to the counter values in the loop body.
- r19 is decremented by 1 (line 20, loop counter)
- r22 is incremented by 4 (line 22, pointer to array)
- r23 is decremented by 4 (line 24, pointer to c, in the branch delay slot)
The compiler was able determine that the address of array tracks increasing loca- tions in array, and decreasing locations in c, for the instruction: a += array[i] * c[256 - i]; In other words, the compiler automatically performed the following very effective
int array[256]; int c[256]; int main() { int i, a; int *p1, *p2;
SLIDE 29
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 29 (4/7/09) Static Program Analysis for Microblaze p1 = array; p2 = &(c[255]); a = 0; for (i=0; i<256; i++) a += (*(p1++)) * (*(p2--)); return a; } Through static program analysis, we established that the C compiler carried out a fairly advanced dataflow analysis and optimization. Static program analysis does not reduce cycle counts or improve performance num- bers. Rather, it provides a qualitative appreciation of a program. It allows you to estimate the performance of the program.
SLIDE 30
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 30 (4/7/09) Static Program Analysis for Microblaze For example, by analyzing the assembly, we determined that the compiler uses a soft- ware routine to implement multiplication. The add-and-shift logic used in such routines is always slower than a hardware mul- tiplier. Suppose you noticed no improvements when debugging this code using a Microblaze processor that had a hardware multiplier. By analyzing the assembly, you could easily determine whether the compiler mistak- enly used the software routine instead.
SLIDE 31 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 31 (4/7/09) Dynamic Program Analysis for StrongARM We can also analyze program execution at runtime and determine, cycle-by-cycle, what is happening. This is called dynamic program analysis. To demonstrate this technique, we take a simple C program and compile it for the SA-110, and then observe how it executes on the RISC pipeline. The SA-110, simulated with the Simit-ARM simulator, has the following architectural characteristics:
- 16 KByte D-cache, organized as a 32-set associative cache with lines of 32-bytes
- 16 KByte I-cache, organized as a 32-set associative cache with lines of 32-bytes
- Memory access latency = 64 cycles, cache access latency = 1 cycle
Consider the GCD program compiled without optimization int gcd (int a, int b) { while (a != b) { if (a > b) a = a - b;
SLIDE 32
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 32 (4/7/09) Dynamic Program Analysis for StrongARM else b = b - a; } return a; } extern void instructiontrace(unsigned); int main() { int a, i; instructiontrace(1); a = gcd(6, 8); instructiontrace(0); printf("GCD = %d\n", a); return 0; } The instructiontrace( ) system call enables dynamic instruction tracing.
SLIDE 33
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 33 (4/7/09) Dynamic Program Analysis for StrongARM To generate the assembly code: /usr/local/arm/bin/arm-linux-gcc -S gcd.c -o gcd.S gcd: mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 sub sp, sp, #8 str r0, [fp, #-16] str r1, [fp, #-20] .L2: ldr r2, [fp, #-16] ldr r3, [fp, #-20] cmp r2, r3 bne .L4 b .L3 .L4: ldr r2, [fp, #-16]
SLIDE 34
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 34 (4/7/09) Dynamic Program Analysis for StrongARM ldr r3, [fp, #-20] cmp r2, r3 ble .L5 ldr r3, [fp, #-16] ldr r2, [fp, #-20] rsb r3, r2, r3 str r3, [fp, #-16] b .L2 .L5: ldr r3, [fp, #-20] ldr r2, [fp, #-16] rsb r3, r2, r3 str r3, [fp, #-20] b .L2 .L3: ldr r3, [fp, #-16] mov r0, r3 ldmea fp, {fp, sp, pc}
SLIDE 35
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 35 (4/7/09) Dynamic Program Analysis for StrongARM Stack Frame for GCD The instructions in bold at the start and the end of the GCD procedure are used to build and destroy the stack frame. The StrongArm compiler creates subroutines according to ACPCS (Procedure Call Standard for the ARM architecture). When GCD is called, the following activities occur. First, the actual call instruction (not shown in the assembly listing) is of the form: bl gcd ;lr <- pc, pc <- gcd This instruction copies the current PC into the link register (lr), and then control is transferred to the program starting at gcd. The instruction: stmfd sp!, {fp, ip, lr, pc} ;sp <- fp, ;(sp-4) <- ip,
SLIDE 36 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 36 (4/7/09) Dynamic Program Analysis for StrongARM ;(sp-8) <- lr, ;(sp-12) <- pc, ;sp <- (sp-16) pushes the current frame pointer, stack pointer, return address and program counter
This instruction performs multiple register transfers (m in instruction) on a full, descending stack (fd). This instruction assumes that the stack pointer points to the last non-free location. The frame pointer is a pointer to the start of the ACPCS frame. All ACPS frames of nested procedure calls are linked together through these frame pointers. The ACPS standard also assumes that integer arguments are passed and returned through registers (starting with r0, r1, etc ..).
SLIDE 37 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 37 (4/7/09) Dynamic Program Analysis for StrongARM Hence the two instructions str r0, [fp, #-16] str r1, [fp, #-20] store copies of the arguments of GCD onto the stack frame. The body of the GCD code is clearly non-optimized assembly code.
mov ip, sp stmfd sp!, {fp, ip, lr, pc} sub fp, ip, #4 sub sp, sp, #8 str r0, [fp, #-16] str r1, [fp, #-20]
High
sp Low Stack Frame Construction High new fp fp
return addr pc a b Low
new sp
SLIDE 38 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 38 (4/7/09) Dynamic Program Analysis for StrongARM For each access to a or b, a fetch/store from/to the stack frame is performed using ldr r2, [fp, #-16] ldr r3, [fp, #-20] and str r3, [fp, #-20] Dynamic Analysis of non-optimized GCD We can now run this program with instruction tracing turned on. The output of the first iteration of the while-loop is given below. The different columns of this dynamic program analysis have the following meaning.
- Cycle: the current cycle count
- Address: location of the current instruction
- Opcode: binary code of the current instruction
- P: pipeline mis-speculation. If 1, then current instruction will not complete and will
be removed from the pipeline
- I: I-cache miss. If 1, there is an instruction-cache miss
- D: D-cache miss. If 1, there is a data-cache miss
SLIDE 39
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 39 (4/7/09) Dynamic Program Analysis for StrongARM cycle address opcode P I D C Mnemonic 44105 81e4 e1a0c00d 0 1 0 70 mov ip, sp; 44171 81e8 e92dd800 0 0 0 8
stmdb sp!, {fp, ip, lr, pc};
44172 81ec e24cb004 0 0 0 8 sub fp, ip, #4; 44176 81f0 e24dd008 0 0 0 5 sub sp, sp, #8; 44177 81f4 e50b0010 0 0 0 5 str r0, [fp, #-16]; 44178 81f8 e50b1014 0 0 0 5 str r1, [fp, #-20]; 44179 81fc e51b2010 0 0 0 5 ldr r2, [fp, #-16]; 44180 8200 e51b3014 0 1 0 70 ldr r3, [fp, #-20]; 44246 8204 e1520003 0 0 0 6 cmp r2, r3; 44247 8208 1a000000 0 0 0 3 bne 0x8210; 44249 820c ea00000d 1 0 0 1 b 0x8248; 44250 8210 e51b2010 0 0 0 5 ldr r2, [fp, #-16]; 44251 8214 e51b3014 0 0 0 5 ldr r3, [fp, #-20]; 44252 8218 e1520003 0 0 0 6 cmp r2, r3; 44253 821c da000004 0 0 0 3 ble 0x8234; 44255 8220 e51b3010 1 1 0 1 ldr r3, [fp, #-16]; 44256 8234 e51b3014 0 1 0 69 ldr r3, [fp, #-20];
SLIDE 40 HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 40 (4/7/09) Dynamic Program Analysis for StrongARM cycle address opcode P I D C Mnemonic 44321 8238 e51b2010 0 0 0 5 ldr r2, [fp, #-16]; 44322 823c e0623003 0 0 0 6 rsb r3, r2, r3; 44323 8240 e50b3014 0 0 0 6 str r3, [fp, #-20]; 44325 8244 eaffffec 0 0 0 2 b 0x81fc; 44326 8248 e51b3010 1 0 0 1 ldr r3, [fp, #-16]; The instructions in red are mis-speculated, meaning they do not complete execution because of pipeline hazards. Note that all of these instructions follow a taken-branch instruction. So they are the result of a pipeline control hazard. Next, the instructions in bold cause an instruction-cache miss. This happens because they start just beyond a 32-byte boundary, and thus are located
- n a new I-cache line which may cause a miss.
SLIDE 41
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 41 (4/7/09) Dynamic Program Analysis for StrongARM Note that the second example, address 0x8234, cycle 44256, does not start at a 32- byte boundary (an address evenly divisible by 0x20). The instruction before it causes the first I-cache miss (0x8220 is evenly divisible by 0x20), but that instruction is cancelled due to a control hazard. In other cases, it can be determined that it is impossible to have an I-cache miss. For example, it cannot happen for instruction at address 0x81fc because it would have happened at the previous instruction 0x81f8 In fact, the paragraph boundary is at 0x81E0. Note that most of the data-processing instructions take 5 clock cycles (corresponding to the 5-stage pipeline), while some require 6 clock cycles. The cmp instruction at address 0x8204 is an example of a 6 cycle instruction. One of the arguments, r3, is produced by the previous instruction, which repre- sents a data hazard and stalls the pipeline 1 cycle.
SLIDE 42
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 42 (4/7/09) Dynamic Program Analysis for StrongARM Consider the same GCD program compiled with the optimization flag turned on. /usr/local/arm/bin/arm-linux-gcc -O2 -static gcd.c cycle.s -o gcd The assembly code is considerably shorter: gcd: cmp r0, r1 moveq pc, lr .L7: cmp r0, r1 rsbgt r0, r1, r0 rsble r1, r0, r1 cmp r0, r1 moveq pc, lr b .L7 First, we note that there is no stack frame The compiler has optimized it out!
SLIDE 43
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 43 (4/7/09) Dynamic Program Analysis for StrongARM This type of optimization is possible for procedures that does not call other proce- dures, and keep all local variables in registers. The return instruction of the procedure has a peculiar form: moveq pc, lr This instruction will copy the link register (return address) to the program counter when the zero flag is set. This kind of instruction is called a predicated instruction. It uses a flag (a predicate) to determine if the instruction should execute. If the zero flag is not set, then the instruction has no effect. The instructions rsbgt r0, r1, r0 ; if gt then r0 <- r0 - r1 rsble r1, r0, r1 ; if le then r1 <- r1 - r0 are also predicated instructions.
SLIDE 44
HW/SW Codesign w/ FPGAsMicroprocessors/Embedded Cores ECE 495/595 ECE UNM 44 (4/7/09) Dynamic Program Analysis for StrongARM They implement reverse-subtraction based on the conditions gt (greater then) and le (less or equal). cycle address opcode P I D C Mnemonic 43773 81e8 1a0f00e 0 0 0 5 moveq pc, lr; 43774 81ec e1500001 0 0 0 5 cmp r0, r1; 43775 81f0 c0610000 0 0 0 5 rsbgt r0, r1, r0; 43776 81f4 d0601001 0 0 0 5 rsble r1, r0, r1; 43777 81f8 e1500001 0 0 0 5 cmp r0, r1; 43778 81fc 1a0f00e 0 0 0 5 moveq pc, lr; 43779 8200 eafffff9 0 0 0 2 b 0x81ec; 43780 8204 e92d4010 1 0 0 1 stmdb sp!, {r4, lr}; This is the first iteration of the while-loop of the dynamic trace. Here, all instructions consume 5 clock cycles. The only stall in the pipeline occurs when the while-loop iterates. The branch instruction at 0x8200 causes a control hazard.