Dynamic Translation for EPIC Architectures David R. Ditzel Chief - - PowerPoint PPT Presentation

dynamic translation for
SMART_READER_LITE
LIVE PREVIEW

Dynamic Translation for EPIC Architectures David R. Ditzel Chief - - PowerPoint PPT Presentation

Dynamic Translation for EPIC Architectures David R. Ditzel Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8 th Workshop on EPIC Architectures April 24, 2010 Dynamic Translation for EPIC


slide-1
SLIDE 1

1 1 Dynamic Translation for EPIC 1 CGO 2010

Dynamic Translation for EPIC Architectures

David R. Ditzel

Chief Architect for Hybrid Computing, VP IAG Intel Corporation Presentation for 8th Workshop on EPIC Architectures April 24, 2010

slide-2
SLIDE 2

Dynamic Translation for EPIC 2 CGO 2010

Thesis: The future of computing belongs to EPIC Architectures

EPIC:

  • Explicitly Parallel Instruction Computer
  • r
  • Exposed Parallelism Instruction Computer
  • Parallelism exposed for software to exploit
  • Examples – Itanium, GPGPU’s, Transmeta Efficeon/Crusoe

My belief:

  • EPIC is a more power efficient approach
  • Dynamic translation will improve power advantages
  • May be a different EPIC than we know today
slide-3
SLIDE 3

Dynamic Translation for EPIC 3 CGO 2010

Power is the limiter

We must move to more efficient computing structures

  • r # cores could be limited

Biggest challenge

slide-4
SLIDE 4

Dynamic Translation for EPIC 4 CGO 2010

Simple Power Scaling Example

Power = Cdyn x Voltage2 x Frequency + Leakage (33%) Moore’s Law says # devices can double every node

  • 4 cores go to 128 cores over 10 years
  • How does power limit this expectation?

With an upper power limit of ~100 Watts, how many cores? Easy to calculate scaling per node:

  • Voltage scaling about 0.9x
  • Cdyn scaling about 0.8x
  • Assume frequency increase of 1.2x

From this data we can see how many cores we can have if we do not change to a more efficient approach

slide-5
SLIDE 5

Dynamic Translation for EPIC 5 CGO 2010

Power Limits # of Big Cores

Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 Power/core 25 Freq 3.0 Voltage 1.0 Cdyn/Core 5.6 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4

slide-6
SLIDE 6

Dynamic Translation for EPIC 6 CGO 2010

Power Limits # of Big Cores

Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4

slide-7
SLIDE 7

Dynamic Translation for EPIC 7 CGO 2010

Power Limits # of Big Cores

Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4

slide-8
SLIDE 8

Dynamic Translation for EPIC 8 CGO 2010

Power Limits # of Big Cores

Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 5 7 9 11 14

slide-9
SLIDE 9

Dynamic Translation for EPIC 9 CGO 2010

Power Limits # of Big Cores

Year 2008 2010 2012 2014 2016 2018 Technology Node (nm) 45 32 22 15 11 8 Total Power 100 100 100 100 100 100 Power/core 25 19 15 12 9 7 Freq 3.0 3.6 4.3 5.2 6.2 7.5 Voltage 1.0 0.9 0.8 0.7 0.7 0.6 Cdyn/Core 5.6 4.4 3.6 2.8 2.3 1.8 Expected #Cores 4 8 16 32 64 128 Power Limited #Cores 4 5 7 9 11 14

We need to improve the efficiency of each core

  • r we will suffer severe performance reduction
slide-10
SLIDE 10

Dynamic Translation for EPIC 10 CGO 2010

So how do we build improved cores?

slide-11
SLIDE 11

Dynamic Translation for EPIC 11 CGO 2010

Change of perspective needed

  • Software should be part of the picture
  • Hardware co-designed with software increases the available
  • ptions
  • Software needs a simple model of the “cost” of an instruction
  • Out-of-order processors made this impossible
  • In-order EPIC processor can provide this simple model
  • Software can do a very good job of scheduling, but only if

the scheduling blocks are large enough

  • Let’s look at an example of how to increase block size and

improve scheduling

Premise

slide-12
SLIDE 12

Dynamic Translation for EPIC 12 CGO 2010

tst.ne p1, ecx, ecx brc p1, D

  • r eax, zero, 1

ld ebx, [ebp] ld ebx, [ebx + esi*4] ld edx, [esp + 112] st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F

  • r eax, zero, 1

ld r32, [esp + 112]

  • r ebx, zero, 0

st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 brc p1, E

  • r eax, zero, 1

ld edx, [esp + 112]

  • r ebx, zero, 0

ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 assert ~p1 tst.ne p1, ecx, ecx assert ~p1 ld ebx, [ebp] ld ebx, [ebx + esi*4] st ebx, [edx]

Conditional branches tend to have a very biased program behavior

  • Exploitable by compiler

Correctness makes it difficult

  • Fixup code for cold exits
  • Exceptions

A little special purpose hardware can make it much easier

Compiler optimization example

  • r eax, zero, 1

ld edx, [esp + 112] st ebx, [r32] tst.ne p1, ecx, ecx brc p1, F

slide-13
SLIDE 13

Dynamic Translation for EPIC 13 CGO 2010

tst.ne p1, ecx, ecx brc p1, D

  • r eax, zero, 1

ld ebx, [ebp] ld ebx, [ebx + esi*4] ld edx, [esp + 112] st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F

  • r eax, zero, 1

ld r32, [esp + 112]

  • r ebx, zero, 0

st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 brc p1, E

Hardware executes a region of code completely or not at all Common case is fast Uncommon case rolls back

  • Resume in non-specialized code

Hardware atomicity

  • r eax, zero, 1

ld edx, [esp + 112]

  • r ebx, zero, 0

ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 assert ~p1 tst.ne p1, ecx, ecx assert ~p1 ld ebx, [ebp] ld ebx, [ebx + esi*4] st ebx, [edx]

slide-14
SLIDE 14

Dynamic Translation for EPIC 14 CGO 2010

  • r eax, zero, 1

ld edx, [esp + 112]

  • r ebx, zero, 0

ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 assert ~p1 tst.ne p1, ecx, ecx assert ~p1 ld ebx, [ebp] ld ebx, [ebx + esi*4] st ebx, [edx] tst.ne p1, ecx, ecx brc p1, D

  • r eax, zero, 1

ld ebx, [ebp] ld ebx, [ebx + esi*4]

  • r edx, r32, 0

st ebx, [edx] tst.ne p1, ecx, ecx brc p1, F

  • r eax, zero, 1

ld r32, [esp + 112]

  • r ebx, zero, 0

st ebx, [r32] ld esi, [ebp + 0x878] cmp.ne p1 edi, 72 brc p1, E

Interpreter Runtime Translations

Code Morphing Software

Dynamic binary translation

x86 processor

x86 ISA

x86 Applications x86 OS

EPIC Processor

RISC ISA

test ecx, ecx jne D mov eax, 1 mov esi, [esp + 112] xor ebx,ebx mov [esi], ebx mov esi, [ebp + 0x878] cmp edi, 72 jne E mov eax, 1 mov ebx, [ebp] mov ebp, [ebx + esi*4] mov edx, [esp + 112] mov [edx], ebx test ecx,ecx jne F

slide-15
SLIDE 15

Dynamic Translation for EPIC 15 CGO 2010

Up to 6-issue/clock EPIC style architecture

  • 2 loads or stores
  • 2 integer ALU
  • 2 SIMD
  • 1 branch/call or other control

Co-designed with CMS Includes hardware atomicity under software control

  • Commit
  • Rollback

Efficeon Processor Example

slide-16
SLIDE 16

Dynamic Translation for EPIC 16 CGO 2010

Efficeon Hardware Example

Load or Store or 32-bit add Load or Store or 32-bit add Integer ALU-1 Integer ALU-2 Alias Control FP / SIMD FP / SIMD Branch Exec-1 Exec-2

Each clock, processor can issue from

  • ne to six 32-bit instruction “atoms” to 11 functional units

atom1 atom2 atom3 atom4 atom5 atom6 atom7 atom8

Functional Units Instruction

slide-17
SLIDE 17

Dynamic Translation for EPIC 17 CGO 2010

No startup cost Lowest speed 1st Gear

Executes 1 instruction at a time

  • Profiles code at runtime
  • Gathers data for flow analysis
  • Gathers branch frequencies and directions
  • Detects load/store typing (IO vs memory)

Filters out infrequently executed code

Code Morphing Software

4 Gear System Significantly Improved Responsiveness and Overall Performance

slide-18
SLIDE 18

Dynamic Translation for EPIC 18 CGO 2010

1st Gear 2nd Gear

Uses profile data to create initial translations after code reaches 1st threshold.

  • Translates a “Region” of up to100 x86 instructions.
  • Adds flow graph “Shape” information
  • Light Optimization
  • “Greedy” scheduling

Low translation overhead Fast execution

Code Morphing Software

4 Gear System Significantly Improved Responsiveness and Overall Performance

slide-19
SLIDE 19

Dynamic Translation for EPIC 19 CGO 2010

1st Gear 2nd Gear

Further optimizes the 2nd gear regions

  • Common sub-expression elimination
  • Memory re-ordering
  • Significant code optimization
  • Critical path scheduling

Medium translation overhead Faster execution 3rd Gear

Code Morphing Software

4 Gear System Significantly Improved Responsiveness and Overall Performance

slide-20
SLIDE 20

Dynamic Translation for EPIC 20 CGO 2010

1st Gear 2nd Gear

Most advanced optimizations for “hottest” code regions.

  • Splices together multiple regions
  • Optimizes across region boundaries
  • Used advanced behavioral data
  • Critical path scheduling

Highest translation overhead Fastest execution 3rd Gear 4th Gear

Code Morphing Software

4 Gear System Significantly Improved Responsiveness and Overall Performance

slide-21
SLIDE 21

Dynamic Translation for EPIC 21 CGO 2010

Sophisticated translation optimizer

  • Quickly applies many optimizations
  • if-conversion, loop unrolling, constant folding and

propagation, common-subexpression elimination, dead- code-elimination, loop-invariant code motion, superblock scheduling

  • New optimizations must have low overhead

CMS Translator Optimization

slide-22
SLIDE 22

Dynamic Translation for EPIC 22 CGO 2010

Dynamic opportunities in CMS translation

  • Hottest method in Vortex: OaGetObject

Potential to eliminate x86 insts

  • 56 -> 47 dynamic x86 insts
  • 19% reduction

CMS relies on superblock abstraction

  • Does not expose available opportunity

Example from Vortex

same condition dead store killed by partially redundant load partially redundant load partially redundant load partially dead op store->load forward

slide-23
SLIDE 23

Dynamic Translation for EPIC 23 CGO 2010

Atomic regions trivially expose

  • pportunity

Convert biased control flow into assert operations

  • Represent as dataflow op in IR
  • Traditional optimizations can now

exploit speculative opportunity

  • Emit as conditional branch to jump

to rollback and recover

Retry in the interpreter or another translation

Atomic region abstraction

slide-24
SLIDE 24

Dynamic Translation for EPIC 24 CGO 2010

Dependence graph shown Atomic region trivially enables optimizer to eliminate operations

  • 88 -> 73 Efficeon operations
  • 17% reduction

Relaxes scheduling constraints

  • 26 -> 19 cycles
  • 27% reduction

Atomic region benefits

MEMORY MEMORY INT ALU INT ALU BRANCH

slide-25
SLIDE 25

Dynamic Translation for EPIC 25 CGO 2010

Why we need EPIC architectures

EPIC architectures offer many advantages Simplified hardware

  • Simpler to design
  • Smaller cores means more cores per die

Enables software scheduling

  • EPIC architectures are easier for DBT to schedule
  • Better scheduling is the key to future performance gains

Power

  • In-order pipelines for EPIC are power efficient
  • Less hardware for OOO means lower power
  • More amenable to new power saving techniques
slide-26
SLIDE 26

Dynamic Translation for EPIC 26 CGO 2010

Why we need Dynamic Translation

Good reasons for Dynamic Binary Translation (DBT) Innovation

  • To allow processor innovation not tied to particular instruction sets
  • Using DBT to provide backwards compatibility
  • DBT system hidden from standard software – CMS as microcode

Performance

  • To enable new means to improve processor performance
  • DBT can provide access to new performance features

Power

  • Dynamic optimization is good for power
  • e.g., optimizing away half the instructions is twice as energy efficient
slide-27
SLIDE 27

Dynamic Translation for EPIC 27 CGO 2010

Conclusions

Binary Translation with EPIC architectures are a good combination. Special purpose hardware support is needed, co-designed with software, in order to provide good performance and power efficiency. Special care is needed to keep translation overhead low. Many opportunities for clever hardware/software co-design tradeoffs This is a technological approach still in its infancy Prediction: Dynamic Binary Translation will become a basic technique used in future processor design, as integral as logic gates and microcode are today.

slide-28
SLIDE 28

END OF SLIDES