Evaluating Compiler Support for Complexity Effective Network - - PowerPoint PPT Presentation

▶

Sep 18, 2023 105 likes •312 views

Evaluating Compiler Support for Complexity Effective Network Processing Pradeep Rao and S.K. Nandy Computer Aided Design Laboratory. SERC, Indian Institute of Science. pradeep,nandy@cadl.iisc.ernet.in http://www.serc.iisc.ernet.in/cadl/ 7 th

SLIDE 1

7th June 2003 Workshop on Complexity Effective Design 1

Evaluating Compiler Support for Complexity Effective Network Processing

Pradeep Rao and S.K. Nandy Computer Aided Design Laboratory. SERC, Indian Institute of Science.

pradeep,nandy@cadl.iisc.ernet.in http://www.serc.iisc.ernet.in/cadl/

SLIDE 2

7th June 2003 Workshop on Complexity Effective Design 2

Outline

Why network processors (NP) ?

– Why complexity effective NPs ?

NP design issues
Statically scheduled processors for NPs

– Compiler optimizations

Classical
Superblock
Hyperblock

– Performance Data

SLIDE 3

7th June 2003 Workshop on Complexity Effective Design 3

Network Processors

Why do we need network processors ?

– Significant time spent in protocol stack – Increasing data rates

Increased performance requirements

– New protocols and services

Software based functionality
Flexible (vs. ASIC)
Faster time to market
Players

– Cisco, Intel IXP, IBM PowerNP, Motorola (C-Port) C5, Broadcom, ClearWater ...

SLIDE 4

7th June 2003 Workshop on Complexity Effective Design 4

Complexity Effective NPs

Complexity-Effective hardware

– Low design, verification and testing times – Impacts time to market – Low power

Fixed power budgets for line cards
Network enabled mobile devices

– Performance goals met ?

Performance

– Exploit parallelism – Push clock frequencies

SLIDE 5

7th June 2003 Workshop on Complexity Effective Design 5

NP Design Issues

System Design:

Organization of memory, interconnection, processing element (PE) and its local memory …

Inadequate performance

data for the design of future network processors

SLIDE 6

7th June 2003 Workshop on Complexity Effective Design 6

Static Scheduling for NPs

Keep hardware simple by offloading

complexity onto the compiler

The compiler has a ‘global’ view of the

program

Performance data for

– In-order superscalar (IOS) – VLIW

SLIDE 7

7th June 2003 Workshop on Complexity Effective Design 7

Methodology

IMPACT Toolset (UIUC)
Architectures

– In-order Superscalar – VLIW

Compiler optimizations

– Classical – Superblock – Hyperblock

Applications

– Checksum computation: crc – Deficient round robin scheduling: drr – Shortest path computation: dijkstra – Diffie Hellman public key encryption/decryption: dh – Reed Solomon codec: reed_enc, reed_dec

SLIDE 8

7th June 2003 Workshop on Complexity Effective Design 8

The Superblock

Essentially a trace with

single entry multiple exits

Reduces bookkeeping

required to support side entrances

Code motion with

compiler controlled speculation

General speculation model

for minimal hardware support.

SLIDE 9

7th June 2003 Workshop on Complexity Effective Design 9

The Hyperblock

Adds predicated

execution for superblocks

SLIDE 10

7th June 2003 Workshop on Complexity Effective Design 10

Application Characteristics

Op-code Frequencies

– 40% integer operations

Addition and shifts

account for > 80% ops

– SB optimizations do not change the op freq.

No additional stress on

resources

– HB optimizations reduce conditional branches by if- conversion.

Predicate instructions

account for 0-37%

SLIDE 11

7th June 2003 Workshop on Complexity Effective Design 11

Application Characteristics…

Branch Statistics

– Avg. branch prediction accuracy: 92.32%, with < 9% deviation – Branch prediction accuracy for SB and HB are higher

SLIDE 12

7th June 2003 Workshop on Complexity Effective Design 12

Application Characteristics…

Block Size

– Indicative of potential parallelism – BB Avg: 5 instructions – SB/HB Avg: 13 instructions

SLIDE 13

7th June 2003 Workshop on Complexity Effective Design 13

Application Characteristics…

Cache Performance

– Effect of SB/HB on cache performance – D$ unaffected – I$, for equivalent cache sizes the miss rate increases by 40%

SLIDE 14

7th June 2003 Workshop on Complexity Effective Design 14

Architectural Evaluation…

Speedup plots with perfect

caches for VLIW

Up to 2.4x speedup with

SB/HB optimization

Predication overhead at low

issue widths

Performance gain from HB

(over SB) at high issue < 8%

Leveling indicates decrease

in processor utilization

SLIDE 15

7th June 2003 Workshop on Complexity Effective Design 15

Architectural Evaluation…

Effect of real cache

– Greater impact on VLIW than IOS – However, the performance benefit of IOS over VLIW is less than 1.8%, suggesting VLIW for complexity effective designs – Average network rates of 6.6Gbps @ 500MHz for drr

7.4% 5.6% HB 6.8% 5.6% SB 1.08% 1.06% BB VLIW IOS

SLIDE 16

7th June 2003 Workshop on Complexity Effective Design 16

Frequency Effects

Increase in memory/FU

latency (empirical)

Increase in performance

not commensurate with frequency increase

– Performance improvement with doubled frequency (B- M1) is < 37%, (M1-M2) < 31%

Need for efficient latency

hiding techniques

– (SMT, TCP) ?

SLIDE 17

7th June 2003 Workshop on Complexity Effective Design 17

Conclusions

This study provides performance data for statically

scheduled processors, for networking applications

Operation frequencies differ from SPEC and

Media applications

– Organization of FU’s

High static branch prediction rates

– Make static scheduling attractive for networking applications

Speedup due to SB and HB optimizations can be

as high as 2.4

SLIDE 18

7th June 2003 Workshop on Complexity Effective Design 18

Conclusions…

HB optimizations improve performance by < 8%

– The additional complexity might not be justified

The performance advantage of an IOS over VLIW

is less than 1.8%

– VLIW being CE might be more attractive

Simulation results show average network rates of

6.6Gbps for drr, at 500MHz for 8-issue VLIW with SB optimization

Need to exploit packet level parallelism

SLIDE 19

7th June 2003 Workshop on Complexity Effective Design 19

Evaluating Compiler Support for Complexity Effective Network Processing

Outline

– Why complexity effective NPs ?

– Compiler optimizations

– Performance Data

Network Processors

– Significant time spent in protocol stack – Increasing data rates

– New protocols and services

– Cisco, Intel IXP, IBM PowerNP, Motorola (C-Port) C5, Broadcom, ClearWater ...

Complexity Effective NPs

– Low design, verification and testing times – Impacts time to market – Low power

– Performance goals met ?

– Exploit parallelism – Push clock frequencies

NP Design Issues

Organization of memory, interconnection, processing element (PE) and its local memory …

data for the design of future network processors

Static Scheduling for NPs

complexity onto the compiler

program

– In-order superscalar (IOS) – VLIW

Methodology

The Superblock

single entry multiple exits

required to support side entrances

compiler controlled speculation

for minimal hardware support.

The Hyperblock

execution for superblocks

Application Characteristics

Application Characteristics…

– Avg. branch prediction accuracy: 92.32%, with < 9% deviation – Branch prediction accuracy for SB and HB are higher

Application Characteristics…

– Indicative of potential parallelism – BB Avg: 5 instructions – SB/HB Avg: 13 instructions

Application Characteristics…

– Effect of SB/HB on cache performance – D$ unaffected – I$, for equivalent cache sizes the miss rate increases by 40%

Architectural Evaluation…

caches for VLIW

SB/HB optimization

issue widths

(over SB) at high issue < 8%

in processor utilization

Architectural Evaluation…

– Greater impact on VLIW than IOS – However, the performance benefit of IOS over VLIW is less than 1.8%, suggesting VLIW for complexity effective designs – Average network rates of 6.6Gbps @ 500MHz for drr

Frequency Effects

latency (empirical)

not commensurate with frequency increase

hiding techniques

Conclusions

scheduled processors, for networking applications

Media applications

– Organization of FU’s

– Make static scheduling attractive for networking applications

as high as 2.4

Conclusions…

– The additional complexity might not be justified

is less than 1.8%

– VLIW being CE might be more attractive

6.6Gbps for drr, at 500MHz for 8-issue VLIW with SB optimization

Thank You