Analysis of Techniques to Improve Protocol Processing Latency David - - PowerPoint PPT Presentation

analysis of techniques to improve protocol processing
SMART_READER_LITE
LIVE PREVIEW

Analysis of Techniques to Improve Protocol Processing Latency David - - PowerPoint PPT Presentation

Analysis of Techniques to Improve Protocol Processing Latency David Mosberger, Patrick Bridges, Larry L. Peterson, and Sean OMalley The University of Arizona f davidm,bridges,llp g @cs.arizona.edu e-mail: sean@netapp.com www:


slide-1
SLIDE 1

Analysis of Techniques to Improve Protocol Processing Latency

David Mosberger, Patrick Bridges, Larry L. Peterson, and Sean O’Malley

The University of Arizona

e-mail:

fdavidm,bridges,llpg@cs.arizona.edu

sean@netapp.com www: http://www.cs.arizona.edu/scout

SIGCOMM ’96

1

slide-2
SLIDE 2

Latency: Where does it come from?

Speed of light Data touching overheads?

– No: messages (data) are small.

Execution overheads?

– Too much code. – Badly structured code.

SIGCOMM ’96 The University of Arizona

2

slide-3
SLIDE 3

Test Environment

Protocol stacks

– TCP/IP – RPC

Hardware platform

– 175MHz Alpha – 100MB/s memory – TURBOchannel bus – 10Mbps Ethernet

TCPTEST TCP IP VNET ETH LANCE LANCE ETH VNET IP BLAST BID CHAN VCHAN MSELECT XRPCTEST

SIGCOMM ’96 The University of Arizona

3

slide-4
SLIDE 4

Starting Point

Data cache footprint

– padding – stack switching – info duplication

Tiny functions Machine idiosyncracies

– byte load/store – integer division

15688 18941

Orig Opt 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

cycle count

4750 5821

Orig Opt 1000 2000 3000 4000 5000 6000

instruction count

SIGCOMM ’96 The University of Arizona

4

slide-5
SLIDE 5

How fast is TCP/IP?

  • xk/Alpha

DUX/Alpha BSD/386 250 500 750 1000 1250 1500 instruction count

  • ther TCP input

tcp_input

  • ther IP input

ipintr

SIGCOMM ’96 The University of Arizona

5

slide-6
SLIDE 6

Latency Bottlenecks

Suspects

Frequent branching Instruction-cache gaps Cache collisions Layering overheads

Not instruction/data translation buffer.

SIGCOMM ’96 The University of Arizona

6

slide-7
SLIDE 7

Techniques

Outlining attacks:

– frequent branching – i-cache gaps

Cloning attacks:

– cache collisions

Path-inlining attacks:

– layering overheads

SIGCOMM ’96 The University of Arizona

7

slide-8
SLIDE 8

Outlining

Exception-handling code

– lots of it (up to 50%) – dilutes instruction-cache – causes taken branches

Remove from fast path

– annotate if-statements with branch probability – move unlikely code to end of function

SIGCOMM ’96 The University of Arizona

8

slide-9
SLIDE 9

Outlining Example

: if (bad case @ 0) f panic("ba d day"); g printf("g
  • d
day"); : : load r0, (bad case) jump if r0, good day load addr a0, "bad day" call panic good day: load addr a0, "good day" call printf : : load r0, (bad case) jump if not r0, bad day load addr a0, "good day" call printf continue : : return bad day: load addr a0, "bad day" call panic jump continue

SIGCOMM ’96 The University of Arizona

9

slide-10
SLIDE 10

Cloning

Make copy of functions on fast path

– relocate to avoid conflict misses – specialize for a particular use (partial evaluation)

Alternative layout algorithms

– micro-positioning – bipartite layout

SIGCOMM ’96 The University of Arizona

10

slide-11
SLIDE 11

Outlining & Cloning Summary

function A infrequently executed instructions frequently executed instructions

Standard Layout: After Outlining: After Cloning:

copy & relocate frequently executed code function A function B function B function A function B clone A clone B

SIGCOMM ’96 The University of Arizona

11

slide-12
SLIDE 12

Path-Inlining

Collapse deeply-nested functions

Assume fast path is known Compile entire function as single unit

Advantages

Removes call-overheads Increases context for optimizer

SIGCOMM ’96 The University of Arizona

12

slide-13
SLIDE 13

End-to-End Latency

Roundtrip time in

s:

310.8 351 498.8 BAD STD OPT

100 200 300 400 500

TCP 365.5 399.2 457.1 BAD STD OPT

100 200 300 400 500

RPC

SIGCOMM ’96 The University of Arizona

13

slide-14
SLIDE 14

Processing Latency

Processing-time per roundtrip in

s:

100.8 141 288.8 BAD STD OPT

100 200 300

TCP 155.5 189.2 247.1 BAD STD OPT

100 200 300

RPC

SIGCOMM ’96 The University of Arizona

14

slide-15
SLIDE 15

Memory System Performance

1.57 1.72 2 1.61 1.17 1.58 2.3 4.58

OPT STD DUX BAD 0 1 2 3 4 5 6 7

TCP

mCPI iCPI 1.67 1.78 1.69 0.81 1.69 4.66

OPT STD DUX BAD 0 1 2 3 4 5 6 7

RPC

mCPI iCPI

SIGCOMM ’96 The University of Arizona

15

slide-16
SLIDE 16

Outlining Effectiveness

TCP

21% 79%

No Outlining

Used Unused

15% 85%

With Outlining

Used Unused

RPC

– Essentially identical performance.

SIGCOMM ’96 The University of Arizona

16

slide-17
SLIDE 17

Conclusions

Instruction cache bandwidth major bottleneck Cache collisions not particularly bad Processor/Memory gap still growing; now:

– 300MHz processor – 100Mbps Ethernet – 80MB/s memory system

SIGCOMM ’96 The University of Arizona

17

slide-18
SLIDE 18

Conclusions

Outlining

– Readily applicable – Relatively convenient

Cloning and path-inlining

– Requires “path” notion: see Scout OS – Need better (automatic) tools

SIGCOMM ’96 The University of Arizona

18

slide-19
SLIDE 19

Dynamics

processFrame() xCall() semSignal() b a c semWait() TCPTEST TCP IP VNET ETH LANCE LANCE ETH VNET IP BLAST BID CHAN VCHAN MSELECT XRPCTEST

SIGCOMM ’96 The University of Arizona

19