Speculative Defragmentation Speculative Defragmentation A - - PowerPoint PPT Presentation

speculative defragmentation speculative defragmentation
SMART_READER_LITE
LIVE PREVIEW

Speculative Defragmentation Speculative Defragmentation A - - PowerPoint PPT Presentation

Ninth IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, Pennsylvania, August 1-4, 2000 Speculative Defragmentation Speculative Defragmentation A Technique to Improve the Communication A Technique to


slide-1
SLIDE 1

Eidgenössische Technische Hochschule Zürich Ecole polytechnique fédérale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

Ninth IEEE International Symposium on High Performance Distributed Computing, Pittsburgh, Pennsylvania, August 1-4, 2000

Speculative Defragmentation Speculative Defragmentation – –

A Technique to Improve the Communication A Technique to Improve the Communication Software Efficiency for Gigabit Ethernet Software Efficiency for Gigabit Ethernet

  • Ch. Kurmann, F. Rauch, M. Müller, T. Stricker

Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich

slide-2
SLIDE 2

2

  • Comm. Speeds of Commodity PCs
  • Comm. Speeds of Commodity PCs

ÿ For Gigabit Ethernet and TCP/IP the OS-software cannot keep up with the hardware speed

MPI-Linux 2.0-BIP MPI-Linux 2.2 TCP-Linux 2.2 TCP-Windows NT 20 40 60 80 100 120 140 Transfer-rate [MByte/s]

Gigabit Ethernet 32bit-PCI

20 20 35 35

Myrinet 32bit-PCI

42 42 126 126

slide-3
SLIDE 3

3

Overview Overview

  • Why Gigabit Ethernet
  • Packet Defragmentation
  • TCP/IP Overheads
  • Speculative Packet Defragmentation
  • Performance Analysis
  • Conclusion
slide-4
SLIDE 4

4

Problem Statement Problem Statement

How can we sustain network bandwidths of 75-100 MByte/s with a commodity PC cluster node:

  • memory copy 90 MByte/s
  • 32-bit PCI I/O-bus 132 MByte/s
  • commodity Gigabit Ethernet adapter 100 MByte/s
  • standard TCP/IP protocol
  • fully transparent standard socket-API
slide-5
SLIDE 5

5

Papers 10 Years Ago Papers 10 Years Ago

The same problem — 30-100 times slower

  • memory copy < 3 MByte/s
  • VME I/O-bus < 3 MByte/s
  • commodity 10BaseT Ethernet adapter 1 MByte/s
  • special purpose blast transfer protocol [Zwaenepoel85]
  • optimistic bulk transfers [Carter89]
  • transparent blasts by header padding [Peterson90]

Not standard protocol & not fully transparent ÿ Solutions did not find their way into current systems!

slide-6
SLIDE 6

6

Why Gigabit Ethernet Why Gigabit Ethernet

  • Compatible to Ethernet and Fast Ethernet (UTP Cat5)
  • Uncomplicated technology which results in high reliability

and low cost

  • Switched Ethernet provides link level flow control on full

duplex channels

  • In larger networks only unacknowledged, connectionless

datagram delivery service ÿ TCP needed

  • Standard frame size is still limited to 46-1500 Byte of

data

slide-7
SLIDE 7

7

Alternatives / Extensions Alternatives / Extensions

  • Dedicated network hardware with customized lightweight

protocols: Myrinet, SCI, Giganet, ServerNet ÿ primarily designed for internal communication in server farms

  • Jumbo Frames (9 KByte) for Gigabit Ethernet to reach a

Maximal Transfer Unit (MTU) of a memory page: ÿ • change of standard

  • higher latencies in store and forward switches
  • do not solve the header/payload separation
slide-8
SLIDE 8

8

Packet [De]Fragmentation Packet [De]Fragmentation

  • IP standard technique
  • Data to be sent is fragmented into small

chunks < network MTU (Maximal Transfer Unit)

  • Network protocols enclose the frames with header/trailer
  • Receiver separates the headers from the payload and

defragments the data again

  • Implications for Ethernet:
  • MTU < Memory Page
  • DMA-logic not optimal

ÿ Therefore memory copy for packet [de]fragmentation

slide-9
SLIDE 9

9

TCP/IP Host Overheads TCP/IP Host Overheads

  • Single largest overhead:

copying and checksums ÿ Zero-copy techniques

  • Per-packet processing

and interrupt overhead also high ÿ Interrupt coalescing

20 40 60 80 100 Percent CPU Copy & Checksum Interrupt TCP/IP Driver/ DMA Init

PII 400MHz, Linux 2.2 Host Overhead for TCP/IP

  • ver Gigabit Ethernet
slide-10
SLIDE 10

10

OS Environment OS Environment

TCP/IP Stack NIC Driver Socket Layer User Application NIC Firmware User space Kernel space NIC PCI Bus Control Path Middleware (CORBA, MPI) Previous work Speculative Defragmentation Protection boundary copies Driver copies Data Path Send and Receive Buffers System Page Pool Protocol handling, Packet Generation User Mapped Data Pages

. . . . . . . . .

DMA ORB Marshalling, Buffering

slide-11
SLIDE 11

11

Required Technologies Required Technologies

  • Well known solutions to eliminate the User/Kernel copy:
  • User-Level Network Interface (U-Net) or Virtual

Interface Architecture (VIA)

  • User/Kernel Shared Memory (FBufs, IOLite), Copy

Emulation or Page Remapping with Copy on Write

  • The Driver copy remains for Gigabit Ethernet

ÿ Goal: Elimination of driver copy for the packet defragmentation and header separation ÿ True zero-copy

slide-12
SLIDE 12

12

Commodity GE-Adapters Commodity GE-Adapters

  • Until now, zero-copy support is only available for

“intelligent” network adapters (ATM, SiliconTCP)

  • Today’s Gigabit Ethernet adapters are too simple
  • no processor, TLBs on board
  • limited DMA capabilities
  • no protocol filtering implemented

ÿ Deterministic zero-copy implementation with commodity GE adapters is not possible!

  • Approach: Making just the common case fast

ÿ Speculation Techniques for Defragmentation

slide-13
SLIDE 13

13

4096

Speculative Defragmentation I Speculative Defragmentation I

  • Our driver manages to send/receive entire 4 KByte pages
  • Decomposition of 4 KByte IP-packets into 3 IP-fragments
  • n driver level (standard IP fragmentation)
  • Attachment of

headers to the payload data with a separate DMA-descriptor

data zcdata data zcdata data zcdata status length status length status length status length status length status length

ETH ETH IP IP TCP TCP ETH ETH

14,20 14,20

IP IP ETH ETH IP IP

14,20,20 1460 1480 1156 1st Frag. 2nd Frag. 3rd Frag.

slide-14
SLIDE 14

14

Speculative Defragmentation II Speculative Defragmentation II

What are we speculating about?

  • Speculation that all fragments of a whole page will be

received in order

  • Speculation about the precise packet format (header-

lengths, data-fields)

  • The receiver has to fix the DMA descriptors without

knowledge about the next packets to arrive

  • In clusters with one or two switches, the probability is

high, that the three fragments arrive in order

  • Software cleanup when mis-speculation
slide-15
SLIDE 15

15

Speculative Defragmentation III Speculative Defragmentation III

Fragmentation/Defragmentation of a 4 KByte memory page by the DMA of the network interface

Ethernet Network

zcdata header ... ... zcdata header sk_buff sk_buff Protocol Headers 4 KByte Page 1460 2nd 1480 1156 3rd

Fragmentation

1st 1460 2nd 1480 1156 1st

Defragmentation

3rd

slide-16
SLIDE 16

16

Performance Evaluation Performance Evaluation

  • Gains by Successful Speculation
  • Penalty for Speculation Misses
  • Speculation Success Rates in Applications
  • Consequences:
  • Network Control Architecture
  • Suggested Hardware Improvements
slide-17
SLIDE 17

17

Gains with Speculation Gains with Speculation

ÿ 80 % increase in performance (bandwidth)

  • Spec. Defragmentation

with ZeroCopy FBufs

  • Spec. Defragmentation

with ZeroCopy Remapping Speculative Defrag. with Copying Socket API Linux 2.2 Standard 10 20 30 40 50 60 70 80 Transfer-rate [MByte/s] TCP/IP Performance of Gigabit Ethernet

46 46 46 45 45 45

ZeroCopy Remapping with Copying Driver 1 copy 0 copy

75 75 75 42 42 42 65 65 65

slide-18
SLIDE 18

18

Penalty with Speculation Misses Penalty with Speculation Misses

ÿ The common case is fast, the fallback not much slower

10 20 30 40 50 60 70 80 Transfer-rate [MByte/s] Compatibility Zero-Copy Sender Standard Receiver Linux 2.2 Operation Standard Sender Standard Receiver TCP/IP Performance of Gigabit Ethernet

45 45 45 42 42 42 35 35 35

Fallback Standard Sender Zero-Copy Receiver

slide-19
SLIDE 19

19

  • Application traces show success of speculative transfers
  • TreadMarks has an inherent scheduling that prevents

interference

  • TPC-D needs a control architecture or hardware changes

Evaluation of Success Rates Evaluation of Success Rates

total large zcopy

  • k

100 %

68182 44010 44004 44004

Master TreadMarks SOR > 99 % > 99 % 100 % 100 % 48 % Success Rate

50731 30419 30405 30399 51095 30707 30693 30675 62311 44848 41682 41682 67524 45877 37833 37833 129835 90725 79515 38235

Ethernet frames Host2 Host1 Host2 Host1 Master Oracle TPC-D Traces

slide-20
SLIDE 20

20

Network Control Architecture Network Control Architecture

  • Problem:

Multiple synchronous, fast receives may garble the zero-copy frames

  • Solution:

Admission Control on Ethernet driver level with negotiation for one single sender to blast

  • Implicit channel allocation by OS works
  • Fully transparent
  • No explicit scheduling of transfers through a special

interface ÿ the API remains the same

slide-21
SLIDE 21

21

Suggested Hardware Improvements Suggested Hardware Improvements

  • Additional control-path between the checksumming- and

the DMA-logic for detection of protocol & header fields ÿ Reliable header/payload separation

  • Stream detection with a simple matching register and a

separate DMA descriptor chain for fast transfers: ÿ Detection of at least one high performance stream ÿ Separation of this stream with its DMA descriptors ÿ Improvement of the speculation rate Lower driver complexity

slide-22
SLIDE 22

22

Conclusion Conclusion

  • Speculation techniques open a new horizon for
  • ptimized network drivers and permit an “almost”-zero-

copy implementation for TCP/IP over Gigabit Ethernet.

  • The performance in our implementation was raised

from 42 to 75 MByte/s (80%) using the standard Linux TCP-stack and commodity network interface hardware.

  • Speculation works in network interfaces as well as in

“Instruction Level Parallelism” and should be considered to find simple and effective hardware improvements for network interfaces.

  • Existing Ethernet protocols and standard network

interface chipsets prevent an accurate, fully deterministic defragmentation in hardware.