A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus - - PowerPoint PPT Presentation

a comparison of two gigabit san lan technologies sci
SMART_READER_LITE
LIVE PREVIEW

A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus - - PowerPoint PPT Presentation

EMMSEC 98: European Multimedia, Microprocessor Systems and Electronic Commerce Conference and Exposition A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus Myrinet Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ -


slide-1
SLIDE 1

Eidgenössische Technische Hochschule Zürich Ecole polytechnique fédérale de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

EMMSEC 98: European Multimedia, Microprocessor Systems and Electronic Commerce Conference and Exposition

A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus Myrinet

  • Ch. Kurmann, T. Stricker

Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich

slide-2
SLIDE 2

2

Motivation

■ Evaluation and comparison of Gigabit/sec interconnects

need a common architectural denominator

■ We propose three different levels:

◆ highly optimized remote

load/store operation

◆ optimized standard

message passing library

◆ connection oriented

LAN emulation

? X

CoPs-SCI CoPs-Myrinet Cray T3D 20 40 60 80 100 120 Transfer-rate by protocol and API (large blocks) Transfer rate (MByte/s) Raw Deposit MPI (full semantics) TCP/IP MPI (restricted semantics)

slide-3
SLIDE 3

3

Overview

■ Levels of comparison ■ Previous work ■ Technologies overview:

◆ PC Platform ◆ Myricom Myrinet ◆ Dolphin CluStar SCI ◆ SGI / Cray T3D

■ Typical transfer modes ■ Measurement results ■ Conclusion

slide-4
SLIDE 4

4

Levels of Comparison

■ Three levels with different amount of support by the

  • perating system:

◆ DIRECT DEPOSIT:

✦ simple remote load/store operation ✦ performance is expected to be closest at actual

hardware peak performance

◆ MPI/PVM:

✦ optimized standard message passing library ✦ carefully coded parallel applications are expected to

see this performance

◆ TCP/IP:

✦ connection oriented TCP/IP LAN emulation

◆ ....

■ Common architectural denominator

slide-5
SLIDE 5

5

Previous Work

■ Previous studies:

◆ maximum bandwidth numbers ◆ minimal latency numbers ◆ performance results for an entire application

■ Performance of application depends:

◆ redistribution of data stored in distributed arrays ◆ migration of data in fine grain object store

■ We need a benchmark that covers data types beyond

contiguous blocks of data (e.g. strided remote stores).

slide-6
SLIDE 6

6

Direct Deposit

■ The deposit model requires a clean separation and

different mechanisms for:

◆ control messages, ◆ data messages.

■ Data is “dropped” directly into the receivers address

space by the hardware without active participation of the receiver process.

■ Allows to copy fine grained data involving complex

access patterns like strides.

slide-7
SLIDE 7

7

Message Passing Libraries

■ Sender can send messages at any time, without

waiting for the receiver to be ready

■ Buffering is often done at a higher level and involves

the memory system of the end-points

■ Fine grain data accesses are implemented through

buffer-packing / -unpacking

slide-8
SLIDE 8

8

Message Passing Model

■ Different flavors for

restricted and full postal semantics

◆ non-buffering semantics:

can be mapped directly to a fast direct deposit including synchronization

◆ buffering semantics:

non-blocking operation allows sending at any time and leads to an additional copy operation

send(B1,P1) end_send end_recv recv(B4) end_recv recv(B2) send(B3,P0) B2 B4 B3 B1 end_send

ping prog lib net lib prog pong

send(B1,P1) end_send end_recv recv(B4) end_send recv(B2) send(B3,P0) B3 B4 B2 B1 end_recv barrier

ping prog lib net lib prog ping

slide-9
SLIDE 9

9

Protocol Emulation

■ Popular API much software

◆ UDP/IP - unreliable, connectionless network service ◆ TCP/IP - allows reliable connection-oriented

communication

◆ NFS/IP - network file system

■ Protocol stacks are provided by the OS ■ Socket API, streams API are ubiquitous ■ It is unrealistic to recode all commercial web servers,

databases or middleware systems for message passing APIs like MPI.

■ With IP support gigabit networks can speedup much

more than just scientific applications!

slide-10
SLIDE 10

10

PC Platform for this Talk

■ Single/Twin Pentium Pro 200MHz ■ Intel 440 FX Chipset ■ 64-bit 66 MHz main memory interface, 0.5 GByte/s ■ 32-bit 33 MHz PCI bus, 132 MB/s

~ 3000 per node

slide-11
SLIDE 11

11

Myricom Myrinet

■ Two 1.28 Gbit/s channels (duplex) connecting hosts

and (4, 8, 16-port) switches point-to-point

■ Supports any topology with switches, hot configurable ■ Wormhole routing with link level flow control

guarantees the delivery of messages

■ Checksumming for error detection ■ Packets of arbitrary length (unlimited MTU)

can encapsulate any type of packets

slide-12
SLIDE 12

12

Myricom Myrinet Adapter

■ RISC processor (LANai) ■ 1MB SRAM to store MCP

and to act as staging memory for buffering packets

■ Bus Master DMA

adapter-to-host (for the PCI)

■ Two DMAs between memory

and network FIFOs

■ Concurrent operation of

DMAs

Memory Pentium Pro PCI-Memory Bridge DMA Mem Host Bus Mem Bus PCI LANai RISC NI

slide-13
SLIDE 13

13

Myrinet Control Program

■ The LANai is a 32-bit dual-context RISC Processor

with 24 general purpose registers that runs the Myrinet Control Program (MCP)

■ A typical MCP provides:

◆ routing table

management,

◆ gathering operation, ◆ checksumming, ◆ send / receive operation, ◆ control message

generation,

◆ scattering operation, ◆ interrupts generation

upon arrival

slide-14
SLIDE 14

14

Dolphin CluStar SCI

■ Two unidirectional 1.6 Gbit/s links (CluStar: 3.2 Gbit/s ) ■ Multidimensional rings and switched ringlets ■ Protocol uses data sizes of 16, 64, 256 Bytes ■ Transparent PCI-to-PCI bridge operation through

memory mapped load/store interface

■ Possibility for fully coherent shared memory on high

end implementations beyond PCI products

■ Per word remote memory and block transfers for

message passing operation

slide-15
SLIDE 15

15

Dolphin CluStar SCI Adapter

■ Protocol engine

◆ 8 64Byte stream buffers ◆ PCI-SCI memory address

mapping by ATT

◆ Busmaster DMA

■ Link controller

◆ Contains 3 FIFOs (TX,

RX, Transit)

■ The PCI-adapter supports a

subset of IEEE-SCI without hardware cache coherency

Memory Pentium Pro PCI-Memory Bridge DMA Host Bus Mem Bus PCI PCI- SCI Bridge NI

slide-16
SLIDE 16

16

SGI / Cray T3D as Reference Point

■ 150 MHz 64bit DEC Alpha ■ No virtual memory ■ ca. 1.28 Gbit/link ■ 3D torus topology ■ Memory mapped network

interface to send remote stores

■ Fetch/deposit engine with

separate memory bus (no involvement of processor)

Send annex Deposit Fetch engine DEC Alpha 21064 Bus Memory NI

slide-17
SLIDE 17

17

Typical Transfer Modes

■ Peak bandwidth for large block transfers (zero-copy) ■ Reduced bandwidth for remote memory operation

including fine grain accesses to the memory system

■ There are two modes for fine grain transfers:

processor driven versus DMA driven:

◆ Remote loads/stores by either the processor

  • r the DMA

(Direct Deposit Model)

◆ Buffer-packing/-unpacking at the sender/receiver by either

the processor or the DMA (Messaging Model)

slide-18
SLIDE 18

18

Myricom Myrinet

Direct mapped transfer Buffer-packing transfer Network Memory Pentium Pro PCI-Memory Bridge DMA Mem Host Bus Mem Bus PCI Bus LANai RISC NI Memory Pentium Pro PCI-Memory Bridge DMA Mem Host Bus Mem Bus PCI Bus LANai RISC NI

slide-19
SLIDE 19

19

Deposit on Myrinet

Intel Pentium Pro (200 MHz) with Myrinet ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

  • 1

2 3 4 5 6 7 8 12 16 24 32 48 64 10 20 30 40 50 60 70 80 90 Throughput (Mbyte/s) Store Stride (1: contiguous 2-64: strided) ❍ local memory

  • remote memory,

direct

  • remote memory,

DMA plus unpack 126

slide-20
SLIDE 20

20

Deposit Dolphin CluStar SCI

Direct mapped transfer Buffer-packing transfer Network Pentium Pro PCI-Memory Bridge DMA Host Bus PCI- SCI Bridge NI Memory Mem Bus PCI Bus Memory Pentium Pro PCI-Memory Bridge DMA Mem Bus NI Host Bus PCI Bus PCI- SCI Bridge

slide-21
SLIDE 21

21

Deposit on SCI

Intel Pentium Pro (200 MHz) with SCI Interconnect ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

  • 1

2 3 4 5 6 7 8 12 16 24 32 48 64 10 20 30 40 50 60 70 80 90 Throughput (Mbyte/s) Store Stride (1: contiguous 2-64: strided) ❍ local memory

  • remote memory,

direct

  • remote memory,

DMA plus unpack

  • CluStar
slide-22
SLIDE 22

22

SGI / Cray T3D

Direct mapped transfer Buffer-packing transfer Network Send annex Deposit Fetch engine DEC Alpha 21064 NI Bus Memory Send annex DEC Alpha 21064 Memory Bus NI Deposit Fetch engine

slide-23
SLIDE 23

23

Deposit on SGI / Cray T3D

❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

  • 1

2 3 4 5 6 7 8 12 16 24 32 48 64 20 40 60 80 100 120 Throughput (Mbyte/s) Store Stride (1: contiguous 2-64: strided) ❍ local memory

  • remote memory, direct
  • remote memory,

unpack at receiver Cray T3D: Copies to local and remote memory

slide-24
SLIDE 24

24

Raw Block Transfers Myrinet, SCI

Transfers of different sized blocks (raw,contiguous)

  • ● ● ● ● ● ● ● ● ● ● ● ● ●

❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

  • ● ●
  • ● ● ● ● ● ● ● ●

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16k 32k 64k 128k 256k 512k 20 40 60 80 100 120 140 Throughput (Mbyte/s) Block Size [Byte]

  • Dolphin CluStar SCI

❍ Dolphin SCI 1

  • Myrinet
slide-25
SLIDE 25

25

MPI Transfer Myrinet

Myrinet: fastest MPI block transfers of different sizes

  • ● ● ● ● ● ● ●
  • ● ● ● ●
  • ● ● ● ● ● ● ●
  • ● ● ●
  • ● ● ●

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16k 32k 65k 131k 262k 524k 1M 20 40 60 80 100 120 140 Throughput [Mbyte/s] Block Size [Byte]

  • restricted

semantics

  • full postal

semantics

slide-26
SLIDE 26

26

MPI Transfer SCI (Scali MPI)

SCI: fastest MPI block transfers of different sizes

  • ● ● ● ● ● ●
  • ● ● ● ● ● ● ●

4 8 16 32 64 128 256 512 1024 2048 4096 8192 16k 32k 65k 131k 262k 524k 1M 10 20 30 40 50 60 70 80 Throughput [Mbyte/s] Block Size [Byte]

  • restricted

semantics

  • full postal

semantics

slide-27
SLIDE 27

27

Summary and Comparison

■ Raw deposit bandwidth

compared to MPI (blocking and non-blocking)

■ Blocking MPI matches

direct deposit bandwidth

■ Non-blocking calls suffer

from buffering

■ TCP/IP performance

results confirm the

  • verhead of copies

? X

CoPs-SCI CoPs-Myrinet Cray T3D 20 40 60 80 100 120 Transfer-rate by protocol and API (large blocks) Transfer rate (MByte/s) Raw Deposit MPI (full semantics) TCP/IP MPI (restricted semantics)

slide-28
SLIDE 28

28

Conclusion

■ Three different levels of system software support for

Gbit/s- networks permit a good comparison between different networking technologies based on micro- benchmarks.

■ SCI, Myrinet and MPPs have excellent performance for

contiguous blocks but for strided data the performance of PCI-adapters collapses.

■ MPI with buffering semantics suffers from the poor

memory copy performance whereas ‘zero copy’ MPI

  • ffer better speed.

■ PCI card interconnects will get into difficulties with

applications that require complex remote memory

  • perations or high-level networking protocols.