a comparison of two gigabit san lan technologies sci
play

A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus - PowerPoint PPT Presentation

EMMSEC 98: European Multimedia, Microprocessor Systems and Electronic Commerce Conference and Exposition A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus Myrinet Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ -


  1. EMMSEC 98: European Multimedia, Microprocessor Systems and Electronic Commerce Conference and Exposition A Comparison of Two Gigabit SAN/LAN Technologies: SCI versus Myrinet Ch. Kurmann, T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zurich Ecole polytechnique fédérale de Zurich Eidgenössische Politecnico federale di Zurigo Technische Hochschule Swiss Federal Institute of Technology Zurich Zürich

  2. Motivation ■ Evaluation and comparison of Gigabit/sec interconnects need a common architectural denominator ■ We propose three different levels: Transfer-rate by protocol and API (large blocks) ◆ highly optimized remote 120 load/store operation 100 ◆ optimized standard Transfer rate (MByte/s) 80 message passing library 60 ◆ connection oriented 40 LAN emulation 20 ? X 0 CoPs-SCI CoPs-Myrinet Cray T3D Raw Deposit MPI (full semantics) TCP/IP MPI (restricted semantics) 2

  3. Overview ■ Levels of comparison ■ Previous work ■ Technologies overview: ◆ PC Platform ◆ Myricom Myrinet ◆ Dolphin CluStar SCI ◆ SGI / Cray T3D ■ Typical transfer modes ■ Measurement results ■ Conclusion 3

  4. Levels of Comparison ■ Three levels with different amount of support by the operating system : ◆ DIRECT DEPOSIT: ✦ simple remote load/store operation ✦ performance is expected to be closest at actual hardware peak performance ◆ MPI/PVM: ✦ optimized standard message passing library ✦ carefully coded parallel applications are expected to see this performance ◆ TCP/IP: ✦ connection oriented TCP/IP LAN emulation ◆ .... ■ Common architectural denominator 4

  5. Previous Work ■ Previous studies: ◆ maximum bandwidth numbers ◆ minimal latency numbers ◆ performance results for an entire application ■ Performance of application depends: ◆ redistribution of data stored in distributed arrays ◆ migration of data in fine grain object store ■ We need a benchmark that covers data types beyond contiguous blocks of data (e.g. strided remote stores). 5

  6. Direct Deposit ■ The deposit model requires a clean separation and different mechanisms for: ◆ control messages, ◆ data messages . ■ Data is “dropped” directly into the receivers address space by the hardware without active participation of the receiver process. ■ Allows to copy fine grained data involving complex access patterns like strides. 6

  7. Message Passing Libraries ■ Sender can send messages at any time, without waiting for the receiver to be ready ■ Buffering is often done at a higher level and involves the memory system of the end-points ■ Fine grain data accesses are implemented through buffer-packing / -unpacking 7

  8. Message Passing Model ■ Different flavors for ping prog lib net lib prog pong restricted and full send(B1,P1) recv(B2) B1 postal semantics B2 end_send end_recv ◆ non-buffering semantics: recv(B4) send(B3,P0) B3 can be mapped directly to B4 a fast direct deposit end_recv end_send including synchronization ◆ buffering semantics: ping prog lib net lib prog ping non-blocking operation send(B1,P1) send(B3,P0) allows sending at any B3 B1 end_send end_send time and leads to an barrier additional copy operation recv(B4) recv(B2) B4 B2 end_recv end_recv 8

  9. Protocol Emulation ■ Popular API � much software ◆ UDP/IP - unreliable, connectionless network service ◆ TCP/IP - allows reliable connection-oriented communication ◆ NFS/IP - network file system ■ Protocol stacks are provided by the OS ■ Socket API, streams API are ubiquitous ■ It is unrealistic to recode all commercial web servers, databases or middleware systems for message passing APIs like MPI. ■ With IP support gigabit networks can speedup much more than just scientific applications! 9

  10. PC Platform for this Talk ■ Single/Twin Pentium Pro 200MHz ■ Intel 440 FX Chipset ■ 64-bit 66 MHz main memory interface, 0.5 GByte/s ■ 32-bit 33 MHz PCI bus, 132 MB/s ~ 3000 per node 10

  11. Myricom Myrinet ■ Two 1.28 Gbit/s channels (duplex) connecting hosts and (4, 8, 16-port) switches point-to-point ■ Supports any topology with switches, hot configurable ■ Wormhole routing with link level flow control guarantees the delivery of messages ■ Checksumming for error detection ■ Packets of arbitrary length (unlimited MTU) � can encapsulate any type of packets 11

  12. Myricom Myrinet Adapter ■ RISC processor (LANai) Pentium Pro ■ 1MB SRAM to store MCP and to act as staging memory Host LANai Bus for buffering packets RISC ■ Bus Master DMA PCI-Memory PCI NI adapter-to-host (for the PCI) Bridge Mem ■ Two DMAs between memory Mem and network FIFOs Bus DMA ■ Concurrent operation of Memory DMAs 12

  13. Myrinet Control Program ■ The LANai is a 32-bit dual-context RISC Processor with 24 general purpose registers that runs the Myrinet Control Program (MCP) ■ A typical MCP provides: ◆ control message ◆ routing table generation, management, ◆ scattering operation, ◆ gathering operation, ◆ interrupts generation ◆ checksumming, upon arrival ◆ send / receive operation, 13

  14. Dolphin CluStar SCI ■ Two unidirectional 1.6 Gbit/s links (CluStar: 3.2 Gbit/s ) ■ Multidimensional rings and switched ringlets ■ Protocol uses data sizes of 16, 64, 256 Bytes ■ Transparent PCI-to-PCI bridge operation through memory mapped load/store interface ■ Possibility for fully coherent shared memory on high end implementations beyond PCI products ■ Per word remote memory and block transfers for message passing operation 14

  15. Dolphin CluStar SCI Adapter ■ Protocol engine Pentium ◆ 8 64Byte stream buffers Pro ◆ PCI-SCI memory address Host mapping by ATT Bus PCI- ◆ Busmaster DMA SCI PCI-Memory Bridge PCI NI ■ Link controller Bridge ◆ Contains 3 FIFOs (TX, Mem RX, Transit) Bus DMA ■ The PCI-adapter supports a Memory subset of IEEE-SCI without hardware cache coherency 15

  16. SGI / Cray T3D as Reference Point ■ 150 MHz 64bit DEC Alpha DEC Alpha 21064 ■ No virtual memory ■ ca. 1.28 Gbit/link Send annex ■ 3D torus topology Bus ■ Memory mapped network NI interface to send remote stores Deposit Fetch ■ Fetch/deposit engine with engine separate memory bus (no involvement of processor) Memory 16

  17. Typical Transfer Modes ■ Peak bandwidth for large block transfers (zero-copy) ■ Reduced bandwidth for remote memory operation including fine grain accesses to the memory system ■ There are two modes for fine grain transfers: processor driven versus DMA driven: ◆ Remote loads/stores by either the processor or the DMA (Direct Deposit Model) ◆ Buffer-packing/-unpacking at the sender/receiver by either the processor or the DMA (Messaging Model) 17

  18. Myricom Myrinet Pentium Pentium Pro Pro Network Host Host LANai LANai Bus Bus RISC RISC PCI-Memory PCI-Memory PCI Bus PCI Bus NI NI Bridge Bridge Mem Mem Mem Mem Bus Bus DMA DMA Memory Memory Direct mapped transfer Buffer-packing transfer 18

  19. Deposit on Myrinet Intel Pentium Pro (200 MHz) with Myrinet 90 126 local memory ● ❍ 80 remote memory, ● 70 direct Throughput (Mbyte/s) 60 remote memory, ● DMA plus unpack 50 ❍ 40 ❍ 30 ● ● ❍ 20 ● ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Store Stride (1: contiguous 2-64: strided) 19

  20. Deposit Dolphin CluStar SCI Pentium Pentium Pro Pro Network Host Host Bus Bus PCI- PCI- SCI SCI PCI-Memory PCI-Memory Bridge PCI Bus Bridge PCI Bus NI NI Bridge Bridge Mem Mem Bus Bus DMA DMA Memory Memory Direct mapped transfer Buffer-packing transfer 20

  21. Deposit on SCI Intel Pentium Pro (200 MHz) with SCI Interconnect 90 ● CluStar local memory ❍ 80 remote memory, ● 70 ● direct Throughput (Mbyte/s) 60 remote memory, ● DMA plus unpack 50 ❍ 40 ❍ 30 ● ❍ ● 20 ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ● ● ● ● ● ● ● ● ● ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Store Stride (1: contiguous 2-64: strided) 21

  22. SGI / Cray T3D DEC Alpha DEC Alpha 21064 21064 Network Send Send annex annex Bus Bus NI NI Deposit Deposit Fetch Fetch engine engine Memory Memory Direct mapped transfer Buffer-packing transfer 22

  23. Deposit on SGI / Cray T3D Cray T3D: Copies to local and remote memory ● 120 local memory ❍ remote memory, direct ● 100 remote memory, ❍ ❍ Throughput (Mbyte/s) ● unpack at receiver 80 ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ● 60 ● ● ● ● ● ● ● ● ● ● ● ● 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 0 1 2 3 4 5 6 7 8 12 16 24 32 48 64 Store Stride (1: contiguous 2-64: strided) 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend