Programmable NICs: What they mean for parallel middleware (and are - - PowerPoint PPT Presentation
Programmable NICs: What they mean for parallel middleware (and are - - PowerPoint PPT Presentation
Programmable NICs: What they mean for parallel middleware (and are they here to stay?) Anthony Skjellum, Vijay Velusamy, ChangZheng Rao, Boris Protopopov* Department of Computer Science Mississippi State University * Now with Mercury Computer
6/20/2002 2
MSU Motivations
Long term efforts in parallel programming design (MPI- 1, MPI-2, MPI/RT, PacketWay) Interested in the performance triple (latency,overhead,bandwidth) and not (latency,bandwidth) alone for real applications Emphasis on lowering overhead and delivering
! Predictability ! Overlap of communication and computation
Interested in offloading work to NICs (security, pt2pt primitives, collective operations, ...) Long term interest in programmable NICs
6/20/2002 3
Outline
Introduction Survey of Programmable NICs Similarities, Differences Advantages, Disadvantages Economics of Programmability Conclusions Future work
6/20/2002 4
Introduction
Programmable NICs are great for research
! Hosts ! Switches ! New algorithms, strategies, ideas
Programmable NICs are flexible Often difficult to program More expensive per unit than ASICs Where do they fit in, when, and for how long?
6/20/2002 5
Some Programmable NICs
Myrinet LANai 2.x – 9.x (Myricom) Alteon (Netgear, Farallon) Quadrics ELAN 3 (successor to Meiko Computing Surface, which used ELAN) Sitera Prism IBM IB HCA (PPC-based)
6/20/2002 6
Myrinet LANai
Very popular NIC/network Simultaneous DMAs on host side (1 active) [since LANai 5] Simultaneous DMAs on network side (2 active) [since LANai 2] Chainable DMA (host+ network send, host+ network recv) [LANai 9] Programmable LANai processor (no cache, high speed SRAM) No support for non-contiguous physical page DMA Not necessarily enough memory bandwidth for LANai to execute many instructions while all DMAs running (degree of this varies with generation of the NIC) Clock resolution for real-time purposes, .5µS Supports User-space Communication NICs moderately expensive; overall solution “cost effective/upscale” For last few years, official firmware (GM - Glenn’s Messages) has been promoted in lieu of end-users programming themselves GM-based overhead has not proven as low as ASIC-based SANs, such as Giganet cLAN...
6/20/2002 7
Alteon NIC
2 MIPS processors Dual DMA Engines, DMA Assist Engine Support TCP/IP checksum offloading, interrupt coalescing, failover Features Jumbo frames (create bigger packets of data, works to offload CPU, by reducing interrupts)
6/20/2002 8
Quadrics ELAN 3 [4]
Two primary processing engines
!
Microcode processor
!
Thread processor
Supports four hardware threads (each can individually issue pipelined memory requests to the memory system) Full programmability with user thread(s) 32-bit RISC thread processor MMU contains 16-entry, fully associative look-aside buffer and a small data path and state machine DMA Engine prioritizes outstanding DMAs and time-slices large DMAs to prevent adverse blocking of small DMAs Only one packet in flight in end-to-end protocol for ELAN 3; ELAN 4 fixes this (stop&wait vs. go-back-N link-layer protocols) Very expensive NICs; overall solution “high end”
6/20/2002 9
Other Noteworthy NICs
Sitera Prism Intel EtherExpressPro Alacritech VI/IP from Emulex IBM 1st generation IB HCA (PPC) Myrinet FI32 Other NICs that hold host protocols
6/20/2002 10
Similarities, Differences
DMA engines Handling virtual memory, pages, MMU Division of labor/resources
! what part of protocol executes where ! how notification is accomplished ! how resources are divided among users of
NIC
6/20/2002 11
Advantages, Disadvantages
Advantages
! Programmability means flexible uses with non-TCP/IP
situations
! Programming models: Message Passing, DSM, Transaction,
…
! Experimentation with different kinds of transports ! Scale-relevant transport choices (performance vs. resource
consumption)
! Evolution (IB)
Disadvantages:
! Processor overhead for things that don’t fit on the NIC still ! Relative slow NICs in most cases
6/20/2002 12
Economics of NIC Programmability
Cost to develop/maintain/upgrade transports Cost to deliver NIC with programmability First generation parts – flexibility and low volume wins out Next generation parts – flexibility not needed, high volume demands savings (read ASIC) Even programmable parts may not be able to be programmed by a vendor who won’t reveal this feature, because support costs them money and effort.
6/20/2002 13
Qualitative 1st vs. 2nd generation cost tradeoff
1st generation part (e.g., µP + FPGA(s)) 2nd generation part (e.g., ASIC) Break even Number of parts or volume
Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds... Programmability a plus for first generation prototyping Programmability most often a casualty of economy of scale sought in 2nd generation
Aggregate whole cost
6/20/2002 14
Myricom-like tradeoff
ASIC someday? Aggregate whole cost
Increasing product generations
Number of parts or volume
Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds...
6/20/2002 15
Qualitative Costs of Programmability ($ and/or Performance)
Capital cost:
! additional cost of fielded system as a function of the nearest
equivalent cost non-programmable system
! additional hardware needed to reach “level X of required
performance” for acceptability of system (some app. suite)
Cost of ownership:
! Additional maintenance of extra hardware/software needed
because of any inefficiency, including extra failures of larger system
! Less efficacy of higher overhead system on applications
causes more cycles to be used on overhead-intolerant applications
6/20/2002 16
Security Implications of Programmable NICs
DMA memory operations potentially mean cross-SAN access to memory of disparate processes (security important) Memory of NIC contains contents that can persist between sessions and co-exist for multi-level security (headache) Potential new opportunity for covert channels (covert channels cannot be disproven, but appears like a source) Some Intrusion attacks: denial of service by messing up NIC, malicious NIC code, using NIC to damage local or remote resource Some Insider threats: copying data not belonging to your process, obtaining more than fair share of resources in QoS setting, executing code improperly on NIC, using NIC as means to move data between unrelated processes in one system
6/20/2002 17
Real Time Implications of Programmable NICs
Known problem of coordinating two CPUs with a single schedule If CPU drives NIC in non-programmable mode, then all need for programmability is moot For example, Xinyan Zan found in her MS thesis that GM is not at all predictable Chakravarthi et al implemented stringent Myrinet control programs to achieve high predictability (less bandwidth, higher latency, better predictability)
6/20/2002 18
When is Offloading to NIC better than using the HOST?
Limiting cases
! zero-performance NIC -> use host [e.g., case of Meiko CS-1
ELAN]
! very fast-integer-performance NIC relative to host -> use
NIC, host becomes like “second level” (as with Myricom 2- level multicomputer) or attached processor
In between, legitimately interesting constrained combinatorial optimization of (latency,overhead,bandwidth), aimed at minimizing application runtime. Studying, modeling, and describing this problem is a key area of our current research effort. (B. Protopopov is completing a PhD in this area at MSU.)
6/20/2002 19
When is Offloading to NIC better than using the HOST?
Memory hierarchy also impacts whether a host or a NIC should do certain work A concurrency metric (h/w_latency)* (h/w_bandwidth)/quantum_work provides an upper bound for concurrent number of activities in a host needed to mask latency from network… how big is this number? quantum_work is # of bytes for something useful
6/20/2002 20
What should be left for host?
Connection management, flow control, per- connection-sized work/space [UNM/Sandia/Portals philosophy] Items that naturally travel up the memory hierarchy anyway Some of these items may be handled cheaply now, given additional thread concurrency (e.g., hyperthreading), whose cycles may be resources in excess anyway
6/20/2002 21
Protopopov’s Analysis, Pt 1
If CPU does protocol stack
! There are potentially several cache-assisted
memory copies of data
" checksum, protocol-layer packet formatting
If NIC does protocol stack
! Does the NIC do the work in place (in host DRAM,
in local memory)?
! What is the implication of the cache
flush/invalidate on moving data between NIC and Host?
6/20/2002 22
Protopopov’s Analysis, Pt 2
System achieves lower bound on performance, when a component becomes a bottleneck - utilized 100% (e.g., NIC) System achieves a certain level of (higher) performance when the utilization of components are balanced, provides a constraint curve to follow toward higher utilization
6/20/2002 23
Protopopov’s Analysis, Pt 3
“Computational slack” [cf, BSP model] available to application helps establish if NIC specified Slack is amount of computation available in processors to manage communication overhead, without increasing overall application critical path (time to solution)
6/20/2002 24
Summary/Conclusions
Programmable NICs seem to disappear regularly, except for Myrinet Would Myrinet/Quadrics be better for production use without programmability? Only with “good” non-programmable choice(s)
! high-performance, connection-oriented (e.g.,
VIPL, uDAPL)
! large-scale, quasi-connectionless (e.g., Portals)
How much do Myrinet/Quadrics customers pay for the flexibility (cost, performance)?
6/20/2002 25
Summary/Conclusions
Can R&D that depends on programmability be transitioned to industry… maybe. What incentives exist for COTS products to be programmable in the long term?
6/20/2002 26
Future Work
Evolvable model of when we know to invest in programmable network parts for given middleware projects Detailed models for when we know that NIC
- ffloading benefits application (or higher
middleware like MPI) Understanding of where scalability is impacted by programmability (positively, negatively) Expanding models to programmable switches
6/20/2002 27
Acknowledgements
Srigurunath Chakravarthi (MSU/MSTI) Rossen Dimitrov (MSU/MSTI) Barnie McCabe (UNM, Albuquerque) Ron Brightwell (Sandia)
6/20/2002 28
Selected References, I
Protopopov Boris V., Anthony Skjellum. 2001. A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues, Journal of Parallel and Distributed Computing,
- Vol. 61, No. 4, Apr 2001, pp. 449-466.
Protopopov Boris V., Anthony Skjellum. 2000. Shared-memory communication approaches for an MPI message-passing library, Concurrency: Practice and Experience, 12(9), Aug 2000, pp. 799-820. Skjellum A., B. Protopopov, and S. Hebert, 1996. A Thread Taxonomy for MPI. Proc of MPIDC, 1996. Chowdappa, A, A. Skjellum, N. E. Doss, Thread-Safe Message Passing with P4 and MPI, Proc of SC’94, 1994.
6/20/2002 29
Selected References, II
Dimitrov, Rossen. 2001. Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. Ph.D. dissertation, Mississippi State University. http://library.msstate.edu/etd/show.asp?etd= etd-04092001- 231941. Dimitrov R. and A. Skjellum. 1999. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-enabled Cluster Computing. In Proc. MPIDC'99, Message Passing Interface Developer's and User's Conference, pages 15--24, Atlanta, GA, March 1999.
- A. Skjellum, 1998, High Performance MPI: Extending the
Message Passing Interface for Higher Performance and Higher Predictability, In Proc. Of PDPTA’98, Las Vegas, July 1998.
6/20/2002 30
Selected References, III
Brightwell, Ron and Arthur B. Maccabe. 2000. Scalability limitations of VIA-based technologies in supporting MPI. In Proceedings of the fourth MPI developer’s and user’s conference, March 2000. Brightwell, Ron, Tramm Hudson, Rolf Riesen, and Arthur B.
- Maccabe. 2001. The portals message passing interface:
Revision 1.1. 24 May 2001. Sandia Technical Report SAND99-
- 2959. ftp://ftp.cs.sandia.gov/pub/papers/bright/portals3v1-1.pdf
Gropp W., E. Lusk, N. Doss, and A. Skjellum. 1996. A High- performance, Portable Implementation of The MPI Message Passing Interface Standard, Parallel Computing, 22(6):789--828, September 1996.
6/20/2002 31
Selected References, IV
Leslie G. Valiant "A Bridging model for Parallel Computation“. In Communications of ACM, August 199O Vol.33, No.8 Fabrizio Petrini, Adolfy Hoisie, Wu chun Feng, and Richard
- Graham. Performance Evaluation of the Quadrics
Interconnection Network. In Workshop on Communication Architecture for Clusters (CAC '01), San Francisco, CA, April 2001. Boden, Nan, et al, Myrinet, a Gigabit per Second Local Area Network, IEEE-Micro,Vol.15, No.1, February 1995, pp.29-36.