Programmable NICs: What they mean for parallel middleware (and are - - PowerPoint PPT Presentation

programmable nics what they mean for parallel middleware
SMART_READER_LITE
LIVE PREVIEW

Programmable NICs: What they mean for parallel middleware (and are - - PowerPoint PPT Presentation

Programmable NICs: What they mean for parallel middleware (and are they here to stay?) Anthony Skjellum, Vijay Velusamy, ChangZheng Rao, Boris Protopopov* Department of Computer Science Mississippi State University * Now with Mercury Computer


slide-1
SLIDE 1

Programmable NICs: What they mean for parallel middleware (and are they here to stay?)

Anthony Skjellum, Vijay Velusamy, ChangZheng Rao, Boris Protopopov* Department of Computer Science Mississippi State University * Now with Mercury Computer Systems June 18, 2002

slide-2
SLIDE 2

6/20/2002 2

MSU Motivations

Long term efforts in parallel programming design (MPI- 1, MPI-2, MPI/RT, PacketWay) Interested in the performance triple (latency,overhead,bandwidth) and not (latency,bandwidth) alone for real applications Emphasis on lowering overhead and delivering

! Predictability ! Overlap of communication and computation

Interested in offloading work to NICs (security, pt2pt primitives, collective operations, ...) Long term interest in programmable NICs

slide-3
SLIDE 3

6/20/2002 3

Outline

Introduction Survey of Programmable NICs Similarities, Differences Advantages, Disadvantages Economics of Programmability Conclusions Future work

slide-4
SLIDE 4

6/20/2002 4

Introduction

Programmable NICs are great for research

! Hosts ! Switches ! New algorithms, strategies, ideas

Programmable NICs are flexible Often difficult to program More expensive per unit than ASICs Where do they fit in, when, and for how long?

slide-5
SLIDE 5

6/20/2002 5

Some Programmable NICs

Myrinet LANai 2.x – 9.x (Myricom) Alteon (Netgear, Farallon) Quadrics ELAN 3 (successor to Meiko Computing Surface, which used ELAN) Sitera Prism IBM IB HCA (PPC-based)

slide-6
SLIDE 6

6/20/2002 6

Myrinet LANai

Very popular NIC/network Simultaneous DMAs on host side (1 active) [since LANai 5] Simultaneous DMAs on network side (2 active) [since LANai 2] Chainable DMA (host+ network send, host+ network recv) [LANai 9] Programmable LANai processor (no cache, high speed SRAM) No support for non-contiguous physical page DMA Not necessarily enough memory bandwidth for LANai to execute many instructions while all DMAs running (degree of this varies with generation of the NIC) Clock resolution for real-time purposes, .5µS Supports User-space Communication NICs moderately expensive; overall solution “cost effective/upscale” For last few years, official firmware (GM - Glenn’s Messages) has been promoted in lieu of end-users programming themselves GM-based overhead has not proven as low as ASIC-based SANs, such as Giganet cLAN...

slide-7
SLIDE 7

6/20/2002 7

Alteon NIC

2 MIPS processors Dual DMA Engines, DMA Assist Engine Support TCP/IP checksum offloading, interrupt coalescing, failover Features Jumbo frames (create bigger packets of data, works to offload CPU, by reducing interrupts)

slide-8
SLIDE 8

6/20/2002 8

Quadrics ELAN 3 [4]

Two primary processing engines

!

Microcode processor

!

Thread processor

Supports four hardware threads (each can individually issue pipelined memory requests to the memory system) Full programmability with user thread(s) 32-bit RISC thread processor MMU contains 16-entry, fully associative look-aside buffer and a small data path and state machine DMA Engine prioritizes outstanding DMAs and time-slices large DMAs to prevent adverse blocking of small DMAs Only one packet in flight in end-to-end protocol for ELAN 3; ELAN 4 fixes this (stop&wait vs. go-back-N link-layer protocols) Very expensive NICs; overall solution “high end”

slide-9
SLIDE 9

6/20/2002 9

Other Noteworthy NICs

Sitera Prism Intel EtherExpressPro Alacritech VI/IP from Emulex IBM 1st generation IB HCA (PPC) Myrinet FI32 Other NICs that hold host protocols

slide-10
SLIDE 10

6/20/2002 10

Similarities, Differences

DMA engines Handling virtual memory, pages, MMU Division of labor/resources

! what part of protocol executes where ! how notification is accomplished ! how resources are divided among users of

NIC

slide-11
SLIDE 11

6/20/2002 11

Advantages, Disadvantages

Advantages

! Programmability means flexible uses with non-TCP/IP

situations

! Programming models: Message Passing, DSM, Transaction,

! Experimentation with different kinds of transports ! Scale-relevant transport choices (performance vs. resource

consumption)

! Evolution (IB)

Disadvantages:

! Processor overhead for things that don’t fit on the NIC still ! Relative slow NICs in most cases

slide-12
SLIDE 12

6/20/2002 12

Economics of NIC Programmability

Cost to develop/maintain/upgrade transports Cost to deliver NIC with programmability First generation parts – flexibility and low volume wins out Next generation parts – flexibility not needed, high volume demands savings (read ASIC) Even programmable parts may not be able to be programmed by a vendor who won’t reveal this feature, because support costs them money and effort.

slide-13
SLIDE 13

6/20/2002 13

Qualitative 1st vs. 2nd generation cost tradeoff

1st generation part (e.g., µP + FPGA(s)) 2nd generation part (e.g., ASIC) Break even Number of parts or volume

Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds... Programmability a plus for first generation prototyping Programmability most often a casualty of economy of scale sought in 2nd generation

Aggregate whole cost

slide-14
SLIDE 14

6/20/2002 14

Myricom-like tradeoff

ASIC someday? Aggregate whole cost

Increasing product generations

Number of parts or volume

Economies of scale within a generation/technology probably make for sublinear growth, so our lines are qualitative upper bounds...

slide-15
SLIDE 15

6/20/2002 15

Qualitative Costs of Programmability ($ and/or Performance)

Capital cost:

! additional cost of fielded system as a function of the nearest

equivalent cost non-programmable system

! additional hardware needed to reach “level X of required

performance” for acceptability of system (some app. suite)

Cost of ownership:

! Additional maintenance of extra hardware/software needed

because of any inefficiency, including extra failures of larger system

! Less efficacy of higher overhead system on applications

causes more cycles to be used on overhead-intolerant applications

slide-16
SLIDE 16

6/20/2002 16

Security Implications of Programmable NICs

DMA memory operations potentially mean cross-SAN access to memory of disparate processes (security important) Memory of NIC contains contents that can persist between sessions and co-exist for multi-level security (headache) Potential new opportunity for covert channels (covert channels cannot be disproven, but appears like a source) Some Intrusion attacks: denial of service by messing up NIC, malicious NIC code, using NIC to damage local or remote resource Some Insider threats: copying data not belonging to your process, obtaining more than fair share of resources in QoS setting, executing code improperly on NIC, using NIC as means to move data between unrelated processes in one system

slide-17
SLIDE 17

6/20/2002 17

Real Time Implications of Programmable NICs

Known problem of coordinating two CPUs with a single schedule If CPU drives NIC in non-programmable mode, then all need for programmability is moot For example, Xinyan Zan found in her MS thesis that GM is not at all predictable Chakravarthi et al implemented stringent Myrinet control programs to achieve high predictability (less bandwidth, higher latency, better predictability)

slide-18
SLIDE 18

6/20/2002 18

When is Offloading to NIC better than using the HOST?

Limiting cases

! zero-performance NIC -> use host [e.g., case of Meiko CS-1

ELAN]

! very fast-integer-performance NIC relative to host -> use

NIC, host becomes like “second level” (as with Myricom 2- level multicomputer) or attached processor

In between, legitimately interesting constrained combinatorial optimization of (latency,overhead,bandwidth), aimed at minimizing application runtime. Studying, modeling, and describing this problem is a key area of our current research effort. (B. Protopopov is completing a PhD in this area at MSU.)

slide-19
SLIDE 19

6/20/2002 19

When is Offloading to NIC better than using the HOST?

Memory hierarchy also impacts whether a host or a NIC should do certain work A concurrency metric (h/w_latency)* (h/w_bandwidth)/quantum_work provides an upper bound for concurrent number of activities in a host needed to mask latency from network… how big is this number? quantum_work is # of bytes for something useful

slide-20
SLIDE 20

6/20/2002 20

What should be left for host?

Connection management, flow control, per- connection-sized work/space [UNM/Sandia/Portals philosophy] Items that naturally travel up the memory hierarchy anyway Some of these items may be handled cheaply now, given additional thread concurrency (e.g., hyperthreading), whose cycles may be resources in excess anyway

slide-21
SLIDE 21

6/20/2002 21

Protopopov’s Analysis, Pt 1

If CPU does protocol stack

! There are potentially several cache-assisted

memory copies of data

" checksum, protocol-layer packet formatting

If NIC does protocol stack

! Does the NIC do the work in place (in host DRAM,

in local memory)?

! What is the implication of the cache

flush/invalidate on moving data between NIC and Host?

slide-22
SLIDE 22

6/20/2002 22

Protopopov’s Analysis, Pt 2

System achieves lower bound on performance, when a component becomes a bottleneck - utilized 100% (e.g., NIC) System achieves a certain level of (higher) performance when the utilization of components are balanced, provides a constraint curve to follow toward higher utilization

slide-23
SLIDE 23

6/20/2002 23

Protopopov’s Analysis, Pt 3

“Computational slack” [cf, BSP model] available to application helps establish if NIC specified Slack is amount of computation available in processors to manage communication overhead, without increasing overall application critical path (time to solution)

slide-24
SLIDE 24

6/20/2002 24

Summary/Conclusions

Programmable NICs seem to disappear regularly, except for Myrinet Would Myrinet/Quadrics be better for production use without programmability? Only with “good” non-programmable choice(s)

! high-performance, connection-oriented (e.g.,

VIPL, uDAPL)

! large-scale, quasi-connectionless (e.g., Portals)

How much do Myrinet/Quadrics customers pay for the flexibility (cost, performance)?

slide-25
SLIDE 25

6/20/2002 25

Summary/Conclusions

Can R&D that depends on programmability be transitioned to industry… maybe. What incentives exist for COTS products to be programmable in the long term?

slide-26
SLIDE 26

6/20/2002 26

Future Work

Evolvable model of when we know to invest in programmable network parts for given middleware projects Detailed models for when we know that NIC

  • ffloading benefits application (or higher

middleware like MPI) Understanding of where scalability is impacted by programmability (positively, negatively) Expanding models to programmable switches

slide-27
SLIDE 27

6/20/2002 27

Acknowledgements

Srigurunath Chakravarthi (MSU/MSTI) Rossen Dimitrov (MSU/MSTI) Barnie McCabe (UNM, Albuquerque) Ron Brightwell (Sandia)

slide-28
SLIDE 28

6/20/2002 28

Selected References, I

Protopopov Boris V., Anthony Skjellum. 2001. A Multithreaded Message Passing Interface (MPI) Architecture: Performance and Program Issues, Journal of Parallel and Distributed Computing,

  • Vol. 61, No. 4, Apr 2001, pp. 449-466.

Protopopov Boris V., Anthony Skjellum. 2000. Shared-memory communication approaches for an MPI message-passing library, Concurrency: Practice and Experience, 12(9), Aug 2000, pp. 799-820. Skjellum A., B. Protopopov, and S. Hebert, 1996. A Thread Taxonomy for MPI. Proc of MPIDC, 1996. Chowdappa, A, A. Skjellum, N. E. Doss, Thread-Safe Message Passing with P4 and MPI, Proc of SC’94, 1994.

slide-29
SLIDE 29

6/20/2002 29

Selected References, II

Dimitrov, Rossen. 2001. Overlapping of communication and computation and early binding: Fundamental mechanisms for improving parallel performance on clusters of workstations. Ph.D. dissertation, Mississippi State University. http://library.msstate.edu/etd/show.asp?etd= etd-04092001- 231941. Dimitrov R. and A. Skjellum. 1999. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-enabled Cluster Computing. In Proc. MPIDC'99, Message Passing Interface Developer's and User's Conference, pages 15--24, Atlanta, GA, March 1999.

  • A. Skjellum, 1998, High Performance MPI: Extending the

Message Passing Interface for Higher Performance and Higher Predictability, In Proc. Of PDPTA’98, Las Vegas, July 1998.

slide-30
SLIDE 30

6/20/2002 30

Selected References, III

Brightwell, Ron and Arthur B. Maccabe. 2000. Scalability limitations of VIA-based technologies in supporting MPI. In Proceedings of the fourth MPI developer’s and user’s conference, March 2000. Brightwell, Ron, Tramm Hudson, Rolf Riesen, and Arthur B.

  • Maccabe. 2001. The portals message passing interface:

Revision 1.1. 24 May 2001. Sandia Technical Report SAND99-

  • 2959. ftp://ftp.cs.sandia.gov/pub/papers/bright/portals3v1-1.pdf

Gropp W., E. Lusk, N. Doss, and A. Skjellum. 1996. A High- performance, Portable Implementation of The MPI Message Passing Interface Standard, Parallel Computing, 22(6):789--828, September 1996.

slide-31
SLIDE 31

6/20/2002 31

Selected References, IV

Leslie G. Valiant "A Bridging model for Parallel Computation“. In Communications of ACM, August 199O Vol.33, No.8 Fabrizio Petrini, Adolfy Hoisie, Wu chun Feng, and Richard

  • Graham. Performance Evaluation of the Quadrics

Interconnection Network. In Workshop on Communication Architecture for Clusters (CAC '01), San Francisco, CA, April 2001. Boden, Nan, et al, Myrinet, a Gigabit per Second Local Area Network, IEEE-Micro,Vol.15, No.1, February 1995, pp.29-36.