paste a network programming interface for non volatile
play

PASTE: A Network Programming Interface for Non-Volatile Main Memory - PowerPoint PPT Presentation

PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Universit di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018 Review: Memory Hierarchy Slow,


  1. PASTE: A Network Programming Interface for Non-Volatile Main Memory Michio Honda (NEC Laboratories Europe) Giuseppe Lettieri (Università di Pisa) Lars Eggert and Douglas Santry (NetApp) USENIX NSDI 2018

  2. Review: Memory Hierarchy Slow, block-oriented persistence CPU 5-50 ns Caches Byte access w/ load/store 70 ns Main Memory Block access w/ 100-1000s us HDD / SSD system calls

  3. Review: Memory Hierarchy Fast, byte-addressable persistence CPU 5-50 ns Caches Byte access w/ 70 ns -1000s ns load/store Main Memory Block access w/ 100-1000s us HDD / SSD system calls

  4. Networking is faster than disks/SSDs 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, Syscall, PCIe bus, socket API physical media Client Server SSD 23us 1300us

  5. Networking is slower than NVMM 1.2KB durable write over TCP/HTTP Cables, NICs, TCP/IP, Memcpy, memory bus, socket API physical media Client Server NVMM 23us 2us

  6. Networking is slower than NVMM nevts = epoll_wait(fds) 1.2KB durable write over TCP/HTTP for (i =0; i < nevts; i++) { read(fds[i], buf); ... Cables, NICs, TCP/IP, Memcpy, memory bus, memcpy(nvmm, buf); socket API physical media ... write(fds[i], reply) Client Server NVMM } Client Client

  7. Innovations at both stacks Network stack Storage stack MegaPipe [OSDI’12] NVTree [FAST’15] Seastar NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]

  8. Stacks are isolated Costs of Network stack Storage stack moving data MegaPipe [OSDI’12] NVTree [FAST’15] Seastar NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]

  9. Bridging the gap Network stack Storage stack MegaPipe [OSDI’12] NVTree [FAST’15] Seastar PASTE NVWal [ASPLOS’16] mTCP [NSDI’14] NOVA [FAST’16] IX [OSDI’14] Decibel [NSDI’17] Stackmap [ATC’16] LSNVMM [ATC’17]

  10. PASTE Design Goals ● Durable zero copy ○ DMA to NVMM ● Selective persistence ○ Exploit modern NIC’s DMA to L3 cache ● Persistent data structures ○ Indexed, named packet buffers backed fy a file ● Generality and safety ○ TCP/IP in the kernel and netmap API ● Best practices from modern network stacks ○ Run-to-completion, blocking, busy-polling, batching etc

  11. PASTE in Action Plog /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC

  12. PASTE in Action Plog /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC

  13. PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC

  14. PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog ○ Got 6 in-order TCP App thread pbuf len off segments Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm 20 27 [ 0 ] 21 26 TCP/IP [ 4 ] Pbufs 22 25 [ 8 ] 23 24 NIC

  15. PASTE in Action ● poll() system call Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog ○ They are set to Pring App thread pbuf len off slots Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC

  16. PASTE in Action ● Return from poll() Plog 1. Run NIC I/O and TCP/IP /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC

  17. PASTE in Action Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread pbuf len off Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC

  18. PASTE in Action ● flush Pbuf data from Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog CPU cache to DIMM App thread 3. Flush Pbuf(s) pbuf len off ○ clflush(opt) instruction Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 1 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC

  19. PASTE in Action ● Pbuf is persistent Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog data representation App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) ○ Base address is static 1 96 120 Zero copy i.e., file (/mnt/pm/pp) user ○ Buffers can be kernel Ppool (shared memory) Pring recovered after reboot slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 27 0 [ 0 ] 1 6 [ 4 ] TCP/IP Pbufs 2 5 [ 8 ] 3 4 NIC

  20. PASTE in Action ● Prevent the kernel Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog from recycling the App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) buffer 1 96 120 5. Swap out Pbuf(s) Zero copy user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 8 6 TCP/IP [ 4 ] Pbufs 2 5 [ 8 ] 3 4 NIC

  21. PASTE in Action ● Same for Pbuf 2 and 6 Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] cur /mnt/pm tail 0 27 [ 0 ] 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC

  22. PASTE in Action ● Advance cur Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog ○ Return buffers in slot App thread 3. Flush Pbuf(s) pbuf len off 4. Flush Plog entry(ies) 0-6 to the kernel at 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6. Update Pring next poll() 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC

  23. PASTE in Action Plog 1. Run NIC I/O and TCP/IP 2. Read data on Pring /mnt/pm/plog App thread 3. Flush Pbuf(s) pbuf len off Write-Ahead Logs 4. Flush Plog entry(ies) 1 96 120 5. Swap out Pbuf(s) Zero copy 2 96 768 6. Update Pring 6 96 987 user kernel Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC

  24. PASTE in Action 3 5 Plog /mnt/pm/plog 0 5 7 1. Run NIC I/O and TCP/IP B+tree 2. Read data on Pring App thread 3. Flush Pbuf(s) ( 1 , 96, 120) 4. Flush Plog entry(ies) ( 2 , 96, 987) 5. Swap out Pbuf(s) Zero copy ( 6 , 96, 512) 6. Update Pring user kernel ● We can organize various Ppool (shared memory) Pring slot [0] File system /mnt/pm/pp [7] /mnt/pm data structures in Plog tail 0 27 [ 0 ] cur 8 10 TCP/IP [ 4 ] Pbufs 9 5 [ 8 ] 3 4 NIC

  25. Evaluation 1. How does PASTE outperform existing systems? 2. Is PASTE applicable to existing applications? 3. Is PASTE useful for systems other than file/DB storage?

  26. How does PASTE outperform existing systems? 64B What if we use more complex data structures? 1280B WAL B+tree (all writes)

  27. How does PASTE outperform existing systems? 64B 1280B WAL B+tree (all writes)

  28. Is PASTE applicable to existing applications? ● Redis YCSB (read mostly) YCSB (update heavy)

  29. Is PASTE useful for systems other than DB/file storage? ● Packet logging prior to forwarding ○ Fault-tolerant middlebox [Sigcomm’15] ○ Traffic recording ● Extend mSwitch [SOSR’15] ○ Scalable NFV backend switch

  30. Conclusion ● PASTE is a network programming interface that: ○ Enables durable zero copy to NVMM ○ Helps apps organize persistent data structures on NVMM ○ Lets apps use TCP/IP and be protected ○ Offers high-performance network stack even w/o NVMM https://github.com/luigirizzo/netmap/tree/paste micchie@sfc.wide.ad.jp or @michioh

  31. Multicore Scalability ● WAL throughput

  32. Further Opportunity with Co-designed Stacks ● What if we use higher access latency NVMM? ○ e.g., 3D-Xpoint ● Overlap flushes and processing with clflushopt and mfence before system call (triggers packet I/O) ○ See the paper for results Examine Examine request request clflushopt clflushopt Systemcall mfence Systemcall time Receive new Send requests responses Wait for flushes done

  33. Experiment Setup ● Intel Xeon E5-2640v4 (2.4 Ghz) ● HPE 8GB NVDIMM (NVDIMM-N) ● Intel X540 10 GbE NIC ● Comparison ○ Linux and Stackmap [ATC’15] (current state-of-the art) ○ Fair to use the same kernel TCP/IP implementation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend