in intel processors
play

in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of


  1. Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research *

  2. Motivation 1

  3. Motivation Some of these services demand bo bounde ded d lo low-latency latency and predic icta table ble service time. 2

  4. Motivation 3

  5. Motivation A server receiving 64 B packets at 100 Gbps has only 5.12 12 ns to process a packet before the next packet arrives. 4

  6. Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 Single-Thread 10 Performance 2 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 5

  7. Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 It is essential to use our current Single-Thread 10 Performance 2 hardware more efficiently. 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 6

  8. Memory Hierarchy <4 CPU cycles Registers Getting Slower 4-40 Cache cycles L1,L2, LLC >200 DRAM cycles (>60ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 7

  9. Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 8

  10. Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 We focus on be bett tter er management gement of cache. DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 9

  11. Better Cache Management Reduce tail latencies of NFV service chains running at 100 Gbps by up to 21.5% .5% 10

  12. Last Level Cache (LLC) Intel Processor 11

  13. Non-uniform Cache Architecture (NUCA) Since Sandy Bridge (~2011), LLC is not unified any more! Intel Processor 12

  14. Non-uniform Cache Architecture (NUCA) Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known own Methods: ethods: Intel Clémentine Maurice et al. Processor [RAID ‘15]* • Performance Counters * Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using 13 Performance Counters.

  15. Measuring Access Time to LLC Slices Different access time to different LLC slices Intel Xeon E5-2667 v3 (Haswell) aswell) 14

  16. Measuring Access Time to LLC Slices Measuring Read Access Time from Core 0 to all LLC slices 15

  17. Opportunity Accessing the clo loser ser LLC slice can save up to ~20 0 cycle les, i.e., 6.25 ns. For a CPU that is running at 3.2 GHz. 16

  18. Slice-aware Memory Management Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s). DRAM 17

  19. Slice-aware Memory Management Use se Case ses: s: • Isolation 18

  20. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data 19

  21. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance 20

  22. Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance Every core is associated to its closest LLC slice. 21

  23. Slice-aware Memory Management 256 6 2.5 5 20 20 KB KB MB MB MB MB 22

  24. Slice-aware Memory Management Beneficial when working ng 256 6 2.5 2.5 20 20 set can fit into KB KB MB MB MB MB a slice. 23

  25. Slice-aware Memory Management There are many applications that have this characteristic. 24

  26. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys 25

  27. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions 26

  28. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice 27

  29. Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice We focus on vir irtual ualiz ized d networ ork k functi tions ons in this talk! 28

  30. CacheDirector A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management 29

  31. Traditional I/O 1. NICs DMA* packets to DRAM 2. CPU will fetch them to LLC LLC LLC DRAM * Direct Memory Access (DMA) 30

  32. Data Direct I/O (DDIO) DMA*-ing packets directly to LLC rather than DRAM. LLC LLC DRAM Sending/Receiving Packets via DDIO * Direct Memory Access (DMA) 31

  33. Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 32

  34. Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 33

  35. CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 34

  36. CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 35

  37. CacheDirector • Sends packet’s header to the appropriate LLC slice. • Implemented as a part of user- space NIC drivers in the Data Plane Development Kit (DPDK). • Introduces dynamic headroom in DPDK data structures. 36

  38. Evaluation - Testbed 100 0 Gbp bps Device under Test Packet Generator Running VNFs Intel Xeon E5 2667 v3 Mellanox ConnectX-4 37

  39. Evaluation - Testbed Timestamp 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 38

  40. Metron [NSDI ‘18]* Evaluation - Testbed Stateful NFV Service Chain 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 * Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying 39 Hardware.

  41. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 40

  42. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Impr prov ovem emen ent 41

  43. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header Faster processing time per packet Reduce queueing time 42

  44. Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header More Predi dictabl ctable Faster processing Fewer SLO Violations time per packet Reduce queueing time * Service Level Objective (SLO) 43

  45. Read More … • More NFV results • Slice-aware key-value store • Portability of our solution on Skylake architecture • Slice Isolation vs. Cache Allocation Technology (CAT) • More … 44

  46. Conclusion • Hidden opportunity that can decrease average access time to LLC by ~20% • Useful for other applications https://github.com/aliireza/slice-aware • Meet us at the poster session This work is supported by WASP, SSF, and ERC. 45

  47. Backup 46

  48. Portability • Intel Xeon Gold 6134 (Skylake) • Mesh architecture • 8 cores and 18 slices • Non-inclusive LLC • Does not affect DDIO 47

  49. Packet Header Sizes • IPv4: 14 B (Ethernet)+ 20 B (IPv4) + 20 B (TCP) < 64 B • IPv6: 14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B Any 64 B of the packet can be placed in the appropriate slice 48

  50. Limitations and Considerations • Data larger than 64 B Using linked-list and scatter data • Future H/W features: • Bigger chunks (e.g., 4k pages) • Programmable • • Slice Imbalance Limiting our application to smaller portion of LLC, but with faster access. 49

  51. Relevant and Future Works • NUCA • Cache-aware Memory Management (e.g., Partitioning and Page Coloring) • Extending CacheDirector for the whole packet • Slice-aware Hypervisor 50

  52. Slice-aware Memory Management 51

  53. Evaluation – Low Rate Simple Forwarding Application 1000 Packets/s 52

  54. Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 53

  55. Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 54

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend