in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kostić * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research *

Motivation 1

Motivation Some of these services demand bo bounde ded d lo low-latency latency and predic icta table ble service time. 2

Motivation 3

Motivation A server receiving 64 B packets at 100 Gbps has only 5.12 12 ns to process a packet before the next packet arrives. 4

Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 Single-Thread 10 Performance 2 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 5

Motivation 7 10 6 10 Transistors 5 10 (thousands) 4 10 3 It is essential to use our current Single-Thread 10 Performance 2 hardware more efficiently. 10 (SpecINT x 10 3 ) 1 10 0 10 1980 1990 2000 2010 2020 Year 6

Memory Hierarchy <4 CPU cycles Registers Getting Slower 4-40 Cache cycles L1,L2, LLC >200 DRAM cycles (>60ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 7

Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 8

Memory Hierarchy To keep up with 100 Gbps <4 time budget (5.12 ns) CPU cycles Registers Getting Slower 4-40 Cache becomes Cache cycles L1,L2, LLC valuable, as every access to DRAM is expen ensi sive ve >200 We focus on be bett tter er management gement of cache. DRAM cycles (>60ns 0ns) Memory ry Hierarchy rchy For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns. 9

Better Cache Management Reduce tail latencies of NFV service chains running at 100 Gbps by up to 21.5% .5% 10

Last Level Cache (LLC) Intel Processor 11

Non-uniform Cache Architecture (NUCA) Since Sandy Bridge (~2011), LLC is not unified any more! Intel Processor 12

Non-uniform Cache Architecture (NUCA) Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known own Methods: ethods: Intel Clémentine Maurice et al. Processor [RAID ‘15]* • Performance Counters * Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using 13 Performance Counters.

Measuring Access Time to LLC Slices Different access time to different LLC slices Intel Xeon E5-2667 v3 (Haswell) aswell) 14

Measuring Access Time to LLC Slices Measuring Read Access Time from Core 0 to all LLC slices 15

Opportunity Accessing the clo loser ser LLC slice can save up to ~20 0 cycle les, i.e., 6.25 ns. For a CPU that is running at 3.2 GHz. 16

Slice-aware Memory Management Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s). DRAM 17

Slice-aware Memory Management Use se Case ses: s: • Isolation 18

Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data 19

Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance 20

Slice-aware Memory Management Use se Case ses: s: • Isolation • Shared Data • Performance Every core is associated to its closest LLC slice. 21

Slice-aware Memory Management 256 6 2.5 5 20 20 KB KB MB MB MB MB 22

Slice-aware Memory Management Beneficial when working ng 256 6 2.5 2.5 20 20 set can fit into KB KB MB MB MB MB a slice. 23

Slice-aware Memory Management There are many applications that have this characteristic. 24

Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys 25

Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions 26

Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice 27

Slice-aware Memory Management There are many applications that have this characteristic. Key-Value Stores Frequently Accessed keys Virtualized Packet’s Header Network Functions Can fit into a slice We focus on vir irtual ualiz ized d networ ork k functi tions ons in this talk! 28

CacheDirector A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management 29

Traditional I/O 1. NICs DMA* packets to DRAM 2. CPU will fetch them to LLC LLC LLC DRAM * Direct Memory Access (DMA) 30

Data Direct I/O (DDIO) DMA*-ing packets directly to LLC rather than DRAM. LLC LLC DRAM Sending/Receiving Packets via DDIO * Direct Memory Access (DMA) 31

Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 32

Data Direct I/O (DDIO) Packets go to random slices! Sending/Receiving Packets via DDIO 33

CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 34

CacheDirector Sending/Receiving Sending/Receiving CacheDirector Packets via DDIO Packets via DDIO 35

CacheDirector • Sends packet’s header to the appropriate LLC slice. • Implemented as a part of user- space NIC drivers in the Data Plane Development Kit (DPDK). • Introduces dynamic headroom in DPDK data structures. 36

Evaluation - Testbed 100 0 Gbp bps Device under Test Packet Generator Running VNFs Intel Xeon E5 2667 v3 Mellanox ConnectX-4 37

Evaluation - Testbed Timestamp 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 38

Metron [NSDI ‘18]* Evaluation - Testbed Stateful NFV Service Chain 100 0 Gbp bps Device under Test Packet Generator Running VNFs Actual Campus Trace Intel Xeon E5 2667 v3 Mellanox ConnectX-4 * Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying 39 Hardware.

Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 40

Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Impr prov ovem emen ent 41

Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header Faster processing time per packet Reduce queueing time 42

Stateful NFV Service Chain Evaluation – 100 Gbps Achieved Throughput ~76 Gbps 21.5% .5% Faster access to Impr prov ovem emen ent packet header More Predi dictabl ctable Faster processing Fewer SLO Violations time per packet Reduce queueing time * Service Level Objective (SLO) 43

Read More … • More NFV results • Slice-aware key-value store • Portability of our solution on Skylake architecture • Slice Isolation vs. Cache Allocation Technology (CAT) • More … 44

Conclusion • Hidden opportunity that can decrease average access time to LLC by ~20% • Useful for other applications https://github.com/aliireza/slice-aware • Meet us at the poster session This work is supported by WASP, SSF, and ERC. 45

Backup 46

Portability • Intel Xeon Gold 6134 (Skylake) • Mesh architecture • 8 cores and 18 slices • Non-inclusive LLC • Does not affect DDIO 47

Packet Header Sizes • IPv4: 14 B (Ethernet)+ 20 B (IPv4) + 20 B (TCP) < 64 B • IPv6: 14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B Any 64 B of the packet can be placed in the appropriate slice 48

Limitations and Considerations • Data larger than 64 B Using linked-list and scatter data • Future H/W features: • Bigger chunks (e.g., 4k pages) • Programmable • • Slice Imbalance Limiting our application to smaller portion of LLC, but with faster access. 49

Relevant and Future Works • NUCA • Cache-aware Memory Management (e.g., Partitioning and Page Coloring) • Extending CacheDirector for the whole packet • Slice-aware Hypervisor 50

Slice-aware Memory Management 51

Evaluation – Low Rate Simple Forwarding Application 1000 Packets/s 52

Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 53

Evaluation – Tail vs. Throughput Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load. 54

in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh + , Gerald Q. Maguire Jr. , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

intel.com/cloudforall Legal Disclaimer OpenStack is a registered trademark of the OpenStack

2018 Intel retiree 2018 Intel retiree Medical plan Medical plan Changes Changes IRMP Cigna High

Due to Code Placement in IA Zia Ansari zia.ansari@intel.com Intel Corporation 11/03/16 Agenda

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel

CMS TMS Integration How Using Available Standards Could Help Loc Dufresne de Virel

Industrial I/O Subsystem: The Home of Linux Sensors Daniel Baluta Intel daniel.baluta@intel.com

Intel e1000 Ethernet Controller Driver Intel e1000 controller Conclusion Ivan D elalande

SC19 briefing notes J. Simone Intel Ponte Vecchio GPU and OneAPI SW Promotional keynote at Intel

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Mitigating E Egregi egiou ous A ACK CK Delays in Ce n Cellular Da Data Ne Networks b by

CORNET: A Co-Simulation Middleware for Robot Networks Srikrishna Acharya , Robert Bosch Centre for

ANNUAL GENERAL MEETING Row Nova Scotia February 20, 2018 AGENDA 1. CALL TO ORDER 2.

for WavePulser 40iX High Speed Interconnect Analyzer March-2020 Giuseppe Leccia Business

LIGO-Virgo Searches for Gravita5onal- Waves Associated with GRBs

1 2

A Two-Stage Parsing Method for Text-Level Discourse Analysis Yizhong Wang , Sujian Li, Houfeng

Categories of natural models of type theory CT 2016 (Halifax, NS, Canada) Clive Newstead

in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of

CS 105 Intel x86 (IA32/64) Processors Intel x86 (IA32/64) Processors Tour of the Black Holes

Intel Case Intel Case Processor Serial Number (PSN) Processor Serial Number (PSN) 5/9/99 Group

Validation Labs with OpenStack Shuquan Huang, Intel IT Engineering Computing Weibo: @

5G Cloud Native from RAN to Core Christian Maciocco, Intel Shilpa Talwar, Intel Saikrishna

AFS at Intel AFS at Intel Travis Broughton Travis Broughton Agenda Agenda Intels

intel.com/cloudforall Legal Disclaimer OpenStack is a registered trademark of the OpenStack

2018 Intel retiree 2018 Intel retiree Medical plan Medical plan Changes Changes IRMP Cigna High

Due to Code Placement in IA Zia Ansari zia.ansari@intel.com Intel Corporation 11/03/16 Agenda

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel

CMS TMS Integration How Using Available Standards Could Help Loc Dufresne de Virel

Industrial I/O Subsystem: The Home of Linux Sensors Daniel Baluta Intel daniel.baluta@intel.com

Intel e1000 Ethernet Controller Driver Intel e1000 controller Conclusion Ivan D elalande

SC19 briefing notes J. Simone Intel Ponte Vecchio GPU and OneAPI SW Promotional keynote at Intel

Lect. 4: Shared Memory Multiprocessors Obtained by connecting full processors together

Utilizing commercial graphics processors Utilizing commercial graphics processors in the

VLIW Processors VLIW (very long instruction word) processors instructions are scheduled

Mitigating E Egregi egiou ous A ACK CK Delays in Ce n Cellular Da Data Ne Networks b by

CORNET: A Co-Simulation Middleware for Robot Networks Srikrishna Acharya , Robert Bosch Centre for

ANNUAL GENERAL MEETING Row Nova Scotia February 20, 2018 AGENDA 1. CALL TO ORDER 2.

for WavePulser 40iX High Speed Interconnect Analyzer March-2020 Giuseppe Leccia Business

LIGO-Virgo Searches for Gravita5onal- Waves Associated with GRBs

1 2

A Two-Stage Parsing Method for Text-Level Discourse Analysis Yizhong Wang , Sujian Li, Houfeng

Categories of natural models of type theory CT 2016 (Halifax, NS, Canada) Clive Newstead

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh + , Gerald Q. Maguire Jr. , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of