in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - - PowerPoint PPT Presentation

in intel processors
SMART_READER_LITE
LIVE PREVIEW

in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - - PowerPoint PPT Presentation

Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of


slide-1
SLIDE 1

Make the Most out of Last Level Cache in Intel Processors

Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić*

KTH Royal Institute of Technology (EECS/COM) Ericsson Research

*

+

slide-2
SLIDE 2

Motivation

1

slide-3
SLIDE 3

Motivation

2

Some of these services demand bo bounde ded d lo low-latency latency and predic icta table ble service time.

slide-4
SLIDE 4

3

Motivation

slide-5
SLIDE 5

4

Motivation

A server receiving 64 B packets at 100 Gbps has only 5.12 12 ns to process a packet before the next packet arrives.

slide-6
SLIDE 6

5

Motivation

10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

Year 1980 1990 2000 2010 2020

Single-Thread Performance (SpecINT x 103) Transistors (thousands)

slide-7
SLIDE 7

6

Motivation

10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

Year 1980 1990 2000 2010 2020

Single-Thread Performance (SpecINT x 103) Transistors (thousands)

It is essential to use our current hardware more efficiently.

slide-8
SLIDE 8

7

Memory Hierarchy

CPU Registers Cache L1,L2, LLC DRAM

Getting Slower

Memory ry Hierarchy rchy

>200 cycles (>60ns) 4-40 cycles <4 cycles

For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.

slide-9
SLIDE 9

8

Memory Hierarchy

To keep up with 100 Gbps time budget (5.12 ns) Cache becomes valuable, as every access to DRAM is expen ensi sive ve

Cache L1,L2, LLC DRAM

Getting Slower

Memory ry Hierarchy rchy

CPU Registers

4-40 cycles <4 cycles

For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.

>200 cycles (>60ns 0ns)

slide-10
SLIDE 10

9

Memory Hierarchy

To keep up with 100 Gbps time budget (5.12 ns) Cache becomes valuable, as every access to DRAM is expen ensi sive ve

Cache L1,L2, LLC DRAM

Getting Slower

Memory ry Hierarchy rchy

CPU Registers

4-40 cycles <4 cycles

For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.

>200 cycles (>60ns 0ns)

We focus on be bett tter er management gement of cache.

slide-11
SLIDE 11

10

Better Cache Management

Reduce tail latencies of NFV service chains running at 100 Gbps by up to 21.5% .5%

slide-12
SLIDE 12

11

Last Level Cache (LLC)

Intel Processor

slide-13
SLIDE 13

12

Non-uniform Cache Architecture (NUCA)

Since Sandy Bridge (~2011), LLC is not unified any more!

Intel Processor

slide-14
SLIDE 14

13

Non-uniform Cache Architecture (NUCA)

* Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien

  • Francillon. 2015. Reverse Engineering Intel Last-Level Cache Complex Addressing Using

Performance Counters.

Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known

  • wn Methods:

ethods: Clémentine Maurice et al. [RAID ‘15]*

  • Performance Counters

Intel Processor

slide-15
SLIDE 15

14

Measuring Access Time to LLC Slices

Intel Xeon E5-2667 v3

(Haswell) aswell)

Different access time to different LLC slices

slide-16
SLIDE 16

15

Measuring Access Time to LLC Slices

Measuring Read Access Time from Core 0 to all LLC slices

slide-17
SLIDE 17

16

Opportunity

Accessing the clo loser ser LLC slice can save up to ~20 0 cycle les, i.e., 6.25 ns.

For a CPU that is running at 3.2 GHz.

slide-18
SLIDE 18

17

Slice-aware Memory Management

Allocate memory from physical memory in a way that it maps to the appropriate LLC slice(s).

DRAM

slide-19
SLIDE 19

18

Slice-aware Memory Management

Use se Case ses: s:

  • Isolation
slide-20
SLIDE 20

19

Slice-aware Memory Management

Use se Case ses: s:

  • Isolation
  • Shared Data
slide-21
SLIDE 21

20

Slice-aware Memory Management

Use se Case ses: s:

  • Isolation
  • Shared Data
  • Performance
slide-22
SLIDE 22

21

Slice-aware Memory Management

Use se Case ses: s:

  • Isolation
  • Shared Data
  • Performance

Every core is associated to its closest LLC slice.

slide-23
SLIDE 23

22

Slice-aware Memory Management

256 6 KB KB 2.5 5 MB MB 20 20 MB MB

slide-24
SLIDE 24

23

Slice-aware Memory Management

Beneficial when working ng set can fit into a slice.

256 6 KB KB

2.5 2.5 MB MB

20 20 MB MB

slide-25
SLIDE 25

24

Slice-aware Memory Management

There are many applications that have this characteristic.

slide-26
SLIDE 26

25

Slice-aware Memory Management

Key-Value Stores Frequently Accessed keys There are many applications that have this characteristic.

slide-27
SLIDE 27

26

Slice-aware Memory Management

Virtualized Network Functions Packet’s Header Key-Value Stores Frequently Accessed keys There are many applications that have this characteristic.

slide-28
SLIDE 28

27

Slice-aware Memory Management

Can fit into a slice Virtualized Network Functions Packet’s Header Key-Value Stores Frequently Accessed keys There are many applications that have this characteristic.

slide-29
SLIDE 29

28

Slice-aware Memory Management

Can fit into a slice Virtualized Network Functions Packet’s Header Key-Value Stores Frequently Accessed keys There are many applications that have this characteristic. We focus on vir irtual ualiz ized d networ

  • rk

k functi tions

  • ns in this talk!
slide-30
SLIDE 30

29

CacheDirector

A network I/O solution which extends Data Direct I/O (DDIO) by employing Slice-aware Memory Management

slide-31
SLIDE 31

30

Traditional I/O

LLC LLC

  • 1. NICs DMA* packets to

DRAM

  • 2. CPU will fetch them to LLC

* Direct Memory Access (DMA) DRAM

slide-32
SLIDE 32

DMA*-ing packets directly to LLC rather than DRAM.

31

Data Direct I/O (DDIO)

* Direct Memory Access (DMA)

LLC LLC

Sending/Receiving Packets via DDIO

DRAM

slide-33
SLIDE 33

Packets go to random slices!

32

Data Direct I/O (DDIO)

Sending/Receiving Packets via DDIO

slide-34
SLIDE 34

33

Data Direct I/O (DDIO)

Sending/Receiving Packets via DDIO

Packets go to random slices!

slide-35
SLIDE 35

34

CacheDirector

CacheDirector

Sending/Receiving Packets via DDIO Sending/Receiving Packets via DDIO

slide-36
SLIDE 36

35

CacheDirector

CacheDirector

Sending/Receiving Packets via DDIO Sending/Receiving Packets via DDIO

slide-37
SLIDE 37

36

CacheDirector

  • Sends packet’s header to the

appropriate LLC slice.

  • Implemented as a part of user-

space NIC drivers in the Data Plane Development Kit (DPDK).

  • Introduces dynamic headroom

in DPDK data structures.

slide-38
SLIDE 38

37

Evaluation - Testbed

Packet Generator Device under Test Running VNFs Intel Xeon E5 2667 v3 Mellanox ConnectX-4 100 0 Gbp bps

slide-39
SLIDE 39

38

Evaluation - Testbed

Packet Generator Device under Test Running VNFs 100 0 Gbp bps Actual Campus Trace Timestamp Intel Xeon E5 2667 v3 Mellanox ConnectX-4

slide-40
SLIDE 40

39

Evaluation - Testbed

Packet Generator Device under Test Running VNFs 100 0 Gbp bps Actual Campus Trace Metron [NSDI ‘18]*

* Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying Hardware.

Stateful NFV Service Chain

Intel Xeon E5 2667 v3 Mellanox ConnectX-4

slide-41
SLIDE 41

40

Evaluation – 100 Gbps

Achieved Throughput ~76 Gbps

Stateful NFV Service Chain

slide-42
SLIDE 42

41

Evaluation – 100 Gbps

21.5% .5% Impr prov

  • vem

emen ent

Achieved Throughput ~76 Gbps

Stateful NFV Service Chain

slide-43
SLIDE 43

42

Evaluation – 100 Gbps

Faster access to packet header Faster processing time per packet Reduce queueing time

Achieved Throughput ~76 Gbps

21.5% .5% Impr prov

  • vem

emen ent

Stateful NFV Service Chain

slide-44
SLIDE 44

43

Evaluation – 100 Gbps

Achieved Throughput ~76 Gbps

Faster access to packet header Faster processing time per packet Reduce queueing time

21.5% .5% Impr prov

  • vem

emen ent More Predi dictabl ctable Fewer SLO Violations

* Service Level Objective (SLO)

Stateful NFV Service Chain

slide-45
SLIDE 45

44

Read More …

  • More NFV results
  • Slice-aware key-value store
  • Portability of our solution on Skylake

architecture

  • Slice Isolation vs. Cache Allocation

Technology (CAT)

  • More …
slide-46
SLIDE 46
  • Hidden opportunity that can decrease average access

time to LLC by ~20%

  • Useful for other applications
  • Meet us at the poster session

45

Conclusion

https://github.com/aliireza/slice-aware

This work is supported by WASP, SSF, and ERC.

slide-47
SLIDE 47

46

Backup

slide-48
SLIDE 48

47

Portability

  • Intel Xeon Gold

6134 (Skylake)

  • Mesh architecture
  • 8 cores and 18

slices

  • Non-inclusive LLC
  • Does not affect

DDIO

slide-49
SLIDE 49

48

Packet Header Sizes

  • IPv4:

14 B (Ethernet)+ 20 B (IPv4) + 20 B (TCP) < 64 B

  • IPv6:

14B (Ethernet) + 36 B (IPv6) + 20 B (TCP) > 64 B Any 64 B of the packet can be placed in the appropriate slice

slide-50
SLIDE 50

49

Limitations and Considerations

  • Data larger than 64 B
  • Slice Imbalance

Limiting our application to smaller portion of LLC, but with faster access.

  • Using linked-list and scatter data
  • Future H/W features:
  • Bigger chunks (e.g., 4k pages)
  • Programmable
slide-51
SLIDE 51

50

Relevant and Future Works

  • NUCA
  • Cache-aware Memory Management

(e.g., Partitioning and Page Coloring)

  • Extending CacheDirector for the whole packet
  • Slice-aware Hypervisor
slide-52
SLIDE 52

51

Slice-aware Memory Management

slide-53
SLIDE 53

52

Evaluation – Low Rate

Simple Forwarding Application 1000 Packets/s

slide-54
SLIDE 54

53

Evaluation – Tail vs. Throughput

Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load.

slide-55
SLIDE 55

54

Evaluation – Tail vs. Throughput

Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load.