Make the Most out of Last Level Cache in Intel Processors
Alireza Farshin*, Amir Roozbeh*+, Gerald Q. Maguire Jr.*, Dejan Kostić*
KTH Royal Institute of Technology (EECS/COM) Ericsson Research
*
+
in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. - - PowerPoint PPT Presentation
Make the Most out of Last Level Cache in Intel Processors Alireza Farshin * , Amir Roozbeh *+ , Gerald Q. Maguire Jr. * , Dejan Kosti * KTH Royal Institute of Technology (EECS/COM) + Ericsson Research * Motivation 1 Motivation Some of
KTH Royal Institute of Technology (EECS/COM) Ericsson Research
+
1
2
3
4
5
10 10
1
10
2
10
3
10
4
10
5
10
6
10
7
Year 1980 1990 2000 2010 2020
Single-Thread Performance (SpecINT x 103) Transistors (thousands)
6
10 10
1
10
2
10
3
10
4
10
5
10
6
10
7
Year 1980 1990 2000 2010 2020
Single-Thread Performance (SpecINT x 103) Transistors (thousands)
7
CPU Registers Cache L1,L2, LLC DRAM
Getting Slower
Memory ry Hierarchy rchy
>200 cycles (>60ns) 4-40 cycles <4 cycles
For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.
8
Cache L1,L2, LLC DRAM
Getting Slower
Memory ry Hierarchy rchy
CPU Registers
4-40 cycles <4 cycles
For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.
>200 cycles (>60ns 0ns)
9
Cache L1,L2, LLC DRAM
Getting Slower
Memory ry Hierarchy rchy
CPU Registers
4-40 cycles <4 cycles
For a CPU that is running at 3.2 GHz, every 4 cycle is around 1.25 ns.
>200 cycles (>60ns 0ns)
10
11
Intel Processor
12
Since Sandy Bridge (~2011), LLC is not unified any more!
Intel Processor
13
* Clémentine Maurice, Nicolas Scouarnec, Christoph Neumann, Olivier Heen, and Aurélien
Performance Counters.
Intel’s Complex Addressing Determines the mapping between memory address space and LLC Slices. Almost every cache line (64 B) maps to a different LLC slice. Known
ethods: Clémentine Maurice et al. [RAID ‘15]*
Intel Processor
14
Intel Xeon E5-2667 v3
(Haswell) aswell)
15
16
For a CPU that is running at 3.2 GHz.
17
DRAM
18
19
20
21
22
256 6 KB KB 2.5 5 MB MB 20 20 MB MB
23
Beneficial when working ng set can fit into a slice.
256 6 KB KB
2.5 2.5 MB MB
20 20 MB MB
24
25
26
27
28
29
30
LLC LLC
* Direct Memory Access (DMA) DRAM
31
* Direct Memory Access (DMA)
LLC LLC
Sending/Receiving Packets via DDIO
DRAM
32
Sending/Receiving Packets via DDIO
33
Sending/Receiving Packets via DDIO
34
CacheDirector
Sending/Receiving Packets via DDIO Sending/Receiving Packets via DDIO
35
CacheDirector
Sending/Receiving Packets via DDIO Sending/Receiving Packets via DDIO
36
37
38
39
* Georgios P.Katsikas, Tom Barbette, Dejan Kostic, Rebecca Steinert, and Gerald Q. Maguire Jr. 2018. Metron: NFV Service Chains at the True Speed of the Underlying Hardware.
Stateful NFV Service Chain
40
Achieved Throughput ~76 Gbps
Stateful NFV Service Chain
41
Achieved Throughput ~76 Gbps
Stateful NFV Service Chain
42
Faster access to packet header Faster processing time per packet Reduce queueing time
Achieved Throughput ~76 Gbps
Stateful NFV Service Chain
43
Achieved Throughput ~76 Gbps
Faster access to packet header Faster processing time per packet Reduce queueing time
* Service Level Objective (SLO)
Stateful NFV Service Chain
44
45
This work is supported by WASP, SSF, and ERC.
46
47
48
49
Limiting our application to smaller portion of LLC, but with faster access.
50
(e.g., Partitioning and Page Coloring)
51
52
Simple Forwarding Application 1000 Packets/s
53
Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load.
54
Slightly shifts the knee, which means CacheDirector is still beneficial when system is experiencing a moderate load.