Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto - - PowerPoint PPT Presentation

augustus a ccn router for programmable networks
SMART_READER_LITE
LIVE PREVIEW

Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto - - PowerPoint PPT Presentation

Augustus: a CCN router for programmable networks ACM ICN 2016, Kyoto Davide Kirchner 1 , Raihana Ferdous 2 , Renato Lo Cigno 3 , Leonardo Maccari 3 , Massimo Gallo 4 , Diego Perino 5 , and Lorenzo Saino 6 September 27, 2016 1 Google


slide-1
SLIDE 1

Augustus: a CCN router for programmable networks

ACM ICN 2016, Kyoto

Davide Kirchner1∗, Raihana Ferdous2∗, Renato Lo Cigno3, Leonardo Maccari3, Massimo Gallo4, Diego Perino5∗, and Lorenzo Saino6 September 27, 2016

1Google Inc., Dublin, Ireland; 2Create-Net, Trento, Italy; 3DISI – University of Trento, Italy 4Bell Labs – Nokia, Paris, France; 5Telefonica Research, Spain; 6Fastly, London, UK ∗This work was done while D. Kirchner and R. Ferdous were at the University of Trento, and D.

Perino and L. Saino at Bell Labs.

slide-2
SLIDE 2

Outline

  • 1. Introduction
  • 2. The Augustus CCN router
  • 3. Performance evaluation
  • 4. Conclusions and lessons learned

2

slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Objectives

The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking:

  • Implement a CCN data plane forwarder fully in software
  • Run on a commodity x86 64 machine
  • Performance-oriented, open-source and extensible
  • Analyze the performance in a worst-case scenario

4

slide-5
SLIDE 5

Objectives

The main goal is to explore the possibilities offered by modern general-purpose hardware in the context of information-centric networking:

  • Implement a CCN data plane forwarder fully in software
  • Run on a commodity x86 64 machine
  • Performance-oriented, open-source and extensible
  • Analyze the performance in a worst-case scenario

Why software router? Flexibility:

  • Quicker development/deployment cycle and (re)configuration
  • Hardware can be dynamically allocated to network functions

Tools

  • Off-the-shelf high-performance hardware
  • High-speed packet I/O libraries [Int, Riz12]
  • Software routing frameworks built on top [BSM15, KJL+15]

4

slide-6
SLIDE 6

Forwarding flow

  • Focus on the Content Centric Networing

approach [JST+09]

  • Interests hold full content name
  • Similar to CCNx (vs NDN)
  • CS and PIT: exact match
  • Longest-prefix match at FIB

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

/com/updates eth0

Forwarding information base (FIB) Pending Interest Table (PIT) Content Store (CS)

A R1 R2 B R3 C

eth0 eth1 eth2 5

slide-7
SLIDE 7

Forwarding flow

  • Focus on the Content Centric Networing

approach [JST+09]

  • Interests hold full content name
  • Similar to CCNx (vs NDN)
  • CS and PIT: exact match
  • Longest-prefix match at FIB

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

/com/updates eth0

Forwarding information base (FIB)

/com/updates/sw/v4.2.5.tar.gz {eth1}

Pending Interest Table (PIT) Content Store (CS)

A R1 R2 B R3 C

eth0 eth1 eth2 5

slide-8
SLIDE 8

Forwarding flow

  • Focus on the Content Centric Networing

approach [JST+09]

  • Interests hold full content name
  • Similar to CCNx (vs NDN)
  • CS and PIT: exact match
  • Longest-prefix match at FIB

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

/com/updates eth0

Forwarding information base (FIB) Pending Interest Table (PIT)

/com/updates/sw/v4.2.5.tar.gz (data. . . )

Content Store (CS)

A R1 R2 B R3 C

eth0 eth1 eth2 5

slide-9
SLIDE 9

Forwarding flow

  • Focus on the Content Centric Networing

approach [JST+09]

  • Interests hold full content name
  • Similar to CCNx (vs NDN)
  • CS and PIT: exact match
  • Longest-prefix match at FIB

Example: get /com/updates/sw/v4.2.5.tar.gz Router R2:

/com/updates eth0

Forwarding information base (FIB) Pending Interest Table (PIT)

/com/updates/sw/v4.2.5.tar.gz (data. . . )

Content Store (CS)

A R1 R2 B R3 C

eth0 eth1 eth2 5

slide-10
SLIDE 10

The Augustus CCN router

slide-11
SLIDE 11

Design principles

  • Exploit parallelism at all possible levels:
  • Hardware multi-queue at NIC
  • DRAM memory channels
  • Multiple cores on chip
  • Multiple NUMA sockets
  • Data structures designed to match the x86 cache system
  • Shared read-only FIB, duplicated in all NUMA sockets
  • Sharded, thread-private CS and PIT
  • Exploit NIC’s Receive Side Scaling capabilities to dispatch incoming

packets to threads

  • Zero-copy packet processing
  • Based on DPDK for fast packet I/O [Int]
  • Explored two trade-offs: max performance or more flexibility

7

slide-12
SLIDE 12

Design - standalone

Low-level standalone C implementation:

  • Based on low-level optimized APIs
  • Pushes the platform to its limits
  • Architecture based on Caesar [PVL+14]

8

slide-13
SLIDE 13

Design - modular

  • Based on (Fast)Click

[KMC+00, BSM15]

  • Easy to extend, experiment

with

  • Same optimized data structures
  • Can be deployed aside other

routing components

InputMux Check ICNHeader ICN_CS ICN_PIT ICN_FIB OutputDemux FromDPDKDevice(n) 1 I 2 1 D 2 D(hit) 1 1 I(miss) I(hit) 2 1 2 I(miss) 1 2 D (hit) 1 1 Input port

  • utput port

Discard ToDPDKDevice(n) I = Interest Packet D = Data Packet

9

slide-14
SLIDE 14

Performance evaluation

slide-15
SLIDE 15

Experimental setup

  • Two twin machines, each with two 10Gbps Ethernet ports
  • Measurements expressed in data packets per second
  • Work in slight overload conditions

Worst-case assumptions:

  • Every interest packet has a unique name: no CS hits, no PIT aggregation
  • Minimal-sized packets, to stress the forwarding engine

Augustus router Traffic generator and sink

Interest generator Echo server

eth1 eth0

interest interest data data 11

slide-16
SLIDE 16

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-17
SLIDE 17

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-18
SLIDE 18

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-19
SLIDE 19

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-20
SLIDE 20

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-21
SLIDE 21

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-22
SLIDE 22

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-23
SLIDE 23

Threads and core mapping

Threads are pinned to processing cores Test servers: 2 sockets × 8 cores × 2 (hyperthreading)

L3

L2 L1 D L1 I

16

CPU L2 L1 D L1 I

2 18

CPU L2 L1 D L1 I

10 26

CPU L2 L1 D L1 I

12 28

CPU L2 L1 D L1 I

14 30

CPU L2 L1 D L1 I

8 24

CPU L2 L1 D L1 I

6 22

CPU L2 L1 D L1 I

4 20

CPU L2 L1 D L1 I

1 17

CPU L2 L1 D L1 I

3 19

CPU L2 L1 D L1 I

11 27

CPU L2 L1 D L1 I

13 29

CPU L2 L1 D L1 I

15 31

CPU L2 L1 D L1 I

9 25

CPU L2 L1 D L1 I

7 23

CPU L2 L1 D L1 I

5 21

CPU

L3

12

slide-24
SLIDE 24

Standalone performance

1 2 4 6 8 10 12 14 16 1820222426283032 2 4 6 8 10 12 Data throughput [Mpps]

Standalone

1 2 4 6 8 10 12 14 16 1820222426283032 Number of threads 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L3 cache misses ratio

Hyperthreading Single Socket Dual Socket

  • 2 threads: large gap

hyperthreaded vs physical cores

  • Best performance: 4 threads

(dual socket), 8 threads (single/dual)

13

slide-25
SLIDE 25

Click module performance

1 2 4 6 8 10 12 14 16 1820222426283032 2 4 6 8 10 12 Data throughput [Mpps]

Click module

1 2 4 6 8 10 12 14 16 1820222426283032 Number of threads 0.0 0.1 0.2 0.3 0.4 0.5 0.6 L3 cache misses ratio

Hyperthreading Single Socket Dual Socket

  • 1 thread: same cache miss

ratio, half performance

  • Best performance: 16 threads

14

slide-26
SLIDE 26

FIB size scaling

212 214 216 218 220 222 224 226 1 2 3 4 5 6 7 8 9 10 11 12 Data throughput [Mpps]

Standalone, 8 threads Standalone, 4 threads Click module, 16 threads Standalone, 1 thread Click module, 1 thread

212 214 216 218 220 222 224 226 Number of FIB buckets 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Cache miss ratio

15

slide-27
SLIDE 27

Conclusions and lessons learned

slide-28
SLIDE 28

Conclusions and lessons learned

Present Augustus, a CCN software router which:

  • Forwards packets at more than 10 millions data packets per second

and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes;

  • Tested with a thorough worst-case oriented performance evaluation
  • Runs both as a stand-alone system, achieving the best performance,
  • r as a set of elements in the Click modular router framework
  • Is open source and can be used in software based networks for fast

and incremental ICN deployment

17

slide-29
SLIDE 29

Conclusions and lessons learned

Present Augustus, a CCN software router which:

  • Forwards packets at more than 10 millions data packets per second

and supports a FIB with up to 226 entries, and it is able to saturate the 10 Gbit/s link with Ethernet payloads as small as 87 bytes;

  • Tested with a thorough worst-case oriented performance evaluation
  • Runs both as a stand-alone system, achieving the best performance,
  • r as a set of elements in the Click modular router framework
  • Is open source and can be used in software based networks for fast

and incremental ICN deployment Lessons learned:

  • Manual configuration for best performance
  • Abstraction hides critical low level properties
  • Complex zero-copy in modular framework

17

slide-30
SLIDE 30

Augustus: a CCN router for programmable networks

ACM ICN 2016, Kyoto

September 27, 2016

Thanks for your attention

davkir@google.com

slide-31
SLIDE 31

Bibliography

slide-32
SLIDE 32

References I

[BSM15] Tom Barbette, Cyril Soldani, and Laurent Mathy. Fast userspace packet processing. In Proceedings of the Eleventh ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’15, pages 5–16, Washington, DC, USA, 2015. IEEE Computer Society. [Int] Intel R . DPDK: Data plane development kit. http://dpdk.org. [JST+09] Van Jacobson, Diana K. Smetters, James D. Thornton, Michael F. Plass, Nicholas H. Briggs, and Rebecca L. Braynard. Networking named content. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT ’09, pages 1–12, New York, NY, USA, 2009. ACM.

20

slide-33
SLIDE 33

References II

[KJL+15] Joongi Kim, Keon Jang, Keunhong Lee, Sangwook Ma, Junhyun Shim, and Sue Moon. Nba (network balancing act): A high-performance packet processing framework for heterogeneous processors. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pages 22:1–22:14, New York, NY, USA, 2015. ACM. [KMC+00] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. The Click modular router. ACM Trans. Comput. Syst., 18(3):263–297, August 2000. [PVL+14] Diego Perino, Matteo Varvello, Leonardo Linguaglossa, Rafael Laufer, and Roger Boislaigue. Caesar: A Content Router for High-speed Forwarding on Content Names. In Proceedings of the Tenth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’14, pages 137–148, New York, NY, USA, 2014. ACM.

21

slide-34
SLIDE 34

References III

[Riz12] Luigi Rizzo. netmap: A novel framework for fast packet I/O. In 21st USENIX Security Symposium (USENIX Security 12), pages 101–112, Bellevue, WA, August 2012. USENIX Association.

22