Continuous Distributed Monitoring in the Evolved Packet Core - - PowerPoint PPT Presentation

continuous distributed monitoring in the evolved packet
SMART_READER_LITE
LIVE PREVIEW

Continuous Distributed Monitoring in the Evolved Packet Core - - PowerPoint PPT Presentation

Continuous Distributed Monitoring in the Evolved Packet Core Industry Experience Report Romaric Duvignau 1 Marina Papatriantafilou 1 Konstantinos Peratinos 3 om 2 Patrik Nyman 2 Eric Nordstr DEBS 2019, Darmstadt (June 26). 1 Chalmers


slide-1
SLIDE 1

Continuous Distributed Monitoring in the Evolved Packet Core

Industry Experience Report

Romaric Duvignau 1 Marina Papatriantafilou 1 Konstantinos Peratinos 3 Eric Nordstr¨

  • m 2

Patrik Nyman 2 DEBS 2019, Darmstadt (June 26).

1 Chalmers University of Technology, 2 Ericsson, 3 Chalmers student and Ericsson intern.

slide-2
SLIDE 2

Introduction

slide-3
SLIDE 3

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

1

slide-4
SLIDE 4

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

1

MME, QoS, billing, ...

slide-5
SLIDE 5

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

1

MME, QoS, billing, ... Packet Gateway

slide-6
SLIDE 6

Context: Monitoring the Evolved Packet Core (EPC) in 4G

The Evolved Packet Core

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID

W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb

User plane (UP)

GTP

Base station Control Plane (CP) User Equipment (UE)

PFCP TEID TEID W1 W2 WN

.....

LTE

EPG

EPC

Servers

PDN

Sxa/Sxb EPC

  • Large-Scale, Distributed, Performance-critical system.
  • Strong need to continuously monitor the EPC: e.g. detection
  • f under- or over-used subcomponents.

1

MME, QoS, billing, ... Packet Gateway

slide-7
SLIDE 7

Continuous Distributed Monitoring

slide-8
SLIDE 8

Continuous Distributed Monitoring (CDM) Model

2

slide-9
SLIDE 9

Continuous Distributed Monitoring (CDM) Model

2

f (S1, S2, · · · , Sk)

slide-10
SLIDE 10

Continuous Distributed Monitoring (CDM) Model

2

f (S1, S2, · · · , Sk) There exist variants (unidirectional, relay nodes, etc).

slide-11
SLIDE 11

Continuous Distributed Monitoring (CDM) Model

2

f (S1, S2, · · · , Sk) There exist variants (unidirectional, relay nodes, etc).

  • Instant computation &

communication

  • f depends on ∪Si
slide-12
SLIDE 12

System Architecture

slide-13
SLIDE 13

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic

3

slide-14
SLIDE 14

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics

3

slide-15
SLIDE 15

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages

3

slide-16
SLIDE 16

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

3

slide-17
SLIDE 17

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

Differences with CDM models

  • Sites identity matters, performance statistics = “events”, etc
  • Need to account for comp. and communication delays!

3

slide-18
SLIDE 18

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period

3

slide-19
SLIDE 19

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period Fetches

3

slide-20
SLIDE 20

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period Fetches Sliding Window

3

slide-21
SLIDE 21

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period Fetches Sliding Window

3

slide-22
SLIDE 22

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period Fetches Sliding Window

3

slide-23
SLIDE 23

System Architecture Overview

C Agg1 w1

1

w1

2

· · · w1

Agg2 w2

1

w2

2

· · · w2

· · ·

Load Balancer Incoming Traffic Fetched Statis- tics Monitoring Messages Display (analysts)

time Monitoring Period Fetches Sliding Window → At the Agg: monitoring decisions then 1 monitoring message.

3

slide-24
SLIDE 24

Monitoring Algorithms

slide-25
SLIDE 25

Selected CDM Algorithms for Counting problems

Basic Mode: Exact Monitoring

  • Send an update if last value sent is

different to measured value

  • Keep an exact sliding window of

the last n values

  • 4
slide-26
SLIDE 26

Selected CDM Algorithms for Counting problems

Basic Mode: Exact Monitoring

  • Send an update if last value sent is

different to measured value

  • Keep an exact sliding window of

the last n values

  • Approximation Mode: Relative Error of ε
  • Uses Exponential Histograms for

approximate counting

  • Send the approximate count when

it is beyond some error bound from the last value sent

  • Requires in all O(log(nε)/ε) words
  • 4
slide-27
SLIDE 27

Results

slide-28
SLIDE 28

Experimental setup

  • EPG setup: 2 aggregators, 72 workers per aggregator
  • 2 phases: increasing load (20min) then stable load (15min)

20 40 60 80 100 CPU utilization (%) Max p95 Median p5 Min 500 1000 1500 2000 1M 2M 3M Packet rate (packets/s) Max p95 Median p5 Min

5

slide-29
SLIDE 29

Experimental setup

  • EPG setup: 2 aggregators, 72 workers per aggregator
  • 2 phases: increasing load (20min) then stable load (15min)

20 40 60 80 100 CPU utilization (%) Max p95 Median p5 Min 500 1000 1500 2000 1M 2M 3M Packet rate (packets/s) Max p95 Median p5 Min

5

1000 fetches /s – high precision 1 fetch /s – low precision

slide-30
SLIDE 30
  • No. of Monitoring Updates per Round
  • 5-10% of data sent for packet proc. rate; 30-70% for CPU.

0.4 0.6 0.8 1.0 Updates vs Basic (cpu) 5% 10% 20% 5%W60 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 Updates vs Basic (pkt) 5% 10% 20% 5%W60

6

slide-31
SLIDE 31
  • No. of Monitoring Updates per Round
  • 5-10% of data sent for packet proc. rate; 30-70% for CPU.

0.4 0.6 0.8 1.0 Updates vs Basic (cpu) 5% 10% 20% 5%W60 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 Updates vs Basic (pkt) 5% 10% 20% 5%W60

  • Max relative error < 5ε

9 and average < ε 5. 6

slide-32
SLIDE 32

Monitoring Availability

  • 8 runs (ca 4h of data) with monitoring round = 1s

1.5 2.0 2.5 Update time (s,MA300) B5%W60 Agg1/2 B5% Agg1/2 5% Agg1/2 B Agg1/2 250 500 750 1000 1250 1500 1750 2000 0.4 0.6 0.8 Availability (MA300) B 20% 10% 5% B20% B10% B5% B5%W60

7

slide-33
SLIDE 33

Conclusion

slide-34
SLIDE 34

Conclusions

  • Adjusted state-of-the-art CDM implementations in the EPC
  • Keys to popularize CDM within a production level system
  • From experiments, only 6% of data sent for 1.6% avg error
  • Useful for the upcoming transition to 5G architecture

8

slide-35
SLIDE 35

Thank you!

8

slide-36
SLIDE 36

Error Analysis

  • Max relative error is always close to 5ε

9

  • Larger window influences absolute error on CPU

0.00 0.02 5% 0.00 0.02 5%W60 1 5% 500 1000 1500 2000 2 5%W60 Max p90 Median p10 Min

slide-37
SLIDE 37

Comparison with Simple Approximation

  • Simple Approximation: keep an exact window and send

updates when last count is beyond some predefined relative bound

B 5% 10% 20% 5%W60 20 40 60 No of updates (pkt) B 5% 10% 20% 5%W60 5 10 Relative errors (%,pkt)

  • ε-Approximate algorithm presents similar tradeoffs as the

simple approximation with bound 5ε

9

slide-38
SLIDE 38

CDM approaches

Simple approaches

  • Flooding, do not scale!
  • Polling, but hard to choose right polling interval!
  • Sampling, do not capture scarce under/over-used components!

Solutions

  • Communication-optimal algorithms
  • Geometric Monitoring → efficient network-wide aggregate.
  • Tailored algorithms for particular tasks → e.g. computing the

frequency of items or most popular ones.

  • Heuristics → e.g. adaptive filters.
  • Compromises: Magpie, Dapper, Ganglia...
slide-39
SLIDE 39

Proposed Monitoring Solutions

time Monitoring Period Fetches Sliding Window Monitoring Logic for each monitored value

  • Implemented as part of the aggregator nodes
  • once all fetched have been collected, a monitoring decision is

taken upon propagating the update

  • Aggregation of all monitoring updates: sending of (up to) a

single monitoring message per aggregator

slide-40
SLIDE 40

Selected CDM Algorithms

Basic Mode

  • Send an update if last value sent is different
  • Keep an exact sliding window of length n

ε-Approximation Mode

  • Maintains an ε

9-approximate Exponential Histogram for

counting approximate sum ˆ c of items over a sliding window of the last n events

  • Whenever ˆ

c > (1 + 4ε

9 )c or ˆ

c < (1 − 4ε

9 )c, send an update,

where c is the last value sent

  • Requires in all O(log(nε)/ε) words of memory
slide-41
SLIDE 41

Measuring Metrics of Interests: 2 modes

With high granularity: CPU usage

  • 1. P fetches of CPU-usage for past 1ms each within one

monitoring period

  • 2. Frequency chart (histogram of F bins) for the P fetches
  • 3. Sliding Windows are updated: each bin is monitored
  • 4. For each changed (basic) or outside of bounds (approx) value,

a monitoring update is sent

  • 5. Upon receiving an update: C updates its frequency counts for

the resp. observer and CPU-bin and then may display the average CPU over the window as

1≤i≤F ifi/ 1≤i≤F fi

With low granularity: Packet Processing Rate

  • Only the no. of processed packets per mon. period is tracked