[PPT] - Learning Queuing Networks by Recurrent Neural Networks Giulio Garbi PowerPoint Presentation

SLIDE 1

Learning Queuing Networks by Recurrent Neural Networks

Giulio Garbi, Emilio Incerto and Mirco Tribastone IMT School for Advanced Studies Lucca Lucca, Italy giulio.garbi@imtlucca.it ICPE 2020 Virtual Conference April 20—24, 2020

SLIDE 2

Motivation

Performance means revenue
«We are not the fastest retail site on the internet today» [Walmart, 2012]
«[…] page speed will be a ranking factor for mobile searches.» [Google]

è It’s worth investing in system performance. How?

Garbi, Incerto, Tribastone 2

SLIDE 3

Motivation

Question: where to invest?
Performance estimation:
Profiling: easy, does not predict
Modeling: needs expert and

continuous update, predictions

Garbi, Incerto, Tribastone 3

SLIDE 4

Motivation: our vision

If we had a model, we could try all possible choices, forecast and

choose the best option. è Automate model generation!!!

Garbi, Incerto, Tribastone 4

SLIDE 5

Our Main Contribution

Direct association between:
Model: Fluid Approximation of Closed Queuing Networks
Automation: Recurrent Neural Networks
Automatic generation of models from data

Garbi, Incerto, Tribastone 5

SLIDE 6

Model: Queuing Networks

Model that represent contention
f resources by clients
Clients ask for work to station

(resources)

Stations have a maximum

concurrency level, and a speed

Clients once served ask another

resource according to routing matrix

<µ1, s1> <µ2, s2> <µ3, s3> P1,2 P1,3 P2,1 P3,1 x1 x3 x2

Garbi, Incerto, Tribastone 6

SLIDE 7

Model of a system

Resource è hardware
Routing matrix è program code
Clients è program instances

<µ1, s1> <µ2, s2> <µ3, s3> P1,2 P1,3 P2,1 P3,1 x1 x3 x2

Garbi, Incerto, Tribastone 7

SLIDE 8

How our procedure works

Garbi, Incerto, Tribastone 8

Profiling Learning Model Prediction Changes

SLIDE 9

1 2 H-1

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

Recurrent Neural Networks

Recurrent neural networks (RNN) work with sequences (e.g. time

series)

We will encode the model as a RNN with a custom structure.

Garbi, Incerto, Tribastone 9

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

SLIDE 10

Recurrent Neural Networks

The system parameters are directly encoded in the RNN cell

èLearned model explains the system! (Explainable Neural Network)

We can modify the system afterwards to do prediction!

Garbi, Incerto, Tribastone 10

1 2 H-1

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

t1
t2

t2P2,1 1 2 Cellh min min ∑ ∑ t1P1,2

t

M min ∑

SLIDE 11

Synthetic case studies: setting

10 random systems: five with M=5 stations, five with M=10 stations
Concurrency levels between 15 and 30
Service rate between 4 and 30 clients/time unit
100 traces, each one being an average of 500 executions, with

[0, 40 M] clients

Learning time: 74 min for M = 5 and 86 min for M = 10
Error function: % clients wrongly placed

Garbi, Incerto, Tribastone 11

SLIDE 12

Synthetic case studies: prediction with different #clients

100 200 300 400 500 600 700 800

N

2 4 6 8 10

Prediction error (err)

M=5 M=10

No significant difference among network size and number of clients. è Good predictive power among different conditions

Garbi, Incerto, Tribastone 12

#clients

SLIDE 13

Synthetic case studies: prediction with different concurrency levels

50 100 150 200 250

N

1 2 3 4 5

Prediction error (err)

M=5 M=10

Increased concurrency as to resolve the bottleneck è Learning outcome resilient to changes in part of the network

Garbi, Incerto, Tribastone 13

#clients

SLIDE 14

Real case study: setting

node.js web application, replicated

3 times

Python script simulates N clients
Learning time: 27 min for N=26

Garbi, Incerto, Tribastone 14

LB

C1 C2 W

EAL EM ACHIECE

C

N MODEL 10

5
6
1
N
NKNON AAMEE

1,1 M2 M3 2,210 3,35 4,6 1,2 1,3 2,1 3,1 ,1 1, M1 M4

M1 M2 M3 M4

SLIDE 15

Real case study: prediction with different #clients

1 2 3 4 5 6 t(s) 5 10 15 20 25 30 35 40 45 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

M3 is the bottleneck, and this affects the UX. We need to solve it…

Garbi, Incerto, Tribastone 15

1 2 3 4 5 6 t(s) 10 20 30 40 50 60 70 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

1 2 3 4 5 6 t(s) 10 20 30 40 50 60 70 80 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

1 2 3 4 5 6 t(s) 10 20 30 40 50 60 70 80 90 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

N = 52 err = 6.46% N = 104 err = 6.45% N = 78 err = 5.03% N = 130 err = 9.05%

45 40 35 30 25 20 15 10 5 t(s) 0 1 2 3 4 5 6 Queue Length 70 60 50 40 30 20 10 t(s) 0 1 2 3 4 5 6 Queue Length 80 70 60 50 40 30 20 10 t(s) 0 1 2 3 4 5 6 Queue Length 90 80 70 60 50 40 30 20 10 t(s) 0 1 2 3 4 5 6 Queue Length

SLIDE 16

Real case study: prediction with different structure

…by increasing the concurrency level of M3 err: 5.98%

1 2 3 4 5 6 t(s) 10 20 30 40 50 60 70 80 90 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

…by changing the LB scheduling policy err: 6.10%

1 2 3 4 5 6 t(s) 10 20 30 40 50 60 70 80 90 Queue Length

M1 RNN-learned QN M1 Real System M2 RNN-learned QN M2 Real System M3 RNN-learned QN M3 Real System M4 RNN-learned QN M4 Real System

Garbi, Incerto, Tribastone 16

Bottleneck solved. Nice results also on a real HW+SW system.

SLIDE 17

Limits

Many traces required to learn the system.
System must be observed at high frequency.
Layered systems currently not supported.
Resilient to limited changes, not extensive ones.

Garbi, Incerto, Tribastone 17

SLIDE 18

Related work

Performance models from code (e.g. PerfPlotter, not predictive)
Modelling black-box systems (e.g. Siegmund et al., tree-structured

models)

Program-driven generation of models (e.g. Hrischuk et al., distributed

components that communicate via RPC)

Estimation of service demands in QN through several techniques (we

estimate service demands and routing matrix)

Garbi, Incerto, Tribastone 18

SLIDE 19

Conclusions

We provided a method to estimate QN parameters using a RNN that

converges on feasible parameters.

With the estimated parameters, it is possible to estimate the

evolution of the system using a population different from the one used during learning or when doing structural modifications.

We want to apply the technique to more complex systems (e.g

LQN,multiclass), use other learning methodologies (e.g. neural ODEs) and improve the accuracy of the results

Garbi, Incerto, Tribastone 19

SLIDE 20

Learning Queuing Networks by Recurrent Neural Networks

Motivation

è It’s worth investing in system performance. How?

Motivation

Motivation: our vision

choose the best option. è Automate model generation!!!

Our Main Contribution

Model: Queuing Networks

(resources)

concurrency level, and a speed

resource according to routing matrix

Model of a system

How our procedure works

Profiling Learning Model Prediction Changes

Recurrent Neural Networks

series)

Recurrent Neural Networks

èLearned model explains the system! (Explainable Neural Network)

Synthetic case studies: setting

[0, 40 M] clients

Synthetic case studies: prediction with different #clients

No significant difference among network size and number of clients. è Good predictive power among different conditions

Synthetic case studies: prediction with different concurrency levels

Increased concurrency as to resolve the bottleneck è Learning outcome resilient to changes in part of the network

Real case study: setting

3 times

Real case study: prediction with different #clients

M3 is the bottleneck, and this affects the UX. We need to solve it…

N = 52 err = 6.46% N = 104 err = 6.45% N = 78 err = 5.03% N = 130 err = 9.05%

Real case study: prediction with different structure

Bottleneck solved. Nice results also on a real HW+SW system.

Limits

Related work

models)

components that communicate via RPC)

estimate service demands and routing matrix)

Conclusions

converges on feasible parameters.

evolution of the system using a population different from the one used during learning or when doing structural modifications.

LQN,multiclass), use other learning methodologies (e.g. neural ODEs) and improve the accuracy of the results

Thank you!