Algorithms and Applications for the Estimation of Stream Statistics in Networks
Aviv Yehezkel
Ph.D. Research Proposal
Supervisor: Prof. Reuven Cohen
Algorithms and Applications for the Estimation of Stream Statistics - - PowerPoint PPT Presentation
Algorithms and Applications for the Estimation of Stream Statistics in Networks Aviv Yehezkel Ph.D. Research Proposal Supervisor: Prof. Reuven Cohen Overview Motivation Introduction Cardinality Estimation Problem Weighted
Supervisor: Prof. Reuven Cohen
2
3
4
– Use a small fixed size storage to store only the “most important” information about the stream elements, a summary of the data = the sketch – Process the stream of data (packets) in one pass – No need to store per-flow states for each flow – Employ probabilistic algorithm on the sketch to get an accurate estimation of the wanted quantity
5
– To determine in real-time the Application layer load imposed on its end server – To detect Application layer attacks
6
7
8
Element Multi C 1 D 3 B 3 Z 1
9
10
11
12
13
14
C D B Z
1 10th apart from each other
𝑜 𝑜+1
15
16
C
17
C D
18
C D B
19
C D B
20
C D B Z
21
C D B Z
22
C D B Z
23
C D B Z
24
C D B Z
𝑜 𝑜+1 = 0.773
+, ℎ2 +, … , ℎ𝑛 +
𝑛 −1 ∑(1 −ℎ𝑙
+)
25
+ ∼ 𝑜 𝑜+1
+
1 𝑜+1 = 𝑛 𝑜+1
𝑛 −1 ∑(1 −ℎ𝑙
+) ∼ 𝑜
26
27
28
29
+ = max ℎ𝑙 𝑦𝑗
+, ℎ2 +, … , ℎ𝑛 + ) to estimate 𝑜
30
31
32
Element Weight C 0.5 D 0.25 B 1 Z 1.25
33
𝑘 imposes a load 𝑥 𝑘 on the server
𝑘 represents the total load imposed on the server
34
𝑘 be the weight of 𝑓 𝑘
𝑘, using only 𝑛 storage units, where 𝑛 ≪ 𝑜
35
37
𝑘) ∼ 𝐹𝑦𝑞(𝑥 𝑘)
𝑘))
– ℎ+ ∼ 𝐹𝑦𝑞 ∑𝑥
𝑘 = 𝐹𝑦𝑞(𝑥)
– Integer weights only – Storage is not fixed
38
39
40
41
42
43
+
44
𝑘, 1
𝑘 is the element’s weight
45
^ 𝑦𝑗 ∼ 𝐶𝑓𝑢𝑏 𝑥 𝑘, 1
+ = max ℎ𝑙 ^ 𝑦𝑗
+, ℎ2 +, … , ℎ𝑛 + ) to estimate the value of 𝑥
46
1/𝑥𝑘 ∼ 𝐶𝑓𝑢𝑏 𝑥 𝑘, 1
47
48
+ ∼ 𝐶𝑓𝑢𝑏 𝑜, 1
+ ∼ 𝐶𝑓𝑢𝑏(w = ∑𝑥 𝑘, 1)
49
50
51
52
53
+ = max 𝐼2 𝑦𝑗
+, ℎ2 +, … , ℎ𝑛 + ) to estimate 𝑜
– 𝑐𝑙 =
𝑜 𝑛 ± 𝑃 𝑜 𝑛
+ = max 𝐼2 𝑦𝑗 | 𝐼1 𝑦𝑗 = 𝑙
n m by Algorithm 3, is equivalent to estimating the
+
54
𝑘, 1
𝑘 is the element’s weight
𝑘 is the sum of the elements in the k’th bucket
𝑥 𝑛 ± 𝑃( 1 𝑛 ∑𝑥 𝑘 2)
55
56
Algorithm 4
^ 𝑦𝑗 ∼ 𝐶𝑓𝑢𝑏 𝑥 𝑘, 1
+ = max 𝐼2 ^ 𝑦𝑗
+, ℎ2 +, … , ℎ𝑛 + ) to estimate 𝑥
1 𝑥𝑘 ∼ 𝐶𝑓𝑢𝑏 𝑥
𝑘, 1
57
58
+ ∼ 𝐶𝑓𝑢𝑏
+ ∼ 𝐶𝑓𝑢𝑏(𝑥
𝑘
59
60
𝑐𝑙 𝐹 𝑐𝑙
𝑛 𝑜 = 10−3
61
∑𝑥𝑘
2
𝑥2 = 10−6 ⟹
– The normalized standard deviation = 𝑊𝑏𝑠
𝑐𝑙 𝐹 𝑐𝑙
≈
∑𝑥𝑘
2
𝑥2 𝑛 = 10−3
62
𝑘 are drawn from a random distribution
63
The unified scheme can deal with unbounded number of weights as long as:
𝑘] /𝐹2[𝑥 𝑘] is a small constant
– Some estimators transform ℎ 𝑦𝑗 into another distribution
– The unified scheme transforms ℎ(𝑦𝑗) into a Beta distribution
𝑘 , 1)
64
where,
– Inverse-Transfom Method
– Without transformation – In this case, 𝐺(𝑦) = 𝑦
65
𝑘, 1)
– CDF: 𝐻max (𝑦) = 𝑦𝑥𝑘 – CDF inverse: 𝐻𝑛𝑏𝑦
−1 (𝑣) = 𝑣1/𝑥𝑘
−1
1 𝑥𝑘 ∼ 𝐶𝑓𝑢𝑏 𝑥
𝑘, 1
– Inverse-Transform Method
66
1/𝑥𝑘 ∼ 𝐶𝑓𝑢𝑏 𝑥 𝑘, 1
– No transformation is needed, F−1 𝑣 = 𝑣
𝑛(𝑛 −1) ∑(1 −ℎ𝑙
+)
67
𝑛(𝑛 −1) ∑(1 −ℎ𝑙
+)
+ = max{ℎ𝑙 ^ 𝑦𝑗 } = max ℎ𝑙 𝑦𝑗 1/𝑥𝑘
– 𝐺−1(𝑣) = −ln 𝑣 ∼ 𝐹𝑦𝑞(1)
𝑛 ∑ℎ𝑙
+
– where ℎ𝑙
+ = max{− ln(ℎ𝑙(𝑦𝑗))}
68
69
𝑛 ∑ℎ𝑙
+
+ = max{−ln(ℎ𝑙 ^ 𝑦𝑗 )} = max −ln(ℎ𝑙 𝑦𝑗 1 𝑥𝑘)
This generalization is identical to the algorithm presented by Cohen, 1995
– 𝐺−1(𝑣) = ⌊− log2 𝑣⌋ ∼ 𝐻𝑓𝑝𝑛 (1/2)
𝛽𝑛𝑛2 ∑2−ℎ𝑙
+
+ = max{⌊− log2 𝐼2 𝑦𝑗 ⌋
70
71
𝛽𝑛𝑛2 ∑2−ℎ𝑙
+
+ = max{⌊− log2 𝐼2 𝑦𝑗 1/𝑥𝑘⌋
generic way
manipulates the input using properties of the Beta distribution
equivalent to estimating the unweighted cardinality
algorithm to solve the weighted problem
and memory storage, among all the other known algorithms for the weighted problem
72
– The overload prevents/slows the resource from responding to legitimate traffic
– Multiple attackers to defend against
74
– HTTP request attacks:
– HTTPS/SSL request attacks
– DNS request attacks
75
– Typically, it is enough for the attacker to send only hundreds of resource intensive requests, instead of flooding the server with millions of TCP SYNs, as in a volumetric DDoS attack
76
77
– Often this weakest link is tier-2 or 3 – Will be the first to collapse in a targeted Application layer DDoS attack.
– Many devices – Does not have flow awareness, cannot perform per-flow tasks – Dedicated to fast performance, its processing tasks must be simple and cheap – Lacks deep knowledge of the end applications, and is unable to keep track of the association between packets-flows-applications
78
– Cardinality estimation problem
– Possibly DDoS attack – Alternative: monitor the entropy of selected attributes in the received packets and compare to pre-computed profile
– This is clearly not true in a realistic case where high-workload requests require significantly more server efforts than simple ones – We solve this problem by preclassifying the incoming flows and associating them with different weights according to their load
79
– Triggers the opening of more tier-2 and tier-3 devices – Triggers the invocation of special tier-1 packet-based filtering rules, which will reduce the load
80
destination port number in the packet’s header in order to know the load imposed on the server by the flow to which the packet belongs
𝐷
81
82
𝐷
83
84
– Each TCP or UDP flow is associated with one application layer instance
– Allows the client to send multiple HTTP requests over the same TCP connection (flow) – Cannot tell in advance which or how many requests will be sent over the same connection
– The weight associated with the light requests will take into account their resource consumption and the possibility that multiple light requests may share the same connection
85
– Instead of solving the cardinality estimation problem once per each class, the enhanced scheme solves the weighted cardinality estimation problem – The total load is estimated directly, without estimating the number of flows in each class
– Moreover, the enhanced scheme is agnostic to the distribution of the weights and does not need a priori information about the distribution of the weight classes
86
87
𝒙𝟑 𝒏 −𝟑𝑫 > 𝒙𝟑 𝒏 −𝟑 = variance of enhanced scheme
𝑛 2, then the variance of the basic scheme
– Moreover, even if there are only a few classes, and the statistical inefficiency can be tolerated, the basic scheme needs a priori information on the distribution of the weights, while the enhanced scheme does not.
– as long as the number of weight classes satisfy 𝐷 >
𝑛 2, and this requirement is satisfied because
m is usually very small.
88
𝒙𝟑 𝒏 −𝟑𝑫 > 𝒙𝟑 𝒏 −𝟑 = variance of enhanced scheme
89
– The weighted algorithm is useful for performing management tasks
– Not useful for detecting an extreme and sudden increase in the load imposed on the server due to an Application layer attack.
– n(t) = number of active flows sampled at time t over the last T units of time – w(t) = weighted sum of these flows
90
91
92
93
– Determines the real load imposed on the web server during every considered time interval by computing the server’s average response time. – Actual is expected to outperform our scheme – Of course, such a scheme cannot employed by a stateless intermediate device
– Uses HyperLogLog to estimate the number of distinct flows during each time period.
– Because we do not have access to the server, but only to its log files, we assign weights according to the average size of the response file sent by the server to each request
94
95
We can see a strong correlation between the load estimated by our scheme and Actual:
by our scheme (in blue)
96
We can see a strong correlation between the load estimated by our scheme and Actual:
and our scheme.
– the closer it is to either (−1) or to 1, the stronger the correlation between the variables; – the closer it is to 0, the weaker the correlation.
– In the first trace we find that the correlation coefficient is 0.85, which indicates a very strong correlation between Actual and our scheme. – In the second trace, the correlation coefficient is 0.92, indicating even stronger correlation
97
between Number-of-Flows and Actual is very weak – In the first trace, the correlation coefficient is only 0.38 – In the second trace, the correlation coefficient is 0.23
– In the first trace, the peak after 22 minutes is not identified by the Number-of-Flows scheme – Moreover, the Number-of-Flows scheme identifies false heavy loads, for example after 1 minute
98
a) attack-1 is represented by 30 downloads of a 1-minute video stream starting at 10:00; b) attack-2 is represented by 40 downloads of a 1-minute video stream starting at 20:00; c) attack-3 = 50 downloads of a 1-minute video stream starting at 06:00.
99
100
One can easily see that Normalized Load scheme does not detect any of the attacks.
The three other schemes successfully detect the three attacks.
– 𝑌𝑏𝑢𝑢𝑏𝑑𝑙 is the minimal value computed by the scheme during an attack
– 𝑌𝑔𝑏𝑚𝑡𝑓 is the maximal value of the scheme during normal times
101
– 𝑌𝑏𝑢𝑢𝑏𝑑𝑙= 23.85 (for the attack at 10:00) – The maximal value at a normal time is 𝑌𝑔𝑏𝑚𝑡𝑓 = 20.21 (at 18:00)
23.85 −20.21 20.21
102
– As 𝜇 grows, the scheme more accurately distinguish between attack and normal traffic
– Largest 𝜇 – In particular, the two other schemes detect a false attack at 18:00
103
104
In this scenario we change attack-2 to consist of 100 downloads of a 1-minute video stream starting at 20:00 The main different between the two figures is that this time the Normalized Load scheme successfully detects attack-2 (at 20:00). However, this scheme still does not detect the two other attacks: at 10:00 and at 6:00. In terms of the other schemes, the Normalized Load Variance scheme still has the best performance, because the
𝑈 Δ ∗ 𝑛 buckets
– For instance, for 𝑈 = 10, Δ = 1, if a packet is received at t = 15, its sketch is compared to the maximum obtained for the intervals [6, 15], [7, 16], . . . , [15, 24]
105
– 𝑢𝑙 is the arrival time of the packet – 𝑆𝑙 is the hashed value of the packet’s flow ID
106
107
108
– Extract the relevant packets from the LFPM, and compute the highest 𝐷
𝑘 among them
– The rest of the estimation is as specified in the weighted HyperLogLog algorithm
109
110
The update procedure of the LFPM list does not affect the computation of the maximal hash values. It is simply an efficient method for computing its exact value at any time, by storing only a short list of packets. Therefore, the extended algorithm has the same bias and variance as the weighted HyperLogLog
𝑜 𝑛
bits
111
device to estimate the total load imposed on the Application layer of a server
– We compare the performance of the enhanced scheme and the basic scheme and show that the enhanced scheme provides a much better variance – However, they do not detect an extreme and sudden increase in the load due to an attack
– The first scheme estimates the variance of the weighted sum of the flows – The second estimates the normalized variance of the weighted sum of the flows
112
volumes of monitored data, making it impractical to collect and analyze the entire stream
– For example, IP packets over a high-speed link; 100 Gbps link creates a 1 TB log file in < 1.5 minutes – Must sample and process only a small part of the stream – sFlow
the characteristics of the full stream
114
D , B , D , D
– Is it = 2? (equal to the sample cardinality) – Is it = 2*2 = 4? (inverse to the sampling rate) – Something else?
115
– 𝑌 full (unsampled) stream – 𝑍 sample stream – 𝑜 = cardinality of full stream – 𝑜𝑡 = cardinality of sample stream
a) Cardinality estimation of the sampled stream using any known cardinality estimator b) Estimation of the sampling ratio 𝑜/𝑜𝑡
116
appear in a document
– 𝑄𝑗 = the probability that an element of 𝑌 appears in the sample 𝑗 times – |𝐹𝑗| = the number of elements that appear exactly 𝑗 times in the sample
– The probability that an element of 𝑌 does not appear in the sample at all (unseen element)
appear exactly once in the sample Y.
117
𝑜 − 𝑜𝑡 𝑜
1 1 − 𝑄0 = 𝑜/𝑜𝑡.
– Using Good-Turing – Need to find the number of elements that appear exactly once in the sampled stream
118
119
To compute the value of |𝐹1| precisely, one should keep track of all the elements in Y and ignore each previously encountered element. This is done using 𝑷(𝒎) storage units.
– Uses O(l) storage units, which is linear in the sample size – We hope to reduce this cost by estimating the value of
𝐹1 𝑚 , instead of computing it precisely
120
– Well known problem, arises in network monitoring, database systems, data integration and information retrieval – Many application scenarios are in the context of Network Functions Virtualization (NFV)
– Does not scale if storage is limited
121
1. Using inclusion-exclusion principle 2. Using Jaccard similarity (1) 3. Using Jaccard similarty (2)
122
– Efficiency – Statistical performance (bias and variance) – Compare between them
– Cramer-Rao bound – The variance of any unbiased estimator is at least as high as the inverse of Fisher information
123
– Maximum Likelihood (ML) method – We hope to prove analytically, and/or using simulations, that our new ML estimator outperforms all previously known schemes
124
126
– Uses 𝑃(𝑚) storage units, linear in sample size.
127
128
– 𝑃(𝑚 + 𝑣 · log 𝑚) = 𝑃(𝑚) – Similar to proposed algorithm
– Algorithm 2: 𝑛 + 𝑣 units – Algorithm 1: 𝑛 + 𝑚 units, where 𝑣 ≪ 𝑚
129
130
𝑜2 𝑛 / 𝑊𝑏𝑠(
131
132
133