Scalable Machine Learning
- 3. Data Streams
Alex Smola Yahoo! Research and ANU
http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12
Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! - - PowerPoint PPT Presentation
Scalable Machine Learning 3. Data Streams Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 3. Data Streams Building realtime *Analytics at home Data Streams Data & Applications Moments
http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12
Data & Applications
Observe instances (xt, t) stock symbols, acceleration data, video, server logs, surveillance
Observe instances xi (weighted), always positive increments query stream, user activity, network traffic, revenue, clicks
Increments and decrements (possibly require nonnegativity) caching, windowed statistics
limited memory footprint
information gathering
prediction
(news, quarterly reports, financial background)
s :=
N
X
i=1
i s ← s − xi
s :=
N
X
i=1
i s ← s − xi
s :=
N
X
i=1
i s ← s − xi
sp :=
N
X
i=1
ip sp ← sp − xp
i
sp :=
N
X
i=1
ip sp ← sp − xp
i
Fp := X
x∈X
np
x
Pr(h(x) = j) = 2−j
F(j) = (1 − 2−j)n
max
x∈X h(j) ≈ log |X|
Pr ✓
x∈X h(j) − log |X|
◆ ≤ 2 c
|X| · 2−j ≤ 1 c = ⇒ 2j ≥ c|X| (1 − 2−j)|X| ≤ exp
≤ 1 c ≤ e−c 2j ≥ |X| c
1 1 1 1 1 1 1
E[Xij] = F2 ¯ Xi := 1 a
a
X
j=1
Xij ¯ X := med ⇥ ¯ X1, . . . , ¯ Xb ⇤ Xij := " X
x∈stream
σ(x, i, j) #2 {±1}
¯ Xi := 1 a
a
X
j=1
Xij ¯ X := med ⇥ ¯ X1, . . . , ¯ Xb ⇤ Pr
X − µ| ≥ ✏ ≤ for a = 82✏−2 and b = −8 3 log b = −2 log δ
Pick and apply Chebyshev bound to see that
Plug in
a = 82✏−2 ¯ Xi Pr {x ≥ (1 + δ)µ)} ≤ e− µδ2
3
✏ = 3; µ = b 8 hence ≤ exp ✓ −3b 8 ◆ and b ≤ −8 3 log Pr
Xi − µ| > ✏ ≤ 1 8
E [Xij] = E " X
x∈stream
σ(x, i, j) #2 = E "X
x∈X
σ(x, i, j)nx #2 = X
x∈X
n2
x
E ⇥ X2
ij
⇤ = E " X
x∈stream
σ(x, i, j) #4 = 3 X
x,x0∈X
n2
xn2 x0 − 2
X
x∈X
n4
x
E ⇥ X2
ij
⇤ − [E [Xij]] 2 = 2 X
x,x0∈X
n2
xn2 x0 − 2
X
x∈X
n4
x ≤ 2F 2 2
O(✏−2 log(1/) log |X|n)
a s r a n d
a s c a n b e 1 2 3 1
Xij = m
ij − (rij − 1)k
3 1
E[Xij] = h 1k + (2k − 1k) + . . . + (nk
1 − (n1 − 1)k)
+ . . . + (nk
|X| − (n|X| − 1)k)
i = X
x∈X
nk
x = Fk
Var [Xij] ≤ E [Xij] ≤ k|X|1−1/kF 2
k
O(k|X|1−1/k✏−2 log 1/(log m + log |X|)
E[Xij] = h 1k + (2k − 1k) + . . . + (nk
1 − (n1 − 1)k)
+ . . . + (nk
|X| − (n|X| − 1)k)
i = X
x∈X
nk
x = Fk
Var [Xij] ≤ E [Xij] ≤ k|X|1−1/kF 2
k
O(k|X|1−1/k✏−2 log 1/(log m + log |X|)
no better than brute force for large k
h(x) = g increment c = c + 1 h(x) < g set c = 1 and g = h(x)
(ignoring collisions)
(see papers by Li, Hastie, Church; Broder’s shingles)
1 |X|
(automatically)
(counti = 0, labeli = ∅) counti = counti + 1 counti = counti + 1 labeli = x
(a,4) (b,4) (c,2) (d,2) (a,4) (b,4) (e,3) (c,2) (b,5) (a,4) (e,3) (c,2) (b,5) (a,4) (e,3) (f,3)
e b f
http://www.boost.org/doc/libs/1_48_0/boost/bimap/bimap.hpp
(counti = 0, labeli = ∅) counti = counti + 1 counti = counti + 1 labeli = x
nx ≤ countx ≤ nx + n k nx ≤ countx ≤ nx + F (k)
1
n − k where F (k)
1
= X
i>k
ni
from Metwally, Agrawal, El Abbadi 2005
nx ≤ countx ≤ nx + n k
its frequency O(n/k)
{a}
rank and is at or below position i. Monotonicity proves the claim.
nx ≤ countx ≤ nx + F (k)
1
n − k where F (k)
1
= X
i>k
ni
accuracy threshold
all counters by 1
Pr {b[i] = 1} = 1 − ✓ 1 − 1 n ◆mk ≈ 1 − e− mk
n
Pr {b[h(x, 1)] = . . . = b[h(x, k)] = 1} ≈ ⇣ 1 − e− mk
n
⌘k
∂k h k log ⇣ 1 − e−mk/n⌘i = log ⇣ 1 − e−mk/n⌘ + mk n e−mk/n 1 − e−mk/n mk n = log 2 and hence k = n m log 2
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Pr {b = 1} = Pr {b = 1|S1} + Pr {b = 1|S2} − Pr {b = 1|S1 ∪ S2} ≈1 − e− k|S1|
m
− e− k|S2|
m
+ e− k|S1∪S2|
m
we don’t know whether this was set before
for all i increment b[h(x,i)] = b[h(x,i)] + 1
for all i decrement b[h(x,i)] = b[h(x,i)] - 1
1 1 1 1 1 1 1 1
log log m bits
d hash functions h1(x) h2(x) h3(x) h4(x) m bins
supports turnstile
d hash functions h1(x) h2(x) h3(x) h4(x) m bins
O(✏−1/z) nx ≤ cx ≤ nx + ✏ X
x0
nx0 for m = ⌃ e
✏
⌥ with probability 1 − e−d
d hash functions h1(x) h2(x) h3(x) h4(x) m bins
E [w[i, h(i, x)] − nx] = n m hence Pr n w[i, h(i, x)] − nx > e n m
Pr n cx − nx > e n m
20 3 8 3 4 6 7 1 13 5 14 9 34 7 2
20 3 8 3 4 6 7 1 13 5 14 9 34 7 2
accuracy penalty only on nodes accuracy penalty only on nodes accuracy penalty only on nodes
Pr {x} = c (a + x)z
czk1−z z 1
U
X
i=k
fi cz(k 1)1−z z 1
E[cx|noheavy] = nx + 3 2m
1
X
i=k+1,i6=x
ni ≤ nx + nz m cz 3z2(z − 1) for k = n 3
Pr {cx > nx + ✏n} ≤ O ⇣ ✏− min{1,1/z} log 1/ ⌘
d hash functions h1(x) h2(x) h3(x) h4(x) m bins
a priori lower bound
if we know all inserts we can get new lower bound
d hash functions h1(x) h2(x) h3(x) h4(x) m bins
w[i, j] = X
h(i,x)=j
nx ≤ X
h(i,x)=j
cx hence cx ≥ lx := w[i, j] − X
h(i,x)=j,x06=x
cx0 w[i, j] ≥ X
h(i,x)=j
lx hence cx ≤ ux := w[i, j] − X
h(i,x)=j,x06=x
lx0
w[i, j] = X
h(i,x)=j
nx ≤ X
h(i,x)=j
cx hence cx ≥ lx := w[i, j] − X
h(i,x)=j,x06=x
cx0 w[i, j] ≥ X
h(i,x)=j
lx hence cx ≤ ux := w[i, j] − X
h(i,x)=j,x06=x
lx0
many bins (almost empty)
counter
m(x) := argmin
m∈M
h(m, x)
m(x) := argmin
m∈M
h(m, x)
d hash functions h1(x) h2(x) h3(x) h4(x) n bins machine 1 machine 2 machine 3 machine 4
(if machine fails we can use others)
(use set hashing on C(x) with client ID)
C(x) := argmin
C∈M with |C|=k
X
m∈C
h(m, x)
1,1,2,4,8,16,32,64 ...
4 2
1 1 1 1 2 1 1 1 1 1 1 1 1 4 2 4 2 1 2 1 1 1 1 1 2 4
2 2 8 1
1 1 1 1 1 1 1 4 2 1 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 4 2 4 2 1 1 1 1 1 2 4 8 8 8 8
2 2 8 1
1 1 1 1 1 1 1 4 2 1 8 8 8 4 4 4 2 4 2 1 1 1 1 1 1 4 2 4 2 1 1 1 1 1 2 4 8 8 8 8
Decreasing temporal resolution - n(x,last year)
Decreasing accuracy at fine time resolution
items
p(i, t) ≈ p(i)p(t) ⇒ n(i, t) ≈ n(i)n(t) n
Data & Applications
http://www.cs.rutgers.edu/~muthu/stream-1-1.ps
http://www.sciencedirect.com/science/article/pii/S0022000097915452
https://sites.google.com/site/countminsketch/
http://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
http://www.cs.ucsb.edu/research/tech_reports/reports/2005-23.pdf
http://www.research.att.com/people/Cormode_Graham/library/publications/ BerindeCormodeIndykStrauss10.pdf
http://dimacs.rutgers.edu/~graham/pubs/papers/sk.pdf
http://algo.inria.fr/flajolet/Publications/FlMa85.pdf