Scalable Machine Learning
- 12. Tail bounds and Averages
Geoff Gordon and Alex Smola CMU
http://alex.smola.org/teaching/10-701x
Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon - - PowerPoint PPT Presentation
Scalable Machine Learning 12. Tail bounds and Averages Geoff Gordon and Alex Smola CMU http://alex.smola.org/teaching/10-701x Estimating Probabilities Binomial Distribution Two outcomes (head, tail); (0,1) Data likelihood p ( X ;
http://alex.smola.org/teaching/10-701x
p(X; π) = πn1(1 − π)n0 θ = log n1 n0 + n1 ⇐ ⇒ p(x = 1) = n1 n0 + n1 π ∈ [0, 1]
p(x; θ) = exθ 1 + eθ
p(X; θ) =
n
Y
i=1
p(xi; θ) =
n
Y
i=1
eθxi 1 + eθ = ⇒ log p(X; θ) = θ
n
X
i=1
xi − n log ⇥ 1 + eθ⇤ = ⇒ ∂θ log p(X; θ) =
n
X
i=1
xi − n eθ 1 + eθ ⇐ ⇒ 1 n
n
X
i=1
xi = eθ 1 + eθ = p(x = 1)
p(X; θ) =
n
Y
i=1
p(xi; θ) =
n
Y
i=1
eθxi 1 + eθ = ⇒ log p(X; θ) = θ
n
X
i=1
xi − n log ⇥ 1 + eθ⇤ = ⇒ ∂θ log p(X; θ) =
n
X
i=1
xi − n eθ 1 + eθ ⇐ ⇒ 1 n
n
X
i=1
xi = eθ 1 + eθ = p(x = 1)
p(x; θ) = exp θx P
x0 exp θx0
p(X; π) = Y
i
πni
i
θi = log ni P
j nj
⇐ ⇒ p(x = i) = ni P
j nj
Chernoff Hoeffding Chebyshev
(same trick works for intervals)
E[f(x)] = Z f(x)dp(x) Pr {x = c} = E[{x = c}] = Z {x = c} dp(x) Eemp[f(x)] = 1 n
n
X
i=1
f(xi) and Pr
emp {x = c} = 1
n
n
X
i=1
{xi = c}
IS THE DICE TAINTED?
It’s probably OK ... can we develop general theory?
Pr(X ≤ 11) =
11
X
i=0
p(i) =
11
X
i=0
✓100 i ◆ 1 6 i 5 6 100−i ≈ 7.0%
ˆ P(X = 6) = 1 n
n
X
i=1
{xi = 6}
IS THE DICE TAINTED?
It’s probably OK ... can we develop general theory?
Pr(X ≤ 11) =
11
X
i=0
p(i) =
11
X
i=0
✓100 i ◆ 1 6 i 5 6 100−i ≈ 7.0%
ˆ P(X = 6) = 1 n
n
X
i=1
{xi = 6}
ad campaign working new page layout better drug working
101 102 103 1 2 3 4 5 6
µ = E[xi] ˆ µn := 1 n
n
X
i=1
xi lim
n→∞ Pr (|ˆ
µn − µ| ≤ ✏) = 1 for any ✏ > 0 Pr ⇣ lim
n→∞ ˆ
µn = µ ⌘ = 1
101 102 103 1 2 3 4 5 6
5 sample traces
µ ± p Var(x)/n
zn := " n X
i=1
σ2
i
#− 1
2 " n
X
i=1
xi − µi # N(0, 1) √n σ " 1 n
n
X
i=1
xi − µ # → N(0, 1)
O ⇣ n− 1
2
⌘
zn := " n X
i=1
σ2
i
#− 1
2 " n
X
i=1
xi − µi # N(0, 1) √n σ " 1 n
n
X
i=1
xi − µ # → N(0, 1)
O ⇣ n− 1
2
⌘
a2
n (g(Xn) g(b)) ! N(0, [rxg(b)]Σ[rxg(b)]>)
a−2
n (Xn − b) → N(0, Σ) with a2 n → 0 for n → ∞
a2
n [g(Xn) g(b)] = [rxg(ξn)]>a2 n (Xn b)
F[f](ω) := (2π)− d
2
Z
Rn f(x) exp(i hω, xi)dx
F −1[g](x) := (2π)− d
2
Z
Rn g(ω) exp(i hω, xi)dω.
F −1 F = F F −1 = Id F[f g] = (2π)
d 2 F[f] · F[g]
F[∂xf] = −iωF[f]
φX+Y (ω) = φX(ω) · φY (ω) pX+Y (z) = Z pX(z y)pY (y)dy = pX pY φX(ω) := F −1[p(x)] = Z exp(i hω, xi)dp(x)
(need to assume that we can bound the tail)
exp(iwx) = 1 + i hw, xi + o(|w|) and hence φX(ω) = 1 + iwEX[x] + o(|w|). φˆ
µm(ω) =
✓ 1 + i mwµ + o(m−1 |w|) ◆m
convolution vanishing higher
φˆ
µm(ω) → exp iωµ = 1 + iωµ + . . .
mean
p(x) = 1 π 1 1 + x2 Z |x|dp(x) ≥ 2 π Z ∞
1
x 1 + x2 dx ≥ 1 π Z ∞
1
1 xdx = ∞
exp(iwx) = 1 + iwx − 1 2w2x2 + o(|w|2) and hence φX(ω) = 1 + iwEX[x] − 1 2w2varX[x] + o(|w|2)
zn := " n X
i=1
σ2
i
#− 1
2 " n
X
i=1
xi − µi #
φZm(ω) = ✓ 1 − 1 2mw2 + o(m−1 |w|2) ◆m → exp ✓ −1 2w2 ◆ for m → ∞
5
0.0 0.5 1.0
5
0.0 0.5 1.0
5
0.0 0.5 1.0
5
0.0 0.5 1.0
5
0.0 0.5 1.0
1
0.0 0.5 1.0 1.5
1
0.0 0.5 1.0 1.5
1
0.0 0.5 1.0 1.5
1
0.0 0.5 1.0 1.5
1
0.0 0.5 1.0 1.5
Random variable X with mean μ Proof - decompose expectation
Random variable X with mean μ and variance σ2 Proof - applying Gauss-Markov to Y = (X - μ)2 with confidence ε2 yields the result.
Pr(X ≥ ✏) ≤ µ/✏ Pr(X ≥ ✏) = Z ∞
✏
dp(x) ≤ Z ∞
✏
x ✏ dp(x) ≤ ✏−1 Z ∞ xdp(x) = µ ✏ . Pr(|ˆ µm µk > ✏) 2m−1✏−2 or equivalently ✏ / p m
✏ ≤
m ✏ ≤ µ
K(p, q) = p log p q + (1 − p) log 1 − p 1 − q
w.l.o.g.q > p and set k ≥ qn
Pr {P
i xi = k|q}
Pr {P
i xi = k|p} = qk(1 − q)n−k
pk(1 − p)n−k ≥ qqn(1 − q)n−qn pqn(1 − p)n−qn = exp (nK(q, p))
X
k≥nq
Pr (X
i
xi = k|p ) ≤ X
k≥nq
Pr (X
i
xi = k|q ) exp(−nK(q, p)) ≤ exp(−nK(q, p))
Pr (X
i
xi ≥ nq ) ≤ exp (−nK(q, p)) ≤ exp
f : X m → R
Pr (|f(x1, . . . , xm) − EX1,...,Xm[f(x1, . . . , xm)]| > ✏) ≤ 2 exp
C2 =
m
X
i=1
c2
i
|f(x1, . . . , xi, . . . , xm) − f(x1, . . . , x0
i, . . . , xm)| ≤ ci
Pr (|ˆ µm − µ| > ✏) ≤ 2 exp ✓ −2m✏2 c2 ◆ .
:= Pr (|ˆ µm − µ| > ✏) ≤ 2 exp ✓ −2m✏2 c2 ◆ = ⇒ log /2 ≤ −2m✏2 c2 = ⇒ ✏ ≤ c r log 2 − log 2m
here M upper-bounds the random variables Xi
exponential sums (hence exp. inequality)
Pr (µm − µ ≥ ✏) ≤ exp ✓ − t2/2 P
i E[X2 i ] + Mt/3
◆
more clicks?
variance by σ2 = 0.25
can bound the variance at 0.15 * 0.85 = 0.1275. This requires at most 25,500 users.
m ≤ 2 ✏2 = 0.25 0.012 · 0.05 = 50, 000
m ≤ −c2 log /2 2✏2 = −1 · log 0.025 2 · 0.012 < 18, 445
1 2πσ2 Z µ+✏
µ−✏
exp ✓ −(x − µ)2 2σ2 ◆ dx = 0.95 m ≤ 2.9622 ✏2 = 2.962 · 0.1275 0.012 ≤ 11, 172
(aka the Thai Restaurant process)
(rather than variance bound)?
(more details at the end of this class)