First order optimization. Last time. Other scenarios. min f ( x ) - - PowerPoint PPT Presentation

first order optimization last time other scenarios
SMART_READER_LITE
LIVE PREVIEW

First order optimization. Last time. Other scenarios. min f ( x ) - - PowerPoint PPT Presentation

First order optimization. Last time. Other scenarios. min f ( x ) Gradient Descent: Dont you dual norm me! Convexity: f ( x ) + ( f ( x )) ( y x ) f ( y ) . x t + 1 = x t f ( x t ) Lipschitz: ( f ( x ))


slide-1
SLIDE 1

First order optimization.

min f(x) Convexity: f(x) + (∇f(x)) · (y − x) ≤ f(y). Lipschitz: ∇(f(x)) − ∇(f(y)) ≤ Lx − y ∇f(x) - gradient or subgradient. Gradient Descent: xt+1 = xt − α∇f(xt) One bound: f(xt) − f(xt+1) ≥ ∇f(xt)2

L

. Lipschitz. “Mirror” Descent: xt+1 = xt − α∇f(xt) for “euclidean proximity function” Output: Average point. One bound: Total Difference from optimal or “regret.”

  • t α∇f(xt)2 + w(u)

T .

No Lipschitz condition. Works for subgradients. Idea: average lower bound is average of linear lower bounds. R(u) − w(x) =

i(∇f(xt))(x − u) − w(u).

What is w(x)? One option: Euclidean norm of x. Another, w(x) =

i xi log xi. Get multiplicative weight update!!!!

Last time.

Gradient Descent: xt+1 = xt − α∇f(xt) One bound: f(xt) − f(xt+1) ≥ ∇f(xt)2

L

. Lipschitz. “Mirror” Descent: xt+1 = xt − α∇f(xt) for “euclidean proximity function” Output: Average point. One bound: Total Difference from optimal or “regret.”

  • t α∇f(xt)2 + w(u)

T .

Accelerated Gradient Descent: xt+1 = x + αi(xt − xt−1) − βi∇f(xt). Momentum term: (xt − xt+1) =

i νi∇(f(xi)).

where

i νi = 1.

Mirror Descent point! Idea of Analysis: Benefit for gradient cancels some of regret term of MD.

Other scenarios.

Don’t you dual norm me! Norm: x. Dual Norm: y∗. y∗ = maxx=1x, y. For Euclidean norm, what is dual norm? For ℓ1 or hamming norm, what is dual norm? x1 =

i |xi|.

x∞ = maxi |xi|. Can be Lipschitz in different norms: ∇f(x) − ∇f(y)∗ = Lx − y. Gradient Step: xt+1 = xt − αargmax|y|=1∇(f(x)), y. Lipschitz in ℓ1, when optimizing

i |xi|.

E.g. Max Flow or tolls.

Next Topic

Streaming. Frequent Items.

Streaming

Stream: x1, x2, x3, , . . . xn Resources: O(logc n) storage. Today’s Goal: find frequent items.

Frequent Items: deterministic.

Additive n

k error.

Accurate count for k + 1th item? Yes? No? k + 1st most frequent item occurs <

n k+1

Off by 100%. 0 estimate is fine. No item more frequent than n

k ?

0 estimate is fine. Only reasonable for frequent items.

slide-2
SLIDE 2

Deteministic Algorithm.

Alg: (1) Set, S, of k counters, initially 0. (2) If xi ∈ S increment xi’s counter. (3) If xi ∈ S If S has space, add xi to S w/value 1. Otherwise decrement all counters. Delete zero count elts. Example: Stream 1, 1, 2 1, 2, 3 1, 2, 3, 1 1, 2, 3, 1, 2 1, 2, 3, 1, 2, 4 1, 2, 3, 1, 2, 4/stream7 State: k = 3 Previous State [] [(1, 1)] Previous State [(1, 1)] [(1, 1) − −(2, 1)] Previous State [(1, 1) − −(2, 1)] [(1, 1) − −(2, 1) − −(3, 1)] Previous State [(1, 1) − −(2, 1) − −(3, 1)] [(1, 2) − −(2, 1) − −(3, 1)] Previous State [(1, 2) − −(2, 1) − −(3, 1)] [(1, 2) − −(2, 2) − −(3, 1)] Previous State [(1, 2) − −(2, 2) − −(3, 1)] [(1, 1) − −(2, 1) − −(3, 0)] Previous State [(1, 1) − −(2, 1) − −(3, 0)] [(1, 2) − −(2, 2) − −(3, 0)]/stream7

Deterministic Algorithm.

Alg: (1) Set, S, of k counters, initially 0. (2) If xi ∈ S increment xi’s counter. (3) If xi ∈ S If S has space, add xi to S w/value 1. Otherwise decrement all counters. Estimate for item: if in S, value of counter.

  • therwise 0.

Underestimate clearly. Increment once when see an item, might decrement. Total decrements, T? n? n/k? k? decrement k counters on each decrement. Tk total decremting n items. n total incrementing. = ⇒ T ≤ n

k .

Off by at most n

k

Space?O(k log n)

Turnstile Model and Randomization

Stream: . . . , (i, ci), . . . item i, count ci (possibly negative.) Positive total for each item! Estimate frequency of item: fj = cj. |f|1 =

j |fj|

Smaller than

i |ci|.

Approximation: Additive ǫ|f|1 with probability 1 − δ Space O( 1

ǫ log 1 δ log n).

Count Min Sketch

Sketch – Summary of stream. (1) t arrays, A[i], of k counters. h1, . . . , ht from 2-wise ind. family. (2) Process elt (j, cj), A[i][hi(j)] += cj. (3) Item j estimate: mini A[i][hi(j)]. Intuition:|f|1/k other “counts” in same bucket. → Additive |f|1/k error on average for each of t arrays. Why t buckets? To get high probability.

Count min sketch:analysis

(1) t arrays, A[i], of k counters. h1, . . . , ht from 2-wise ind. family. (2) Process elt (j, cj), A[i][hi(j)]+ = cj. (3) Item j estimate: mini A[i][hi(j)]. A[1][hj(j)] = fj + X, where X is a random variable. Yi - item h1(i) = h1(j) X =

i Yifi

E[X] =

i E[Yi]fi = i 1 k fi = |f|1 k

Markov: Pr[X > 2 |f|1

k ] ≤ 1 2

Exercise: proof of Markov. (All above average?) t independent trials, pick smallest. Pr[X > 2 |f|1

k

in all t trials] ≤ ( 1

2)t

≤ δ when t = log 1

δ.

Error ǫ|f|1 if ǫ = 2

k .

Space? O(k log 1

δ log n)

O( 1

ǫ log 1 δ log n)

Count sketch.

Error in terms of |f|2 =

  • i f 2

2 . |f|1 √n ≤ |f|2 ≤ |f|1.

Could be much better. E.g., uniform frequency |f|1

√n = |f|2

Alg: (1) t arrays, A[i]: t hash functions hi : U → [k] t hash functions gi : U → [−1, +1] (2) Elt (j, cj) A[i][h(j)] = A[i][hi(j)] + gi(j)cj (3) Item j estimate: median of gi(j)A[i][hi(j)]. Buckets contains signed count (estimate cancels sign.) Other items cancel each other out! Tight! (Not an asymptotic statement.) Do t times and average? No! Median! Two ideas! One simple algorithm!

slide-3
SLIDE 3

Analysis

(1) · · · gi : U → [−1, +1], hi : U → [k] (2) Elt (j, cj) A[i][h(j)] = A[i][hi(j)] + gi(j)cj (3) Item j estimate: median of gi(j)A[i][hi(j)]. Notice: A[1][h1(j)] = g1(j)fj + X X =

i Yi

Yi = ±fi if item h1(i) = h1(j) Yi = 0, otherwise E[Yi] = 0 Var(Yi) = f 2

i

k .

E[X] = 0 Expected drift is 0! Var[X] =

i∈[m] Var(Yi) = i f 2

i

k = |f|2

2

k

Cheybshev: Pr[|X − µ| > ∆] ≤ Var(X)2

∆2

Choose k = 4

ǫ2 : Pr[|X| > ǫ|f|2] ≤ |f|2

2/k

ǫ2|f|2

2 ≤ ǫ2|f|2 2/4

ǫ2|f|2

2

≤ 1

4.

Each trial is close with probability 3/4. If > half tosses close, median is close! Exists t = Θ(log 1

δ) where ≥ 1 2 are correct with probability ≥ 1 − δ

Total Space: O( log 1

δ

ǫ2

log n)

Sum up

Deterministic: stream has items Count within additive n

k

O(k log n) space. Within ǫn with O( 1

ǫ log n) space.

Count Min: stream has ± counts Count within additive ǫ|f|1 with probability at least 1 − δ O( log n log 1

δ

ǫ

). Count Sketch: stream has ± counts Count within additive ǫ|f|2 with probability at least 1 − δ O( log n log 1

δ

ǫ2

). See you on Thurday.