10701 Machine Learning Recitation 7 - Tail bounds and Averages - - PowerPoint PPT Presentation

10701 machine learning
SMART_READER_LITE
LIVE PREVIEW

10701 Machine Learning Recitation 7 - Tail bounds and Averages - - PowerPoint PPT Presentation

10701 Machine Learning Recitation 7 - Tail bounds and Averages Ahmed Hefny Slides mostly by Alex Smola Why this stuff ? Can machine learning work ? Why this stuff ? Can machine learning work ? Yes, otherwise: No Google No


slide-1
SLIDE 1

10701 Machine Learning

Recitation 7 - Tail bounds and Averages

Ahmed Hefny Slides mostly by Alex Smola

slide-2
SLIDE 2

Why this stuff ?

  • Can machine learning work ?
slide-3
SLIDE 3

Why this stuff ?

  • Can machine learning work ?
  • Yes, otherwise:
  • No Google
  • No spam-filters
  • No face detectors
  • No 701 midterm
  • I’d be living my life
slide-4
SLIDE 4

Why this stuff ?

  • Will machine learning always work ?
slide-5
SLIDE 5

Why this stuff ?

  • Will machine learning always work ?
  • No,

? ? Class 1 Class 2

slide-6
SLIDE 6

Why this stuff ?

  • We need some theory to analyze machine

learning algorithms.

  • We will go through basic tools used to build

theory.

  • How well can we estimate stuff from data ?
  • What is the convergence behavior of empirical

averages ?

slide-7
SLIDE 7

Outline

  • Estimation Example
  • Convergence of Averages
  • Law of Large Numbers
  • Central Limit Theorem
  • Inequalities and Tail Bounds
  • Markov Inequality
  • Chebychev’s Inequality
  • Heoffding’s and McDiarmid’s Inequalities
  • Proof Tools
  • Union Bound
  • Fourier Analysis
  • Characteristic Functions
slide-8
SLIDE 8

Estimating Probabilities

slide-9
SLIDE 9

Discrete Distribution

  • n outcomes (e.g. faces of a dice)
  • Data likelihood
  • Maximum Likelihood Estimation
  • Constrained optimization problem ... or ...
  • Incorporate constraint via
  • Taking derivatives yields
slide-10
SLIDE 10

Tossing a Die

24 120 60 12

slide-11
SLIDE 11

Tossing a Die

24 120 60 12

slide-12
SLIDE 12

Tossing a Die

24 120 60 12

slide-13
SLIDE 13

Tossing a Die

24 120 60 12

slide-14
SLIDE 14

Empirical average for a die

slide-15
SLIDE 15

Empirical average for a die

is it guaranteed to converge ? how quickly does it converge?

slide-16
SLIDE 16

Convergence of Empirical Averages

slide-17
SLIDE 17

Expectations

  • Random variable x with probability measure
  • Expected value of f(x)
  • Special case - discrete probability mass

(same trick works for intervals)

  • Draw xi identically and independently from p
  • Empirical average
slide-18
SLIDE 18

Law of Large Numbers

convergence in probability

  • Random variables xi with mean
  • Empirical average
  • Weak Law of Large Numbers
  • Strong Law of Large Numbers

Almost sure convergence

slide-19
SLIDE 19

Empirical average for a dice

5 sample traces

slide-20
SLIDE 20

Central Limit Theorem

  • Independent random variables xi with mean μi

and standard deviation σi

  • The random variable

converges to a Normal Distribution

  • Special case - IID random variables & average

convergence

slide-21
SLIDE 21

Central Limit Theorem in Practice

unscaled scaled

slide-22
SLIDE 22

Tail Bounds

slide-23
SLIDE 23

Simple tail bounds

  • Gauss Markov inequality

Non-negative Random variable X with mean μ

  • Proof - decompose expectation
slide-24
SLIDE 24

Simple tail bounds

  • Chebyshev inequality

Random variable X with mean μ and variance σ2

  • Proof - applying Gauss-Markov to Y = (X - μ)2 with

confidence ε2 yields the result.

𝜀

slide-25
SLIDE 25

Simple tail bounds

  • Chebyshev inequality

Random variable X with mean μ and variance σ2

  • Proof - applying Gauss-Markov to Y = (X - μ)2 with

confidence ε2 yields the result.

𝜀 Correct ?

slide-26
SLIDE 26

Simple tail bounds

  • Chebyshev inequality

Random variable X with mean μ and variance σ2

  • Proof - applying Gauss-Markov to Y = (X - μ)2 with

confidence ε2 yields the result.

𝜀 Approximately Correct ?

slide-27
SLIDE 27

Simple tail bounds

  • Chebyshev inequality

Random variable X with mean μ and variance σ2

  • Proof - applying Gauss-Markov to Y = (X - μ)2 with

confidence ε2 yields the result.

𝜀 Probably Approximately Correct !

slide-28
SLIDE 28
  • Gauss-Markov

Scales properly in μ but expensive in δ

  • Chebyshev

Proper scaling in σ but still bad in δ Can we get logarithmic scaling in δ?

Scaling behavior

slide-29
SLIDE 29

Chernoff bound

  • For Bernoulli Random Variable with P(x=1)=p
  • Ex: n independent tosses from biased coin with p

probability of getting head Pr 𝜈 𝑜 − 𝑞 ≥ 𝜗 ≤ exp(−2𝑜𝜗2)

slide-30
SLIDE 30

Chernoff bound

  • Proof: We show that
  • Where

Pinsker’s inequality Pinsker’s inequality

slide-31
SLIDE 31

Heoffding’s Inequality

  • If Xi have bounded range c
slide-32
SLIDE 32

Heoffding’s Inequality

  • Scaling Behavior
  • This helps when we need to combine several tail

bounds since we only pay logarithmically in terms of their combination.

slide-33
SLIDE 33

McDiarmid Inequality

  • Generalization of Heoffding’s Inequality
  • Independent random variables Xi
  • Function
  • Deviation from expected value
  • Here C is given by where
  • f is average and Xi have bounded range c 

Heoffding’s Inequality

slide-34
SLIDE 34

More tail bounds

  • Higher order moments
  • Bernstein inequality (needs variance bound)
  • Absolute / relative error bounds
  • Bounds for (weakly) dependent random variables

Pr 𝜈 𝑛 − 𝜈 ≥ 𝜗 ≤ 2 exp − 𝑜𝜗 2/2 𝐹 𝑌𝑗

2 + 𝐷𝑜𝜗/3 𝑗

slide-35
SLIDE 35

Summary

  • Markov [X is non-negative]
  • Chebychev [Finite variance]
  • Heoffding [Bound on range]
  • Brenestein [Bounded on range + Bound
  • n second moment]

Tighter Bounds More Assumptions

slide-36
SLIDE 36

Tools for the proof

slide-37
SLIDE 37

Union Bound

𝑄 𝐵 ∪ 𝐶 = 𝑄 𝐵 + 𝑄 𝐶 − 𝑄 𝐵 ∩ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 In general 𝑄( 𝐵𝑗) ≤ 𝑄(𝐵𝑗)

𝑗 𝑗

slide-38
SLIDE 38

Fourier Transform

  • Fourier transform relations
  • Useful identities
  • Identity
  • Derivative
  • Convolution (also holds for inverse transform)
slide-39
SLIDE 39

The Characteristic Function Method

  • Characteristic function
  • For X and Y independent we have
  • Joint distribution is convolution
  • Characteristic function is product
  • Proof - plug in definition of Fourier

transform

  • Characteristic function is unique
slide-40
SLIDE 40

Proof - Weak law of large numbers

  • Require that expectation exists
  • Taylor expansion of exponential

(need to assume that we can bound the tail)

  • Average of random variables
  • Limit is constant distribution

convolution vanishing higher

  • rder terms

mean

slide-41
SLIDE 41

Warning

  • Moments may not always exist
  • Cauchy distribution
  • For the mean to exist the following

integral would have to converge

slide-42
SLIDE 42

Proof - Central limit theorem

  • Require that second order moment exists

(we assume they’re all identical WLOG)

  • Characteristic function
  • Subtract out mean (centering)
  • This is the FT of a Normal Distribution
slide-43
SLIDE 43

Conclusion & what’s next ?

We looked at basic building blocks of learning theory

  • Convergence of empirical averages
  • Tail bounds
  • Union bound
slide-44
SLIDE 44

Conclusion & what’s next ?

Evaluate classifier C on N data points and estimate

  • accuracy. Can we upper-bound estimation error ?
slide-45
SLIDE 45

Conclusion & what’s next ?

Evaluate classifier C on N data points and estimate

  • accuracy. Can we upper-bound estimation error ?

Yes, Chernoff bound / Heoffding’s inequality

slide-46
SLIDE 46

Conclusion & what’s next ?

Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ?

slide-47
SLIDE 47

Conclusion & what’s next ?

Evaluate a set classifiers on N data points and pick the one with best accuracy. Can we upper-bound estimation error ? Yes, Chernoff bound / Heoffding’s inequality + union bound

slide-48
SLIDE 48

Conclusion & what’s next ?

What if the set of classifiers is infinite ??