Review S. Cheng (OU-Tulsa) October 17, 2017 1 / 28 Lecture 10 - - PowerPoint PPT Presentation

review
SMART_READER_LITE
LIVE PREVIEW

Review S. Cheng (OU-Tulsa) October 17, 2017 1 / 28 Lecture 10 - - PowerPoint PPT Presentation

Lecture 10 Review S. Cheng (OU-Tulsa) October 17, 2017 1 / 28 Lecture 10 Review Conditioning reduces entropy S. Cheng (OU-Tulsa) October 17, 2017 2 / 28 Lecture 10 Review Conditioning reduces entropy Chain rules: H ( X , Y , Z ) S.


slide-1
SLIDE 1

Lecture 10

Review

  • S. Cheng (OU-Tulsa)

October 17, 2017 1 / 28

slide-2
SLIDE 2

Lecture 10

Review

Conditioning reduces entropy

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-3
SLIDE 3

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z)

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-4
SLIDE 4

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-5
SLIDE 5

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-6
SLIDE 6

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-7
SLIDE 7

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )= I(X; U|V ) + I(Y ; U|V , X) + I(Z; U|V , X, Y )

Data processing inequality: if X⊥Y |Z,

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-8
SLIDE 8

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )= I(X; U|V ) + I(Y ; U|V , X) + I(Z; U|V , X, Y )

Data processing inequality: if X⊥Y |Z, I(X; Y ) ≥ I(X; Z) Independence and mutual information:

X⊥Y ⇔

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-9
SLIDE 9

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )= I(X; U|V ) + I(Y ; U|V , X) + I(Z; U|V , X, Y )

Data processing inequality: if X⊥Y |Z, I(X; Y ) ≥ I(X; Z) Independence and mutual information:

X⊥Y ⇔ I(X; Y ) = 0 X⊥Y |Z ⇔

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-10
SLIDE 10

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )= I(X; U|V ) + I(Y ; U|V , X) + I(Z; U|V , X, Y )

Data processing inequality: if X⊥Y |Z, I(X; Y ) ≥ I(X; Z) Independence and mutual information:

X⊥Y ⇔ I(X; Y ) = 0 X⊥Y |Z ⇔ I(X; Y |Z) = 0

KL-divergence: KL(p||q)

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-11
SLIDE 11

Lecture 10

Review

Conditioning reduces entropy Chain rules:

H(X, Y , Z) = H(Z) + H(Y |X) + H(Z|X, Y ) H(X, Y , U|V )= H(X|V ) + H(Y |X, V ) + H(U|Y , X, V ) I(X, Y , Z; U)= I(X; U) + I(Y ; U|X) + I(Z; U|X, Y ) I(X, Y , Z; U|V )= I(X; U|V ) + I(Y ; U|V , X) + I(Z; U|V , X, Y )

Data processing inequality: if X⊥Y |Z, I(X; Y ) ≥ I(X; Z) Independence and mutual information:

X⊥Y ⇔ I(X; Y ) = 0 X⊥Y |Z ⇔ I(X; Y |Z) = 0

KL-divergence: KL(p||q)

x p(x) log p(x) q(x)

  • S. Cheng (OU-Tulsa)

October 17, 2017 2 / 28

slide-12
SLIDE 12

Lecture 10 Overview

This time

Identification/Decision trees Random forests Law of Large Number Asymptotic equipartition (AEP) and typical sequences

  • S. Cheng (OU-Tulsa)

October 17, 2017 3 / 28

slide-13
SLIDE 13

Lecture 10 Identification/Decision tree

Vampire database

(https://www.youtube.com/watch?v=SXBG3RGr Rc)

  • S. Cheng (OU-Tulsa)

October 17, 2017 4 / 28

slide-14
SLIDE 14

Lecture 10 Identification/Decision tree

Identifying vampire

Goal: Design a set of tests to identify vampires Potential difficulties Non-numerical data Some information may not matter Some may matter only sometimes Tests may be costly ⇒ conduct as few as possible

  • S. Cheng (OU-Tulsa)

October 17, 2017 5 / 28

slide-15
SLIDE 15

Lecture 10 Identification/Decision tree

Test trees Shadow

++

  • ?
  • Y

+

N

Garlic

  • Y

+++

  • N

Complexion

++

  • A
  • P
  • +

R

Accent

  • +

N

  • ++

H

  • +

O

+ : Vampire − : Not vampire

  • S. Cheng (OU-Tulsa)

October 17, 2017 6 / 28

slide-16
SLIDE 16

Lecture 10 Identification/Decision tree

Test trees Shadow

++

  • ?
  • Y

+

N

Garlic

  • Y

+++

  • N

Complexion

++

  • A
  • P
  • +

R

Accent

  • +

N

  • ++

H

  • +

O

+ : Vampire − : Not vampire How to pick a good test?

  • S. Cheng (OU-Tulsa)

October 17, 2017 6 / 28

slide-17
SLIDE 17

Lecture 10 Identification/Decision tree

Test trees Shadow

++

  • ?
  • Y

+

N

Garlic

  • Y

+++

  • N

Complexion

++

  • A
  • P
  • +

R

Accent

  • +

N

  • ++

H

  • +

O

+ : Vampire − : Not vampire How to pick a good test? Pick test that identifies most vampires (and non-vampires)!

  • S. Cheng (OU-Tulsa)

October 17, 2017 6 / 28

slide-18
SLIDE 18

Lecture 10 Identification/Decision tree

Sizes of homogeneous sets Shadow

++

  • ?
  • Y

+

N

Garlic

  • Y

+++

  • N

Complexion

++

  • A
  • P
  • +

R

Accent

  • +

N

  • ++

H

  • +

O

+ : Vampire − : Not vampire

  • S. Cheng (OU-Tulsa)

October 17, 2017 7 / 28

slide-19
SLIDE 19

Lecture 10 Identification/Decision tree

Sizes of homogeneous sets Shadow

++

  • ?
  • Y

+

N

Garlic

  • Y

+++

  • N

Complexion

++

  • A
  • P
  • +

R

Accent

  • +

N

  • ++

H

  • +

O

+ : Vampire − : Not vampire Shadow: 4 Garlic: 3 Complexion: 2 Accent: 0

  • S. Cheng (OU-Tulsa)

October 17, 2017 7 / 28

slide-20
SLIDE 20

Lecture 10 Identification/Decision tree

Picking second test

Let say we pick “shadow” as the first test after all. Then, for the remaining unclassified individuals,

Garlic

  • Y

++

N

Complexion

+

A

  • P

+-

R

Accent

+-

N

+-

H

  • +

O

Garlic: 4 Complexion: 2 Accent: 0

  • S. Cheng (OU-Tulsa)

October 17, 2017 8 / 28

slide-21
SLIDE 21

Lecture 10 Identification/Decision tree

Combined tests Shadow Garlic

Not vampire

Y

Vampire

N ?

Not vampire

Y

Vampire

N

  • S. Cheng (OU-Tulsa)

October 17, 2017 9 / 28

slide-22
SLIDE 22

Lecture 10 Identification/Decision tree

Combined tests Shadow Garlic

Not vampire

Y

Vampire

N ?

Not vampire

Y

Vampire

N

Problem When our database size increases, none of the test likely to completely separate vampire from non-vampire. All tests will score 0 then.

  • S. Cheng (OU-Tulsa)

October 17, 2017 9 / 28

slide-23
SLIDE 23

Lecture 10 Identification/Decision tree

Combined tests Shadow Garlic

Not vampire

Y

Vampire

N ?

Not vampire

Y

Vampire

N

Problem When our database size increases, none of the test likely to completely separate vampire from non-vampire. All tests will score 0 then. Entropy comes to the rescue!

  • S. Cheng (OU-Tulsa)

October 17, 2017 9 / 28

slide-24
SLIDE 24

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-25
SLIDE 25

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

Remaining uncertainty given the test:

4 8H(V |S =?)

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-26
SLIDE 26

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

Remaining uncertainty given the test:

4 8H(V |S =?) + 3 8H(V |S = Y )

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-27
SLIDE 27

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

Remaining uncertainty given the test:

4 8H(V |S =?) + 3 8H(V |S = Y ) + 1 8H(V |S = N) = 0.5

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-28
SLIDE 28

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

Remaining uncertainty given the test:

4 8H(V |S =?) + 3 8H(V |S = Y ) + 1 8H(V |S = N) = 0.5 =Pr(S =?)H(V |S =?) + Pr(S = Y )H(V |S = Y ) + Pr(S = N)H(V |S = N)

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-29
SLIDE 29

Lecture 10 Identification/Decision tree

Conditional entropy as a measure of test efficiency

Consider the database is randomly sampled from a distribution. A set is Very homogeneous ≈ high certainty Not so homogenous ≈ high randomness These can be measured with its entropy

Shadow

++--

?

  • Y

+

N H(V |S =?) = 1 H(V |S = Y ) = 0 H(V |S = N) = 0

Remaining uncertainty given the test:

4 8H(V |S =?) + 3 8H(V |S = Y ) + 1 8H(V |S = N) = 0.5 =Pr(S =?)H(V |S =?) + Pr(S = Y )H(V |S = Y ) + Pr(S = N)H(V |S = N) =H(V |S)

  • S. Cheng (OU-Tulsa)

October 17, 2017 10 / 28

0.2 0.4 0.6 0.8 1 Pr(Head) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Entropy

slide-30
SLIDE 30

Lecture 10 Identification/Decision tree

Remaining uncertainty Garlic

  • Y

+++

  • N

H(V |G = Y ) = 0 0.97

Complexion

++

  • A
  • P
  • +

R 0.92 0.92

Accent

  • +

N

  • ++

H

  • +

O 0.92 0.92 1

H(V |S) =0.5

  • S. Cheng (OU-Tulsa)

October 17, 2017 11 / 28

slide-31
SLIDE 31

Lecture 10 Identification/Decision tree

Remaining uncertainty Garlic

  • Y

+++

  • N

H(V |G = Y ) = 0 0.97

Complexion

++

  • A
  • P
  • +

R 0.92 0.92

Accent

  • +

N

  • ++

H

  • +

O 0.92 0.92 1

H(V |S) =0.5 H(V |G) =3 8 · 0 + 5 8 · 0.97 = 0.61

  • S. Cheng (OU-Tulsa)

October 17, 2017 11 / 28

slide-32
SLIDE 32

Lecture 10 Identification/Decision tree

Remaining uncertainty Garlic

  • Y

+++

  • N

H(V |G = Y ) = 0 0.97

Complexion

++

  • A
  • P
  • +

R 0.92 0.92

Accent

  • +

N

  • ++

H

  • +

O 0.92 0.92 1

H(V |S) =0.5 H(V |G) =3 8 · 0 + 5 8 · 0.97 = 0.61 H(V |C) =3 8 · 0.92 + 2 8 · 0 + 3 8 · 0.92 = 0.69

  • S. Cheng (OU-Tulsa)

October 17, 2017 11 / 28

slide-33
SLIDE 33

Lecture 10 Identification/Decision tree

Remaining uncertainty Garlic

  • Y

+++

  • N

H(V |G = Y ) = 0 0.97

Complexion

++

  • A
  • P
  • +

R 0.92 0.92

Accent

  • +

N

  • ++

H

  • +

O 0.92 0.92 1

H(V |S) =0.5 H(V |G) =3 8 · 0 + 5 8 · 0.97 = 0.61 H(V |C) =3 8 · 0.92 + 2 8 · 0 + 3 8 · 0.92 = 0.69 H(V |A) =3 8 · 0.92 + 3 8 · 0.92 + 2 8 · 1 = 0.94

  • S. Cheng (OU-Tulsa)

October 17, 2017 11 / 28

slide-34
SLIDE 34

Lecture 10 Identification/Decision tree

Remaining uncertainty Garlic

  • Y

+++

  • N

H(V |G = Y ) = 0 0.97

Complexion

++

  • A
  • P
  • +

R 0.92 0.92

Accent

  • +

N

  • ++

H

  • +

O 0.92 0.92 1

H(V |S) =0.5 H(V |G) =3 8 · 0 + 5 8 · 0.97 = 0.61 H(V |C) =3 8 · 0.92 + 2 8 · 0 + 3 8 · 0.92 = 0.69 H(V |A) =3 8 · 0.92 + 3 8 · 0.92 + 2 8 · 1 = 0.94

Order of tests to pick: S ≻ G ≻ C ≻ A

  • S. Cheng (OU-Tulsa)

October 17, 2017 11 / 28

slide-35
SLIDE 35

Lecture 10 Identification/Decision tree

Potential extensions

The test does not need to return discrete result. Let X be the test

  • utcome. It can be continuous as well
  • S. Cheng (OU-Tulsa)

October 17, 2017 12 / 28

slide-36
SLIDE 36

Lecture 10 Identification/Decision tree

Potential extensions

The test does not need to return discrete result. Let X be the test

  • utcome. It can be continuous as well

We should just pick i such that H(V |Xi) to be as small as possible

  • S. Cheng (OU-Tulsa)

October 17, 2017 12 / 28

slide-37
SLIDE 37

Lecture 10 Identification/Decision tree

Potential extensions

The test does not need to return discrete result. Let X be the test

  • utcome. It can be continuous as well

We should just pick i such that H(V |Xi) to be as small as possible It is equivalent of saying I(V ; Xi) = H(V ) − H(V |Xi) is as large as

  • possible. This is intuitive because we want to pick the information that

is most relevant (sharing most information with) to V

  • S. Cheng (OU-Tulsa)

October 17, 2017 12 / 28

slide-38
SLIDE 38

Lecture 10 Identification/Decision tree

Potential extensions

The test does not need to return discrete result. Let X be the test

  • utcome. It can be continuous as well

We should just pick i such that H(V |Xi) to be as small as possible It is equivalent of saying I(V ; Xi) = H(V ) − H(V |Xi) is as large as

  • possible. This is intuitive because we want to pick the information that

is most relevant (sharing most information with) to V

Build a number of trees instead of a single tree ⇒ random forests

  • S. Cheng (OU-Tulsa)

October 17, 2017 12 / 28

slide-39
SLIDE 39

Lecture 10 Identification/Decision tree

Random forests

Pick random subset of training samples Train on each random subset but limited to a subset of features/attributes Given a test sample

Classify sample using each of the trees Make final decision based on majority vote

  • S. Cheng (OU-Tulsa)

October 17, 2017 13 / 28

slide-40
SLIDE 40

Lecture 10 Law of Large Number

Law of Large Number (LLN)

If we randomly sample x1, x2, · · · , xN from an i.i.d. (identical and independently distributed) source, the average of f (xi) will approach the expected value as N → ∞. That is, 1 N

N

  • i=1

f (xi) = E[f (X)] as N → ∞

  • S. Cheng (OU-Tulsa)

October 17, 2017 14 / 28

slide-41
SLIDE 41

Lecture 10 Law of Large Number

Law of Large Number (LLN)

If we randomly sample x1, x2, · · · , xN from an i.i.d. (identical and independently distributed) source, the average of f (xi) will approach the expected value as N → ∞. That is, 1 N

N

  • i=1

f (xi) = E[f (X)] as N → ∞ Example This is precisely how poll supposes to work! Pollster randomly draws sample from a portion of the population but will expect the prediction matches the outcome

  • S. Cheng (OU-Tulsa)

October 17, 2017 14 / 28

slide-42
SLIDE 42

Lecture 10 Law of Large Number

Proof of LLN

The LLN is a rather strong result. We will only show a weak version here Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • ≤ Var(f (X))

Na2 ∝ 1 N

  • S. Cheng (OU-Tulsa)

October 17, 2017 15 / 28

slide-43
SLIDE 43

Lecture 10 Law of Large Number

Proof of LLN

The LLN is a rather strong result. We will only show a weak version here Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • ≤ Var(f (X))

Na2 ∝ 1 N Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0

  • S. Cheng (OU-Tulsa)

October 17, 2017 15 / 28

slide-44
SLIDE 44

Lecture 10 Law of Large Number

Proof of LLN

The LLN is a rather strong result. We will only show a weak version here Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • ≤ Var(f (X))

Na2 ∝ 1 N Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Proof: X = I(X ≥ b) · X + I(X < b) · X ≥ I(X ≥ b) · b

  • S. Cheng (OU-Tulsa)

October 17, 2017 15 / 28

slide-45
SLIDE 45

Lecture 10 Law of Large Number

Proof of LLN

The LLN is a rather strong result. We will only show a weak version here Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • ≤ Var(f (X))

Na2 ∝ 1 N Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Proof: X = I(X ≥ b) · X + I(X < b) · X ≥ I(X ≥ b) · b ⇒ E[X] ≥ Pr(X ≥ b) · b

  • S. Cheng (OU-Tulsa)

October 17, 2017 15 / 28

slide-46
SLIDE 46

Lecture 10 Law of Large Number

Proof of LLN

Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2

  • S. Cheng (OU-Tulsa)

October 17, 2017 16 / 28

slide-47
SLIDE 47

Lecture 10 Law of Large Number

Proof of LLN

Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof: Take X = |Y − E[Y ]|2 and b = a2, by Markov’s Inequality

  • S. Cheng (OU-Tulsa)

October 17, 2017 16 / 28

slide-48
SLIDE 48

Lecture 10 Law of Large Number

Proof of LLN

Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof: Take X = |Y − E[Y ]|2 and b = a2, by Markov’s Inequality Pr(|Y − E[Y ]| ≥ a) = Pr(|Y − E[Y ]|2 ≥ a2) ≤E[|Y − E[Y ]|2] a2

  • S. Cheng (OU-Tulsa)

October 17, 2017 16 / 28

slide-49
SLIDE 49

Lecture 10 Law of Large Number

Proof of LLN

Markov’s Inequality Pr(X ≥ b) ≤ E[X] b if X ≥ 0 Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof: Take X = |Y − E[Y ]|2 and b = a2, by Markov’s Inequality Pr(|Y − E[Y ]| ≥ a) = Pr(|Y − E[Y ]|2 ≥ a2) ≤E[|Y − E[Y ]|2] a2 = Var(Y ) a2

  • S. Cheng (OU-Tulsa)

October 17, 2017 16 / 28

slide-50
SLIDE 50

Lecture 10 Law of Large Number

Proof of LLN

Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof of weak LLN Let Z = 1

N

N

i=1 f (Xi), apparently E[Z] = E[f (X)] and

Var(Z) = 1 N2

N

  • i=1

Var(f (X)) = Var(f (X)) N By Chebyshev’s Inequality,

  • S. Cheng (OU-Tulsa)

October 17, 2017 17 / 28

slide-51
SLIDE 51

Lecture 10 Law of Large Number

Proof of LLN

Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof of weak LLN Let Z = 1

N

N

i=1 f (Xi), apparently E[Z] = E[f (X)] and

Var(Z) = 1 N2

N

  • i=1

Var(f (X)) = Var(f (X)) N By Chebyshev’s Inequality, Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • =Pr(|Z − E[Z]| ≥ a) ≤ Var(Z)

a2

  • S. Cheng (OU-Tulsa)

October 17, 2017 17 / 28

slide-52
SLIDE 52

Lecture 10 Law of Large Number

Proof of LLN

Chebyshev’s Inequality Pr(|Y − E[Y ]| ≥ a) ≤ Var(Y ) a2 Proof of weak LLN Let Z = 1

N

N

i=1 f (Xi), apparently E[Z] = E[f (X)] and

Var(Z) = 1 N2

N

  • i=1

Var(f (X)) = Var(f (X)) N By Chebyshev’s Inequality, Pr

  • 1

N

N

  • i=1

f (Xi) − E[f (X)]

  • ≥ a
  • =Pr(|Z − E[Z]| ≥ a) ≤ Var(Z)

a2 = Var(f (X)) Na2

  • S. Cheng (OU-Tulsa)

October 17, 2017 17 / 28

slide-53
SLIDE 53

Lecture 10 Asymptotic equipartition

Main idea

Consider a sequence of symbols x1, x2, · · · , xN sampled from a DMS and consider the sample average of the log-probabilities of each sampled symbols 1 N

N

  • i=1

log 1 p(xi) → E

  • log

1 p(X)

  • by LLN.
  • S. Cheng (OU-Tulsa)

October 17, 2017 18 / 28

slide-54
SLIDE 54

Lecture 10 Asymptotic equipartition

Main idea

Consider a sequence of symbols x1, x2, · · · , xN sampled from a DMS and consider the sample average of the log-probabilities of each sampled symbols 1 N

N

  • i=1

log 1 p(xi) → E

  • log

1 p(X)

  • = H(X)

by LLN.

  • S. Cheng (OU-Tulsa)

October 17, 2017 18 / 28

slide-55
SLIDE 55

Lecture 10 Asymptotic equipartition

Main idea

Consider a sequence of symbols x1, x2, · · · , xN sampled from a DMS and consider the sample average of the log-probabilities of each sampled symbols 1 N

N

  • i=1

log 1 p(xi) → E

  • log

1 p(X)

  • = H(X)

by LLN. But for the LHS, 1 N

N

  • i=1

log 1 p(xi) = 1 N log 1 N

i=1 p(xi)

= − 1 N log p(xN), where xN = x1, x2, · · · , xN

  • S. Cheng (OU-Tulsa)

October 17, 2017 18 / 28

slide-56
SLIDE 56

Lecture 10 Asymptotic equipartition

Main idea

Consider a sequence of symbols x1, x2, · · · , xN sampled from a DMS and consider the sample average of the log-probabilities of each sampled symbols 1 N

N

  • i=1

log 1 p(xi) → E

  • log

1 p(X)

  • = H(X)

by LLN. But for the LHS, 1 N

N

  • i=1

log 1 p(xi) = 1 N log 1 N

i=1 p(xi)

= − 1 N log p(xN), where xN = x1, x2, · · · , xN Rearranging the terms, this implies that for any sequence sampled from the source, the probability of the sampled sequence p(xN) → 2−NH(X)!

  • S. Cheng (OU-Tulsa)

October 17, 2017 18 / 28

slide-57
SLIDE 57

Lecture 10 Asymptotic equipartition

Set of typical sequences

Let’s name the sequence xN with p(xN) ∼ 2−NH(X) typical and define the set of typical sequences AN

ǫ (X) = {xN|2−N(H(X)+ǫ) ≤ p(xN) ≤ 2−N(H(X)−ǫ)}

  • S. Cheng (OU-Tulsa)

October 17, 2017 19 / 28

slide-58
SLIDE 58

Lecture 10 Asymptotic equipartition

Set of typical sequences

Let’s name the sequence xN with p(xN) ∼ 2−NH(X) typical and define the set of typical sequences AN

ǫ (X) = {xN|2−N(H(X)+ǫ) ≤ p(xN) ≤ 2−N(H(X)−ǫ)}

For any ǫ > 0, we can find a sufficiently large N such that any sampled sequence from the source is typical

  • S. Cheng (OU-Tulsa)

October 17, 2017 19 / 28

slide-59
SLIDE 59

Lecture 10 Asymptotic equipartition

Set of typical sequences

Let’s name the sequence xN with p(xN) ∼ 2−NH(X) typical and define the set of typical sequences AN

ǫ (X) = {xN|2−N(H(X)+ǫ) ≤ p(xN) ≤ 2−N(H(X)−ǫ)}

For any ǫ > 0, we can find a sufficiently large N such that any sampled sequence from the source is typical Since all typical sequences have probability ∼ 2−NH(X) and they fill up the entire probability space (everything is typical), there should be approximately

1 2−NH(X) = 2NH(X) typical sequences

  • S. Cheng (OU-Tulsa)

October 17, 2017 19 / 28

slide-60
SLIDE 60

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X))

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-61
SLIDE 61

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN)

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-62
SLIDE 62

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN) ≥

  • xN∈AN

ǫ (X)

2−N(H(X)+ǫ)

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-63
SLIDE 63

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN) ≥

  • xN∈AN

ǫ (X)

2−N(H(X)+ǫ) = |AN

ǫ (X)|2−N(H(X)+ǫ)

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-64
SLIDE 64

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN) ≥

  • xN∈AN

ǫ (X)

2−N(H(X)+ǫ) = |AN

ǫ (X)|2−N(H(X)+ǫ)

For a sufficiently large N, we have 1 − δ ≤ Pr(X N ∈ AN

ǫ (X))

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-65
SLIDE 65

Lecture 10 Asymptotic equipartition

Precise bounds on the size of typical set

(1 − δ)2N(H(X)−ǫ) ≤ |AN

ǫ (X)| ≤ 2N(H(X)+ǫ)

1 ≥ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN) ≥

  • xN∈AN

ǫ (X)

2−N(H(X)+ǫ) = |AN

ǫ (X)|2−N(H(X)+ǫ)

For a sufficiently large N, we have 1 − δ ≤ Pr(X N ∈ AN

ǫ (X)) =

  • xN∈AN

ǫ (X)

p(xN) ≤

  • xN∈AN

ǫ (X)

2−N(H(X)−ǫ) = |AN

ǫ (X)|2−N(H(X)−ǫ)

  • S. Cheng (OU-Tulsa)

October 17, 2017 20 / 28

slide-66
SLIDE 66

Lecture 10 Asymptotic equipartition

AEP

Set of typical Sequences Sequences are equally probable Sequence that won't happen Asymptotic equipatition refers to the fact that the probability space is equally partitioned by the typical sequences

  • S. Cheng (OU-Tulsa)

October 17, 2017 21 / 28

slide-67
SLIDE 67

Lecture 10 Asymptotic equipartition

AEP and compression limit

Consider coin flipping again, let say Pr(Head) = 0.3 and N = 1000

  • S. Cheng (OU-Tulsa)

October 17, 2017 22 / 28

slide-68
SLIDE 68

Lecture 10 Asymptotic equipartition

AEP and compression limit

Consider coin flipping again, let say Pr(Head) = 0.3 and N = 1000 The typical sequences will be those with approximately 300 heads and 700 tails

  • S. Cheng (OU-Tulsa)

October 17, 2017 22 / 28

slide-69
SLIDE 69

Lecture 10 Asymptotic equipartition

AEP and compression limit

Consider coin flipping again, let say Pr(Head) = 0.3 and N = 1000 The typical sequences will be those with approximately 300 heads and 700 tails AEP (LLN) tells us that it is almost impossible to get, say, a sequence

  • f 100 heads and 900 tails
  • S. Cheng (OU-Tulsa)

October 17, 2017 22 / 28

slide-70
SLIDE 70

Lecture 10 Asymptotic equipartition

AEP and compression limit

Consider coin flipping again, let say Pr(Head) = 0.3 and N = 1000 The typical sequences will be those with approximately 300 heads and 700 tails AEP (LLN) tells us that it is almost impossible to get, say, a sequence

  • f 100 heads and 900 tails

AEP also tells us that the number of typical sequences are approximately 2NH(X)

  • S. Cheng (OU-Tulsa)

October 17, 2017 22 / 28

slide-71
SLIDE 71

Lecture 10 Asymptotic equipartition

AEP and compression limit

Consider coin flipping again, let say Pr(Head) = 0.3 and N = 1000 The typical sequences will be those with approximately 300 heads and 700 tails AEP (LLN) tells us that it is almost impossible to get, say, a sequence

  • f 100 heads and 900 tails

AEP also tells us that the number of typical sequences are approximately 2NH(X) Therefore, we can simply assign index to all the typical sequences and ignore the rest. Then we only need log 2NH(X) = NH(X) to store a sequence of N symbols. And on average, we need H(X) bits per symbol as before!

  • S. Cheng (OU-Tulsa)

October 17, 2017 22 / 28