Learning Additive Noise Channels: Generalization Bounds and - - PowerPoint PPT Presentation

learning additive noise channels generalization bounds
SMART_READER_LITE
LIVE PREVIEW

Learning Additive Noise Channels: Generalization Bounds and - - PowerPoint PPT Presentation

Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger Massachusetts Institute of Technology, MA, USA IEEE International Symposium on Information Theory June 2020 1/22 In an nutshell An additive noise channel


slide-1
SLIDE 1

Learning Additive Noise Channels: Generalization Bounds and Algorithms

Nir Weinberger

Massachusetts Institute of Technology, MA, USA

IEEE International Symposium on Information Theory June 2020

1/22

slide-2
SLIDE 2

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd

2/22

slide-3
SLIDE 3

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

2/22

slide-4
SLIDE 4

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

2/22

slide-5
SLIDE 5

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ?

2/22

slide-6
SLIDE 6

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

2/22

slide-7
SLIDE 7

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

2/22

slide-8
SLIDE 8

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

Applies to empirical risk minimization (ERM).

2/22

slide-9
SLIDE 9

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

2/22

slide-10
SLIDE 10

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

2/22

slide-11
SLIDE 11

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

3

a “codeword-expurgating” Gibbs learning algorithm.

2/22

slide-12
SLIDE 12

In an nutshell

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Z ∼ µ, but µ is unknown and non-parametric.

✞ ✝ ☎ ✆

Can we learn to efficiently communicate from (Z1,...,Zn) i.i.d. ∼ µ ? Generalization bounds for

1

learning under error probability loss

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

3

a “codeword-expurgating” Gibbs learning algorithm.

Caveat: a distilled learning-theoretic framework.

2/22

slide-13
SLIDE 13

Motivation

Why?

3/22

slide-14
SLIDE 14

Motivation

Why? Justification of learning-based methods:

3/22

slide-15
SLIDE 15

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 3/22

slide-16
SLIDE 16

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. 3/22

slide-17
SLIDE 17

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19].

Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization.

3/22

slide-18
SLIDE 18

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19].

Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO.

3/22

slide-19
SLIDE 19

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19].

Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO.

3 Existing theory on learning-based quantizer design [LLZ94;

LLZ97; BLL98; Lin02].

3/22

slide-20
SLIDE 20

Motivation

Why? Justification of learning-based methods:

1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19].

Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO.

3 Existing theory on learning-based quantizer design [LLZ94;

LLZ97; BLL98; Lin02].

4 Exploit efficient optimization methods, e.g., for the design of

low latency codes [Kim+18; Jia+19].

3/22

slide-21
SLIDE 21

Outline

1

Learning to Minimize Error Probability

2

Learning to Minimize a Surrogate to the Error Probability

3

Learning by Codebook Expurgation

4/22

slide-22
SLIDE 22

Model

Channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd

5/22

slide-23
SLIDE 23

Model

Channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Encoder: A codebook C = {xj}j∈[m] ⊂ C ⊆ (Rd)m.

5/22

slide-24
SLIDE 24

Model

Channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Encoder: A codebook C = {xj}j∈[m] ⊂ C ⊆ (Rd)m. Decoder: minimal (Mahalanobis) distance decoder ˆ j(y) ∈ argmin

j∈[m]

x−yS w.r.t. inverse covariance matrix S ∈ S ⊆ Sd

+.

5/22

slide-25
SLIDE 25

Expected and empirical error probability

Expected average error probability: pµ(C,S) := 1 m

m

  • j=1

pµ(C,S | j), with pµ(C,S | j) := Eµ

  • min

j′∈[m],j′=j xj +Z −xj′S < ZS

  • .

6/22

slide-26
SLIDE 26

Expected and empirical error probability

Expected average error probability: pµ(C,S) := 1 m

m

  • j=1

pµ(C,S | j), with pµ(C,S | j) := Eµ

  • min

j′∈[m],j′=j xj +Z −xj′S < ZS

  • .

Ultimate goal: find argminC,S pµ(C,S). ✶

6/22

slide-27
SLIDE 27

Expected and empirical error probability

Expected average error probability: pµ(C,S) := 1 m

m

  • j=1

pµ(C,S | j), with pµ(C,S | j) := Eµ

  • min

j′∈[m],j′=j xj +Z −xj′S < ZS

  • .

Ultimate goal: find argminC,S pµ(C,S). Empirical average error probability: replace Eµ[ℓ(Z)] → EZ[ℓ(Z)] := 1 n

n

  • i=1

ℓ(Zi) so that pZ(C,S) := 1 m

m

  • j=1

EZ

  • min

j′∈[m]\j xj +Z −xj′S < ZS

  • .

6/22

slide-28
SLIDE 28

Uniform error bound and ERM

Theorem Assume that n ≥ d+1. With probability of at least 1−δ sup

C⊂(Rd)m, S∈Sd

+

|pµ(C,S)−pZ(C,S)| ≤ 4m

  • 2(d+1)log
  • en

d+1

  • n

+

  • 2log(2/δ)

n .

7/22

slide-29
SLIDE 29

Uniform error bound and ERM

Theorem Assume that n ≥ d+1. With probability of at least 1−δ sup

C⊂(Rd)m, S∈Sd

+

|pµ(C,S)−pZ(C,S)| ≤ 4m

  • 2(d+1)log
  • en

d+1

  • n

+

  • 2log(2/δ)

n . Holds for the output (CZ,SZ) of any learning algorithm.

7/22

slide-30
SLIDE 30

Uniform error bound and ERM

Theorem Assume that n ≥ d+1. With probability of at least 1−δ sup

C⊂(Rd)m, S∈Sd

+

|pµ(C,S)−pZ(C,S)| ≤ 4m

  • 2(d+1)log
  • en

d+1

  • n

+

  • 2log(2/δ)

n . Holds for the output (CZ,SZ) of any learning algorithm. Specifically, for ERM (CZ,SZ)ERM ∈ argminpZ(C,S), n = ˜ O(m2d+log(1/δ)

ǫ2

) samples guarantees pµ(CZ,SZ)ERM) ≤ inf

C,S pµ(C,S)+ǫ.

7/22

slide-31
SLIDE 31

Uniform error bound and ERM - cont.

Open questions: The term ˜ O(log(1/δ)

ǫ2

) can be shown to be minimax tight.

8/22

slide-32
SLIDE 32

Uniform error bound and ERM - cont.

Open questions: The term ˜ O(log(1/δ)

ǫ2

) can be shown to be minimax tight.

Is the dependence on ˜ O( m2d

ǫ2 ) is optimal?

8/22

slide-33
SLIDE 33

Uniform error bound and ERM - cont.

Open questions: The term ˜ O(log(1/δ)

ǫ2

) can be shown to be minimax tight.

Is the dependence on ˜ O( m2d

ǫ2 ) is optimal?

Typically, m = 2dR ≫ 1, but codebook has structure.

8/22

slide-34
SLIDE 34

Uniform error bound and ERM - cont.

Open questions: The term ˜ O(log(1/δ)

ǫ2

) can be shown to be minimax tight.

Is the dependence on ˜ O( m2d

ǫ2 ) is optimal?

Typically, m = 2dR ≫ 1, but codebook has structure.

What is the right complexity measure for learnability?

8/22

slide-35
SLIDE 35

Outline

1

Learning to Minimize Error Probability

2

Learning to Minimize a Surrogate to the Error Probability

3

Learning by Codebook Expurgation

9/22

slide-36
SLIDE 36

A Surrogate to the Error Probability

pZ(C,S) is difficult to optimize – discontinuous in (C,S). ✶ ✶

10/22

slide-37
SLIDE 37

A Surrogate to the Error Probability

pZ(C,S) is difficult to optimize – discontinuous in (C,S). A hinge loss upper bound ✶(t < 0) ≤ (1−t)∨0. ✶

10/22

slide-38
SLIDE 38

A Surrogate to the Error Probability

pZ(C,S) is difficult to optimize – discontinuous in (C,S). A hinge loss upper bound ✶(t < 0) ≤ (1−t)∨0. Bound the error event indicator ✶

  • min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • < 0
  • 1−

min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • ∨0.

10/22

slide-39
SLIDE 39

A Surrogate to the Error Probability

pZ(C,S) is difficult to optimize – discontinuous in (C,S). A hinge loss upper bound ✶(t < 0) ≤ (1−t)∨0. Bound the error event indicator ✶

  • min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • < 0
  • 1−

min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • ∨0.

The corresponding expected and empirical losses are ¯ pµ(C,S) ≥ pµ(C,S), ¯ pZ(C,S) ≥ pZ(C,S).

10/22

slide-40
SLIDE 40

Uniform error bound

Theorem Under some boundedness assumptions on C,S, and the support of µ with probability of at least 1−δ sup

C⊂C, S∈S

|¯ pµ(C,S)− ¯ pZ(C,S)|

  • (d+1)mlog(m)

n +

  • log(1/δ)

n .

11/22

slide-41
SLIDE 41

Uniform error bound

Theorem Under some boundedness assumptions on C,S, and the support of µ with probability of at least 1−δ sup

C⊂C, S∈S

|¯ pµ(C,S)− ¯ pZ(C,S)|

  • (d+1)mlog(m)

n +

  • log(1/δ)

n . Sample complexity is n = ˜ O(dm+log(1/δ)

ǫ2

).

11/22

slide-42
SLIDE 42

Uniform error bound

Theorem Under some boundedness assumptions on C,S, and the support of µ with probability of at least 1−δ sup

C⊂C, S∈S

|¯ pµ(C,S)− ¯ pZ(C,S)|

  • (d+1)mlog(m)

n +

  • log(1/δ)

n . Sample complexity is n = ˜ O(dm+log(1/δ)

ǫ2

).

Improvement from quadratic to linear dependency on m.

11/22

slide-43
SLIDE 43

Uniform error bound

Theorem Under some boundedness assumptions on C,S, and the support of µ with probability of at least 1−δ sup

C⊂C, S∈S

|¯ pµ(C,S)− ¯ pZ(C,S)|

  • (d+1)mlog(m)

n +

  • log(1/δ)

n . Sample complexity is n = ˜ O(dm+log(1/δ)

ǫ2

).

Improvement from quadratic to linear dependency on m.

Boundedness assumptions are common for analytical simplicity.

11/22

slide-44
SLIDE 44

Alternating optimization algorithm

Problem: ¯ pZ(C,S) is non-convex in (C,S).

12/22

slide-45
SLIDE 45

Alternating optimization algorithm

Problem: ¯ pZ(C,S) is non-convex in (C,S). Idea: add auxiliary variables.

12/22

slide-46
SLIDE 46

Alternating optimization algorithm

Problem: ¯ pZ(C,S) is non-convex in (C,S). Idea: add auxiliary variables. Let {αj,j′}j′∈[m]\j be such that αj,j′ ∈ [0,1], αj,j = 0 and

  • j′∈[m]\j αj,j′ ≤ 1.

12/22

slide-47
SLIDE 47

Alternating optimization algorithm

Problem: ¯ pZ(C,S) is non-convex in (C,S). Idea: add auxiliary variables. Let {αj,j′}j′∈[m]\j be such that αj,j′ ∈ [0,1], αj,j = 0 and

  • j′∈[m]\j αj,j′ ≤ 1.

For a given z ∈ Rd, and j ∈ [m]

  • 1−

min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • ∨0

= max

{αj,j′}j′∈[m]\j

  • j′∈[m]\j

αj,j′

  • 1−
  • xj −xj′2

S −2(xj −xj′)T Szi

  • .

12/22

slide-48
SLIDE 48

Alternating optimization algorithm

Problem: ¯ pZ(C,S) is non-convex in (C,S). Idea: add auxiliary variables. Let {αj,j′}j′∈[m]\j be such that αj,j′ ∈ [0,1], αj,j = 0 and

  • j′∈[m]\j αj,j′ ≤ 1.

For a given z ∈ Rd, and j ∈ [m]

  • 1−

min

j′∈[m]\{j}

  • xj −xj′2

S −2(xj −xj′)T Sz

  • ∨0

= max

{αj,j′}j′∈[m]\j

  • j′∈[m]\j

αj,j′

  • 1−
  • xj −xj′2

S −2(xj −xj′)T Szi

  • .

Given (C,S), the variables {αj,j′}j′∈[m] describe the nearest neighbors of xj w.r.t. to z.

12/22

slide-49
SLIDE 49

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

13/22

slide-50
SLIDE 50

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

13/22

slide-51
SLIDE 51

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

13/22

slide-52
SLIDE 52

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

2

Optimization over (C,S) (C(t),S(t)) ∈ argmin

C,S

¯ pZ(C,S,A(t)).

13/22

slide-53
SLIDE 53

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

2

Optimization over (C,S) (C(t),S(t)) ∈ argmin

C,S

¯ pZ(C,S,A(t)).

A stochastic gradient descent version of this idea appears in the paper.

13/22

slide-54
SLIDE 54

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

2

Optimization over (C,S) (C(t),S(t)) ∈ argmin

C,S

¯ pZ(C,S,A(t)).

A stochastic gradient descent version of this idea appears in the paper. Open problems:

13/22

slide-55
SLIDE 55

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

2

Optimization over (C,S) (C(t),S(t)) ∈ argmin

C,S

¯ pZ(C,S,A(t)).

A stochastic gradient descent version of this idea appears in the paper. Open problems:

Finite number of iterations and samples convergence analysis.

13/22

slide-56
SLIDE 56

Alternating optimization algorithm - cont.

For Z, extend the loss function ¯ pZ(C,S) = max

A

¯ pZ(C,S,A) where A := {α(i)

j,j′}j∈[m],j′∈[m]\j,i∈[n].

Algorithm: fix (C(0),S(0)), and for t ≥ 1 alternate between:

1

Optimization over A A(t) ∈ argmax

A

¯ pZ(C(t−1),S(t−1),A).

2

Optimization over (C,S) (C(t),S(t)) ∈ argmin

C,S

¯ pZ(C,S,A(t)).

A stochastic gradient descent version of this idea appears in the paper. Open problems:

Finite number of iterations and samples convergence analysis. Tightened generalization bounds based on the algorithm.

13/22

slide-57
SLIDE 57

Outline

1

Learning to Minimize Error Probability

2

Learning to Minimize a Surrogate to the Error Probability

3

Learning by Codebook Expurgation

14/22

slide-58
SLIDE 58

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id.

15/22

slide-59
SLIDE 59

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id. Let C0 be a super-codebook of m0 > m codewords.

15/22

slide-60
SLIDE 60

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id. Let C0 be a super-codebook of m0 > m codewords. An expurgated codebook of m codewords:

15/22

slide-61
SLIDE 61

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id. Let C0 be a super-codebook of m0 > m codewords. An expurgated codebook of m codewords:

Population version: C∗,µ = argmin

C={x1,...,xm}⊂C0

pµ(C)

15/22

slide-62
SLIDE 62

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id. Let C0 be a super-codebook of m0 > m codewords. An expurgated codebook of m codewords:

Population version: C∗,µ = argmin

C={x1,...,xm}⊂C0

pµ(C) Empirical version: C∗,Z = argmin

C={x1,...,xm}⊂C0

pZ(C).

15/22

slide-63
SLIDE 63

The codebook expurgation problem

Assume a standard minimal distance decoder S = Id. Let C0 be a super-codebook of m0 > m codewords. An expurgated codebook of m codewords:

Population version: C∗,µ = argmin

C={x1,...,xm}⊂C0

pµ(C) Empirical version: C∗,Z = argmin

C={x1,...,xm}⊂C0

pZ(C).

A complex task: removing xj should take into account both errors xj → xj′ and xj′ → xj.

15/22

slide-64
SLIDE 64

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step.

16/22

slide-65
SLIDE 65

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

16/22

slide-66
SLIDE 66

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

1

Initialization: a codebook C0, |C0| = m0, β > 0 and Q ∈ P(Rd).

16/22

slide-67
SLIDE 67

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

1

Initialization: a codebook C0, |C0| = m0, β > 0 and Q ∈ P(Rd).

2

For t = 1,...T := m0−m

k

P

  • Ct+1 = Ct\{xj}j∈{j1,...jk} | Z,Ct
  • ∝ Q(Ct+1)·exp[−β ·pZ(Ct+1)].

16/22

slide-68
SLIDE 68

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

1

Initialization: a codebook C0, |C0| = m0, β > 0 and Q ∈ P(Rd).

2

For t = 1,...T := m0−m

k

P

  • Ct+1 = Ct\{xj}j∈{j1,...jk} | Z,Ct
  • ∝ Q(Ct+1)·exp[−β ·pZ(Ct+1)].

Finite inverse temperature β

16/22

slide-69
SLIDE 69

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

1

Initialization: a codebook C0, |C0| = m0, β > 0 and Q ∈ P(Rd).

2

For t = 1,...T := m0−m

k

P

  • Ct+1 = Ct\{xj}j∈{j1,...jk} | Z,Ct
  • ∝ Q(Ct+1)·exp[−β ·pZ(Ct+1)].

Finite inverse temperature β

β = 0: random expurgation according to Q.

16/22

slide-70
SLIDE 70

Gibbs selection algorithm

A relaxation: remove codewords step-by-step, k at each step. Algorithm:

1

Initialization: a codebook C0, |C0| = m0, β > 0 and Q ∈ P(Rd).

2

For t = 1,...T := m0−m

k

P

  • Ct+1 = Ct\{xj}j∈{j1,...jk} | Z,Ct
  • ∝ Q(Ct+1)·exp[−β ·pZ(Ct+1)].

Finite inverse temperature β

β = 0: random expurgation according to Q. β → ∞: may not generalize well to µ.

16/22

slide-71
SLIDE 71

Gibbs algorithm error types

Sequence generated by population Gibbs algorithm Cµ = (C0,Cµ,1,...,Cµ,T )

17/22

slide-72
SLIDE 72

Gibbs algorithm error types

Sequence generated by population Gibbs algorithm Cµ = (C0,Cµ,1,...,Cµ,T ) Sequence generated by empirical Gibbs algorithm CZ = (C0,CZ,1,...,CZ,T )

17/22

slide-73
SLIDE 73

Gibbs algorithm error types

Sequence generated by population Gibbs algorithm Cµ = (C0,Cµ,1,...,Cµ,T ) Sequence generated by empirical Gibbs algorithm CZ = (C0,CZ,1,...,CZ,T ) Excess error: pµ(CZ,T )−pµ(C∗,µ) = pµ(CZ,T )−pµ(Cµ,T )

  • empirical error

+pµ(Cµ,T )−pµ(C∗,µ)

  • approximation error

17/22

slide-74
SLIDE 74

Gibbs algorithm error types

Sequence generated by population Gibbs algorithm Cµ = (C0,Cµ,1,...,Cµ,T ) Sequence generated by empirical Gibbs algorithm CZ = (C0,CZ,1,...,CZ,T ) Excess error: pµ(CZ,T )−pµ(C∗,µ) = pµ(CZ,T )−pµ(Cµ,T )

  • empirical error

+pµ(Cµ,T )−pµ(C∗,µ)

  • approximation error

Generalization error: pµ(CZ,T )−pZ(CZ,T ).

17/22

slide-75
SLIDE 75

Theoretical guarantees

Theorem Under mild assumptions

18/22

slide-76
SLIDE 76

Theoretical guarantees

Theorem Under mild assumptions Average empirical error E[pµ(CZ,T )−pµ(Cµ,T )] ≤ 3

  • Tβ2 (logn+m0 logm0 −logk)

n

18/22

slide-77
SLIDE 77

Theoretical guarantees

Theorem Under mild assumptions Average empirical error E[pµ(CZ,T )−pµ(Cµ,T )] ≤ 3

  • Tβ2 (logn+m0 logm0 −logk)

n Average generalization error E[pµ(CZ,T )−pZ(CZ,T )] ≤

  • T

β

n ∧ β2 4n2

  • .

18/22

slide-78
SLIDE 78

Theoretical guarantees

Theorem Under mild assumptions Average empirical error E[pµ(CZ,T )−pµ(Cµ,T )] ≤ 3

  • Tβ2 (logn+m0 logm0 −logk)

n Average generalization error E[pµ(CZ,T )−pZ(CZ,T )] ≤

  • T

β

n ∧ β2 4n2

  • .

Reducing β improves empirical and generalization error.

18/22

slide-79
SLIDE 79

Theoretical guarantees

Theorem Under mild assumptions Average empirical error E[pµ(CZ,T )−pµ(Cµ,T )] ≤ 3

  • Tβ2 (logn+m0 logm0 −logk)

n Average generalization error E[pµ(CZ,T )−pZ(CZ,T )] ≤

  • T

β

n ∧ β2 4n2

  • .

Reducing β improves empirical and generalization error. Main open problem: Characterizing the approximation error and optimizing β.

18/22

slide-80
SLIDE 80

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd

19/22

slide-81
SLIDE 81

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

19/22

slide-82
SLIDE 82

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

19/22

slide-83
SLIDE 83

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

19/22

slide-84
SLIDE 84

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

19/22

slide-85
SLIDE 85

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

19/22

slide-86
SLIDE 86

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

3

a “codeword-expurgating” Gibbs learning algorithm.

19/22

slide-87
SLIDE 87

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

3

a “codeword-expurgating” Gibbs learning algorithm.

19/22

slide-88
SLIDE 88

Summary

An additive noise channel Y

  • utput

= X

  • input

+ Z

  • noise

, Z ⊥ ⊥ X ∈ Rd Generalization bounds for

1

learning under error probability loss.

Applies to empirical risk minimization (ERM).

2

learning under a surrogate error probability loss.

New alternating optimization algorithm.

3

a “codeword-expurgating” Gibbs learning algorithm.

Questions? Comments? nir.wein@gmail.com

19/22

slide-89
SLIDE 89

References I

Bartlett, Peter L., Tamás Linder, and Gábor Lugosi (1998). “The minimax distortion redundancy in empirical quantizer design”. In: IEEE Transactions on Information theory 44.5, pp. 1802–1813. Farsad, Nariman and Andrea Goldsmith (2017). “Detection algorithms for communication systems using deep learning”. In: arXiv preprint arXiv:1705.08044. Gruber, Tobias et al. (2017). “On deep learning-based channel decoding”. In: 2017 51st Annual Conference on Information Sciences and Systems (CISS). IEEE, pp. 1–6. Jiang, Yihan et al. (2019). “Learn codes: Inventing low-latency codes via recurrent neural networks”. In: ICC 2019-2019 IEEE International Conference on Communications (ICC). IEEE,

  • pp. 1–7.

Kim, Hyeji et al. (2018). “Communication algorithms via deep learning”. In: arXiv preprint arXiv:1805.09317.

20/22

slide-90
SLIDE 90

References II

Linder, Tamás (2002). “Learning-theoretic methods in vector quantization”. In: Principles of nonparametric learning. Springer,

  • pp. 163–210.

Linder, Tamás, Gábor Lugosi, and Kenneth Zeger (1994). “Rates

  • f convergence in the source coding theorem, in empirical

quantizer design, and in universal lossy source coding”. In: IEEE Transactions on Information Theory 40.6, pp. 1728–1740. – (1997). “Empirical quantizer design in the presence of source noise or channel noise”. In: IEEE Transactions on Information Theory 43.2, pp. 612–623. O’Shea, Timothy and Jakob Hoydis (2017). “An introduction to deep learning for the physical layer”. In: IEEE Transactions on Cognitive Communications and Networking 3.4, pp. 563–575. Schenk, Tim (2008). RF imperfections in high-rate wireless systems: Impact and digital compensation. Springer Science & Business Media.

21/22

slide-91
SLIDE 91

References III

Shlezinger, Nir et al. (2019). “ViterbiNet: Symbol detection using a deep learning based Viterbi algorithm”. In: 2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC). IEEE, pp. 1–5. Wang, Tianqi et al. (2017). “Deep learning for wireless physical layer: Opportunities and challenges”. In: China Communications 14.11, pp. 92–111.

22/22