learning additive noise channels generalization bounds
play

Learning Additive Noise Channels: Generalization Bounds and - PowerPoint PPT Presentation

Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger Massachusetts Institute of Technology, MA, USA IEEE International Symposium on Information Theory June 2020 1/22 In an nutshell An additive noise channel


  1. Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger Massachusetts Institute of Technology, MA, USA IEEE International Symposium on Information Theory June 2020 1/22

  2. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise 2/22

  3. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . 2/22

  4. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . 2/22

  5. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ 2/22

  6. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for 2/22

  7. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 2/22

  8. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). 2/22

  9. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 2/22

  10. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. 2/22

  11. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. a “codeword-expurgating” Gibbs learning algorithm. 3 2/22

  12. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. a “codeword-expurgating” Gibbs learning algorithm. 3 Caveat: a distilled learning-theoretic framework. 2/22

  13. Motivation Why? 3/22

  14. Motivation Why? Justification of learning-based methods: 3/22

  15. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 3/22

  16. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. 3/22

  17. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. 3/22

  18. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3/22

  19. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3 Existing theory on learning-based quantizer design [LLZ94; LLZ97; BLL98; Lin02]. 3/22

  20. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3 Existing theory on learning-based quantizer design [LLZ94; LLZ97; BLL98; Lin02]. 4 Exploit efficient optimization methods, e.g., for the design of low latency codes [Kim+18; Jia+19]. 3/22

  21. Outline Learning to Minimize Error Probability 1 Learning to Minimize a Surrogate to the Error Probability 2 Learning by Codebook Expurgation 3 4/22

  22. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise 5/22

  23. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Encoder: A codebook C = { x j } j ∈ [ m ] ⊂ C ⊆ ( R d ) m . 5/22

  24. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Encoder: A codebook C = { x j } j ∈ [ m ] ⊂ C ⊆ ( R d ) m . Decoder: minimal (Mahalanobis) distance decoder ˆ j ( y ) ∈ argmin � x − y � S j ∈ [ m ] w.r.t. inverse covariance matrix S ∈ S ⊆ S d + . 5/22

  25. ✶ Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ 6/22

  26. ✶ Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ Ultimate goal: find argmin C,S p µ ( C,S ) . 6/22

  27. Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ Ultimate goal: find argmin C,S p µ ( C,S ) . Empirical average error probability: replace n E µ [ ℓ ( Z )] → E Z [ ℓ ( Z )] := 1 � ℓ ( Z i ) n i =1 so that � � �� m p Z ( C,S ) := 1 � E Z ✶ j ′ ∈ [ m ] \ j � x j + Z − x j ′ � S < � Z � S min . m j =1 6/22

  28. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n 7/22

  29. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n Holds for the output ( C Z ,S Z ) of any learning algorithm. 7/22

  30. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n Holds for the output ( C Z ,S Z ) of any learning algorithm. Specifically, for ERM ( C Z ,S Z ) ERM ∈ argmin p Z ( C,S ) , O ( m 2 d +log(1 /δ ) n = ˜ ) samples guarantees ǫ 2 p µ ( C Z ,S Z ) ERM ) ≤ inf C,S p µ ( C,S )+ ǫ. 7/22

  31. Uniform error bound and ERM - cont. Open questions: The term ˜ O ( log(1 /δ ) ) can be shown to be minimax tight. ǫ 2 8/22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend