Interference and Generalization in Temporal Difference Learning - - PowerPoint PPT Presentation

interference and generalization in temporal difference
SMART_READER_LITE
LIVE PREVIEW

Interference and Generalization in Temporal Difference Learning - - PowerPoint PPT Presentation

Interference and Generalization in Temporal Difference Learning Emmanuel Bengio Joelle Pineau Doina Precup ICML 2020 Overview The setting: - Deep Neural Networks - Interference: = f ( u 1 ) , f ( u 2 ) - Data:


slide-1
SLIDE 1

Interference and Generalization in Temporal Difference Learning

Emmanuel Bengio Joelle Pineau Doina Precup ICML 2020

slide-2
SLIDE 2

Overview

The setting:

  • Deep Neural Networks
  • Interference: ρ = ∇θf(u1), ∇θf(u2)
  • Data: classification, regression, interactive environments
  • Training: supervised vs reinforcement (TD, TD(λ), & PG)

We wish to understand the relation between interference and generalization, and how Temporal Difference affects both.

2/20 ICML 2020

:)

slide-3
SLIDE 3

Key Takeaways

For the same data:

  • TD tends to induce unaligned (ρ = 0 ± ǫ) representations
  • SL tends to induce aligned (ρ > 0) representations
  • increased alignment is correlated with:
  • a reduced generalization gap in TD
  • an increased generalization gap in SL
  • TD and SL generalize differently! Even for RL data
  • TD(λ) controls this behaviour (λ = 1 being ≈ SL)

3/20 ICML 2020

:)

slide-4
SLIDE 4

Key Takeaways

In more intuitive words/conjecture: For the same data:

  • TD tends to memorize its data
  • SL tends to generalize
  • further training:
  • breaks memorized structures in TD
  • creates memorized structures in SL (overfitting)
  • TD and SL generalize differently! Even for RL data
  • TD(λ) controls this behaviour (λ = 1 being ≈ SL)

4/20 ICML 2020

:)

slide-5
SLIDE 5

Interference

  • ρ > 0

∇θf(x1) ∇θf(x2)

  • ρ = 0

∇θf(x1) ∇θf(x2)

  • ρ < 0

∇θf(x1) ∇θf(x2) ∆f(x2) = α∇T

θ f(x2)∇θf(x1)

5/20 ICML 2020

:)

slide-6
SLIDE 6

Interference

  • Taylor expansion:

f(x, θ′) = f(x, θ)+∇θf(x)T(θ′ − θ)

  • +(θ′−θ)T∇2

θf(x)(θ′−θ)+...

  • stiffness (Fort et al., 2019):

angle(∇f(x1), ∇f(x2)) = ∇f(x1)T∇f(x2) ∇f(x1)∇f(x2)

6/20 ICML 2020

:)

slide-7
SLIDE 7

Classification

Overfitting manifests differently

7/20 ICML 2020

:)

slide-8
SLIDE 8

Supervised Data

8/20 ICML 2020

:)

slide-9
SLIDE 9

Atari

Measuring gain (effective loss interference) for nearby states:

9/20 ICML 2020

:)

slide-10
SLIDE 10

Atari

Measuring gain (effective loss interference) for nearby states:

10/20 ICML 2020

:)

slide-11
SLIDE 11

Understanding interference in TD

11/20 ICML 2020

:)

slide-12
SLIDE 12

Understanding interference in TD

  • Test TD(λ), which “smooths” those wiggles
  • Test for correlation between wiggles and performance

12/20 ICML 2020

:)

slide-13
SLIDE 13

TD(λ)

TD(λ) smooths the TD target by taking into account (weighed) future predictions: Gλ(St) = (1 − λ) ∞

n=1 λn−1Gn(St)

(1) Gn(St) = γnV(St+n) + n−1

j=0 γjR(St+j)

(2)

13/20 ICML 2020

:)

slide-14
SLIDE 14

TD(λ)

14/20 ICML 2020

:)

slide-15
SLIDE 15

TD(λ)

Increasing λ increases how fast the loss decreases (around st)

15/20 ICML 2020

:)

slide-16
SLIDE 16

Local prediction variance

16/20 ICML 2020

:)

slide-17
SLIDE 17

Local prediction variance

17/20 ICML 2020

:)

slide-18
SLIDE 18

Interference update decomposition

Two extra terms in the TD update’s interference time derivative: ρ′

reg;AB = −¯

ρ2

ABδ2 B − 2δAδB ¯

ρAB ¯ ρBB − δAδ2

B∇fB( ¯

HA∇fB + ¯ HB∇fA) ρ′

TD;AB = −δ2 B ¯

ρAB(¯ ρAB − γ¯ ρA′B) − δAδB ¯ ρAB(¯ ρBB − γ¯ ρB′B) − δAδ2

B∇fB( ¯

HA∇fB + ¯ HB∇fA) → gradient variance induced by errors in predictions will be much larger for a high-capacity high-variance model

18/20 ICML 2020

:)

slide-19
SLIDE 19

Interference update decomposition

DDQN and QL (no frozen target) have unstable updates, unlike Regression and DQN (frozen target):

19/20 ICML 2020

:)

slide-20
SLIDE 20

Recap & Conclusion

  • generalization dynamics in SL and DL → different

parameterizations.

  • in RL tasks, TD doesn’t generalize as well as SL

(even when the f to approximate is the same)

  • find link between the complexity and variance of TD targets

and interference

  • TD(λ) has generalization potential
  • better optimizers for TD might improve things quite a lot!

20/20 ICML 2020

:)