Interference and Generalization in Temporal Difference Learning - - PowerPoint PPT Presentation

▶

Feb 25, 2023 23 likes •223 views

Interference and Generalization in Temporal Difference Learning Emmanuel Bengio Joelle Pineau Doina Precup ICML 2020 Overview The setting: - Deep Neural Networks - Interference: = f ( u 1 ) , f ( u 2 ) - Data:

SLIDE 1

Interference and Generalization in Temporal Difference Learning

Emmanuel Bengio Joelle Pineau Doina Precup ICML 2020

SLIDE 2

Overview

The setting:

Deep Neural Networks
Interference: ρ = ∇θf(u1), ∇θf(u2)
Data: classification, regression, interactive environments
Training: supervised vs reinforcement (TD, TD(λ), & PG)

We wish to understand the relation between interference and generalization, and how Temporal Difference affects both.

2/20 ICML 2020

SLIDE 3

Key Takeaways

For the same data:

TD tends to induce unaligned (ρ = 0 ± ǫ) representations
SL tends to induce aligned (ρ > 0) representations
increased alignment is correlated with:
a reduced generalization gap in TD
an increased generalization gap in SL
TD and SL generalize differently! Even for RL data
TD(λ) controls this behaviour (λ = 1 being ≈ SL)

3/20 ICML 2020

SLIDE 4

Key Takeaways

In more intuitive words/conjecture: For the same data:

TD tends to memorize its data
SL tends to generalize
further training:
breaks memorized structures in TD
creates memorized structures in SL (overfitting)
TD and SL generalize differently! Even for RL data
TD(λ) controls this behaviour (λ = 1 being ≈ SL)

4/20 ICML 2020

SLIDE 5

Interference

ρ > 0

∇θf(x1) ∇θf(x2)

ρ = 0

∇θf(x1) ∇θf(x2)

ρ < 0

∇θf(x1) ∇θf(x2) ∆f(x2) = α∇T

θ f(x2)∇θf(x1)

5/20 ICML 2020

SLIDE 6

Interference

Taylor expansion:

f(x, θ′) = f(x, θ)+∇θf(x)T(θ′ − θ)

+(θ′−θ)T∇2

θf(x)(θ′−θ)+...

stiffness (Fort et al., 2019):

angle(∇f(x1), ∇f(x2)) = ∇f(x1)T∇f(x2) ∇f(x1)∇f(x2)

6/20 ICML 2020

SLIDE 7

Classification

Overfitting manifests differently

7/20 ICML 2020

SLIDE 8

Supervised Data

8/20 ICML 2020

SLIDE 9

Atari

Measuring gain (effective loss interference) for nearby states:

9/20 ICML 2020

SLIDE 10

Atari

Measuring gain (effective loss interference) for nearby states:

10/20 ICML 2020

SLIDE 11

Understanding interference in TD

11/20 ICML 2020

SLIDE 12

Understanding interference in TD

Test TD(λ), which “smooths” those wiggles
Test for correlation between wiggles and performance

12/20 ICML 2020

SLIDE 13

TD(λ)

TD(λ) smooths the TD target by taking into account (weighed) future predictions: Gλ(St) = (1 − λ) ∞

n=1 λn−1Gn(St)

(1) Gn(St) = γnV(St+n) + n−1

j=0 γjR(St+j)

(2)

13/20 ICML 2020

SLIDE 14

TD(λ)

14/20 ICML 2020

SLIDE 15

TD(λ)

Increasing λ increases how fast the loss decreases (around st)

15/20 ICML 2020

SLIDE 16

Local prediction variance

16/20 ICML 2020

SLIDE 17

Local prediction variance

17/20 ICML 2020

SLIDE 18

Interference update decomposition

Two extra terms in the TD update’s interference time derivative: ρ′

reg;AB = −¯

ρ2

ABδ2 B − 2δAδB ¯

ρAB ¯ ρBB − δAδ2

B∇fB( ¯

HA∇fB + ¯ HB∇fA) ρ′

TD;AB = −δ2 B ¯

ρAB(¯ ρAB − γ¯ ρA′B) − δAδB ¯ ρAB(¯ ρBB − γ¯ ρB′B) − δAδ2

B∇fB( ¯

HA∇fB + ¯ HB∇fA) → gradient variance induced by errors in predictions will be much larger for a high-capacity high-variance model

18/20 ICML 2020

SLIDE 19

Interference update decomposition

DDQN and QL (no frozen target) have unstable updates, unlike Regression and DQN (frozen target):

19/20 ICML 2020

SLIDE 20

Recap & Conclusion

generalization dynamics in SL and DL → different

parameterizations.

in RL tasks, TD doesn’t generalize as well as SL

(even when the f to approximate is the same)

find link between the complexity and variance of TD targets

and interference

TD(λ) has generalization potential
better optimizers for TD might improve things quite a lot!

20/20 ICML 2020