Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface - - PowerPoint PPT Presentation

neural fitted actor critic
SMART_READER_LITE
LIVE PREVIEW

Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface - - PowerPoint PPT Presentation

Background Neural Fitted Actor-Critic Future works Neural Fitted Actor-Critic Matthieu Zimmer Alain Dutech Yann Boniface University of Lorraine, LORIA 8 th July 2016 1/18 Background Neural Fitted Actor-Critic Future works Outline


slide-1
SLIDE 1

Background Neural Fitted Actor-Critic Future works

Neural Fitted Actor-Critic

Matthieu Zimmer

Alain Dutech Yann Boniface

University of Lorraine, LORIA

8th July 2016

1/18

slide-2
SLIDE 2

Background Neural Fitted Actor-Critic Future works

Outline

1

Background

2

Neural Fitted Actor-Critic

3

Future works

2/18

slide-3
SLIDE 3

Background Neural Fitted Actor-Critic Future works

Reinforcement Learning

3/18

slide-4
SLIDE 4

Background Neural Fitted Actor-Critic Future works

Reinforcement Learning

Optimization problem Find function π : S → A that maximize rewards Eπ

  • t=0 γtrt
  • 3/18
slide-5
SLIDE 5

Background Neural Fitted Actor-Critic Future works

Constraints and Motivations

Reinforcement learning + Developmental robotics :

1

Continuous environments

2

No prior models of agent or environment

3

Use non linear approximator (neural networks)

4

No prior goal states or trajectories

4/18

slide-6
SLIDE 6

Background Neural Fitted Actor-Critic Future works

How to solve reinforcement learning problems ?

Actor-only π : S → A πk

  • t=0 γtrt

πk+1 play update

5/18

slide-7
SLIDE 7

Background Neural Fitted Actor-Critic Future works

How to solve reinforcement learning problems ?

Actor-only π : S → A πk

  • t=0 γtrt

πk+1 play update Critic-only Q : S × A → R Qk πk

  • t=0 γtrt

Qk+1 deduce play update

5/18

slide-8
SLIDE 8

Background Neural Fitted Actor-Critic Future works

How to solve reinforcement learning problems ?

Actor-only π : S → A πk

  • t=0 γtrt

πk+1 play update Critic-only Q : S × A → R Qk πk

  • t=0 γtrt

Qk+1 deduce play update Actor-Critic π : S → A V : S → R πk

  • t=0 γtrt

V k+1 play update V k πk+1 update

5/18

slide-9
SLIDE 9

Background Neural Fitted Actor-Critic Future works

State of the art

Critic only

Fitted Q Iteration Q Learning, Sarsa

Actor only

Evolutionnary algorithms (CMA-ES, ...) PI2

Actor-critic

Natural Actor Critic Cacla

6/18

slide-10
SLIDE 10

Unsatisfied Constraints : (1) No Continuous environments (2) No prior models of agent or environment (3) Use linear approximator (4) No prior goal states or trajectories

Background Neural Fitted Actor-Critic Future works

State of the art

Critic only

Fitted Q Iteration (1) Q Learning, Sarsa (1)

Actor only

Evolutionnary algorithms (CMA-ES) → poor data efficiency PI2 (3) (4)

Actor-critic

Natural Actor Critic (3) (4) Cacla → poor data efficiency, lot of meta-parameters

7/18

slide-11
SLIDE 11

Background Neural Fitted Actor-Critic Future works

Landscape algorithms

ideal algorithm decisional complexity data required

8/18

slide-12
SLIDE 12

Background Neural Fitted Actor-Critic Future works

Landscape algorithms

ideal algorithm NFQ decisional complexity data required

8/18

slide-13
SLIDE 13

NFQ decisional complexity data required

Background Neural Fitted Actor-Critic Future works

Neural Fitted Q (NFQ)

Qk+1 = arg min

Q∈Fc N

  • t=1
  • Q(st, at) −
  • rt+1 + γ max

a′∈A Qk(st+1, a′)

2

π∗(s) = arg max

a∈A

Q(s, a)

s1 s2 s3 a1 a2 Q(s, a) Hidden layer Inputs Outputs

9/18

slide-14
SLIDE 14

CACLA CMA-ES decisional complexity data required

Background Neural Fitted Actor-Critic Future works

CACLA

Temporal Difference Error δt = rt + γV (st+1) − V (st) Critic Vk+1(s) = Vk(s) + αvδt θV

i,k+1 = θV i,k+1 + αvδt

∂Vt(st) ∂θV

i,k+1

Actor θt+1 = θt +

      

αa(at − ut)∂ut(st)

∂θt ,

if δ > 0 0,

  • therwise

10/18

slide-15
SLIDE 15

Background Neural Fitted Actor-Critic Future works

Neural Fitted Actor Critic

1) interactions 2) a. actor update 2) b. critic update Environment u Agent s, r a ∼ π Dπ δ V0 {st, at} {st, ut}

> 0 ≤ 0

π

Rprop

{st, vk,t} Vk Vk+1

Rprop

repeat

11/18

slide-16
SLIDE 16

NFAC decisional complexity data required

Background Neural Fitted Actor-Critic Future works

Neural Fitted Actor Critic

Vk+1 ← argmin

V ∈Fc

  • st∈Dπ
  • V (st) − rt+1 + γVk−1(st+1)

2

s1 s2 s3 V (s) Hidden layer Inputs Outputs

πk+1 ← argmin

π∈Fa

  • st∈Dπ
  • π(st) −

at,

if δt > 0 ut,

  • therwise

2

s1 s2 s3 a1 a2 Hidden layer Inputs Outputs

12/18

slide-17
SLIDE 17

Background Neural Fitted Actor-Critic Future works

Experimental Results

13/18

slide-18
SLIDE 18

Background Neural Fitted Actor-Critic Future works

Landscape algorithms

ideal algorithm NFQ CACLA CMA-ES NFAC decisional complexity data required

14/18

slide-19
SLIDE 19

Background Neural Fitted Actor-Critic Future works

Landscape algorithms

ideal algorithm NFQ CACLA CMA-ES NFAC DDPG NAF ? decisional complexity data required

15/18

slide-20
SLIDE 20

Background Neural Fitted Actor-Critic Future works

Methods landscape

ideal algorithm NFQ CACLA CMA-ES NFAC DDPG NAF ? NFAC+ decisional complexity data required

16/18

slide-21
SLIDE 21

Background Neural Fitted Actor-Critic Future works

Toward a better data efficiency

Fitted Actor-Critic Qπ

k+1 = argmin Q∈Fc N

  • t=1

c(at|st)

  • Q(st, at)−
  • rt+1+γQπ

k (st+1, π(st+1))

2

πk+1 = argmax

π∈Fa N

  • t=1

Qk+1

  • st, πk(st)
  • c(at|st) = min
  • 1, π(at|st)

π0(at|st)

  • 17/18
slide-22
SLIDE 22

Background Neural Fitted Actor-Critic Future works

Conclusion & Further Works

Neural Fitted Actor Critic Compare to DDPG Don’t forget previous data Guided exploration of sensorimotor space Increase the dimension of states/actions Redefine the reward function for the new sub-goal

18/18