COMP 138: Reinforcement Learning Instructor : Jivko Sinapov Webpage : - - PowerPoint PPT Presentation

comp 138 reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

COMP 138: Reinforcement Learning Instructor : Jivko Sinapov Webpage : - - PowerPoint PPT Presentation

COMP 138: Reinforcement Learning Instructor : Jivko Sinapov Webpage : https://www.eecs.tufts.edu/~jsinapov/teaching/comp150_RL_Fall2020/ Announcements Reading Assignment Chapter 6 of Sutton and Barto Research Article Topics Transfer


slide-1
SLIDE 1

COMP 138: Reinforcement Learning

Instructor: Jivko Sinapov Webpage: https://www.eecs.tufts.edu/~jsinapov/teaching/comp150_RL_Fall2020/

slide-2
SLIDE 2

Announcements

slide-3
SLIDE 3

Reading Assignment

  • Chapter 6 of Sutton and Barto
slide-4
SLIDE 4

Research Article Topics

  • Transfer learning
  • Learning with human demonstrations and/or

advice

  • Approximating q-functions with neural networks
slide-5
SLIDE 5

Reading Assignment

  • Chapter 6 of Sutton and Barto
  • Matthew E. Taylor, Peter Stone, and Yaxin Liu.

Transfer Learning via Inter-Task Mappings for Temporal Difference Learning. Journal of Machine Learning Research, 8(1):2125-2167, 2007.

  • Responses should discuss both readings
  • You get extra credit for answering others’ questions!
slide-6
SLIDE 6

Programming Assignment #2

  • Homework 2 is out
slide-7
SLIDE 7

Class Project Discussion

  • What makes a good project?
  • What makes a good team?
slide-8
SLIDE 8

Reading Responses

“What are some real word applications of DP?”

  • Boriana

“Since there are at least four ways Monte Carlo methods are advantageous over DP mentioned, are there any problems in which using DP is more practical?”

  • Catherine
slide-9
SLIDE 9

Reading Responses

“How can we define the stopping conditions for value iteration or the Monte-Carlo method (how many iterations is enough)?”

– Tung

slide-10
SLIDE 10

Reading Responses

“Are DP methods dependent on initial states?”

– Eric

slide-11
SLIDE 11

Reading Responses

“In the Asynchronous Dynamic Programming method, according to what to choose which states should be updated more frequently?”

– Pandong

slide-12
SLIDE 12

Any other questions about DP?

slide-13
SLIDE 13

Dynamic Programming

slide-14
SLIDE 14

+X X/12 X/3 X/6 A B C D C D B A V0 V1 V2 V3 V4 V5 1/3 * 0.5 * x/12 + 1/3 * 0.5 * x/6 + x/3 π = random policy γ = 0.5

slide-15
SLIDE 15

+X 2nd 1st A B C D C D B A V0 V1 V2 V3 V4 V5 1/3 * 0.5 * x/12 + 1/3 * 0.5 * x/6 + x/3 π = random policy γ = 0.5

slide-16
SLIDE 16

+10 At each state, the agent has 1 or more actions allowing it to move to neighboring states. Moving in the direction of a wall is not allowed A B C D C D B A V0 V1 V2 V3 V4 V5 WORKING TEXT AREA: π = optimal policy γ = 0.5

slide-17
SLIDE 17

Policy Improvement

  • Main idea: if for a particular state s, we can do

better than following the current policy by taking a different action, then the current policy is not

  • ptimal and changing it to follow the different

action at state s improves it

slide-18
SLIDE 18

Policy Iteration

  • evaluate → improve → evaluate → improve →

…..

slide-19
SLIDE 19

Value Iteration

  • Main idea:

– Do one sweep of policy evaluation under the

current greedy policy

– Repeat until values stop changing (relative to some

small Δ)

slide-20
SLIDE 20

+10 At each state, the agent has 1 or more actions allowing it to move to neighboring states. Moving in the direction of a wall is not allowed A B C D C D B A V0 V1 V2 V3 V4 V5 WORKING TEXT AREA: π = greedy policy γ = 0.5

slide-21
SLIDE 21

Monte Carlo Methods

slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24

Code Demo

slide-25
SLIDE 25

Reading Responses

“- What is the advantage and disadvantages of model-free method? What is the advantage and disadvantages of model-based method?”

– Tung

slide-26
SLIDE 26

Reading Responses

“In theory, both DP and Monte Carlo will find

  • ptimal policy, but since our implementation of

the method won't iterate infinitely, will there be chances that the result is only local optimal value”

– Erli

slide-27
SLIDE 27

Reading Responses

“Are there situations when on-policy methods are preferred over off-policy for reasons other than ease of implementation?”

– Eric

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30

Finding Project Partner(s) Breakout

slide-31
SLIDE 31

Monte Carlo Tree Search Video

slide-32
SLIDE 32

THE END

slide-33
SLIDE 33