Journal of Articial In telligence Researc h - - PDF document

journal of arti cial in telligence researc h
SMART_READER_LITE
LIVE PREVIEW

Journal of Articial In telligence Researc h - - PDF document

Journal of Articial In telligence Researc h Submitted published Reinforcemen t Learning A Surv ey Leslie P ac k Kaelbling lpkcsbr o


slide-1
SLIDE 1 Journal
  • f
Articial In telligence Researc h
  • Submitted
  • published
  • Reinforcemen
t Learning A Surv ey Leslie P ac k Kaelbling lpkcsbr
  • wnedu
Mic hael L Littman mlittmancsbr
  • wnedu
Computer Scienc e Dep artment Box
  • Br
  • wn
University Pr
  • videnc
e RI
  • USA
Andrew W Mo
  • re
a wmcscmuedu Smith Hal l
  • Carne
gie Mel lon University
  • F
  • rb
es A venue Pittsbur gh P A
  • USA
Abstract This pap er surv eys the eld
  • f
reinforcemen t learning from a computerscience p er sp ectiv e It is written to b e accessible to researc hers familia r with mac hine learning Both the historical basis
  • f
the eld and a broad selection
  • f
curren t w
  • rk
are summarized Reinforcemen t learning is the problem faced b y an agen t that learns b eha vior through trialanderror in teractions with a dynamic en vironmen t The w
  • rk
describ ed here has a resem blance to w
  • rk
in psyc hology
  • but
diers considerably in the details and in the use
  • f
the w
  • rd
reinforcemen t The pap er discusses cen tral issues
  • f
reinforcemen t learning including trading
  • exploration
and exploitation establishing the foundations
  • f
the eld via Mark
  • v
decision theory
  • learning
from dela y ed reinforcemen t constructing empirical mo dels to accelerate learning making use
  • f
generalization and hierarc h y
  • and
coping with hidden state It concludes with a surv ey
  • f
some implemen ted systems and an assessmen t
  • f
the practical utilit y
  • f
curren t metho ds for reinforcemen t learning
  • In
tro duction Reinforcemen t learning dates bac k to the early da ys
  • f
cyb ernetics and w
  • rk
in statistics psyc hology
  • neuroscience
and computer science In the last v e to ten y ears it has attracted rapidly increasing in terest in the mac hine learning and articial in telligence comm unities Its promise is b eguilinga w a y
  • f
programming agen ts b y rew ard and punishmen t without needing to sp ecify how the task is to b e ac hiev ed But there are formidable computational
  • bstacles
to fullling the promise This pap er surv eys the historical basis
  • f
reinforcemen t learning and some
  • f
the curren t w
  • rk
from a computer science p ersp ectiv e W e giv e a highlev el
  • v
erview
  • f
the eld and a taste
  • f
some sp ecic approac hes It is
  • f
course imp
  • ssible
to men tion all
  • f
the imp
  • rtan
t w
  • rk
in the eld this should not b e tak en to b e an exhaustiv e accoun t Reinforcemen t learning is the problem faced b y an agen t that m ust learn b eha vior through trialanderror in teractions with a dynamic en vironmen t The w
  • rk
describ ed here has a strong family resem blance to ep
  • n
ymous w
  • rk
in psyc hology
  • but
diers considerably in the details and in the use
  • f
the w
  • rd
reinforcemen t It is appropriately though t
  • f
as a class
  • f
problems rather than as a set
  • f
tec hniques There are t w
  • main
strategies for solving reinforcemen tlearning problems The rst is to searc h in the space
  • f
b eha viors in
  • rder
to nd
  • ne
that p erforms w ell in the en vironmen t This approac h has b een tak en b y w
  • rk
in genetic algorithms and genetic programming c
  • AI
Access F
  • undation
and Morgan Kaufmann Publishers All righ ts reserv ed
slide-2
SLIDE 2 Kaelbling Littman
  • Moore

a T s i r B I R

Figure
  • The
standard reinforcemen tlearning mo del as w ell as some more no v el searc h tec hniques Sc hmidh ub er
  • The
second is to use statistical tec hniques and dynamic programming metho ds to estimate the utilit y
  • f
taking actions in states
  • f
the w
  • rld
This pap er is dev
  • ted
almost en tirely to the second set
  • f
tec hniques b ecause they tak e adv an tage
  • f
the sp ecial structure
  • f
reinforcemen tlearning problems that is not a v ailable in
  • ptimization
problems in general It is not y et clear whic h set
  • f
approac hes is b est in whic h circumstances The rest
  • f
this section is dev
  • ted
to establishing notation and describing the basic reinforcemen tlearning mo del Section
  • explains
the tradeo b et w een exploration and exploitation and presen ts some solutions to the most basic case
  • f
reinforcemen tlearning problems in whic h w e w an t to maximize the immediate rew ard Section
  • considers
the more general problem in whic h rew ards can b e dela y ed in time from the actions that w ere crucial to gaining them Section
  • considers
some classic mo delfree algorithms for reinforcemen t learning from dela y ed rew ard adaptiv e heuristic critic T D
  • and
Qlearning Section
  • demonstrates
a con tin uum
  • f
algorithms that are sensitiv e to the amoun t
  • f
computation an agen t can p erform b et w een actual steps
  • f
action in the en vironmen t Generalizationthe cornerstone
  • f
mainstream mac hine learning researc hhas the p
  • ten
tial
  • f
considerably aiding reinforcemen t learning as describ ed in Section
  • Section
  • considers
the problems that arise when the agen t do es not ha v e complete p erceptual access to the state
  • f
the en vironmen t Section
  • catalogs
some
  • f
reinforcemen t learnings successful applications Finally
  • Section
  • concludes
with some sp eculations ab
  • ut
imp
  • rtan
t
  • p
en problems and the future
  • f
reinforcemen t learning
  • Reinforcemen
tLearning Mo del In the standard reinforcemen tlearning mo del an agen t is connected to its en vironmen t via p erception and action as depicted in Figure
  • On
eac h step
  • f
in teraction the agen t receiv es as input i some indication
  • f
the curren t state s
  • f
the en vironmen t the agen t then c ho
  • ses
an action a to generate as
  • utput
The action c hanges the state
  • f
the en vironmen t and the v alue
  • f
this state transition is comm unicated to the agen t through a scalar r einfor c ement signal r
  • The
agen ts b eha vior B
  • should
c ho
  • se
actions that tend to increase the longrun sum
  • f
v alues
  • f
the reinforcemen t signal It can learn to do this
  • v
er time b y systematic trial and error guided b y a wide v ariet y
  • f
algorithms that are the sub ject
  • f
later sections
  • f
this pap er
slide-3
SLIDE 3 Reinf
  • r
cement Learning A Sur vey F
  • rmally
  • the
mo del consists
  • f
  • a
discrete set
  • f
en vironmen t states S
  • a
discrete set
  • f
agen t actions A and
  • a
set
  • f
scalar reinforcemen t signals t ypically f g
  • r
the real n um b ers The gure also includes an input function I
  • whic
h determines ho w the agen t views the en vironmen t state w e will assume that it is the iden tit y function that is the agen t p erceiv es the exact state
  • f
the en vironmen t un til w e consider partial
  • bserv
abilit y in Section
  • An
in tuitiv e w a y to understand the relation b et w een the agen t and its en vironmen t is with the follo wing example dialogue En vironmen t Y
  • u
are in state
  • Y
  • u
ha v e
  • p
  • ssible
actions Agen t Ill tak e action
  • En
vironmen t Y
  • u
receiv ed a reinforcemen t
  • f
  • units
Y
  • u
are no w in state
  • Y
  • u
ha v e
  • p
  • ssible
actions Agen t Ill tak e action
  • En
vironmen t Y
  • u
receiv ed a reinforcemen t
  • f
  • units
Y
  • u
are no w in state
  • Y
  • u
ha v e
  • p
  • ssible
actions Agen t Ill tak e action
  • En
vironmen t Y
  • u
receiv ed a reinforcemen t
  • f
  • units
Y
  • u
are no w in state
  • Y
  • u
ha v e
  • p
  • ssible
actions
  • The
agen ts job is to nd a p
  • licy
  • mapping
states to actions that maximizes some longrun measure
  • f
reinforcemen t W e exp ect in general that the en vironmen t will b e nondeterministic that is that taking the same action in the same state
  • n
t w
  • dieren
t
  • ccasions
ma y result in dieren t next states andor dieren t reinforcemen t v alues This happ ens in
  • ur
example ab
  • v
e from state
  • applying
action
  • pro
duces diering rein forcemen ts and diering states
  • n
t w
  • ccasions
Ho w ev er w e assume the en vironmen t is stationary that is that the pr
  • b
abilities
  • f
making state transitions
  • r
receiving sp ecic reinforcemen t signals do not c hange
  • v
er time
  • Reinforcemen
t learning diers from the more widely studied problem
  • f
sup ervised learn ing in sev eral w a ys The most imp
  • rtan
t dierence is that there is no presen tation
  • f
in putoutput pairs Instead after c ho
  • sing
an action the agen t is told the immediate rew ard and the subsequen t state but is not told whic h action w
  • uld
ha v e b een in its b est longterm in terests It is necessary for the agen t to gather useful exp erience ab
  • ut
the p
  • ssible
system states actions transitions and rew ards activ ely to act
  • ptimally
  • Another
dierence from sup ervised learning is that
  • nline
p erformance is imp
  • rtan
t the ev aluation
  • f
the system is
  • ften
concurren t with learning
  • This
assumption ma y b e disapp
  • in
ti ng after all
  • p
eration in nonstationary en vironmen ts is
  • ne
  • f
the motiv ations for buildin g learning systems In fact man y
  • f
the algorithms describ ed in later sections are eectiv e in slo wlyv arying nonstationary en vironmen ts but there is v ery little theoretical analysis in this area
slide-4
SLIDE 4 Kaelbling Littman
  • Moore
Some asp ects
  • f
reinforcemen t learning are closely related to searc h and planning issues in articial in telligence AI searc h algorithms generate a satisfactory tra jectory through a graph
  • f
states Planning
  • p
erates in a similar manner but t ypically within a construct with more complexit y than a graph in whic h states are represen ted b y comp
  • sitions
  • f
logical expressions instead
  • f
atomic sym b
  • ls
These AI algorithms are less general than the reinforcemen tlearning metho ds in that they require a predened mo del
  • f
state transitions and with a few exceptions assume determinism On the
  • ther
hand reinforcemen t learning at least in the kind
  • f
discrete cases for whic h theory has b een dev elop ed assumes that the en tire state space can b e en umerated and stored in memoryan assumption to whic h con v en tional searc h algorithms are not tied
  • Mo
dels
  • f
Optimal Beha vior Before w e can start thinking ab
  • ut
algorithms for learning to b eha v e
  • ptimally
  • w
e ha v e to decide what
  • ur
mo del
  • f
  • ptimalit
y will b e In particular w e ha v e to sp ecify ho w the agen t should tak e the future in to accoun t in the decisions it mak es ab
  • ut
ho w to b eha v e no w There are three mo dels that ha v e b een the sub ject
  • f
the ma jorit y
  • f
w
  • rk
in this area The nitehorizon mo del is the easiest to think ab
  • ut
at a giv en momen t in time the agen t should
  • ptimize
its exp ected rew ard for the next h steps E
  • h
X t r t
  • it
need not w
  • rry
ab
  • ut
what will happ en after that In this and subsequen t expressions r t represen ts the scalar rew ard receiv ed t steps in to the future This mo del can b e used in t w
  • w
a ys In the rst the agen t will ha v e a nonstationary p
  • licy
that is
  • ne
that c hanges
  • v
er time On its rst step it will tak e what is termed a hstep
  • ptimal
action This is dened to b e the b est action a v ailable giv en that it has h steps remaining in whic h to act and gain reinforcemen t On the next step it will tak e a h
  • step
  • ptimal
action and so
  • n
un til it nally tak es a step
  • ptimal
action and terminates In the second the agen t do es r e c e dinghorizon c
  • ntr
  • l
in whic h it alw a ys tak es the hstep
  • ptimal
action The agen t alw a ys acts according to the same p
  • licy
  • but
the v alue
  • f
h limits ho w far ahead it lo
  • ks
in c ho
  • sing
its actions The nitehorizon mo del is not alw a ys appropriate In man y cases w e ma y not kno w the precise length
  • f
the agen ts life in adv ance The innitehorizon discoun ted mo del tak es the longrun rew ard
  • f
the agen t in to ac coun t but rew ards that are receiv ed in the future are geometrically discoun ted according to discoun t factor
  • where
  • E
  • X
t
  • t
r t
  • W
e can in terpret
  • in
sev eral w a ys It can b e seen as an in terest rate a probabilit y
  • f
living another step
  • r
as a mathematical tric k to b
  • und
the innite sum The mo del is conceptu ally similar to recedinghorizon con trol but the discoun ted mo del is more mathematically tractable than the nitehorizon mo del This is a dominan t reason for the wide atten tion this mo del has receiv ed
slide-5
SLIDE 5 Reinf
  • r
cement Learning A Sur vey Another
  • ptimalit
y criterion is the aver ager ewar d mo del in whic h the agen t is supp
  • sed
to tak e actions that
  • ptimize
its longrun a v erage rew ard lim h E
  • h
h X t r t
  • Suc
h a p
  • licy
is referred to as a gain
  • ptimal
p
  • licy
it can b e seen as the limiting case
  • f
the innitehorizon discoun ted mo del as the discoun t factor approac hes
  • Bertsek
as
  • One
problem with this criterion is that there is no w a y to distinguish b et w een t w
  • p
  • licies
  • ne
  • f
whic h gains a large amoun t
  • f
rew ard in the initial phases and the
  • ther
  • f
whic h do es not Rew ard gained
  • n
an y initial prex
  • f
the agen ts life is
  • v
ershado w ed b y the longrun a v erage p erformance It is p
  • ssible
to generalize this mo del so that it tak es in to accoun t b
  • th
the long run a v erage and the amoun t
  • f
initial rew ard than can b e gained In the generalized bias
  • ptimal
mo del a p
  • licy
is preferred if it maximizes the longrun a v erage and ties are brok en b y the initial extra rew ard Figure
  • con
trasts these mo dels
  • f
  • ptimalit
y b y pro viding an en vironmen t in whic h c hanging the mo del
  • f
  • ptimalit
y c hanges the
  • ptimal
p
  • licy
  • In
this example circles represen t the states
  • f
the en vironmen t and arro ws are state transitions There is
  • nly
a single action c hoice from ev ery state except the start state whic h is in the upp er left and mark ed with an incoming arro w All rew ards are zero except where mark ed Under a nitehorizon mo del with h
  • the
three actions yield rew ards
  • f
  • and
  • so
the rst action should b e c hosen under an innitehorizon discoun ted mo del with
  • the
three c hoices yield
  • and
  • so
the second action should b e c hosen and under the a v erage rew ard mo del the third action should b e c hosen since it leads to an a v erage rew ard
  • f
  • If
w e c hange h to
  • and
  • to
  • then
the second action is
  • ptimal
for the nitehorizon mo del and the rst for the innitehorizon discoun ted mo del ho w ev er the a v erage rew ard mo del will alw a ys prefer the b est longterm a v erage Since the c hoice
  • f
  • ptimalit
y mo del and parameters matters so m uc h it is imp
  • rtan
t to c ho
  • se
it carefully in an y application The nitehorizon mo del is appropriate when the agen ts lifetime is kno wn
  • ne
im p
  • rtan
t asp ect
  • f
this mo del is that as the length
  • f
the remaining lifetime decreases the agen ts p
  • licy
ma y c hange A system with a hard deadline w
  • uld
b e appropriately mo deled this w a y
  • The
relativ e usefulness
  • f
innitehorizon discoun ted and biasoptimal mo dels is still under debate Biasoptimalit y has the adv an tage
  • f
not requiring a discoun t parameter ho w ev er algorithms for nding biasoptimal p
  • licies
are not y et as w ellundersto
  • d
as those for nding
  • ptimal
innitehorizon discoun ted p
  • licies
  • Measuring
Learning P erformance The criteria giv en in the previous section can b e used to assess the p
  • licies
learned b y a giv en algorithm W e w
  • uld
also lik e to b e able to ev aluate the qualit y
  • f
learning itself There are sev eral incompatible measures in use
  • Ev
en tual con v ergence to
  • ptimal
Man y algorithms come with a pro v able guar an tee
  • f
asymptotic con v ergence to
  • ptimal
b eha vior W atkins
  • Da
y an
  • This
is reassuring but useless in practical terms An agen t that quic kly reac hes a plateau
slide-6
SLIDE 6 Kaelbling Littman
  • Moore

Finite horizon, h=4 Infinite horizon, γ=0.9 Average reward +2 +10 +11

Figure
  • Comparing
mo dels
  • f
  • ptimalit
y
  • All
unlab eled arro ws pro duce a rew ard
  • f
zero at
  • f
  • ptimalit
y ma y
  • in
man y applications b e preferable to an agen t that has a guaran tee
  • f
ev en tual
  • ptimalit
y but a sluggish early learning rate
  • Sp
eed
  • f
con v ergence to
  • ptimalit
y
  • Optimalit
y is usually an asymptotic result and so con v ergence sp eed is an illdened measure More practical is the sp e e d
  • f
c
  • nver
genc e to ne aroptimality This measure b egs the denition
  • f
ho w near to
  • ptimalit
y is sucien t A related measure is level
  • f
p erformanc e after a given time whic h similarly requires that someone dene the giv en time It should b e noted that here w e ha v e another dierence b et w een reinforcemen t learning and con v en tional sup ervised learning In the latter exp ected future predictiv e accu racy
  • r
statistical eciency are the prime concerns F
  • r
example in the w ellkno wn P A C framew
  • rk
V alian t
  • there
is a learning p erio d during whic h mistak es do not coun t then a p erformance p erio d during whic h they do The framew
  • rk
pro vides b
  • unds
  • n
the necessary length
  • f
the learning p erio d in
  • rder
to ha v e a probabilistic guaran tee
  • n
the subsequen t p erformance That is usually an inappropriate view for an agen t with a long existence in a complex en vironmen t In spite
  • f
the mismatc h b et w een em b edded reinforcemen t learning and the traintest p ersp ectiv e Fiec h ter
  • pro
vides a P A C analysis for Qlearning describ ed in Section
  • that
sheds some ligh t
  • n
the connection b et w een the t w
  • views
Measures related to sp eed
  • f
learning ha v e an additional w eakness An algorithm that merely tries to ac hiev e
  • ptimalit
y as fast as p
  • ssible
ma y incur unnecessarily large p enalties during the learning p erio d A less aggressiv e strategy taking longer to ac hiev e
  • ptimalit
y
  • but
gaining greater total reinforcemen t during its learning migh t b e preferable
  • Regret
A more appropriate measure then is the exp ected decrease in rew ard gained due to executing the learning algorithm instead
  • f
b eha ving
  • ptimally
from the v ery b eginning This measure is kno wn as r e gr et Berry
  • F
ristedt
  • It
p enalizes mistak es wherev er they
  • ccur
during the run Unfortunately
  • results
concerning the regret
  • f
algorithms are quite hard to
  • btain
slide-7
SLIDE 7 Reinf
  • r
cement Learning A Sur vey
  • Reinforcemen
t Learning and Adaptiv e Con trol Adaptiv e con trol Burghes
  • Graham
  • Stengel
  • is
also concerned with algo rithms for impro ving a sequence
  • f
decisions from exp erience Adaptiv e con trol is a m uc h more mature discipline that concerns itself with dynamic systems in whic h states and ac tions are v ectors and system dynamics are smo
  • th
linear
  • r
lo cally linearizable around a desired tra jectory
  • A
v ery common form ulation
  • f
cost functions in adaptiv e con trol are quadratic p enalties
  • n
deviation from desired state and action v ectors Most imp
  • rtan
tly
  • although
the dynamic mo del
  • f
the system is not kno wn in adv ance and m ust b e esti mated from data the structur e
  • f
the dynamic mo del is xed lea ving mo del estimation as a parameter estimation problem These assumptions p ermit deep elegan t and p
  • w
erful mathematical analysis whic h in turn lead to robust practical and widely deplo y ed adaptiv e con trol algorithms
  • Exploitation
v ersus Exploration The SingleState Case One ma jor dierence b et w een reinforcemen t learning and sup ervised learning is that a reinforcemen tlearner m ust explicitly explore its en vironmen t In
  • rder
to highligh t the problems
  • f
exploration w e treat a v ery simple case in this section The fundamen tal issues and approac hes describ ed here will in man y cases transfer to the more complex instances
  • f
reinforcemen t learning discussed later in the pap er The simplest p
  • ssible
reinforcemen tlearning problem is kno wn as the k armed bandit problem whic h has b een the sub ject
  • f
a great deal
  • f
study in the statistics and applied mathematics literature Berry
  • F
ristedt
  • The
agen t is in a ro
  • m
with a collection
  • f
k gam bling mac hines eac h called a
  • nearmed
bandit in collo quial English The agen t is p ermitted a xed n um b er
  • f
pulls h An y arm ma y b e pulled
  • n
eac h turn The mac hines do not require a dep
  • sit
to pla y the
  • nly
cost is in w asting a pull pla ying a sub
  • ptimal
mac hine When arm i is pulled mac hine i pa ys
  • r
  • according
to some underlying probabilit y parameter p i
  • where
pa y
  • s
are indep enden t ev en ts and the p i s are unkno wn What should the agen ts strategy b e This problem illustrates the fundamen tal tradeo b et w een exploitation and exploration The agen t migh t b eliev e that a particular arm has a fairly high pa y
  • probabilit
y should it c ho
  • se
that arm all the time
  • r
should it c ho
  • se
another
  • ne
that it has less information ab
  • ut
but seems to b e w
  • rse
Answ ers to these questions dep end
  • n
ho w long the agen t is exp ected to pla y the game the longer the game lasts the w
  • rse
the consequences
  • f
prematurely con v erging
  • n
a suboptimal arm and the more the agen t should explore There is a wide v ariet y
  • f
solutions to this problem W e will consider a represen tativ e selection
  • f
them but for a deep er discussion and a n um b er
  • f
imp
  • rtan
t theoretical results see the b
  • k
b y Berry and F ristedt
  • W
e use the term action to indicate the agen ts c hoice
  • f
arm to pull This eases the transition in to dela y ed reinforcemen t mo dels in Section
  • It
is v ery imp
  • rtan
t to note that bandit problems t
  • ur
denition
  • f
a reinforcemen tlearning en vironmen t with a single state with
  • nly
self transitions Section
  • discusses
three solutions to the basic
  • nestate
bandit problem that ha v e formal correctness results Although they can b e extended to problems with realv alued rew ards they do not apply directly to the general m ultistate dela y edreinforcemen t case
slide-8
SLIDE 8 Kaelbling Littman
  • Moore
Section
  • presen
ts three tec hniques that are not formally justied but that ha v e had wide use in practice and can b e applied with similar lac k
  • f
guaran tee to the general case
  • F
  • rmally
Justied T ec hniques There is a fairly w elldev elop ed formal theory
  • f
exploration for v ery simple problems Although it is instructiv e the metho ds it pro vides do not scale w ell to more complex problems
  • D
ynamicPr
  • gramming
Appr
  • a
ch If the agen t is going to b e acting for a total
  • f
h steps it can use basic Ba y esian reasoning to solv e for an
  • ptimal
strategy Berry
  • F
ristedt
  • This
requires an assumed prior join t distribution for the parameters fp i g the most natural
  • f
whic h is that eac h p i is indep enden tly uniformly distributed b et w een
  • and
  • W
e compute a mapping from b elief states summaries
  • f
the agen ts exp eriences during this run to actions Here a b elief state can b e represen ted as a tabulation
  • f
action c hoices and pa y
  • s
fn
  • w
  • n
  • w
  • n
k
  • w
k g denotes a state
  • f
pla y in whic h eac h arm i has b een pulled n i times with w i pa y
  • s
W e write V
  • n
  • w
  • n
k
  • w
k
  • as
the exp ected pa y
  • remaining
giv en that a total
  • f
h pulls are a v ailable and w e use the remaining pulls
  • ptimally
  • If
P i n i
  • h
then there are no remaining pulls and V
  • n
  • w
  • n
k
  • w
k
  • This
is the basis
  • f
a recursiv e denition If w e kno w the V
  • v
alue for all b elief states with t pulls remaining w e can compute the V
  • v
alue
  • f
an y b elief state with t
  • pulls
remaining V
  • n
  • w
  • n
k
  • w
k
  • max
i E
  • F
uture pa y
  • if
agen t tak es action i then acts
  • ptimally
for remaining pulls
  • max
i
  • i
V
  • n
  • w
i
  • n
i
  • w
i
  • n
k
  • w
k
  • i
V
  • n
  • w
i
  • n
i
  • w
i
  • n
k
  • w
k
  • where
  • i
is the p
  • sterior
sub jectiv e probabilit y
  • f
action i pa ying
  • giv
en n i
  • w
i and
  • ur
prior probabilit y
  • F
  • r
the uniform priors whic h result in a b eta distribution
  • i
  • w
i
  • n
i
  • The
exp ense
  • f
lling in the table
  • f
V
  • v
alues in this w a y for all attainable b elief states is linear in the n um b er
  • f
b elief states times actions and th us exp
  • nen
tial in the horizon
  • Gittins
Alloca tion Indices Gittins giv es an allo cation index metho d for nding the
  • ptimal
c hoice
  • f
action at eac h step in k armed bandit problems Gittins
  • The
tec hnique
  • nly
applies under the discoun ted exp ected rew ard criterion F
  • r
eac h action consider the n um b er
  • f
times it has b een c hosen n v ersus the n um b er
  • f
times it has paid
  • w
  • F
  • r
certain discoun t factors there are published tables
  • f
index v alues I n w
  • for
eac h pair
  • f
n and w
  • Lo
  • k
up the index v alue for eac h action i I n i
  • w
i
  • It
represen ts a comparativ e measure
  • f
the com bined v alue
  • f
the exp ected pa y
  • f
action i giv en its history
  • f
pa y
  • s
and the v alue
  • f
the information that w e w
  • uld
get b y c ho
  • sing
it Gittins has sho wn that c ho
  • sing
the action with the largest index v alue guaran tees the
  • ptimal
balance b et w een exploration and exploitation
slide-9
SLIDE 9 Reinf
  • r
cement Learning A Sur vey

1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1 a = 0 a = 1 r = 0 r = 1 1 2 3 N-1 N 2N 2N-1 N+3 N+2 N+1 a = 0 a = 1

Figure
  • A
Tsetlin automaton with N states The top ro w sho ws the state transitions that are made when the previous action resulted in a rew ard
  • f
  • the
b
  • ttom
ro w sho ws transitions after a rew ard
  • f
  • In
states in the left half
  • f
the gure action
  • is
tak en in those
  • n
the righ t action
  • is
tak en Because
  • f
the guaran tee
  • f
  • ptimal
exploration and the simplicit y
  • f
the tec hnique giv en the table
  • f
index v alues this approac h holds a great deal
  • f
promise for use in more complex applications This metho d pro v ed useful in an application to rob
  • tic
manipulation with immediate rew ard Salganico
  • Ungar
  • Unfortunately
  • no
  • ne
has y et b een able to nd an analog
  • f
index v alues for dela y ed reinforcemen t problems
  • Learning
A utoma t a A branc h
  • f
the theory
  • f
adaptiv e con trol is dev
  • ted
to le arning automata surv ey ed b y Narendra and Thathac har
  • whic
h w ere
  • riginally
describ ed explicitly as nite state automata The Tsetlin automaton sho wn in Figure
  • pro
vides an example that solv es a armed bandit arbitrarily near
  • ptimally
as N approac hes innit y
  • It
is incon v enien t to describ e algorithms as nitestate automata so a mo v e w as made to describ e the in ternal state
  • f
the agen t as a probabilit y distribution according to whic h actions w
  • uld
b e c hosen The probabilities
  • f
taking dieren t actions w
  • uld
b e adjusted according to their previous successes and failures An example whic h stands among a set
  • f
algorithms indep enden tly dev elop ed in the mathematical psyc hology literature Hilgard
  • Bo
w er
  • is
the line ar r ewar dinaction algorithm Let p i b e the agen ts probabilit y
  • f
taking action i
  • When
action a i succeeds p i
  • p
i
  • p
i
  • p
j
  • p
j
  • p
j for j
  • i
  • When
action a i fails p j remains unc hanged for all j
  • This
algorithm con v erges with probabilit y
  • to
a v ector con taining a single
  • and
the rest s c ho
  • sing
a particular action with probabilit y
  • Unfortunately
  • it
do es not alw a ys con v erge to the correct action but the probabilit y that it con v erges to the wrong
  • ne
can b e made arbitrarily small b y making
  • small
Narendra
  • Thathac
har
  • There
is no literature
  • n
the regret
  • f
this algorithm
slide-10
SLIDE 10 Kaelbling Littman
  • Moore
  • AdHo
c T ec hniques In reinforcemen tlearning practice some simple ad ho c strategies ha v e b een p
  • pular
They are rarely
  • if
ev er the b est c hoice for the mo dels
  • f
  • ptimalit
y w e ha v e used but they ma y b e view ed as reasonable computationally tractable heuristics Thrun
  • has
surv ey ed a v ariet y
  • f
these tec hniques
  • Greed
y Stra tegies The rst strategy that comes to mind is to alw a ys c ho
  • se
the action with the highest esti mated pa y
  • The
a w is that early unluc ky sampling migh t indicate that the b est actions rew ard is less than the rew ard
  • btained
from a sub
  • ptimal
action The sub
  • ptimal
action will alw a ys b e pic k ed lea ving the true
  • ptimal
action starv ed
  • f
data and its sup eriorit y nev er disco v ered An agen t m ust explore to ameliorate this
  • utcome
A useful heuristic is
  • ptimism
in the fac e
  • f
unc ertainty in whic h actions are selected greedily
  • but
strongly
  • ptimistic
prior b eliefs are put
  • n
their pa y
  • s
so that strong negativ e evidence is needed to eliminate an action from consideration This still has a measurable danger
  • f
starving an
  • ptimal
but unluc ky action but the risk
  • f
this can b e made arbitrar ily small T ec hniques lik e this ha v e b een used in sev eral reinforcemen t learning algorithms including the in terv al exploration metho d Kaelbling b describ ed shortly the ex plor ation b
  • nus
in Dyna Sutton
  • curiositydriven
explor ation Sc hmidh ub er a and the exploration mec hanism in prioritized sw eeping Mo
  • re
  • A
tk eson
  • Randomized
Stra tegies Another simple exploration strategy is to tak e the action with the b est estimated exp ected rew ard b y default but with probabilit y p c ho
  • se
an action at random Some v ersions
  • f
this strategy start with a large v alue
  • f
p to encourage initial exploration whic h is slo wly decreased An
  • b
jection to the simple strategy is that when it exp erimen ts with a nongreedy action it is no more lik ely to try a promising alternativ e than a clearly hop eless alternativ e A sligh tly more sophisticated strategy is Boltzmann explor ation In this case the exp ected rew ard for taking action a E Ra is used to c ho
  • se
an action probabilisticall y according to the distribution P a
  • e
E RaT P a
  • A
e E Ra
  • T
  • The
temp er atur e parameter T can b e decreased
  • v
er time to decrease exploration This metho d w
  • rks
w ell if the b est action is w ell separated from the
  • thers
but suers somewhat when the v alues
  • f
the actions are close It ma y also con v erge unnecessarily slo wly unless the temp erature sc hedule is man ually tuned with great care
  • Inter
v albased Techniques Exploration is
  • ften
more ecien t when it is based
  • n
secondorder information ab
  • ut
the certain t y
  • r
v ariance
  • f
the estimated v alues
  • f
actions Kaelblings interval estimation algorithm b stores statistics for eac h action a i
  • w
i is the n um b er
  • f
successes and n i the n um b er
  • f
trials An action is c hosen b y computing the upp er b
  • und
  • f
a
slide-11
SLIDE 11 Reinf
  • r
cement Learning A Sur vey condence in terv al
  • n
the success probabilit y
  • f
eac h action and c ho
  • sing
the action with the highest upp er b
  • und
Smaller v alues
  • f
the
  • parameter
encourage greater exploration When pa y
  • s
are b
  • lean
the normal appro ximation to the binomial distribution can b e used to construct the condence in terv al though the binomial should b e used for small n Other pa y
  • distributions
can b e handled using their asso ciated statistics
  • r
with nonparametric metho ds The metho d w
  • rks
v ery w ell in empirical trials It is also related to a certain class
  • f
statistical tec hniques kno wn as exp eriment design metho ds Bo x
  • Drap
er
  • whic
h are used for comparing m ultiple treatmen ts for example fertilizers
  • r
drugs to determine whic h treatmen t if an y is b est in as small a set
  • f
exp erimen ts as p
  • ssible
  • More
General Problems When there are m ultiple states but reinforcemen t is still immediate then an y
  • f
the ab
  • v
e solutions can b e replicated
  • nce
for eac h state Ho w ev er when generalization is required these solutions m ust b e in tegrated with generalization metho ds see section
  • this
is straigh tforw ard for the simple adho c metho ds but it is not understo
  • d
ho w to main tain theoretical guaran tees Man y
  • f
these tec hniques fo cus
  • n
con v erging to some regime in whic h exploratory actions are tak en rarely
  • r
nev er this is appropriate when the en vironmen t is stationary
  • Ho
w ev er when the en vironmen t is nonstationary
  • exploration
m ust con tin ue to tak e place in
  • rder
to notice c hanges in the w
  • rld
Again the more adho c tec hniques can b e mo died to deal with this in a plausible manner k eep temp erature parameters from going to
  • deca
y the statistics in in terv al estimation but none
  • f
the theoretically guaran teed metho ds can b e applied
  • Dela
y ed Rew ard In the general case
  • f
the reinforcemen t learning problem the agen ts actions determine not
  • nly
its immediate rew ard but also at least probabilistically the next state
  • f
the en vironmen t Suc h en vironmen ts can b e though t
  • f
as net w
  • rks
  • f
bandit problems but the agen t m ust tak e in to accoun t the next state as w ell as the immediate rew ard when it decides whic h action to tak e The mo del
  • f
longrun
  • ptimalit
y the agen t is using determines exactly ho w it should tak e the v alue
  • f
the future in to accoun t The agen t will ha v e to b e able to learn from dela y ed reinforcemen t it ma y tak e a long sequence
  • f
actions receiving insignican t reinforcemen t then nally arriv e at a state with high reinforcemen t The agen t m ust b e able to learn whic h
  • f
its actions are desirable based
  • n
rew ard that can tak e place arbitrarily far in the future
  • Mark
  • v
Decision Pro cesses Problems with dela y ed reinforcemen t are w ell mo deled as Markov de cision pr
  • c
esses MDPs An MDP consists
  • f
  • a
set
  • f
states S
  • a
set
  • f
actions A
slide-12
SLIDE 12 Kaelbling Littman
  • Moore
  • a
rew ard function R
  • S
  • A
  • and
  • a
state transition function T
  • S
  • A
  • S
  • where
a mem b er
  • f
S
  • is
a probabilit y distribution
  • v
er the set S ie it maps states to probabilities W e write T s a s
  • for
the probabilit y
  • f
making a transition from state s to state s
  • using
action a The state transition function probabilistically sp ecies the next state
  • f
the en vironmen t as a function
  • f
its curren t state and the agen ts action The rew ard function sp ecies exp ected instan taneous rew ard as a function
  • f
the curren t state and action The mo del is Markov if the state transitions are indep enden t
  • f
an y previous en vironmen t states
  • r
agen t actions There are man y go
  • d
references to MDP mo dels Bellman
  • Bertsek
as
  • Ho
w ard
  • Puterman
  • Although
general MDPs ma y ha v e innite ev en uncoun table state and action spaces w e will
  • nly
discuss metho ds for solving nitestate and niteaction problems In section
  • w
e discuss metho ds for solving problems with con tin uous input and
  • utput
spaces
  • Finding
a P
  • licy
Giv en a Mo del Before w e consider algorithms for learning to b eha v e in MDP en vironmen ts w e will ex plore tec hniques for determining the
  • ptimal
p
  • licy
giv en a correct mo del These dynamic programming tec hniques will serv e as the foundation and inspiration for the learning al gorithms to follo w W e restrict
  • ur
atten tion mainly to nding
  • ptimal
p
  • licies
for the innitehorizon discoun ted mo del but most
  • f
these algorithms ha v e analogs for the nite horizon and a v eragecase mo dels as w ell W e rely
  • n
the result that for the innitehorizon discoun ted mo del there exists an
  • ptimal
deterministic stationary p
  • licy
Bellman
  • W
e will sp eak
  • f
the
  • ptimal
value
  • f
a stateit is the exp ected innite discoun ted sum
  • f
rew ard that the agen t will gain if it starts in that state and executes the
  • ptimal
p
  • licy
  • Using
  • as
a complete decision p
  • licy
  • it
is written V
  • s
  • max
  • E
  • X
t
  • t
r t
  • This
  • ptimal
v alue function is unique and can b e dened as the solution to the sim ultaneous equations V
  • s
  • max
a
  • Rs
a
  • X
s
  • S
T s a s
  • V
  • s
  • A
  • s
  • S
  • whic
h assert that the v alue
  • f
a state s is the exp ected instan taneous rew ard plus the exp ected discoun ted v alue
  • f
the next state using the b est a v ailable action Giv en the
  • ptimal
v alue function w e can sp ecify the
  • ptimal
p
  • licy
as
  • s
  • arg
max a
  • Rs
a
  • X
s
  • S
T s a s
  • V
  • s
  • A
  • V
alue Itera tion One w a y
  • then
to nd an
  • ptimal
p
  • licy
is to nd the
  • ptimal
v alue function It can b e determined b y a simple iterativ e algorithm called value iter ation that can b e sho wn to con v erge to the correct V
  • v
alues Bellman
  • Bertsek
as
slide-13
SLIDE 13 Reinf
  • r
cement Learning A Sur vey initialize V s arbitrarily loop until policy good enough loop for s
  • S
loop for a
  • A
Qs a
  • Rs
a
  • P
s
  • S
T s a s
  • V
s
  • V
s
  • max
a Qs a end loop end loop It is not
  • b
vious when to stop the v alue iteration algorithm One imp
  • rtan
t result b
  • unds
the p erformance
  • f
the curren t greedy p
  • licy
as a function
  • f
the Bel lman r esidual
  • f
the curren t v alue function Williams
  • Baird
b It sa ys that if the maxim um dierence b et w een t w
  • successiv
e v alue functions is less than
  • then
the v alue
  • f
the greedy p
  • licy
  • the
p
  • licy
  • btained
b y c ho
  • sing
in ev ery state the action that maximizes the estimated discoun ted rew ard using the curren t estimate
  • f
the v alue function diers from the v alue function
  • f
the
  • ptimal
p
  • licy
b y no more than
  • at
an y state This pro vides an eectiv e stopping criterion for the algorithm Puterman
  • discusses
another stopping criterion based
  • n
the sp an seminorm whic h ma y result in earlier termination Another imp
  • rtan
t result is that the greedy p
  • licy
is guaran teed to b e
  • ptimal
in some nite n um b er
  • f
steps ev en though the v alue function ma y not ha v e con v erged Bertsek as
  • And
in practice the greedy p
  • licy
is
  • ften
  • ptimal
long b efore the v alue function has con v erged V alue iteration is v ery exible The assignmen ts to V need not b e done in strict
  • rder
as sho wn ab
  • v
e but instead can
  • ccur
async hronously in parallel pro vided that the v alue
  • f
ev ery state gets up dated innitely
  • ften
  • n
an innite run These issues are treated extensiv ely b y Bertsek as
  • who
also pro v es con v ergence results Up dates based
  • n
Equation
  • are
kno wn as ful l b ackups since they mak e use
  • f
infor mation from all p
  • ssible
successor states It can b e sho wn that up dates
  • f
the form Qs a
  • Qs
a
  • r
  • max
a
  • Qs
  • a
  • Qs
a can also b e used as long as eac h pairing
  • f
a and s is up dated innitely
  • ften
s
  • is
sampled from the distribution T s a s
  • r
is sampled with mean Rs a and b
  • unded
v ariance and the learning rate
  • is
decreased slo wly
  • This
t yp e
  • f
sample b ackup Singh
  • is
critical to the
  • p
eration
  • f
the mo delfree metho ds discussed in the next section The computational complexit y
  • f
the v alueiteration algorithm with full bac kups p er iteration is quadratic in the n um b er
  • f
states and linear in the n um b er
  • f
actions Com monly
  • the
transition probabilities T s a s
  • are
sparse If there are
  • n
a v erage a constan t n um b er
  • f
next states with nonzero probabilit y then the cost p er iteration is linear in the n um b er
  • f
states and linear in the n um b er
  • f
actions The n um b er
  • f
iterations required to reac h the
  • ptimal
v alue function is p
  • lynomial
in the n um b er
  • f
states and the magnitude
  • f
the largest rew ard if the discoun t factor is held constan t Ho w ev er in the w
  • rst
case the n um b er
  • f
iterations gro ws p
  • lynomially
in
  • so
the con v ergence rate slo ws considerably as the discoun t factor approac hes
  • Littman
Dean
  • Kaelbling
b
slide-14
SLIDE 14 Kaelbling Littman
  • Moore
  • Policy
Itera tion The p
  • licy
iter ation algorithm manipulates the p
  • licy
directly
  • rather
than nding it indi rectly via the
  • ptimal
v alue function It
  • p
erates as follo ws choose an arbitrary policy
  • loop
  • compute
the value function
  • f
policy
  • solve
the linear equations V
  • s
  • Rs
  • s
  • P
s
  • S
T s
  • s
s
  • V
  • s
  • improve
the policy at each state
  • s
  • arg
max a Rs a
  • P
s
  • S
T s a s
  • V
  • s
  • until
  • The
v alue function
  • f
a p
  • licy
is just the exp ected innite discoun ted rew ard that will b e gained at eac h state b y executing that p
  • licy
  • It
can b e determined b y solving a set
  • f
linear equations Once w e kno w the v alue
  • f
eac h state under the curren t p
  • licy
  • w
e consider whether the v alue could b e impro v ed b y c hanging the rst action tak en If it can w e c hange the p
  • licy
to tak e the new action whenev er it is in that situation This step is guaran teed to strictly impro v e the p erformance
  • f
the p
  • licy
  • When
no impro v emen ts are p
  • ssible
then the p
  • licy
is guaran teed to b e
  • ptimal
Since there are at most jAj jS j distinct p
  • licies
and the sequence
  • f
p
  • licies
impro v es at eac h step this algorithm terminates in at most an exp
  • nen
tial n um b er
  • f
iterations Puter man
  • Ho
w ev er it is an imp
  • rtan
t
  • p
en question ho w man y iterations p
  • licy
iteration tak es in the w
  • rst
case It is kno wn that the running time is pseudop
  • lynomial
and that for an y xed discoun t factor there is a p
  • lynomial
b
  • und
in the total size
  • f
the MDP Littman et al b
  • Enhancement
to V alue Itera tion and Policy Itera tion In practice v alue iteration is m uc h faster p er iteration but p
  • licy
iteration tak es few er iterations Argumen ts ha v e b een put forth to the eect that eac h approac h is b etter for large problems Putermans mo die d p
  • licy
iter ation algorithm Puterman
  • Shin
  • pro
vides a metho d for trading iteration time for iteration impro v emen t in a smo
  • ther
w a y
  • The
basic idea is that the exp ensiv e part
  • f
p
  • licy
iteration is solving for the exact v alue
  • f
V
  • Instead
  • f
nding an exact v alue for V
  • w
e can p erform a few steps
  • f
a mo died v alueiteration step where the p
  • licy
is held xed
  • v
er successiv e iterations This can b e sho wn to pro duce an appro ximation to V
  • that
con v erges linearly in
  • In
practice this can result in substan tial sp eedups Sev eral standard n umericalanalysis tec hniques that sp eed the con v ergence
  • f
dynamic programming can b e used to accelerate v alue and p
  • licy
iteration Multigrid metho ds can b e used to quic kly seed a go
  • d
initial appro ximation to a high resolution v alue function b y initially p erforming v alue iteration at a coarser resolution R ude
  • State
aggr e gation w
  • rks
b y collapsing groups
  • f
states to a single metastate solving the abstracted problem Bertsek as
  • Casta
! non
slide-15
SLIDE 15 Reinf
  • r
cement Learning A Sur vey
  • Comput
a tional Complexity V alue iteration w
  • rks
b y pro ducing successiv e appro ximations
  • f
the
  • ptimal
v alue function Eac h iteration can b e p erformed in O jAjjS j
  • steps
  • r
faster if there is sparsit y in the transition function Ho w ev er the n um b er
  • f
iterations required can gro w exp
  • nen
tially in the discoun t factor Condon
  • as
the discoun t factor approac hes
  • the
decisions m ust b e based
  • n
results that happ en farther and farther in to the future In practice p
  • licy
iteration con v erges in few er iterations than v alue iteration although the p eriteration costs
  • f
O jAjjS j
  • jS
j
  • can
b e prohibitiv e There is no kno wn tigh t w
  • rstcase
b
  • und
a v ailable for p
  • licy
iteration Littman et al b Mo died p
  • licy
iteration Puterman
  • Shin
  • seeks
a tradeo b et w een c heap and eectiv e iterations and is preferred b y some practictioners Rust
  • Linear
programming Sc hrijv er
  • is
an extremely general problem and MDPs can b e solv ed b y generalpurp
  • se
linearprogramming pac k ages Derman
  • DEp
enoux
  • Homan
  • Karp
  • An
adv an tage
  • f
this approac h is that commercialqualit y linearprogramming pac k ages are a v ailable although the time and space requiremen ts can still b e quite high F rom a theoretic p ersp ectiv e linear programming is the
  • nly
kno wn algorithm that can solv e MDPs in p
  • lynomial
time although the theoretically ecien t algorithms ha v e not b een sho wn to b e ecien t in practice
  • Learning
an Optimal P
  • licy
Mo delfree Metho ds In the previous section w e review ed metho ds for
  • btaining
an
  • ptimal
p
  • licy
for an MDP assuming that w e already had a mo del The mo del consists
  • f
kno wledge
  • f
the state tran sition probabilit y function T s a s
  • and
the reinforcemen t function Rs a Reinforcemen t learning is primarily concerned with ho w to
  • btain
the
  • ptimal
p
  • licy
when suc h a mo del is not kno wn in adv ance The agen t m ust in teract with its en vironmen t directly to
  • btain
information whic h b y means
  • f
an appropriate algorithm can b e pro cessed to pro duce an
  • ptimal
p
  • licy
  • A
t this p
  • in
t there are t w
  • w
a ys to pro ceed
  • Mo
delfree Learn a con troller without learning a mo del
  • Mo
delbased Learn a mo del and use it to deriv e a con troller Whic h approac h is b etter This is a matter
  • f
some debate in the reinforcemen tlearning comm unit y
  • A
n um b er
  • f
algorithms ha v e b een prop
  • sed
  • n
b
  • th
sides This question also app ears in
  • ther
elds suc h as adaptiv e con trol where the dic hotom y is b et w een dir e ct and indir e ct adaptiv e con trol This section examines mo delfree learning and Section
  • examines
mo delbased meth
  • ds
The biggest problem facing a reinforcemen tlearning agen t is temp
  • r
al cr e dit assignment Ho w do w e kno w whether the action just tak en is a go
  • d
  • ne
when it migh t ha v e far reac hing eects One strategy is to w ait un til the end and rew ard the actions tak en if the result w as go
  • d
and punish them if the result w as bad In
  • ngoing
tasks it is dicult to kno w what the end is and this migh t require a great deal
  • f
memory
  • Instead
w e will use insigh ts from v alue iteration to adjust the estimated v alue
  • f
a state based
  • n
slide-16
SLIDE 16 Kaelbling Littman
  • Moore

AHC RL v s r a

Figure
  • Arc
hitecture for the adaptiv e heuristic critic the immediate rew ard and the estimated v alue
  • f
the next state This class
  • f
algorithms is kno wn as temp
  • r
al dier enc e metho ds Sutton
  • W
e will consider t w
  • dieren
t temp
  • raldierence
learning strategies for the discoun ted innitehorizon mo del
  • Adaptiv
e Heuristic Critic and TD
  • The
adaptive heuristic critic algorithm is an adaptiv e v ersion
  • f
p
  • licy
iteration Barto Sutton
  • Anderson
  • in
whic h the v aluefunction computation is no longer imple men ted b y solving a set
  • f
linear equations but is instead computed b y an algorithm called T D
  • A
blo c k diagram for this approac h is giv en in Figure
  • It
consists
  • f
t w
  • comp
  • nen
ts a critic lab eled AHC and a reinforcemen tlearning comp
  • nen
t lab eled RL The reinforcemen tlearning comp
  • nen
t can b e an instance
  • f
an y
  • f
the k armed bandit algo rithms mo died to deal with m ultiple states and nonstationary rew ards But instead
  • f
acting to maximize instan taneous rew ard it will b e acting to maximize the heuristic v alue v
  • that
is computed b y the critic The critic uses the real external reinforcemen t signal to learn to map states to their exp ected discoun ted v alues giv en that the p
  • licy
b eing executed is the
  • ne
curren tly instan tiated in the RL comp
  • nen
t W e can see the analogy with mo died p
  • licy
iteration if w e imagine these comp
  • nen
ts w
  • rking
in alternation The p
  • licy
  • implemen
ted b y RL is xed and the critic learns the v alue function V
  • for
that p
  • licy
  • No
w w e x the critic and let the RL comp
  • nen
t learn a new p
  • licy
  • that
maximizes the new v alue function and so
  • n
In most implemen tations ho w ev er b
  • th
comp
  • nen
ts
  • p
erate sim ultaneously
  • Only
the alternating implemen tation can b e guaran teed to con v erge to the
  • ptimal
p
  • licy
  • under
appropriate conditions Williams and Baird explored the con v ergence prop erties
  • f
a class
  • f
AHCrelated algorithms they call incremen tal v arian ts
  • f
p
  • licy
iteration Williams
  • Baird
a It remains to explain ho w the critic can learn the v alue
  • f
a p
  • licy
  • W
e dene hs a r
  • s
  • i
to b e an exp erienc e tuple summarizing a single transition in the en vironmen t Here s is the agen ts state b efore the transition a is its c hoice
  • f
action r the instan taneous rew ard it receiv es and s
  • its
resulting state The v alue
  • f
a p
  • licy
is learned using Suttons T D
  • algorithm
Sutton
  • whic
h uses the up date rule V s
  • V
s
  • r
  • V
s
  • V
s
  • Whenev
er a state s is visited its estimated v alue is up dated to b e closer to r
  • V
s
  • since
r is the instan taneous rew ard receiv ed and V s
  • is
the estimated v alue
  • f
the actually
  • ccurring
next state This is analogous to the samplebac kup rule from v alue iterationthe
  • nly
dierence is that the sample is dra wn from the real w
  • rld
rather than b y sim ulating a kno wn mo del The k ey idea is that r
  • V
s
  • is
a sample
  • f
the v alue
  • f
V s and it is
slide-17
SLIDE 17 Reinf
  • r
cement Learning A Sur vey more lik ely to b e correct b ecause it incorp
  • rates
the real r
  • If
the learning rate
  • is
adjusted prop erly it m ust b e slo wly decreased and the p
  • licy
is held xed T D
  • is
guaran teed to con v erge to the
  • ptimal
v alue function The T D
  • rule
as presen ted ab
  • v
e is really an instance
  • f
a more general class
  • f
algorithms called T D
  • with
  • T
D
  • lo
  • ks
  • nly
  • ne
step ahead when adjusting v alue estimates although it will ev en tually arriv e at the correct answ er it can tak e quite a while to do so The general T D
  • rule
is similar to the T D
  • rule
giv en ab
  • v
e V u
  • V
u
  • r
  • V
s
  • V
seu
  • but
it is applied to every state according to its eligibili t y eu rather than just to the immediately previous state s One v ersion
  • f
the eligibil it y trace is dened to b e es
  • t
X k
  • tk
  • ss
k
  • where
  • ss
k
  • if
s
  • s
k
  • therwise
  • The
eligibili t y
  • f
a state s is the degree to whic h it has b een visited in the recen t past when a reinforcemen t is receiv ed it is used to up date all the states that ha v e b een recen tly visited according to their eligibili t y
  • When
  • this
is equiv alen t to T D
  • When
  • it
is roughly equiv alen t to up dating all the states according to the n um b er
  • f
times they w ere visited b y the end
  • f
a run Note that w e can up date the eligibili t y
  • nline
as follo ws es
  • es
  • if
s
  • curren
t state
  • es
  • therwise
  • It
is computationally more exp ensiv e to execute the general T D
  • though
it
  • ften
con v erges considerably faster for large
  • Da
y an
  • Da
y an
  • Sejno
wski
  • There
has b een some recen t w
  • rk
  • n
making the up dates more ecien t Cic hosz
  • Mula
wk a
  • and
  • n
c hanging the denition to mak e T D
  • more
consisten t with the certain t yequiv alen t metho d Singh
  • Sutton
  • whic
h is discussed in Section
  • Qlearning
The w
  • rk
  • f
the t w
  • comp
  • nen
ts
  • f
AHC can b e accomplished in a unied manner b y W atkins Qlearning algorithm W atkins
  • W
atkins
  • Da
y an
  • Qlearning
is t ypically easier to implemen t In
  • rder
to understand Qlearning w e ha v e to dev elop some additional notation Let Q
  • s
a b e the exp ected discoun ted reinforcemen t
  • f
taking action a in state s then con tin uing b y c ho
  • sing
actions
  • ptimally
  • Note
that V
  • s
is the v alue
  • f
s assuming the b est action is tak en initially
  • and
so V
  • s
  • max
a Q
  • s
a Q
  • s
a can hence b e written recursiv ely as Q
  • s
a
  • Rs
a
  • X
s
  • S
T s a s
  • max
a
  • Q
  • s
  • a
  • Note
also that since V
  • s
  • max
a Q
  • s
a w e ha v e
  • s
  • arg
max a Q
  • s
a as an
  • ptimal
p
  • licy
  • Because
the Q function mak es the action explicit w e can estimate the Q v alues
  • n
line using a metho d essen tially the same as T D
  • but
also use them to dene the p
  • licy
slide-18
SLIDE 18 Kaelbling Littman
  • Moore
b ecause an action can b e c hosen just b y taking the
  • ne
with the maxim um Q v alue for the curren t state The Qlearning rule is Qs a
  • Qs
a
  • r
  • max
a
  • Qs
  • a
  • Qs
a
  • where
hs a r
  • s
  • i
is an exp erience tuple as describ ed earlier If eac h action is executed in eac h state an innite n um b er
  • f
times
  • n
an innite run and
  • is
deca y ed appropriately
  • the
Q v alues will con v erge with probabilit y
  • to
Q
  • W
atkins
  • Tsitsiklis
  • Jaakk
  • la
Jordan
  • Singh
  • Qlearning
can also b e extended to up date states that
  • ccurred
more than
  • ne
step previously
  • as
in T D
  • P
eng
  • Williams
  • When
the Q v alues are nearly con v erged to their
  • ptimal
v alues it is appropriate for the agen t to act greedily
  • taking
in eac h situation the action with the highest Q v alue During learning ho w ev er there is a dicult exploitation v ersus exploration tradeo to b e made There are no go
  • d
formally justied approac hes to this problem in the general case standard practice is to adopt
  • ne
  • f
the ad ho c metho ds discussed in section
  • AHC
arc hitectures seem to b e more dicult to w
  • rk
with than Qlearning
  • n
a practical lev el It can b e hard to get the relativ e learning rates righ t in AHC so that the t w
  • comp
  • nen
ts con v erge together In addition Qlearning is explor ation insensitive that is that the Q v alues will con v erge to the
  • ptimal
v alues indep enden t
  • f
ho w the agen t b eha v es while the data is b eing collected as long as all stateaction pairs are tried
  • ften
enough This means that although the explorationexploitation issue m ust b e addressed in Qlearning the details
  • f
the exploration strategy will not aect the con v ergence
  • f
the learning algorithm F
  • r
these reasons Qlearning is the most p
  • pular
and seems to b e the most eectiv e mo delfree algorithm for learning from dela y ed reinforcemen t It do es not ho w ev er address an y
  • f
the issues in v
  • lv
ed in generalizing
  • v
er large state andor action spaces In addition it ma y con v erge quite slo wly to a go
  • d
p
  • licy
  • Mo
delfree Learning With Av erage Rew ard As describ ed Qlearning can b e applied to discoun ted innitehorizon MDPs It can also b e applied to undiscoun ted problems as long as the
  • ptimal
p
  • licy
is guaran teed to reac h a rew ardfree absorbing state and the state is p erio dicall y reset Sc h w artz
  • examined
the problem
  • f
adapting Qlearning to an a v eragerew ard framew
  • rk
Although his Rlearning algorithm seems to exhibit con v ergence problems for some MDPs sev eral researc hers ha v e found the a v eragerew ard criterion closer to the true problem they wish to solv e than a discoun ted criterion and therefore prefer Rlearning to Qlearning Mahadev an
  • With
that in mind researc hers ha v e studied the problem
  • f
learning
  • ptimal
a v erage rew ard p
  • licies
Mahadev an
  • surv
ey ed mo delbased a v eragerew ard algorithms from a reinforcemen tlearning p ersp ectiv e and found sev eral diculties with existing algorithms In particular he sho w ed that existing reinforcemen tlearning algorithms for a v erage rew ard and some dynamic programming algorithms do not alw a ys pro duce biasoptimal p
  • li
cies Jaakk
  • la
Jordan and Singh
  • describ
ed an a v eragerew ard learning algorithm with guaran teed con v ergence prop erties It uses a Mon teCarlo comp
  • nen
t to estimate the exp ected future rew ard for eac h state as the agen t mo v es through the en vironmen t In
slide-19
SLIDE 19 Reinf
  • r
cement Learning A Sur vey addition Bertsek as presen ts a Qlearninglik e algorithm for a v eragecase rew ard in his new textb
  • k
  • Although
this recen t w
  • rk
pro vides a m uc h needed theoretical foundation to this area
  • f
reinforcemen t learning man y imp
  • rtan
t problems remain unsolv ed
  • Computing
Optimal P
  • licies
b y Learning Mo dels The previous section sho w ed ho w it is p
  • ssible
to learn an
  • ptimal
p
  • licy
without kno wing the mo dels T s a s
  • r
Rs a and without ev en learning those mo dels en route Although man y
  • f
these metho ds are guaran teed to nd
  • ptimal
p
  • licies
ev en tually and use v ery little computation time p er exp erience they mak e extremely inecien t use
  • f
the data they gather and therefore
  • ften
require a great deal
  • f
exp erience to ac hiev e go
  • d
p erformance In this section w e still b egin b y assuming that w e dont kno w the mo dels in adv ance but w e examine algorithms that do
  • p
erate b y learning these mo dels These algorithms are esp ecially imp
  • rtan
t in applications in whic h computation is considered to b e c heap and realw
  • rld
exp erience costly
  • Certain
t y Equiv alen t Metho ds W e b egin with the most conceptually straigh tforw ard metho d rst learn the T and R functions b y exploring the en vironmen t and k eeping statistics ab
  • ut
the results
  • f
eac h action next compute an
  • ptimal
p
  • licy
using
  • ne
  • f
the metho ds
  • f
Section
  • This
metho d is kno wn as c ertainty e quivlanc e Kumar
  • V
araiy a
  • There
are some serious
  • b
jections to this metho d
  • It
mak es an arbitrary division b et w een the learning phase and the acting phase
  • Ho
w should it gather data ab
  • ut
the en vironmen t initially Random exploration migh t b e dangerous and in some en vironmen ts is an immensely inecien t metho d
  • f
gathering data requiring exp
  • nen
tially more data Whitehead
  • than
a system that in terlea v es exp erience gathering with p
  • licybuil
din g more tigh tly Ko enig
  • Simmons
  • See
Figure
  • for
an example
  • The
p
  • ssibilit
y
  • f
c hanges in the en vironmen t is also problematic Breaking up an agen ts life in to a pure learning and a pure acting phase has a considerable risk that the
  • ptimal
con troller based
  • n
early life b ecomes without detection a sub
  • ptimal
con troller if the en vironmen t c hanges A v ariation
  • n
this idea is c ertainty e quivalenc e in whic h the mo del is learned con tin ually through the agen ts lifetime and at eac h step the curren t mo del is used to compute an
  • ptimal
p
  • licy
and v alue function This metho d mak es v ery eectiv e use
  • f
a v ailable data but still ignores the question
  • f
exploration and is extremely computationally demanding ev en for fairly small state spaces F
  • rtunately
  • there
are a n um b er
  • f
  • ther
mo delbased algorithms that are more practical
  • Dyna
Suttons Dyna arc hitecture
  • exploits
a middle ground yielding strategies that are b
  • th
more eectiv e than mo delfree learning and more computationally ecien t than
slide-20
SLIDE 20 Kaelbling Littman
  • Moore

. . . . . . . Goal 1 2 3 n

Figure
  • In
this en vironmen t due to Whitehead
  • random
exploration w
  • uld
tak e tak e O
  • n
  • steps
to reac h the goal ev en
  • nce
whereas a more in telligen t explo ration strategy eg assume an y un tried action leads directly to goal w
  • uld
require
  • nly
O n
  • steps
the certain t yequiv alence approac h It sim ultaneously uses exp erience to build a mo del
  • "
T and " R uses exp erience to adjust the p
  • licy
  • and
uses the mo del to adjust the p
  • licy
  • Dyna
  • p
erates in a lo
  • p
  • f
in teraction with the en vironmen t Giv en an exp erience tuple hs a s
  • r
i it b eha v es as follo ws
  • Up
date the mo del incremen ting statistics for the transition from s to s
  • n
action a and for receiving rew ard r for taking action a in state s The up dated mo dels are " T and " R
  • Up
date the p
  • licy
at state s based
  • n
the newly up dated mo del using the rule Qs a
  • "
Rs a
  • X
s
  • "
T s a s
  • max
a
  • Qs
  • a
  • whic
h is a v ersion
  • f
the v alueiteration up date for Q v alues
  • P
erform k additional up dates c ho
  • se
k stateaction pairs at random and up date them according to the same rule as b efore Qs k
  • a
k
  • "
Rs k
  • a
k
  • X
s
  • "
T s k
  • a
k
  • s
  • max
a
  • Qs
  • a
  • Cho
  • se
an action a
  • to
p erform in state s
  • based
  • n
the Q v alues but p erhaps mo died b y an exploration strategy
  • The
Dyna algorithm requires ab
  • ut
k times the computation
  • f
Qlearning p er instance but this is t ypically v astly less than for the naiv e mo delbased metho d A reasonable v alue
  • f
k can b e determined based
  • n
the relativ e sp eeds
  • f
computation and
  • f
taking action Figure
  • sho
ws a grid w
  • rld
in whic h in eac h cell the agen t has four actions N S E W and transitions are made deterministically to an adjacen t cell unless there is a blo c k in whic h case no mo v emen t
  • ccurs
As w e will see in T able
  • Dyna
requires an
  • rder
  • f
magnitude few er steps
  • f
exp erience than do es Qlearning to arriv e at an
  • ptimal
p
  • licy
  • Dyna
requires ab
  • ut
six times more computational eort ho w ev er
slide-21
SLIDE 21 Reinf
  • r
cement Learning A Sur vey Figure
  • A
state grid w
  • rld
This w as form ulated as a shortestpath reinforcemen t learning problem whic h yields the same result as if a rew ard
  • f
  • is
giv en at the goal a rew ard
  • f
zero elsewhere and a discoun t factor is used Steps b efore Bac kups b efore con v ergence con v ergence Qlearning
  • Dyna
  • prioritized
sw eeping
  • T
able
  • The
p erformance
  • f
three algorithms describ ed in the text All metho ds used the exploration heuristic
  • f
  • ptimism
in the face
  • f
uncertain t y an y state not previously visited w as assumed b y default to b e a goal state Qlearning used its
  • ptimal
learning rate parameter for a deterministic maze
  • Dyna
and prioritized sw eeping w ere p ermitted to tak e k
  • bac
kups p er transition F
  • r
prioritized sw eeping the priorit y queue
  • ften
emptied b efore all bac kups w ere used
slide-22
SLIDE 22 Kaelbling Littman
  • Moore
  • Prioritized
Sw eeping
  • QueueDyna
Although Dyna is a great impro v emen t
  • n
previous metho ds it suers from b eing relativ ely undirected It is particularly unhelpful when the goal has just b een reac hed
  • r
when the agen t is stuc k in a dead end it con tin ues to up date random stateaction pairs rather than concen trating
  • n
the in teresting parts
  • f
the state space These problems are addressed b y prioritized sw eeping Mo
  • re
  • A
tk eson
  • and
QueueDyna P eng
  • Williams
  • whic
h are t w
  • indep
enden tlydev el
  • p
ed but v ery similar tec hniques W e will describ e prioritized sw eeping in some detail The algorithm is similar to Dyna except that up dates are no longer c hosen at random and v alues are no w asso ciated with states as in v alue iteration instead
  • f
stateaction pairs as in Qlearning T
  • mak
e appropriate c hoices w e m ust store additional information in the mo del Eac h state remem b ers its pr e de c essors the states that ha v e a nonzero transition probabilit y to it under some action In addition eac h state has a priority initially set to zero Instead
  • f
up dating k random stateaction pairs prioritized sw eeping up dates k states with the highest priorit y
  • F
  • r
eac h highpriorit y state s it w
  • rks
as follo ws
  • Remem
b er the curren t v alue
  • f
the state V
  • ld
  • V
s
  • Up
date the states v alue V s
  • max
a
  • "
Rs a
  • X
s
  • "
T s a s
  • V
s
  • Set
the states priorit y bac k to
  • Compute
the v alue c hange #
  • jV
  • ld
  • V
sj
  • Use
# to mo dify the priorities
  • f
the predecessors
  • f
s If w e ha v e up dated the V v alue for state s
  • and
it has c hanged b y amoun t # then the immediate predecessors
  • f
s
  • are
informed
  • f
this ev en t An y state s for whic h there exists an action a suc h that " T s a s
  • has
its priorit y promoted to #
  • "
T s a s
  • unless
its priorit y already exceeded that v alue The global b eha vior
  • f
this algorithm is that when a realw
  • rld
transition is surprising the agen t happ ens up
  • n
a goal state for instance then lots
  • f
computation is directed to propagate this new information bac k to relev an t predecessor states When the real w
  • rld
transition is b
  • ring
the actual result is v ery similar to the predicted result then computation con tin ues in the most deserving part
  • f
the space Running prioritized sw eeping
  • n
the problem in Figure
  • w
e see a large impro v emen t
  • v
er Dyna The
  • ptimal
p
  • licy
is reac hed in ab
  • ut
half the n um b er
  • f
steps
  • f
exp erience and
  • nethird
the computation as Dyna required and therefore ab
  • ut
  • times
few er steps and t wice the computational eort
  • f
Qlearning
slide-23
SLIDE 23 Reinf
  • r
cement Learning A Sur vey
  • Other
Mo delBased Metho ds Metho ds prop
  • sed
for solving MDPs giv en a mo del can b e used in the con text
  • f
mo del based metho ds as w ell R TDP realtime dynamic programming Barto Bradtk e
  • Singh
  • is
another mo delbased metho d that uses Qlearning to concen trate computational eort
  • n
the areas
  • f
the statespace that the agen t is most lik ely to
  • ccup
y
  • It
is sp ecic to problems in whic h the agen t is trying to ac hiev e a particular goal state and the rew ard ev erywhere else is
  • By
taking in to accoun t the start state it can nd a short path from the start to the goal without necessarily visiting the rest
  • f
the state space The Plexus planning system Dean Kaelbling Kirman
  • Nic
holson
  • Kirman
  • exploits
a similar in tuition It starts b y making an appro ximate v ersion
  • f
the MDP whic h is m uc h smaller than the
  • riginal
  • ne
The appro ximate MDP con tains a set
  • f
states called the envelop e that includes the agen ts curren t state and the goal state if there is
  • ne
States that are not in the en v elop e are summarized b y a single
  • ut
state The planning pro cess is an alternation b et w een nding an
  • ptimal
p
  • licy
  • n
the appro ximate MDP and adding useful states to the en v elop e Action ma y tak e place in parallel with planning in whic h case irrelev an t states are also pruned
  • ut
  • f
the en v elop e
  • Generalization
All
  • f
the previous discussion has tacitly assumed that it is p
  • ssible
to en umerate the state and action spaces and store tables
  • f
v alues
  • v
er them Except in v ery small en vironmen ts this means impractical memory requiremen ts It also mak es inecien t use
  • f
exp erience In a large smo
  • th
state space w e generally exp ect similar states to ha v e similar v alues and sim ilar
  • ptimal
actions Surely
  • therefore
there should b e some more compact represen tation than a table Most problems will ha v e con tin uous
  • r
large discrete state spaces some will ha v e large
  • r
con tin uous action spaces The problem
  • f
learning in large spaces is addressed through gener alization te chniques whic h allo w compact storage
  • f
learned information and transfer
  • f
kno wledge b et w een similar states and actions The large literature
  • f
generalization tec hniques from inductiv e concept learning can b e applied to reinforcemen t learning Ho w ev er tec hniques
  • ften
need to b e tailored to sp ecic details
  • f
the problem In the follo wing sections w e explore the application
  • f
standard functionappro ximation tec hniques adaptiv e resolution mo dels and hierarc hical metho ds to the problem
  • f
reinforcemen t learning The reinforcemen tlearning arc hitectures and algorithms discussed ab
  • v
e ha v e included the storage
  • f
a v ariet y
  • f
mappings including S
  • A
p
  • licies
S
  • v
alue functions S
  • A
  • Q
functions and rew ards S
  • A
  • S
deterministic transitions and S
  • A
  • S
  • $
% transition probabilities Some
  • f
these mappings suc h as transitions and immediate rew ards can b e learned using straigh tforw ard sup ervised learning and can b e handled using an y
  • f
the wide v ariet y
  • f
functionappro ximation tec hniques for sup ervised learning that supp
  • rt
noisy training examples P
  • pular
tec hniques include v arious neural net w
  • rk
metho ds Rumelhart
  • McClelland
  • fuzzy
logic Berenji
  • Lee
  • CMA
C Albus
  • and
lo cal memorybased metho ds Mo
  • re
A tk eson
  • Sc
haal
  • suc
h as generalizations
  • f
nearest neigh b
  • r
metho ds Other mappings esp ecially the p
  • licy
slide-24
SLIDE 24 Kaelbling Littman
  • Moore
mapping t ypically need sp ecialized algorithms b ecause training sets
  • f
inputoutput pairs are not a v ailable
  • Generalization
  • v
er Input A reinforcemen tlearning agen ts curren t state pla ys a cen tral role in its selection
  • f
rew ard maximizing actions Viewing the agen t as a statefree blac k b
  • x
a description
  • f
the curren t state is its input Dep ending
  • n
the agen t arc hitecture its
  • utput
is either an action selection
  • r
an ev aluation
  • f
the curren t state that can b e used to select an action The problem
  • f
deciding ho w the dieren t asp ects
  • f
an input aect the v alue
  • f
the
  • utput
is sometimes called the structural creditassignmen t problem This section examines approac hes to generating actions
  • r
ev aluations as a function
  • f
a description
  • f
the agen ts curren t state The rst group
  • f
tec hniques co v ered here is sp ecialized to the case when rew ard is not dela y ed the second group is more generally applicable
  • Immedia
te Rew ard When the agen ts actions do not inuence state transitions the resulting problem b ecomes
  • ne
  • f
c ho
  • sing
actions to maximize immediate rew ard as a function
  • f
the agen ts curren t state These problems b ear a resem blance to the bandit problems discussed in Section
  • except
that the agen t should condition its action selection
  • n
the curren t state F
  • r
this reason this class
  • f
problems has b een describ ed as asso ciative reinforcemen t learning The algorithms in this section address the problem
  • f
learning from immediate b
  • lean
reinforcemen t where the state is v ector v alued and the action is a b
  • lean
v ector Suc h algorithms can and ha v e b een used in the con text
  • f
a dela y ed reinforcemen t for instance as the RL comp
  • nen
t in the AHC arc hitecture describ ed in Section
  • They
can also b e generalized to realv alued rew ard through r ewar d c
  • mp
arison metho ds Sutton
  • CRBP
The complemen tary reinforcemen t bac kpropagation algorithm Ac kley
  • Littman
  • crbp
consists
  • f
a feedforw ard net w
  • rk
mapping an enco ding
  • f
the state to an enco ding
  • f
the action The action is determined probabilistically from the activ ation
  • f
the
  • utput
units if
  • utput
unit i has activ ation y i
  • then
bit i
  • f
the action v ector has v alue
  • with
probabilit y y i
  • and
  • therwise
An y neuralnet w
  • rk
sup ervised training pro cedure can b e used to adapt the net w
  • rk
as follo ws If the result
  • f
generating action a is r
  • then
the net w
  • rk
is trained with inputoutput pair hs ai If the result is r
  • then
the net w
  • rk
is trained with inputoutput pair hs & a i where & a
  • a
  • a
n
  • The
idea b ehind this training rule is that whenev er an action fails to generate rew ard crbp will try to generate an action that is dieren t from the curren t c hoice Although it seems lik e the algorithm migh t
  • scillate
b et w een an action and its complemen t that do es not happ en One step
  • f
training a net w
  • rk
will
  • nly
c hange the action sligh tly and since the
  • utput
probabilities will tend to mo v e to w ard
  • this
mak es action selection more random and increases searc h The hop e is that the random distribution will generate an action that w
  • rks
b etter and then that action will b e reinforced AR C The asso ciativ e reinforcemen t comparison ar c algorithm Sutton
  • is
an instance
  • f
the ahc arc hitecture for the case
  • f
b
  • lean
actions consisting
  • f
t w
  • feed
slide-25
SLIDE 25 Reinf
  • r
cement Learning A Sur vey forw ard net w
  • rks
One learns the v alue
  • f
situations the
  • ther
learns a p
  • licy
  • These
can b e simple linear net w
  • rks
  • r
can ha v e hidden units In the simplest case the en tire system learns
  • nly
to
  • ptimize
immediate rew ard First let us consider the b eha vior
  • f
the net w
  • rk
that learns the p
  • licy
  • a
mapping from a v ector describing s to a
  • r
  • If
the
  • utput
unit has activ ation y i
  • then
a the action generated will b e
  • if
y
  • where
  • is
normal noise and
  • therwise
The adjustmen t for the
  • utput
unit is in the simplest case e
  • r
a
  • where
the rst factor is the rew ard receiv ed for taking the most recen t action and the second enco des whic h action w as tak en The actions are enco ded as
  • and
  • so
a
  • alw
a ys has the same magnitude if the rew ard and the action ha v e the same sign then action
  • will
b e made more lik ely
  • therwise
action
  • will
b e As describ ed the net w
  • rk
will tend to seek actions that giv en p
  • sitiv
e rew ard T
  • extend
this approac h to maximize rew ard w e can compare the rew ard to some baseline b This c hanges the adjustmen t to e
  • r
  • ba
  • where
b is the
  • utput
  • f
the second net w
  • rk
The second net w
  • rk
is trained in a standard sup ervised mo de to estimate r as a function
  • f
the input state s V ariations
  • f
this approac h ha v e b een used in a v ariet y
  • f
applications Anderson
  • Barto
et al
  • Lin
b Sutton
  • REINF
OR CE Algorithms Williams
  • studied
the problem
  • f
c ho
  • sing
ac tions to maximize immedate rew ard He iden tied a broad class
  • f
up date rules that p er form gradien t descen t
  • n
the exp ected rew ard and sho w ed ho w to in tegrate these rules with bac kpropagation This class called reinf
  • r
ce algorithms includes linear rew ardinaction Section
  • as
a sp ecial case The generic reinf
  • r
ce up date for a parameter w ij can b e written #w ij
  • ij
r
  • b
ij
  • w
ij ln g j
  • where
  • ij
is a nonnegativ e factor r the curren t reinforcemen t b ij a reinforcemen t baseline and g i is the probabilit y densit y function used to randomly generate actions based
  • n
unit activ ations Both
  • ij
and b ij can tak e
  • n
dieren t v alues for eac h w ij
  • ho
w ev er when
  • ij
is constan t throughout the system the exp ected up date is exactly in the direction
  • f
the exp ected rew ard gradien t Otherwise the up date is in the same half space as the gradien t but not necessarily in the direction
  • f
steep est increase Williams p
  • in
ts
  • ut
that the c hoice
  • f
baseline b ij
  • can
ha v e a profound eect
  • n
the con v ergence sp eed
  • f
the algorithm LogicBased Metho ds Another strategy for generalization in reinforcemen t learning is to reduce the learning problem to an asso ciativ e problem
  • f
learning b
  • lean
functions A b
  • lean
function has a v ector
  • f
b
  • lean
inputs and a single b
  • lean
  • utput
T aking inspiration from mainstream mac hine learning w
  • rk
Kaelbling dev elop ed t w
  • algorithms
for learning b
  • lean
functions from reinforcemen t
  • ne
uses the bias
  • f
k DNF to driv e
slide-26
SLIDE 26 Kaelbling Littman
  • Moore
the generalization pro cess Kaelbling b the
  • ther
searc hes the space
  • f
syn tactic descriptions
  • f
functions using a simple generateandtest metho d Kaelbling a The restriction to a single b
  • lean
  • utput
mak es these tec hniques dicult to apply
  • In
v ery b enign learning situations it is p
  • ssible
to extend this approac h to use a collection
  • f
learners to indep enden tly learn the individual bits that mak e up a complex
  • utput
In general ho w ev er that approac h suers from the problem
  • f
v ery unreliable reinforcemen t if a single learner generates an inappropriate
  • utput
bit all
  • f
the learners receiv e a lo w reinforcemen t v alue The cascade metho d Kaelbling b allo ws a collection
  • f
learners to b e trained collectiv ely to generate appropriate join t
  • utputs
it is considerably more reliable but can require additional computational eort
  • Dela
yed Rew ard Another metho d to allo w reinforcemen tlearning tec hniques to b e applied in large state spaces is mo deled
  • n
v alue iteration and Qlearning Here a function appro ximator is used to represen t the v alue function b y mapping a state description to a v alue Man y reseac hers ha v e exp erimen ted with this approac h Bo y an and Mo
  • re
  • used
lo cal memorybased metho ds in conjunction with v alue iteration Lin
  • used
bac kprop agation net w
  • rks
for Qlearning W atkins
  • used
CMA C for Qlearning T esauro
  • used
bac kpropagation for learning the v alue function in bac kgammon describ ed in Section
  • Zhang
and Dietteric h
  • used
bac kpropagation and T D
  • to
learn go
  • d
strategies for jobshop sc heduling Although there ha v e b een some p
  • sitiv
e examples in general there are unfortunate in teractions b et w een function appro ximation and the learning rules In discrete en vironmen ts there is a guaran tee that an y
  • p
eration that up dates the v alue function according to the Bellman equations can
  • nly
reduce the error b et w een the curren t v alue function and the
  • ptimal
v alue function This guaran tee no longer holds when generalization is used These issues are discussed b y Bo y an and Mo
  • re
  • who
giv e some simple examples
  • f
v alue function errors gro wing arbitrarily large when generalization is used with v alue iteration Their solution to this applicable
  • nly
to certain classes
  • f
problems discourages suc h div er gence b y
  • nly
p ermitting up dates whose estimated v alues can b e sho wn to b e nearoptimal via a battery
  • f
Mon teCarlo exp erimen ts Thrun and Sc h w artz
  • theorize
that function appro ximation
  • f
v alue functions is also dangerous b ecause the errors in v alue functions due to generalization can b ecome comp
  • unded
b y the max
  • p
erator in the denition
  • f
the v alue function Sev eral recen t results Gordon
  • Tsitsiklis
  • V
an Ro y
  • sho
w ho w the appro priate c hoice
  • f
function appro ximator can guaran tee con v ergence though not necessarily to the
  • ptimal
v alues Bairds r esidual gr adient tec hnique Baird
  • pro
vides guaran teed con v ergence to lo cally
  • ptimal
solutions P erhaps the glo
  • miness
  • f
these coun terexamples is misplaced Bo y an and Mo
  • re
  • rep
  • rt
that their coun terexamples c an b e made to w
  • rk
with problemsp ecic handtuning despite the unreliabilit y
  • f
un tuned algorithms that pro v ably con v erge in discrete domains Sutton
  • sho
ws ho w mo died v ersions
  • f
Bo y an and Mo
  • res
examples can con v erge successfully
  • An
  • p
en question is whether general principles ideally supp
  • rted
b y theory
  • can
help us understand when v alue function appro ximation will succeed In Suttons com
slide-27
SLIDE 27 Reinf
  • r
cement Learning A Sur vey parativ e exp erimen ts with Bo y an and Mo
  • res
coun terexamples he c hanges four asp ects
  • f
the exp erimen ts
  • Small
c hanges to the task sp ecications
  • A
v ery dieren t kind
  • f
function appro ximator CMA C Albus
  • that
has w eak generalization
  • A
dieren t learning algorithm SARSA Rummery
  • Niranjan
  • instead
  • f
v alue iteration
  • A
dieren t training regime Bo y an and Mo
  • re
sampled states uniformly in state space whereas Suttons metho d sampled along empirical tra jectories There are in tuitiv e reasons to b eliev e that the fourth factor is particularly imp
  • rtan
t but more careful researc h is needed Adaptiv e Resolution Mo dels In man y cases what w e w
  • uld
lik e to do is partition the en vironmen t in to regions
  • f
states that can b e considered the same for the purp
  • ses
  • f
learning and generating actions Without detailed prior kno wledge
  • f
the en vironmen t it is v ery dicult to kno w what gran ularit y
  • r
placemen t
  • f
partitions is appropriate This problem is
  • v
ercome in metho ds that use adaptiv e resolution during the course
  • f
learning a partition is constructed that is appropriate to the en vironmen t Decision T rees In en vironmen ts that are c haracterized b y a set
  • f
b
  • lean
  • r
discrete v alued v ariables it is p
  • ssible
to learn compact decision trees for represen ting Q v alues The Gle arning algorithm Chapman
  • Kaelbling
  • w
  • rks
as follo ws It starts b y assuming that no partitioning is necessary and tries to learn Q v alues for the en tire en vironmen t as if it w ere
  • ne
state In parallel with this pro cess it gathers statistics based
  • n
individual input bits it asks the question whether there is some bit b in the state description suc h that the Q v alues for states in whic h b
  • are
signican tly dieren t from Q v alues for states in whic h b
  • If
suc h a bit is found it is used to split the decision tree Then the pro cess is rep eated in eac h
  • f
the lea v es This metho d w as able to learn v ery small represen tations
  • f
the Q function in the presence
  • f
an
  • v
erwhelming n um b er
  • f
irrelev an t noisy state attributes It
  • utp
erformed Qlearning with bac kpropagation in a simple video game en vironmen t and w as used b y McCallum
  • in
conjunction with
  • ther
tec hniques for dealing with partial
  • bserv
abilit y to learn b eha viors in a complex drivingsim ulator It cannot ho w ev er acquire partitions in whic h attributes are
  • nly
signican t in com bination suc h as those needed to solv e parit y problems V ariable Resolution Dynamic Programming The VRDP algorithm Mo
  • re
  • enables
con v en tional dynamic programming to b e p erformed in realv alued m ultiv ariate statespaces where straigh tforw ard discretization w
  • uld
fall prey to the curse
  • f
dimension alit y
  • A
k dtree similar to a decision tree is used to partition state space in to coarse regions The coarse regions are rened in to detailed regions but
  • nly
in parts
  • f
the state space whic h are predicted to b e imp
  • rtan
t This notion
  • f
imp
  • rtance
is
  • btained
b y run ning men tal tra jectories through state space This algorithm pro v ed eectiv e
  • n
a n um b er
  • f
problems for whic h full highresolution arra ys w
  • uld
ha v e b een impractical It has the disadv an tage
  • f
requiring a guess at an initially v alid tra jectory through statespace
slide-28
SLIDE 28 Kaelbling Littman
  • Moore

G

Start Goal

(a)

G

(b)

G

(c)

Figure
  • a
A t w
  • dimensional
maze problem The p
  • in
t rob
  • t
m ust nd a path from start to goal without crossing an y
  • f
the barrier lines b The path tak en b y P artiGame during the en tire rst trial It b egins with in tense exploration to nd a route
  • ut
  • f
the almost en tirely enclosed start region Ha ving ev en tually reac hed a sucien tly high resolution it disco v ers the gap and pro ceeds greedily to w ards the goal
  • nly
to b e temp
  • rarily
blo c k ed b y the goals barrier region c The second trial P artiGame Algorithm Mo
  • res
P artiGame algorithm Mo
  • re
  • is
another solution to the problem
  • f
learning to ac hiev e goal congurations in deterministic highdimensional con tin uous spaces b y learning an adaptiv eresolution mo del It also divides the en vironmen t in to cells but in eac h cell the actions a v ailable consist
  • f
aiming at the neigh b
  • ring
cells this aiming is accomplished b y a lo cal con troller whic h m ust b e pro vided as part
  • f
the problem statemen t The graph
  • f
cell transitions is solv ed for shortest paths in an
  • nline
incremen tal manner but a minimax criterion is used to detect when a group
  • f
cells is to
  • coarse
to prev en t mo v emen t b et w een
  • bstacles
  • r
to a v
  • id
limit cycles The
  • ending
cells are split to higher resolution Ev en tually
  • the
en vironmen t is divided up just enough to c ho
  • se
appropriate actions for ac hieving the goal but no unnecessary distinctions are made An imp
  • rtan
t feature is that as w ell as reducing memory and computational requiremen ts it also structures exploration
  • f
state space in a m ultiresolution manner Giv en a failure the agen t will initially try something v ery dieren t to rectify the failure and
  • nly
resort to small lo cal c hanges when all the qualitativ ely dieren t strategies ha v e b een exhausted Figure a sho ws a t w
  • dimensional
con tin uous maze Figure b sho ws the p erformance
  • f
a rob
  • t
using the P artiGame algorithm during the v ery rst trial Figure c sho ws the second trial started from a sligh tly dieren t p
  • sition
This is a v ery fast algorithm learning p
  • licies
in spaces
  • f
up to nine dimensions in less than a min ute The restriction
  • f
the curren t implemen tation to deterministic en vironmen ts limits its applicabili t y
  • ho
w ev er McCallum
  • suggests
some related treestructured metho ds
slide-29
SLIDE 29 Reinf
  • r
cement Learning A Sur vey
  • Generalization
  • v
er Actions The net w
  • rks
describ ed in Section
  • generalize
  • v
er state descriptions presen ted as inputs They also pro duce
  • utputs
in a discrete factored represen tation and th us could b e seen as generalizing
  • v
er actions as w ell In cases suc h as this when actions are describ ed com binatorially
  • it
is imp
  • rtan
t to generalize
  • v
er actions to a v
  • id
k eeping separate statistics for the h uge n um b er
  • f
actions that can b e c hosen In con tin uous action spaces the need for generalization is ev en more pronounced When estimating Q v alues using a neural net w
  • rk
it is p
  • ssible
to use either a distinct net w
  • rk
for eac h action
  • r
a net w
  • rk
with a distinct
  • utput
for eac h action When the action space is con tin uous neither approac h is p
  • ssible
An alternativ e strategy is to use a single net w
  • rk
with b
  • th
the state and action as input and Q v alue as the
  • utput
T raining suc h a net w
  • rk
is not conceptually dicult but using the net w
  • rk
to nd the
  • ptimal
action can b e a c hallenge One metho d is to do a lo cal gradien tascen t searc h
  • n
the action in
  • rder
to nd
  • ne
with high v alue Baird
  • Klopf
  • Gullapalli
  • has
dev elop ed a neural reinforcemen tlearning unit for use in con tin uous action spaces The unit generates actions with a normal distribution it adjusts the mean and v ariance based
  • n
previous exp erience When the c hosen actions are not p erforming w ell the v ariance is high resulting in exploration
  • f
the range
  • f
c hoices When an action p erforms w ell the mean is mo v ed in that direction and the v ariance decreased resulting in a tendency to generate more action v alues near the successful
  • ne
This metho d w as successfully emplo y ed to learn to con trol a rob
  • t
arm with man y con tin uous degrees
  • f
freedom
  • Hierarc
hical Metho ds Another strategy for dealing with large state spaces is to treat them as a hierarc h y
  • f
learning problems In man y cases hierarc hical solutions in tro duce sligh t suboptimalit y in p erformance but p
  • ten
tially gain a go
  • d
deal
  • f
eciency in execution time learning time and space Hierarc hical learners are commonly structured as gate d b ehaviors as sho wn in Figure
  • There
is a collection
  • f
b ehaviors that map en vironmen t states in to lo wlev el actions and a gating function that decides based
  • n
the state
  • f
the en vironmen t whic h b eha viors actions should b e switc hed through and actually executed Maes and Bro
  • ks
  • used
a v ersion
  • f
this arc hitecture in whic h the individual b eha viors w ere xed a priori and the gating function w as learned from reinforcemen t Mahadev an and Connell b used the dual approac h they xed the gating function and supplied reinforcemen t functions for the individual b eha viors whic h w ere learned Lin a and Dorigo and Colom b etti
  • b
  • th
used this approac h rst training the b eha viors and then training the gating function Man y
  • f
the
  • ther
hierarc hical learning metho ds can b e cast in this framew
  • rk
  • Feud
al Qlearning F eudal Qlearning Da y an
  • Hin
ton
  • W
atkins
  • in
v
  • lv
es a hierarc h y
  • f
learning mo dules In the simplest case there is a highlev el master and a lo wlev el sla v e The master receiv es reinforcemen t from the external en vironmen t Its actions consist
  • f
commands that
slide-30
SLIDE 30 Kaelbling Littman
  • Moore

s b1 b2 b3 g a

Figure
  • A
structure
  • f
gated b eha viors it can giv e to the lo wlev el learner When the master generates a particular command to the sla v e it m ust rew ard the sla v e for taking actions that satisfy the command ev en if they do not result in external reinforcemen t The master then learns a mapping from states to commands The sla v e learns a mapping from commands and states to external actions The set
  • f
commands and their asso ciated reinforcemen t functions are established in adv ance
  • f
the learning This is really an instance
  • f
the general gated b eha viors approac h in whic h the sla v e can execute an y
  • f
the b eha viors dep ending
  • n
its command The reinforcemen t functions for the individual b eha viors commands are giv en but learning tak es place sim ultaneously at b
  • th
the high and lo w lev els
  • Compositional
Qlearning Singhs comp
  • sitional
Qlearning b a CQL consists
  • f
a hierarc h y based
  • n
the temp
  • ral
sequencing
  • f
subgoals The elemental tasks are b eha viors that ac hiev e some recognizable condition The highlev el goal
  • f
the system is to ac hiev e some set
  • f
condi tions in sequen tial
  • rder
The ac hiev emen t
  • f
the conditions pro vides reinforcemen t for the elemen tal tasks whic h are trained rst to ac hiev e individual subgoals Then the gating function learns to switc h the elemen tal tasks in
  • rder
to ac hiev e the appropriate highlev el sequen tial goal This metho d w as used b y Tham and Prager
  • to
learn to con trol a sim ulated m ultilink rob
  • t
arm
  • Hierar
chical Dist ance to Go al Esp ecially if w e consider reinforcemen t learning mo dules to b e part
  • f
larger agen t arc hi tectures it is imp
  • rtan
t to consider problems in whic h goals are dynamically input to the learner Kaelblings HDG algorithm a uses a hierarc hical approac h to solving prob lems when goals
  • f
ac hiev emen t the agen t should get to a particular state as quic kly as p
  • ssible
are giv en to an agen t dynamically
  • The
HDG algorithm w
  • rks
b y analogy with na vigation in a harb
  • r
The en vironmen t is partitioned a priori but more recen t w
  • rk
Ashar
  • addresses
the case
  • f
learning the partition in to a set
  • f
regions whose cen ters are kno wn as landmarks If the agen t is
slide-31
SLIDE 31 Reinf
  • r
cement Learning A Sur vey

2/5 1/5 2/5 printer

  • ffice

+100

hall hall

Figure
  • An
example
  • f
a partially
  • bserv
able en vironmen t curren tly in the same region as the goal then it uses lo wlev el actions to mo v e to the goal If not then highlev el information is used to determine the next landmark
  • n
the shortest path from the agen ts closest landmark to the goals closest landmark Then the agen t uses lo wlev el information to aim to w ard that next landmark If errors in action cause deviations in the path there is no problem the b est aiming p
  • in
t is recomputed
  • n
ev ery step
  • P
artially Observ able En vironmen ts In man y realw
  • rld
en vironmen ts it will not b e p
  • ssible
for the agen t to ha v e p erfect and complete p erception
  • f
the state
  • f
the en vironmen t Unfortunately
  • complete
  • bserv
abilit y is necessary for learning metho ds based
  • n
MDPs In this section w e consider the case in whic h the agen t mak es
  • bservations
  • f
the state
  • f
the en vironmen t but these
  • bserv
ations ma y b e noisy and pro vide incomplete information In the case
  • f
a rob
  • t
for instance it migh t
  • bserv
e whether it is in a corridor an
  • p
en ro
  • m
a Tjunction etc and those
  • bserv
ations migh t b e errorprone This problem is also referred to as the problem
  • f
incomplete p erception p erceptual aliasing
  • r
hidden state In this section w e will consider extensions to the basic MDP framew
  • rk
for solving partially
  • bserv
able problems The resulting formal mo del is called a p artial ly
  • bservable
Markov de cision pr
  • c
ess
  • r
POMDP
  • StateF
ree Deterministic P
  • licies
The most naiv e strategy for dealing with partial
  • bserv
abilit y is to ignore it That is to treat the
  • bserv
ations as if they w ere the states
  • f
the en vironmen t and try to learn to b eha v e Figure
  • sho
ws a simple en vironmen t in whic h the agen t is attempting to get to the prin ter from an
  • ce
If it mo v es from the
  • ce
there is a go
  • d
c hance that the agen t will end up in
  • ne
  • f
t w
  • places
that lo
  • k
lik e hall but that require dieren t actions for getting to the prin ter If w e consider these states to b e the same then the agen t cannot p
  • ssibly
b eha v e
  • ptimally
  • But
ho w w ell can it do The resulting problem is not Mark
  • vian
and Qlearning cannot b e guaran teed to con v erge Small breac hes
  • f
the Mark
  • v
requiremen t are w ell handled b y Qlearning but it is p
  • ssible
to construct simple en vironmen ts that cause Qlearning to
  • scillate
Chrisman
slide-32
SLIDE 32 Kaelbling Littman
  • Moore
Littman
  • It
is p
  • ssible
to use a mo delbased approac h ho w ev er act according to some p
  • licy
and gather statistics ab
  • ut
the transitions b et w een
  • bserv
ations then solv e for the
  • ptimal
p
  • licy
based
  • n
those
  • bserv
ations Unfortunately
  • when
the en vironmen t is not Mark
  • vian
the transition probabilities dep end
  • n
the p
  • licy
b eing executed so this new p
  • licy
will induce a new set
  • f
transition probabilities This approac h ma y yield plausible results in some cases but again there are no guaran tees It is reasonable though to ask what the
  • ptimal
p
  • licy
mapping from
  • bserv
ations to actions in this case is It is NPhard Littman b to nd this mapping and ev en the b est mapping can ha v e v ery p
  • r
p erformance In the case
  • f
  • ur
agen t trying to get to the prin ter for instance an y deterministic statefree p
  • licy
tak es an innite n um b er
  • f
steps to reac h the goal
  • n
a v erage
  • StateF
ree Sto c hastic P
  • licies
Some impro v emen t can b e gained b y considering sto c hastic p
  • licies
these are mappings from
  • bserv
ations to probabilit y distributions
  • v
er actions If there is randomness in the agen ts actions it will not get stuc k in the hall forev er Jaakk
  • la
Singh and Jordan
  • ha
v e dev elop ed an algorithm for nding lo callyoptimal sto c hastic p
  • licies
but nding a globally
  • ptimal
p
  • licy
is still NP hard In
  • ur
example it turns
  • ut
that the
  • ptimal
sto c hastic p
  • licy
is for the agen t when in a state that lo
  • ks
lik e a hall to go east with probabilit y
  • p
  • and
w est with probabilit y p
  • This
p
  • licy
can b e found b y solving a simple in this case quadratic program The fact that suc h a simple example can pro duce irrational n um b ers giv es some indication that it is a dicult problem to solv e exactly
  • P
  • licies
with In ternal State The
  • nly
w a y to b eha v e truly eectiv ely in a widerange
  • f
en vironmen ts is to use memory
  • f
previous actions and
  • bserv
ations to disam biguate the curren t state There are a v ariet y
  • f
approac hes to learning p
  • licies
with in ternal state Recurren t Qlearning One in tuitiv ely simple approac h is to use a recurren t neural net w
  • rk
to learn Q v alues The net w
  • rk
can b e trained using bac kpropagation through time
  • r
some
  • ther
suitable tec hnique and learns to retain history features to predict v alue This approac h has b een used b y a n um b er
  • f
researc hers Meeden McGra w
  • Blank
  • Lin
  • Mitc
hell
  • Sc
hmidh ub er b It seems to w
  • rk
eectiv ely
  • n
simple problems but can suer from con v ergence to lo cal
  • ptima
  • n
more complex problems Classier Systems Classier systems Holland
  • Goldb
erg
  • w
ere explicitly dev elop ed to solv e problems with dela y ed rew ard including those requiring shortterm memory
  • The
in ternal mec hanism t ypically used to pass rew ard bac k through c hains
  • f
decisions called the bucket brigade algorithm b ears a close resem blance to Qlearning In spite
  • f
some early successes the
  • riginal
design do es not app ear to handle partially
  • b
serv ed en vironmen ts robustly
  • Recen
tly
  • this
approac h has b een reexamined using insigh ts from the reinforcemen t learning literature with some success Dorigo did a comparativ e study
  • f
Qlearning and classier systems Dorigo
  • Bersini
  • Cli
and Ross
  • start
with Wilsons zeroth
slide-33
SLIDE 33 Reinf
  • r
cement Learning A Sur vey

i b a SE π

Figure
  • Structure
  • f
a POMDP agen t lev el classier system Wilson
  • and
add
  • ne
and t w
  • bit
memory registers They nd that although their system can learn to use shortterm memory registers eectiv ely
  • the
approac h is unlik ely to scale to more complex en vironmen ts Dorigo and Colom b etti applied classier systems to a mo derately complex problem
  • f
learning rob
  • t
b eha vior from immediate reinforcemen t Dorigo
  • Dorigo
  • Colom
b etti
  • Finitehistorywindo
w Approac h One w a y to restore the Mark
  • v
prop ert y is to allo w decisions to b e based
  • n
the history
  • f
recen t
  • bserv
ations and p erhaps actions Lin and Mitc hell
  • used
a xedwidth nite history windo w to learn a p
  • le
balancing task McCallum
  • describ
es the utile sux memory whic h learns a v ariablewidth windo w that serv es sim ultaneously as a mo del
  • f
the en vironmen t and a nitememory p
  • licy
  • This
system has had excellen t results in a v ery complex drivingsim ulation domain McCallum
  • Ring
  • has
a neuralnet w
  • rk
approac h that uses a v ariable history windo w adding history when necessary to disam biguate situations POMDP Approac h Another strategy consists
  • f
using hidden Mark
  • v
mo del HMM tec hniques to learn a mo del
  • f
the en vironmen t including the hidden state then to use that mo del to construct a p erfe ct memory con troller Cassandra Kaelbling
  • Littman
  • Lo
v ejo y
  • Monahan
  • Chrisman
  • sho
w ed ho w the forw ardbac kw ard algorithm for learning HMMs could b e adapted to learning POMDPs He and later McCallum
  • also
ga v e heuristic state splitting rules to attempt to learn the smallest p
  • ssible
mo del for a giv en en vironmen t The resulting mo del can then b e used to in tegrate information from the agen ts
  • bserv
ations in
  • rder
to mak e decisions Figure
  • illustrates
the basic structure for a p erfectmemory con troller The comp
  • nen
t
  • n
the left is the state estimator whic h computes the agen ts b elief state b as a function
  • f
the
  • ld
b elief state the last action a and the curren t
  • bserv
ation i In this con text a b elief state is a probabilit y distribution
  • v
er states
  • f
the en vironmen t indicating the lik eliho
  • d
giv en the agen ts past exp erience that the en vironmen t is actually in eac h
  • f
those states The state estimator can b e constructed straigh tforw ardly using the estimated w
  • rld
mo del and Ba y es rule No w w e are left with the problem
  • f
nding a p
  • licy
mapping b elief states in to action This problem can b e form ulated as an MDP but it is dicult to solv e using the tec hniques describ ed earlier b ecause the input space is con tin uous Chrismans approac h
  • do
es not tak e in to accoun t future uncertain t y
  • but
yields a p
  • licy
after a small amoun t
  • f
com putation A standard approac h from the
  • p
erationsresearc h literature is to solv e for the
slide-34
SLIDE 34 Kaelbling Littman
  • Moore
  • ptimal
p
  • licy
  • r
a close appro ximation thereof
  • based
  • n
its represen tation as a piecewise linear and con v ex function
  • v
er the b elief space This metho d is computationally in tractable but ma y serv e as inspiration for metho ds that mak e further appro ximations Cassandra et al
  • Littman
Cassandra
  • Kaelbling
a
  • Reinforcemen
t Learning Applications One reason that reinforcemen t learning is p
  • pular
is that is serv es as a theoretical to
  • l
for studying the principles
  • f
agen ts learning to act But it is unsurprising that it has also b een used b y a n um b er
  • f
researc hers as a practical computational to
  • l
for constructing autonomous systems that impro v e themselv es with exp erience These applications ha v e ranged from rob
  • tics
to industrial man ufacturing to com binatorial searc h problems suc h as computer game pla ying Practical applications pro vide a test
  • f
the ecacy and usefulness
  • f
learning algorithms They are also an inspiration for deciding whic h comp
  • nen
ts
  • f
the reinforcemen t learning framew
  • rk
are
  • f
practical imp
  • rtance
F
  • r
example a researc her with a real rob
  • tic
task can pro vide a data p
  • in
t to questions suc h as
  • Ho
w imp
  • rtan
t is
  • ptimal
exploration Can w e break the learning p erio d in to explo ration phases and exploitation phases
  • What
is the most useful mo del
  • f
longterm rew ard Finite horizon Discoun ted Innite horizon
  • Ho
w m uc h computation is a v ailable b et w een agen t decisions and ho w should it b e used
  • What
prior kno wledge can w e build in to the system and whic h algorithms are capable
  • f
using that kno wledge Let us examine a set
  • f
practical applications
  • f
reinforcemen t learning while b earing these questions in mind
  • Game
Pla ying Game pla ying has dominated the Articial In telligence w
  • rld
as a problem domain ev er since the eld w as b
  • rn
Tw
  • pla
y er games do not t in to the established reinforcemen tlearning framew
  • rk
since the
  • ptimalit
y criterion for games is not
  • ne
  • f
maximizing rew ard in the face
  • f
a xed en vironmen t but
  • ne
  • f
maximizing rew ard against an
  • ptimal
adv ersary minimax Nonetheless reinforcemen tlearning algorithms can b e adapted to w
  • rk
for a v ery general class
  • f
games Littman a and man y researc hers ha v e used reinforcemen t learning in these en vironmen ts One application sp ectacularly far ahead
  • f
its time w as Sam uels c hec k ers pla ying system Sam uel
  • This
learned a v alue function represen ted b y a linear function appro ximator and emplo y ed a training sc heme similar to the up dates used in v alue iteration temp
  • ral
dierences and Qlearning More recen tly
  • T
esauro
  • applied
the temp
  • ral
dierence algorithm to bac kgammon Bac kgammon has appro ximately
  • states
making tablebased rein forcemen t learning imp
  • ssible
Instead T esauro used a bac kpropagationbased threela y er
slide-35
SLIDE 35 Reinf
  • r
cement Learning A Sur vey T raining Games Hidden Units Results Basic P
  • r
TD
  • Lost
b y
  • p
  • in
ts in
  • games
TD
  • Lost
b y
  • p
  • in
ts in
  • games
TD
  • Lost
b y
  • p
  • in
t in
  • games
T able
  • TDGammons
p erformance in games against the top h uman professional pla y ers A bac kgammon tournamen t in v
  • lv
es pla ying a series
  • f
games for p
  • in
ts un til
  • ne
pla y er reac hes a set target TDGammon w
  • n
none
  • f
these tournamen ts but came sucien tly close that it is no w considered
  • ne
  • f
the b est few pla y ers in the w
  • rld
neural net w
  • rk
as a function appro ximator for the v alue function Bo ar d Position
  • Pr
  • b
ability
  • f
victory for curr ent player Tw
  • v
ersions
  • f
the learning algorithm w ere used The rst whic h w e will call Basic TD Gammon used v ery little predened kno wledge
  • f
the game and the represen tation
  • f
a b
  • ard
p
  • sition
w as virtually a ra w enco ding sucien tly p
  • w
erful
  • nly
to p ermit the neural net w
  • rk
to distinguish b et w een conceptually dieren t p
  • sitions
The second TDGammon w as pro vided with the same ra w state information supplemen ted b y a n um b er
  • f
hand crafted features
  • f
bac kgammon b
  • ard
p
  • sitions
Pro viding handcrafted features in this manner is a go
  • d
example
  • f
ho w inductiv e biases from h uman kno wledge
  • f
the task can b e supplied to a learning algorithm The training
  • f
b
  • th
learning algorithms required sev eral mon ths
  • f
computer time and w as ac hiev ed b y constan t selfpla y
  • No
exploration strategy w as usedthe system alw a ys greedily c hose the mo v e with the largest exp ected probabilit y
  • f
victory
  • This
naiv e explo ration strategy pro v ed en tirely adequate for this en vironmen t whic h is p erhaps surprising giv en the considerable w
  • rk
in the reinforcemen tlearning literature whic h has pro duced n umerous coun terexamples to sho w that greedy exploration can lead to p
  • r
learning p er formance Bac kgammon ho w ev er has t w
  • imp
  • rtan
t prop erties Firstly
  • whatev
er p
  • licy
is follo w ed ev ery game is guaran teed to end in nite time meaning that useful rew ard information is
  • btained
fairly frequen tly
  • Secondly
  • the
state transitions are sucien tly sto c hastic that indep enden t
  • f
the p
  • licy
  • all
states will
  • ccasionally
b e visiteda wrong initial v alue function has little danger
  • f
starving us from visiting a critical part
  • f
state space from whic h imp
  • rtan
t information could b e
  • btained
The results T able
  • f
TDGammon are impressiv e It has comp eted at the v ery top lev el
  • f
in ternational h uman pla y
  • Basic
TDGammon pla y ed resp ectably
  • but
not at a professional standard
slide-36
SLIDE 36 Figure
  • Sc
haal and A tk esons devilstic king rob
  • t
The tap ered stic k is hit alternately b y eac h
  • f
the t w
  • hand
stic ks The task is to k eep the devil stic k from falling for as man y hits as p
  • ssible
The rob
  • t
has three motors indicated b y torque v ectors
  • Although
exp erimen ts with
  • ther
games ha v e in some cases pro duced in teresting learning b eha vior no success close to that
  • f
TDGammon has b een rep eated Other games that ha v e b een studied include Go Sc hraudolph Da y an
  • Sejno
wski
  • and
Chess Thrun
  • It
is still an
  • p
en question as to if and ho w the success
  • f
TDGammon can b e rep eated in
  • ther
domains
  • Rob
  • tics
and Con trol In recen t y ears there ha v e b een man y rob
  • tics
and con trol applications that ha v e used reinforcemen t learning Here w e will concen trate
  • n
the follo wing four examples although man y
  • ther
in teresting
  • ngoing
rob
  • tics
in v estigations are underw a y
  • Sc
haal and A tk eson
  • constructed
a t w
  • armed
rob
  • t
sho wn in Figure
  • that
learns to juggle a device kno wn as a devilstic k This is a complex nonlinear con trol task in v
  • lving
a sixdimensional state space and less than
  • msecs
p er con trol deci sion After ab
  • ut
  • initial
attempts the rob
  • t
learns to k eep juggling for h undreds
  • f
hits A t ypical h uman learning the task requires an
  • rder
  • f
magnitude more practice to ac hiev e prociency at mere tens
  • f
hits The juggling rob
  • t
learned a w
  • rld
mo del from exp erience whic h w as generalized to un visited states b y a function appro ximation sc heme kno wn as lo cally w eigh ted regression Clev eland
  • Delvin
  • Mo
  • re
  • A
tk eson
  • Bet
w een eac h trial a form
  • f
dynamic programming sp ecic to linear con trol p
  • licies
and lo cally linear transitions w as used to impro v e the p
  • licy
  • The
form
  • f
dynamic programming is kno wn as linearquadraticregulator design Sage
  • White
slide-37
SLIDE 37 Reinf
  • r
cement Learning A Sur vey
  • Mahadev
an and Connell a discuss a task in whic h a mobile rob
  • t
pushes large b
  • xes
for extended p erio ds
  • f
time Bo xpushing is a w ellkno wn dicult rob
  • tics
problem c haracterized b y immense uncertain t y in the results
  • f
actions Qlearning w as used in conjunction with some no v el clustering tec hniques designed to enable a higherdimensional input than a tabular approac h w
  • uld
ha v e p ermitted The rob
  • t
learned to p erform comp etitiv ely with the p erformance
  • f
a h umanprogrammed so lution Another asp ect
  • f
this w
  • rk
men tioned in Section
  • w
as a preprogrammed breakdo wn
  • f
the monolithic task description in to a set
  • f
lo w er lev el tasks to b e learned
  • Mataric
  • describ
es a rob
  • tics
exp erimen t with from the viewp
  • in
t
  • f
theoret ical reinforcemen t learning an un think ably high dimensional state space con taining man y dozens
  • f
degrees
  • f
freedom F
  • ur
mobile rob
  • ts
tra v eled within an enclo sure collecting small disks and transp
  • rting
them to a destination region There w ere three enhancemen ts to the basic Qlearning algorithm Firstly
  • preprogrammed
sig nals called pr
  • gr
ess estimators w ere used to break the monolithic task in to subtasks This w as ac hiev ed in a robust manner in whic h the rob
  • ts
w ere not forced to use the estimators but had the freedom to prot from the inductiv e bias they pro vided Secondly
  • con
trol w as decen tralized Eac h rob
  • t
learned its
  • wn
p
  • licy
indep enden tly without explicit comm unication with the
  • thers
Thirdly
  • state
space w as brutally quan tized in to a small n um b er
  • f
discrete states according to v alues
  • f
a small n um b er
  • f
preprogrammed b
  • lean
features
  • f
the underlying sensors The p erformance
  • f
the Qlearned p
  • licies
w ere almost as go
  • d
as a simple handcrafted con troller for the job
  • Qlearning
has b een used in an elev ator dispatc hing task Crites
  • Barto
  • The
problem whic h has b een implemen ted in sim ulation
  • nly
at this stage in v
  • lv
ed four elev ators servicing ten
  • rs
The
  • b
jectiv e w as to minimize the a v erage squared w ait time for passengers discoun ted in to future time The problem can b e p
  • sed
as a discrete Mark
  • v
system but there are
  • states
ev en in the most simplied v ersion
  • f
the problem Crites and Barto used neural net w
  • rks
for function appro ximation and pro vided an excellen t comparison study
  • f
their Qlearning approac h against the most p
  • pular
and the most sophisticated elev ator dispatc hing algorithms The squared w ait time
  • f
their con troller w as appro ximately
  • less
than the b est alternativ e algorithm Empt y the System heuristic with a receding horizon con troller and less than half the squared w ait time
  • f
the con troller most frequen tly used in real elev ator systems
  • The
nal example concerns an application
  • f
reinforcemen t learning b y
  • ne
  • f
the authors
  • f
this surv ey to a pac k aging task from a fo
  • d
pro cessing industry
  • The
problem in v
  • lv
es lling con tainers with v ariable n um b ers
  • f
noniden tical pro ducts The pro duct c haracteristics also v ary with time but can b e sensed Dep ending
  • n
the task v arious constrain ts are placed
  • n
the con tainerlling pro cedure Here are three examples
  • The
mean w eigh t
  • f
all con tainers pro duced b y a shift m ust not b e b elo w the man ufacturers declared w eigh t W
slide-38
SLIDE 38 Kaelbling Littman
  • Moore
  • The
n um b er
  • f
con tainers b elo w the declared w eigh t m ust b e less than P
  • No
con tainers ma y b e pro duced b elo w w eigh t W
  • Suc
h tasks are con trolled b y mac hinery whic h
  • p
erates according to v arious setp
  • ints
Con v en tional practice is that setp
  • in
ts are c hosen b y h uman
  • p
erators but this c hoice is not easy as it is dep enden t
  • n
the curren t pro duct c haracteristics and the curren t task constrain ts The dep endency is
  • ften
dicult to mo del and highly nonlinear The task w as p
  • sed
as a nitehorizon Mark
  • v
decision task in whic h the state
  • f
the system is a function
  • f
the pro duct c haracteristics the amoun t
  • f
time remaining in the pro duction shift and the mean w astage and p ercen t b elo w declared in the shift so far The system w as discretized in to
  • discrete
states and lo cal w eigh ted regression w as used to learn and generalize a transition mo del Prioritized sw eep ing w as used to main tain an
  • ptimal
v alue function as eac h new piece
  • f
transition information w as
  • btained
In sim ulated exp erimen ts the sa vings w ere considerable t ypically with w astage reduced b y a factor
  • f
ten Since then the system has b een deplo y ed successfully in sev eral factories within the United States Some in teresting asp ects
  • f
practical reinforcemen t learning come to ligh t from these examples The most striking is that in all cases to mak e a real system w
  • rk
it pro v ed necessary to supplemen t the fundamen tal algorithm with extra preprogrammed kno wledge Supplying extra kno wledge comes at a price more h uman eort and insigh t is required and the system is subsequen tly less autonomous But it is also clear that for tasks suc h as these a kno wledgefree approac h w
  • uld
not ha v e ac hiev ed w
  • rth
while p erformance within the nite lifetime
  • f
the rob
  • ts
What forms did this preprogrammed kno wledge tak e It included an assumption
  • f
linearit y for the juggling rob
  • ts
p
  • licy
  • a
man ual breaking up
  • f
the task in to subtasks for the t w
  • mobilerob
  • t
examples while the b
  • xpusher
also used a clustering tec hnique for the Q v alues whic h assumed lo cally consisten t Q v alues The four diskcollecting rob
  • ts
additionally used a man ually discretized state space The pac k aging example had far few er dimensions and so required corresp
  • ndingly
w eak er assumptions but there to
  • the
as sumption
  • f
lo cal piecewise con tin uit y in the transition mo del enabled massiv e reductions in the amoun t
  • f
learning data required The exploration strategies are in teresting to
  • The
juggler used careful statistical anal ysis to judge where to protably exp erimen t Ho w ev er b
  • th
mobile rob
  • t
applications w ere able to learn w ell with greedy explorationalw a ys exploiting without delib erate ex ploration The pac k aging task used
  • ptimism
in the face
  • f
uncertain t y
  • None
  • f
these strategies mirrors theoretically
  • ptimal
but computationally in tractable exploration and y et all pro v ed adequate Finally
  • it
is also w
  • rth
considering the computational regimes
  • f
these exp erimen ts They w ere all v ery dieren t whic h indicates that the diering computational demands
  • f
v arious reinforcemen t learning algorithms do indeed ha v e an arra y
  • f
diering applications The juggler needed to mak e v ery fast decisions with lo w latency b et w een eac h hit but had long p erio ds
  • seconds
and more b et w een eac h trial to consolidate the exp eriences collected
  • n
the previous trial and to p erform the more aggressiv e computation necessary to pro duce a new reactiv e con troller
  • n
the next trial The b
  • xpushing
rob
  • t
w as mean t to
slide-39
SLIDE 39 Reinf
  • r
cement Learning A Sur vey
  • p
erate autonomously for hours and so had to mak e decisions with a uniform length con trol cycle The cycle w as sucien tly long for quite substan tial computations b ey
  • nd
simple Q learning bac kups The four diskcollecting rob
  • ts
w ere particularly in teresting Eac h rob
  • t
had a short life
  • f
less than
  • min
utes due to battery constrain ts meaning that substan tial n um b er crunc hing w as impractical and an y signican t com binatorial searc h w
  • uld
ha v e used a signican t fraction
  • f
the rob
  • ts
learning lifetime The pac k aging task had easy constrain ts One decision w as needed ev ery few min utes This pro vided
  • pp
  • rtunities
for fully computing the
  • ptimal
v alue function for the state system b et w een ev ery con trol cycle in addition to p erforming massiv e crossv alidationbased
  • ptimization
  • f
the transition mo del b eing learned A great deal
  • f
further w
  • rk
is curren tly in progress
  • n
practical implemen tations
  • f
reinforcemen t learning The insigh ts and task constrain ts that they pro duce will ha v e an imp
  • rtan
t eect
  • n
shaping the kind
  • f
algorithms that are dev elop ed in future
  • Conclusions
There are a v ariet y
  • f
reinforcemen tlearning tec hniques that w
  • rk
eectiv ely
  • n
a v ariet y
  • f
small problems But v ery few
  • f
these tec hniques scale w ell to larger problems This is not b ecause researc hers ha v e done a bad job
  • f
in v en ting learning tec hniques but b ecause it is v ery dicult to solv e arbitrary problems in the general case In
  • rder
to solv e highly complex problems w e m ust giv e up tabula r asa learning tec hniques and b egin to incorp
  • rate
bias that will giv e lev erage to the learning pro cess The necessary bias can come in a v ariet y
  • f
forms including the follo wing shaping The tec hnique
  • f
shaping is used in training animals Hilgard
  • Bo
w er
  • a
teac her presen ts v ery simple problems to solv e rst then gradually exp
  • ses
the learner to more complex problems Shaping has b een used in sup ervisedlearning systems and can b e used to train hierarc hical reinforcemen tlearning systems from the b
  • ttom
up Lin
  • and
to alleviate problems
  • f
dela y ed reinforcemen t b y decreasing the dela y un til the problem is w ell understo
  • d
Dorigo
  • Colom
b etti
  • Dorigo
  • lo
cal reinforcemen t signals Whenev er p
  • ssible
agen ts should b e giv en reinforcemen t signals that are lo cal In applications in whic h it is p
  • ssible
to compute a gradien t rew arding the agen t for taking steps up the gradien t rather than just for ac hieving the nal goal can sp eed learning signican tly Mataric
  • imitation
An agen t can learn b y w atc hing another agen t p erform the task Lin
  • F
  • r
real rob
  • ts
this requires p erceptual abilities that are not y et a v ailable But another strategy is to ha v e a h uman supply appropriate motor commands to a rob
  • t
through a jo ystic k
  • r
steering wheel P
  • merleau
  • problem
decomp
  • sition
Decomp
  • sing
a h uge learning problem in to a collection
  • f
smaller
  • nes
and pro viding useful reinforcemen t signals for the subproblems is a v ery p
  • w
er ful tec hnique for biasing learning Most in teresting examples
  • f
rob
  • tic
reinforcemen t learning emplo y this tec hnique to some exten t Connell
  • Mahadev
an
  • reexes
One thing that k eeps agen ts that kno w nothing from learning an ything is that they ha v e a hard time ev en nding the in teresting parts
  • f
the space they w ander
slide-40
SLIDE 40 Kaelbling Littman
  • Moore
around at random nev er getting near the goal
  • r
they are alw a ys killed immediately
  • These
problems can b e ameliorated b y programming a set
  • f
reexes that cause the agen t to act initially in some w a y that is reasonable Mataric
  • Singh
Barto Grup en
  • Connolly
  • These
reexes can ev en tually b e
  • v
erridden b y more detailed and accurate learned kno wledge but they at least k eep the agen t aliv e and p
  • in
ted in the righ t direction while it is trying to learn Recen t w
  • rk
b y Millan
  • explores
the use
  • f
reexes to mak e rob
  • t
learning safer and more ecien t With appropriate biases supplied b y h uman programmers
  • r
teac hers complex reinforcemen t learning problems will ev en tually b e solv able There is still m uc h w
  • rk
to b e done and man y in teresting questions remaining for learning tec hniques and esp ecially regarding metho ds for appro ximating decomp
  • sing
and incorp
  • rating
bias in to problems Ac kno wledgemen ts Thanks to Marco Dorigo and three anon ymous review ers for commen ts that ha v e help ed to impro v e this pap er Also thanks to
  • ur
man y colleagues in the reinforcemen tlearning comm unit y who ha v e done this w
  • rk
and explained it to us Leslie P ac k Kaelbling w as supp
  • rted
in part b y NSF gran ts IRI and IRI
  • Mic
hael Littman w as supp
  • rted
in part b y Bellcore Andrew Mo
  • re
w as supp
  • rted
in part b y an NSF Researc h Initiation Aw ard and b y M Corp
  • ration
References Ac kley
  • D
H
  • Littman
M L
  • Generalization
and scaling in reinforcemen t learn ing In T
  • uretzky
  • D
S Ed A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pp
' San Mateo CA Morgan Kaufmann Albus J S
  • A
new approac h to manipulator con trol Cereb ellar mo del articulation con troller cmac Journal
  • f
Dynamic Systems Me asur ement and Contr
  • l
  • '
  • Albus
J S
  • Br
ains Behavior and R
  • b
  • tics
BYTE Bo
  • ks
Subsidiary
  • f
McGra w Hill P eterb
  • rough
New Hampshire Anderson C W
  • L
e arning and Pr
  • blem
Solving with Multilayer Conne ctionist Systems PhD thesis Univ ersit y
  • f
Massac h usetts Amherst MA Ashar R R
  • Hierarc
hical learning in sto c hastic domains Masters thesis Bro wn Univ ersit y
  • Pro
vidence Rho de Island Baird L
  • Residual
algorithms Reinforcemen t learning with function appro xima tion In Prieditis A
  • Russell
S Eds Pr
  • c
e e dings
  • f
the Twelfth International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Baird L C
  • Klopf
A H
  • Reinforcemen
t learning with highdimensional con tin uous actions T ec h rep WLTR W righ tP atterson Air F
  • rce
Base Ohio Wrigh t Lab
  • ratory
slide-41
SLIDE 41 Reinf
  • r
cement Learning A Sur vey Barto A G Bradtk e S J
  • Singh
S P
  • Learning
to act using realtime dynamic programming A rticial Intel ligenc e
  • '
Barto A G Sutton R S
  • Anderson
C W
  • Neuronlik
e adaptiv e elemen ts that can solv e dicult learning con trol problems IEEE T r ansactions
  • n
Systems Man and Cyb ernetics SMC
  • '
Bellman R
  • Dynamic
Pr
  • gr
amming Princeton Univ ersit y Press Princeton NJ Berenji H R
  • Articial
neural net w
  • rks
and appro ximate reasoning for in telligen t con trol in space In A meric an Contr
  • l
Confer enc e pp ' Berry
  • D
A
  • F
ristedt B
  • Bandit
Pr
  • blems
Se quential A l lo c ation
  • f
Exp eriments Chapman and Hall London UK Bertsek as D P
  • Dynamic
Pr
  • gr
amming Deterministic and Sto chastic Mo dels Pren ticeHall Englew
  • d
Clis NJ Bertsek as D P
  • Dynamic
Pr
  • gr
amming and Optimal Contr
  • l
A thena Scien tic Belmon t Massac h usetts V
  • lumes
  • and
  • Bertsek
as D P
  • Casta
! non D A
  • Adaptiv
e aggregation for innite horizon dynamic programming IEEE T r ansactions
  • n
A utomatic Contr
  • l
  • '
Bertsek as D P
  • Tsitsiklis
J N
  • Par
al lel and Distribute d Computation Numer ic al Metho ds Pren ticeHall Englew
  • d
Clis NJ Bo x G E P
  • Drap
er N R
  • Empiric
al Mo delBuilding and R esp
  • nse
Surfac es Wiley
  • Bo
y an J A
  • Mo
  • re
A W
  • Generalization
in reinforcemen t learning Safely appro ximating the v alue function In T esauro G T
  • uretzky
  • D
S
  • Leen
T K Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • Cam
bridge MA The MIT Press Burghes D
  • Graham
A
  • Intr
  • duction
to Contr
  • l
The
  • ry
including Optimal Contr
  • l
Ellis Horw
  • d
Cassandra A R Kaelbling L P
  • Littman
M L
  • Acting
  • ptimally
in partially
  • bserv
able sto c hastic domains In Pr
  • c
e e dings
  • f
the Twelfth National Confer enc e
  • n
A rticial Intel ligenc e Seattle W A Chapman D
  • Kaelbling
L P
  • Input
generalization in dela y ed reinforcemen t learning An algorithm and p erformance comparisons In Pr
  • c
e e dings
  • f
the Interna tional Joint Confer enc e
  • n
A rticial Intel ligenc e Sydney
  • Australia
Chrisman L
  • Reinforcemen
t learning with p erceptual aliasing The p erceptual distinctions approac h In Pr
  • c
e e dings
  • f
the T enth National Confer enc e
  • n
A rticial Intel ligenc e pp ' San Jose CA AAAI Press
slide-42
SLIDE 42 Kaelbling Littman
  • Moore
Chrisman L
  • Littman
M
  • Hidden
state and shortterm memory
  • Presen
tation at Reinforcemen t Learning W
  • rkshop
Mac hine Learning Conference Cic hosz P
  • Mula
wk a J J
  • F
ast and ecien t reinforcemen t learning with trun cated temp
  • ral
dierences In Prieditis A
  • Russell
S Eds Pr
  • c
e e dings
  • f
the Twelfth International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Clev eland W S
  • Delvin
S J
  • Lo
cally w eigh ted regression An approac h to regression analysis b y lo cal tting Journal
  • f
the A meric an Statistic al Asso ciation
  • '
Cli D
  • Ross
S
  • Adding
temp
  • rary
memory to ZCS A daptive Behavior
  • '
Condon A
  • The
complexit y
  • f
sto c hastic games Information and Computation
  • '
Connell J
  • Mahadev
an S
  • Rapid
task learning for real rob
  • ts
In R
  • b
  • t
L e arning Klu w er Academic Publishers Crites R H
  • Barto
A G
  • Impro
ving elev ator p erformance using reinforcemen t learning In T
  • uretzky
  • D
Mozer M
  • Hasselmo
M Eds Neur al Information Pr
  • c
essing Systems
  • Da
y an P
  • The
con v ergence
  • f
TD for general
  • Machine
L e arning
  • '
  • Da
y an P
  • Hin
ton G E
  • F
eudal reinforcemen t learning In Hanson S J Co w an J D
  • Giles
C L Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • San
Mateo CA Morgan Kaufmann Da y an P
  • Sejno
wski T J
  • TD
con v erges with probabilit y
  • Machine
L e arn ing
  • Dean
T Kaelbling L P
  • Kirman
J
  • Nic
holson A
  • Planning
with deadlines in sto c hastic domains In Pr
  • c
e e dings
  • f
the Eleventh National Confer enc e
  • n
A rticial Intel ligenc e W ashington DC DEp enoux F
  • A
probabilistic pro duction and in v en tory problem Management Scienc e
  • '
Derman C
  • Finite
State Markovian De cision Pr
  • c
esses Academic Press New Y
  • rk
Dorigo M
  • Bersini
H
  • A
comparison
  • f
qlearning and classier systems In F r
  • m
A nimals to A nimats Pr
  • c
e e dings
  • f
the Thir d International Confer enc e
  • n
the Simulation
  • f
A daptive Behavior Brigh ton UK Dorigo M
  • Colom
b etti M
  • Rob
  • t
shaping Dev eloping autonomous agen ts through learning A rticial Intel ligenc e
  • '
slide-43
SLIDE 43 Reinf
  • r
cement Learning A Sur vey Dorigo M
  • Alecsys
and the AutonoMouse Learning to con trol a real rob
  • t
b y distributed classier systems Machine L e arning
  • Fiec
h ter CN
  • Ecien
t reinforcemen t learning In Pr
  • c
e e dings
  • f
the Seventh A nnual A CM Confer enc e
  • n
Computational L e arning The
  • ry
pp ' Asso ciation
  • f
Computing Mac hinery
  • Gittins
J C
  • Multiarme
d Bandit A l lo c ation Indic es WileyIn terscience series in systems and
  • ptimization
Wiley
  • Chic
hester NY Goldb erg D
  • Genetic
algorithms in se ar ch
  • ptimization
and machine le arning AddisonW esley
  • MA
Gordon G J
  • Stable
function appro ximation in dynamic programming In Priedi tis A
  • Russell
S Eds Pr
  • c
e e dings
  • f
the Twelfth International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Gullapalli V
  • A
sto c hastic reinforcemen t learning algorithm for learning realv alued functions Neur al Networks
  • '
Gullapalli V
  • R
einfor c ement le arning and its applic ation to c
  • ntr
  • l
PhD thesis Univ ersit y
  • f
Massac h usetts Amherst MA Hilgard E R
  • Bo
w er G H
  • The
  • ries
  • f
L e arning fourth edition Pren ticeHall Englew
  • d
Clis NJ Homan A J
  • Karp
R M
  • On
non terminating sto c hastic games Management Scienc e
  • '
Holland J H
  • A
daptation in Natur al and A rticial Systems Univ ersit y
  • f
Mic higan Press Ann Arb
  • r
MI Ho w ard R A
  • Dynamic
Pr
  • gr
amming and Markov Pr
  • c
esses The MIT Press Cam bridge MA Jaakk
  • la
T Jordan M I
  • Singh
S P
  • On
the con v ergence
  • f
sto c hastic iterativ e dynamic programming algorithms Neur al Computation
  • Jaakk
  • la
T Singh S P
  • Jordan
M I
  • Mon
tecarlo reinforcemen t learning in nonMark
  • vian
decision problems In T esauro G T
  • uretzky
  • D
S
  • Leen
T K Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • Cam
bridge MA The MIT Press Kaelbling L P
  • a
Hierarc hical learning in sto c hastic domains Preliminary results In Pr
  • c
e e dings
  • f
the T enth International Confer enc e
  • n
Machine L e arning Amherst MA Morgan Kaufmann Kaelbling L P
  • b
L e arning in Emb e dde d Systems The MIT Press Cam bridge MA Kaelbling L P
  • a
Asso ciativ e reinforcemen t learning A generate and test algorithm Machine L e arning
slide-44
SLIDE 44 Kaelbling Littman
  • Moore
Kaelbling L P
  • b
Asso ciativ e reinforcemen t learning Functions in k DNF Machine L e arning
  • Kirman
J
  • Pr
e dicting R e alTime Planner Performanc e by Domain Char acterization PhD thesis Departmen t
  • f
Computer Science Bro wn Univ ersit y
  • Ko
enig S
  • Simmons
R G
  • Complexit
y analysis
  • f
realtime reinforcemen t learning In Pr
  • c
e e dings
  • f
the Eleventh National Confer enc e
  • n
A rticial Intel ligenc e pp ' Menlo P ark California AAAI PressMIT Press Kumar P
  • R
  • V
araiy a P
  • P
  • Sto
chastic Systems Estimation Identic ation and A daptive Contr
  • l
Pren tice Hall Englew
  • d
Clis New Jersey
  • Lee
C C
  • A
self learning rulebased con troller emplo ying appro ximate reasoning and neural net concepts International Journal
  • f
Intel ligent Systems
  • '
Lin LJ
  • Programming
rob
  • ts
using reinforcemen t learning and teac hing In Pr
  • c
e e dings
  • f
the Ninth National Confer enc e
  • n
A rticial Intel ligenc e Lin LJ a Hierac hical learning
  • f
rob
  • t
skills b y reinforcemen t In Pr
  • c
e e dings
  • f
the International Confer enc e
  • n
Neur al Networks Lin LJ b R einfor c ement L e arning for R
  • b
  • ts
Using Neur al Networks PhD thesis Carnegie Mellon Univ ersit y
  • Pittsburgh
P A Lin LJ
  • Mitc
hell T M
  • Memory
approac hes to reinforcemen t learning in non Mark
  • vian
domains T ec h rep CMUCS Carnegie Mellon Univ ersit y
  • Sc
ho
  • l
  • f
Computer Science Littman M L a Mark
  • v
games as a framew
  • rk
for m ultiagen t reinforcemen t learn ing In Pr
  • c
e e dings
  • f
the Eleventh International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Littman M L b Memoryless p
  • licies
Theoretical limitations and practical results In Cli D Husbands P
  • Mey
er JA
  • Wilson
S W Eds F r
  • m
A nimals to A nimats
  • Pr
  • c
e e dings
  • f
the Thir d International Confer enc e
  • n
Simulation
  • f
A daptive Behavior Cam bridge MA The MIT Press Littman M L Cassandra A
  • Kaelbling
L P
  • a
Learning p
  • licies
for partially
  • bserv
able en vironmen ts Scaling up In Prieditis A
  • Russell
S Eds Pr
  • c
e e d ings
  • f
the Twelfth International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Littman M L Dean T L
  • Kaelbling
L P
  • b
On the complexit y
  • f
solving Mark
  • v
decision problems In Pr
  • c
e e dings
  • f
the Eleventh A nnual Confer enc e
  • n
Unc ertainty in A rticial Intel ligenc e UAI Mon treal Qu ( eb ec Canada Lo v ejo y
  • W
S
  • A
surv ey
  • f
algorithmic metho ds for partially
  • bserv
able Mark
  • v
decision pro cesses A nnals
  • f
Op er ations R ese ar ch
  • '
slide-45
SLIDE 45 Reinf
  • r
cement Learning A Sur vey Maes P
  • Bro
  • ks
R A
  • Learning
to co
  • rdinate
b eha viors In Pr
  • c
e e dings Eighth National Confer enc e
  • n
A rticial Intel ligenc e pp ' Morgan Kaufmann Mahadev an S
  • T
  • discoun
t
  • r
not to discoun t in reinforcemen t learning A case study comparing R learning and Q learning In Pr
  • c
e e dings
  • f
the Eleventh Inter national Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Mahadev an S
  • Av
erage rew ard reinforcemen t learning Foundations algorithms and empirical results Machine L e arning
  • Mahadev
an S
  • Connell
J a Automatic programming
  • f
b eha viorbased rob
  • ts
using reinforcemen t learning In Pr
  • c
e e dings
  • f
the Ninth National Confer enc e
  • n
A rticial Intel ligenc e Anaheim CA Mahadev an S
  • Connell
J b Scaling reinforcemen t learning to rob
  • tics
b y ex ploiting the subsumption arc hitecture In Pr
  • c
e e dings
  • f
the Eighth International Workshop
  • n
Machine L e arning pp ' Mataric M J
  • Rew
ard functions for accelerated learning In Cohen W W
  • Hirsh
H Eds Pr
  • c
e e dings
  • f
the Eleventh International Confer enc e
  • n
Machine L e arning Morgan Kaufmann McCallum A K
  • R
einfor c ement L e arning with Sele ctive Per c eption and Hidden State PhD thesis Departmen t
  • f
Computer Science Univ ersit y
  • f
Ro c hester McCallum R A
  • Ov
ercoming incomplete p erception with utile distinction memory
  • In
Pr
  • c
e e dings
  • f
the T enth International Confer enc e
  • n
Machine L e arning pp '
  • Amherst
Massac h usetts Morgan Kaufmann McCallum R A
  • Instancebased
utile distinctions for reinforcemen t learning with hidden state In Pr
  • c
e e dings
  • f
the Twelfth International Confer enc e Machine L e arn ing pp ' San F rancisco CA Morgan Kaufmann Meeden L McGra w G
  • Blank
D
  • Emergen
t con trol and planning in an au tonomous v ehicle In T
  • uretsky
  • D
Ed Pr
  • c
e e dings
  • f
the Fifte enth A nnual Me eting
  • f
the Co gnitive Scienc e So ciety pp ' La w erence Erlbaum Asso ciates Hills dale NJ Millan J d R
  • Rapid
safe and incremen tal learning
  • f
na vigation strategies IEEE T r ansactions
  • n
Systems Man and Cyb ernetics
  • Monahan
G E
  • A
surv ey
  • f
partially
  • bserv
able Mark
  • v
decision pro cesses Theory
  • mo
dels and algorithms Management Scienc e
  • '
Mo
  • re
A W
  • V
ariable resolution dynamic programming Ecien tly learning ac tion maps in m ultiv ariate realv alued spaces In Pr
  • c
Eighth International Machine L e arning Workshop
slide-46
SLIDE 46 Kaelbling Littman
  • Moore
Mo
  • re
A W
  • The
partigame algorithm for v ariable resolution reinforcemen t learn ing in m ultidimensional statespaces In Co w an J D T esauro G
  • Alsp
ector J Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pp
' San Mateo CA Morgan Kaufmann Mo
  • re
A W
  • A
tk eson C G
  • An
in v estigation
  • f
memorybased function ap pro ximators for learning con trol T ec h rep MIT Artical In telligence Lab
  • ratory
  • Cam
bridge MA Mo
  • re
A W
  • A
tk eson C G
  • Prioritized
sw eeping Reinforcemen t learning with less data and less real time Machine L e arning
  • Mo
  • re
A W A tk eson C G
  • Sc
haal S
  • Memorybased
learning for con trol T ec h rep CMURITR CMU Rob
  • tics
Institute Narendra K
  • Thathac
har M A L
  • L
e arning A utomata An Intr
  • duction
Pren ticeHall Englew
  • d
Clis NJ Narendra K S
  • Thathac
har M A L
  • Learning
automataa surv ey
  • IEEE
T r ansactions
  • n
Systems Man and Cyb ernetics
  • '
P eng J
  • Williams
R J
  • Ecien
t learning and planning within the Dyna frame w
  • rk
A daptive Behavior
  • '
P eng J
  • Williams
R J
  • Incremen
tal m ultistep Qlearning In Pr
  • c
e e dings
  • f
the Eleventh International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann P
  • merleau
D A
  • Neur
al network p er c eption for mobile r
  • b
  • t
guidanc e Klu w er Academic Publishing Puterman M L
  • Markov
De cision Pr
  • c
essesDiscr ete Sto chastic Dynamic Pr
  • gr
amming John Wiley
  • Sons
Inc New Y
  • rk
NY Puterman M L
  • Shin
M C
  • Mo
died p
  • licy
iteration algorithms for discoun ted Mark
  • v
decision pro cesses Management Scienc e
  • '
Ring M B
  • Continual
L e arning in R einfor c ement Envir
  • nments
PhD thesis Univ ersit y
  • f
T exas at Austin Austin T exas R ude U
  • Mathematic
al and c
  • mputational
te chniques for multilevel adaptive meth
  • ds
So ciet y for Industrial and Applied Mathematics Philadelphi a P ennsylv ania Rumelhart D E
  • McClelland
J L Eds
  • Par
al lel Distribute d Pr
  • c
essing Explor ations in the micr
  • structur
es
  • f
c
  • gnition
V
  • lume
  • Foundations
The MIT Press Cam bridge MA Rummery
  • G
A
  • Niranjan
M
  • Online
Qlearning using connectionist systems T ec h rep CUEDFINFENGTR Cam bridge Univ ersit y
slide-47
SLIDE 47 Reinf
  • r
cement Learning A Sur vey Rust J
  • Numerical
dynamic programming in economics In Handb
  • k
  • f
Computa tional Ec
  • nomics
Elsevier North Holland Sage A P
  • White
C C
  • Optimum
Systems Contr
  • l
Pren tice Hall Salganico M
  • Ungar
L H
  • Activ
e exploration and learning in realv alued spaces using m ultiarmed bandit allo cation indices In Prieditis A
  • Russell
S Eds Pr
  • c
e e dings
  • f
the Twelfth International Confer enc e
  • n
Machine L e arning pp ' San F rancisco CA Morgan Kaufmann Sam uel A L
  • Some
studies in mac hine learning using the game
  • f
c hec k ers IBM Journal
  • f
R ese ar ch and Development
  • '
Reprin ted in E A F eigen baum and J F eldman editors Computers and Thought McGra wHill New Y
  • rk
  • Sc
haal S
  • A
tk eson C
  • Rob
  • t
juggling An implemen tation
  • f
memorybased learning Contr
  • l
Systems Magazine
  • Sc
hmidh ub er J
  • A
general metho d for m ultiagen t learning and incremen tal self impro v emen t in unrestricted en vironmen ts In Y ao X Ed Evolutionary Computa tion The
  • ry
and Applic ations Scien tic Publ Co Singap
  • re
Sc hmidh ub er J H a Curious mo delbuildi ng con trol systems In Pr
  • c
International Joint Confer enc e
  • n
Neur al Networks Singap
  • r
e V
  • l
  • pp
' IEEE Sc hmidh ub er J H b Reinforcemen t learning in Mark
  • vian
and nonMark
  • vian
en vironmen ts In Lippman D S Mo
  • dy
  • J
E
  • T
  • uretzky
  • D
S Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pp
' San Mateo CA Morgan Kaufmann Sc hraudolph N N Da y an P
  • Sejno
wski T J
  • T
emp
  • ral
dierence learning
  • f
p
  • sition
ev aluation in the game
  • f
Go In Co w an J D T esauro G
  • Alsp
ector J Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pp
' San Mateo CA Morgan Kaufmann Sc hrijv er A
  • The
  • ry
  • f
Line ar and Inte ger Pr
  • gr
amming WileyIn terscience New Y
  • rk
NY Sc h w artz A
  • A
reinforcemen t learning metho d for maximizing undiscoun ted re w ards In Pr
  • c
e e dings
  • f
the T enth International Confer enc e
  • n
Machine L e arning pp ' Amherst Massac h usetts Morgan Kaufmann Singh S P
  • Barto
A G Grup en R
  • Connolly
  • C
  • Robust
reinforcemen t learning in motion planning In Co w an J D T esauro G
  • Alsp
ector J Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • pp
' San Mateo CA Morgan Kaufmann Singh S P
  • Sutton
R S
  • Reinforcemen
t learning with replacing eligibili t y traces Machine L e arning
slide-48
SLIDE 48 Kaelbling Littman
  • Moore
Singh S P
  • a
Reinforcemen t learning with a hierarc h y
  • f
abstract mo dels In Pr
  • c
e e dings
  • f
the T enth National Confer enc e
  • n
A rticial Intel ligenc e pp ' San Jose CA AAAI Press Singh S P
  • b
T ransfer
  • f
learning b y comp
  • sing
solutions
  • f
elemen tal sequen tial tasks Machine L e arning
  • '
Singh S P
  • L
e arning to Solve Markovian De cision Pr
  • c
esses PhD thesis Depart men t
  • f
Computer Science Univ ersit y
  • f
Massac h usetts Also CMPSCI T ec hnical Rep
  • rt
  • Stengel
R F
  • Sto
chastic Optimal Contr
  • l
John Wiley and Sons Sutton R S
  • Generalization
in Reinforcemen t Learning Successful Examples Using Sparse Coarse Co ding In T
  • uretzky
  • D
Mozer M
  • Hasselmo
M Eds Neur al Information Pr
  • c
essing Systems
  • Sutton
R S
  • T
emp
  • r
al Cr e dit Assignment in R einfor c ement L e arning PhD thesis Univ ersit y
  • f
Massac h usetts Amherst MA Sutton R S
  • Learning
to predict b y the metho d
  • f
temp
  • ral
dierences Machine L e arning
  • '
Sutton R S
  • In
tegrated arc hitectures for learning planning and reacting based
  • n
appro ximating dynamic programming In Pr
  • c
e e dings
  • f
the Seventh International Confer enc e
  • n
Machine L e arning Austin TX Morgan Kaufmann Sutton R S
  • Planning
b y incremen tal dynamic programming In Pr
  • c
e e dings
  • f
the Eighth International Workshop
  • n
Machine L e arning pp ' Morgan Kaufmann T esauro G
  • Practical
issues in temp
  • ral
dierence learning Machine L e arning
  • '
T esauro G
  • TDGammon
a selfteac hing bac kgammon program ac hiev es master lev el pla y
  • Neur
al Computation
  • '
T esauro G
  • T
emp
  • ral
dierence learning and TDGammon Communic ations
  • f
the A CM
  • '
Tham CK
  • Prager
R W
  • A
mo dular qlearning arc hitecture for manipula tor task decomp
  • sition
In Pr
  • c
e e dings
  • f
the Eleventh International Confer enc e
  • n
Machine L e arning San F rancisco CA Morgan Kaufmann Thrun S
  • Learning
to pla y the game
  • f
c hess In T esauro G T
  • uretzky
  • D
S
  • Leen
T K Eds A dvanc es in Neur al Information Pr
  • c
essing Systems
  • Cam
bridge MA The MIT Press
slide-49
SLIDE 49 Reinf
  • r
cement Learning A Sur vey Thrun S
  • Sc
h w artz A
  • Issues
in using function appro ximation for reinforcemen t learning In Mozer M Smolensky
  • P
  • T
  • uretzky
  • D
Elman J
  • W
eigend A Eds Pr
  • c
e e dings
  • f
the
  • Conne
ctionist Mo dels Summer Scho
  • l
Hillsdale NJ La wrence Erlbaum Thrun S B
  • The
role
  • f
exploration in learning con trol In White D A
  • Sofge
D A Eds Handb
  • k
  • f
Intel ligent Contr
  • l
Neur al F uzzy and A daptive Appr
  • aches
V an Nostrand Reinhold New Y
  • rk
NY Tsitsiklis J N
  • Async
hronous sto c hastic appro ximation and Qlearning Machine L e arning
  • Tsitsiklis
J N
  • V
an Ro y
  • B
  • F
eaturebased metho ds for large scale dynamic programming Machine L e arning
  • V
alian t L G
  • A
theory
  • f
the learnable Communic ations
  • f
the A CM
  • '
W atkins C J C H
  • L
e arning fr
  • m
Delaye d R ewar ds PhD thesis Kings College Cam bridge UK W atkins C J C H
  • Da
y an P
  • Qlearning
Machine L e arning
  • '
Whitehead S D
  • Complexit
y and co
  • p
eration in Qlearning In Pr
  • c
e e dings
  • f
the Eighth International Workshop
  • n
Machine L e arning Ev anston IL Morgan Kauf mann Williams R J
  • A
class
  • f
gradien testimating algorithms for reinforcemen t learning in neural net w
  • rks
In Pr
  • c
e e dings
  • f
the IEEE First International Confer enc e
  • n
Neur al Networks San Diego CA Williams R J
  • Simple
statistical gradien tfollo wing algorithms for connectionist reinforcemen t learning Machine L e arning
  • '
Williams R J
  • Baird
I I I L C a Analysis
  • f
some incremen tal v arian ts
  • f
p
  • licy
iteration First steps to w ard understanding actorcritic learning systems T ec h rep NUCCS Northeastern Univ ersit y
  • College
  • f
Computer Science Boston MA Williams R J
  • Baird
I I I L C b Tigh t p erformance b
  • unds
  • n
greedy p
  • licies
based
  • n
imp erfect v alue functions T ec h rep NUCCS Northeastern Univ er sit y
  • College
  • f
Computer Science Boston MA Wilson S
  • Classier
tness based
  • n
accuracy
  • Evolutionary
Computation
  • '
Zhang W
  • Dietteric
h T G
  • A
reinforcemen t learning approac h to jobshop sc heduling In Pr
  • c
e e dings
  • f
the International Joint Confer enc e
  • n
A rticial Intel lienc e