Foundations of Machine Learning Reinforcement Learning - - PowerPoint PPT Presentation

foundations of machine learning reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Foundations of Machine Learning Reinforcement Learning - - PowerPoint PPT Presentation

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward


slide-1
SLIDE 1

Foundations of Machine Learning Reinforcement Learning

slide-2
SLIDE 2

page

Mehryar Mohri - Foundations of Machine Learning

Reinforcement Learning

Agent exploring environment. Interactions with environment: Problem: find action policy that maximizes cumulative reward over the course of interactions. Environment Agent

action state reward

slide-3
SLIDE 3

page

Mehryar Mohri - Foundations of Machine Learning

Key Features

Contrast with supervised learning:

  • no explicit labeled training data.
  • distribution defined by actions taken.

Delayed rewards or penalties. RL trade-off:

  • exploration (of unknown states and actions) to

gain more reward information; vs.

  • exploitation (of known information) to optimize

reward.

slide-4
SLIDE 4

page

Mehryar Mohri - Foundations of Machine Learning

Applications

Robot control e.g., Robocup Soccer Teams (Stone et

al., 1999).

Board games, e.g., TD-Gammon (Tesauro, 1995). Elevator scheduling (Crites and Barto, 1996). Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment.

slide-5
SLIDE 5

page

Mehryar Mohri - Foundations of Machine Learning

This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

slide-6
SLIDE 6

page

Mehryar Mohri - Foundations of Machine Learning

Markov Decision Process (MDP)

Definition: a Markov Decision Process is defined by:

  • a set of decision epochs .
  • a set of states , possibly infinite.
  • a start state or initial state ;
  • a set of actions , possibly infinite.
  • a transition probability : distribution over

destination states .

  • a reward probability : distribution over

rewards returned .

6

S {0, . . . , T} A Pr[s|s, a] Pr[r|s, a] s =δ(s, a) r =r(s, a) s0

slide-7
SLIDE 7

page

Mehryar Mohri - Foundations of Machine Learning

Model

State observed at time : Action taken at time : State reached . Reward received: .

7

t st ∈ S. t at ∈ A.

st

st+1 st+2

at/rt+1

at+1/rt+2

Environment Agent

action state reward

st+1 =δ(st, at) rt+1 =r(st, at)

slide-8
SLIDE 8

page

Mehryar Mohri - Foundations of Machine Learning

MDPs - Properties

Finite MDPs: and finite sets. Finite horizon when . Reward : often deterministic function.

8

A S r(s, a) T <∞

slide-9
SLIDE 9

page

Mehryar Mohri - Foundations of Machine Learning

Example - Robot Picking up Balls

start

search/[.1, R1]

  • ther

search/[.9, R1] carry/[.5, R3] carry/[.5, -1] pickup/[1, R2]

slide-10
SLIDE 10

page

Mehryar Mohri - Foundations of Machine Learning

Policy

Definition: a policy is a mapping Objective: find policy maximizing expected return.

  • finite horizon return: .
  • infinite horizon return: .

Theorem: there exists an optimal policy from any start state.

10

π: S → A. π T −1

t=0 r

  • st, π(st)
  • +∞

t=0 γtr

  • st, π(st)
slide-11
SLIDE 11

page

Mehryar Mohri - Foundations of Machine Learning

Policy Value

Definition: the value of a policy at state is

  • finite horizon:
  • infinite horizon: discount factor ,

Problem: find policy with maximum value for all states.

11

s γ ∈[0, 1) π π Vπ(s) = E +∞

  • t=0

γtr

  • st, π(st)
  • s0 = s
  • .

Vπ(s) = E T −1

  • t=0

r

  • st, π(st)
  • s0 = s
  • .
slide-12
SLIDE 12

page

Mehryar Mohri - Foundations of Machine Learning

Policy Evaluation

Analysis of policy value: Bellman equations (system of linear equations):

12

Vπ(s) = E[r(s, π(s)] + γ

  • s

Pr[s|s, π(s)]Vπ(s). Vπ(s) = E +∞

  • t=0

γtr

  • st, π(st)
  • s0 = s
  • .

= E[r(s, π(s))] + γ E +∞

  • t=0

γtr

  • st+1, π(st+1)
  • s0 = s
  • = E[r(s, π(s)] + γ E[Vπ(δ(s, π(s)))].
slide-13
SLIDE 13

page

Mehryar Mohri - Foundations of Machine Learning

Bellman Equation - Existence and Uniqueness

Notation:

  • transition probability matrix
  • value column matrix
  • expected reward column matrix:

Theorem: for a finite MDP , Bellman’s equation admits a unique solution given by

13

Ps,s =Pr[s|s, π(s)]. V0 =(I − γP)−1R. V=Vπ(s). R=E[r(s, π(s)].

slide-14
SLIDE 14

page

Mehryar Mohri - Foundations of Machine Learning

Bellman Equation - Existence and Uniqueness

Proof: Bellman’s equation rewritten as

  • is a stochastic matrix, thus,
  • This implies that The eigenvalues
  • f are all less than one and is

invertible. Notes: general shortest distance problem (MM, 2002).

14

P P = max

s

  • s

|Pss| = max

s

  • s

Pr[s|s, π(s)] = 1. γP∞ = γ <1. P (I − γP) V=R + γPV.

slide-15
SLIDE 15

page

Mehryar Mohri - Foundations of Machine Learning

Optimal Policy

Definition: policy with maximal value for all states

  • value of (optimal value):
  • optimal state-action value function: expected

return for taking action at state and then following optimal policy.

15

π∗ s∈S. π∗ ∀s ∈ S, Vπ∗(s) = max

π

Vπ(s). a s Q⇤(s, a) = E[r(s, a)] + γ E[V ⇤(δ(s, a))] = E[r(s, a)] + γ X

s02S

Pr[s0 | s, a]V ⇤(s0).

slide-16
SLIDE 16

page

Mehryar Mohri - Foundations of Machine Learning

Optimal Values - Bellman Equations

Property: the following equalities hold: Proof: by definition, for all , .

  • If for some we had , then

maximizing action would define a better policy. Thus,

16

∀s ∈ S, V ∗(s) = max

a∈A Q∗(s, a).

s V ∗(s) ≤ max

a∈A Q∗(s, a)

s V ∗(s)<max

a∈A Q∗(s, a)

V ⇤(s) = max

a2A

n E[r(s, a)] + γ X

s02S

Pr[s0|s, a]V ⇤(s0)

  • .
slide-17
SLIDE 17

page

Mehryar Mohri - Foundations of Machine Learning

This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

slide-18
SLIDE 18

page

Mehryar Mohri - Foundations of Machine Learning

Known Model

Setting: environment model known. Problem: find optimal policy. Algorithms:

  • value iteration.
  • policy iteration.
  • linear programming.

18

slide-19
SLIDE 19

page

Mehryar Mohri - Foundations of Machine Learning

Value Iteration Algorithm

19

ValueIteration(V0) 1 V V0 V0 arbitrary value 2 while V Φ(V) (1−)

  • do

3 V Φ(V) 4 return Φ(V) Φ(V) = max

π {Rπ + γP πV}.

Φ(V)(s) = max

aA

  • E[r(s, a)] + γ
  • sS

Pr[s|s, a]V (s)

  • .
slide-20
SLIDE 20

page

Mehryar Mohri - Foundations of Machine Learning

VI Algorithm - Convergence

Theorem: for any initial value , the sequence defined by converge to . Proof: we show that is -contracting for existence and uniqueness of fixed point for .

  • for any , let be the maximizing action

defining . Then, for and any ,

20

V0 s ∈ S Φ Φ Vn+1 =Φ(Vn) V∗ γ a∗(s) Φ(V)(s) s ∈ S U · ∞

Φ(V)(s) Φ(U)(s) Φ(V)(s)

  • E[r(s, a(s))] + γ
  • sS

Pr[s | s, a(s)]U(s)

  • = γ
  • sS

Pr[s|s, a(s)][V(s) U(s)] γ

  • sS

Pr[s|s, a(s)]V U = γV U.

slide-21
SLIDE 21

page

Mehryar Mohri - Foundations of Machine Learning

Complexity and Optimality

Complexity: convergence in . Observe that

  • Optimality: let be the value returned. Then,

21

  • V∗ Vn+1∞ V∗ Φ(Vn+1)∞ + Φ(Vn+1) Vn+1∞

γV∗ Vn+1∞ + γVn+1 Vn∞.

V∗ Vn+1∞

  • 1 Vn+1 Vn∞ .

Vn+1

Thus,

Vn+1 Vn∞ γVn Vn−1∞ γnΦ(V0) V0∞. nΦ(V0) V0∞ (1 )

  • n = O
  • log 1
  • .

Thus,

O(log 1

)

slide-22
SLIDE 22

page

Mehryar Mohri - Foundations of Machine Learning

VI Algorithm - Example

1 a/[3/4, 2] 2 a/[1/4, 2] b/[1, 2] d/[1, 3] c/[1, 2]

Vn+1(1) = max

  • 2 + γ

3 4Vn(1) + 1 4Vn(2)

  • , 2 + γVn(2)
  • Vn+1(2) = max
  • 3 + γVn(1), 2 + γVn(2)
  • .

For ,

V0(1) = −1, V0(2) = 1, γ = 1/2 V1(1) = V1(2) = 5/2.

But, ,

V∗(1) = 14/3, V∗(2) = 16/3.

slide-23
SLIDE 23

page

Mehryar Mohri - Foundations of Machine Learning

Policy Iteration Algorithm

23

PolicyIteration(0) 1 0 arbitrary policy 2 nil 3 while ( = ) do 4 V Vπ policy evaluation: solve (I Pπ)V = Rπ. 5 6 argmaxπ{Rπ + PπV} greedy policy improvement. 7 return

slide-24
SLIDE 24

page

Mehryar Mohri - Foundations of Machine Learning

PI Algorithm - Convergence

Theorem: let be the sequence of policy values computed by the algorithm, then, Proof: let be the policy improvement at the th iteration, then, by definition,

  • therefore,
  • note that preserves ordering:
  • thus,

24

(Vn)n∈N Vn ≤ Vn+1 ≤ V∗. πn+1 n Rπn+1 + γPπn+1Vn ≥ Rπn + γPπnVn = Vn. Rπn+1 ≥ (I − γPπn+1)Vn. X ≥ 0 ⇒ (I − γPπn+1)−1X = ∞

k=0(γPπn+1)kX ≥ 0.

(I − γPπn+1)−1 Vn+1 = (I − γPπn+1)−1Rπn+1 ≥ Vn.

slide-25
SLIDE 25

page

Mehryar Mohri - Foundations of Machine Learning

Notes

Two consecutive policy values can be equal only at last iteration. The total number of possible policies is , thus, this is the maximal possible number of iterations.

  • best upper bound known .

25

|A||S| O |A||S|

|S|

slide-26
SLIDE 26

page

Mehryar Mohri - Foundations of Machine Learning

PI Algorithm - Example

1 a/[3/4, 2] 2 a/[1/4, 2] b/[1, 2] d/[1, 3] c/[1, 2]

Initial policy: .

Vπ0(1) = 1 + γVπ0(2) Vπ0(2) = 2 + γVπ0(2). π0(1) = b, π0(2) = c

Evaluation: Thus,Vπ0(1) = 1 + γ

1 − γ Vπ0(2) = 2 1 − γ .

slide-27
SLIDE 27

page

Mehryar Mohri - Foundations of Machine Learning

VI and PI Algorithms - Comparison

Theorem: let be the sequence of policy values generated by the VI algorithm, and the one generated by the PI algorithm. If , then, Proof: we first show that is monotonic. Let and be such that and let be the policy such that . Then,

27

(Un)n∈N (Vn)n∈N U0 =V0 ∀n ∈ N, Un ≤ Vn ≤ V∗. Φ U V U ≤ V π Φ(U) = Rπ + γPπU Φ(U) ≤ Rπ + γPπV ≤ max

π {R π + γP πV} = Φ(V).

slide-28
SLIDE 28

page

Mehryar Mohri - Foundations of Machine Learning

VI and PI Algorithms - Comparison

  • The proof is by induction on . Assume ,

then, by the monotonicity of ,

  • Let be the maximizing policy:
  • Then,

28

n Un ≤Vn Φ Un+1 = Φ(Un) ≤ Φ(Vn) = max

π {Rπ + γPπVn}.

πn+1 πn+1 = argmax

π

{Rπ + γPπVn}.

Φ(Vn) = Rπn+1 + γPπn+1Vn ≤ Rπn+1 + γPπn+1Vn+1 = Vn+1.

slide-29
SLIDE 29

page

Mehryar Mohri - Foundations of Machine Learning

Notes

The PI algorithm converges in a smaller number of iterations than the VI algorithm due to the optimal policy. But, each iteration of the PI algorithm requires computing a policy value, i.e., solving a system of linear equations, which is more expensive to compute that an iteration of the VI algorithm.

29

slide-30
SLIDE 30

page

Mehryar Mohri - Foundations of Machine Learning

Primal Linear Program

LP formulation: choose , with . Parameters:

  • number rows: .
  • number of columns: .

30

|S||A| |S|

min

V

  • sS

α(s)V (s) subject to ∀s ∈ S, ∀a ∈ A, V (s) ≥ E[r(s, a)] + γ

  • sS

Pr[s|s, a]V (s).

α(s)>0

  • s α(s)=1
slide-31
SLIDE 31

page

Mehryar Mohri - Foundations of Machine Learning

Dual Linear Program

LP formulation: Parameters: more favorable number of rows.

  • number rows: .
  • number of columns: .

31

|S| |S||A|

max

x

  • sS,aA

E[r(s, a)] x(s, a) subject to ∀s ∈ S,

  • aA

x(s, a) = α(s) + γ

  • sS,aA

Pr[s|s, a] x(s, a) ∀s ∈ S, ∀a ∈ A, x(s, a) ≥ 0.

slide-32
SLIDE 32

page

Mehryar Mohri - Foundations of Machine Learning

This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

slide-33
SLIDE 33

page

Mehryar Mohri - Foundations of Machine Learning

Problem

Unknown model:

  • transition and reward probabilities not known.
  • realistic scenario in many practical problems, e.g.,

robot control. Training information: sequence of immediate rewards based on actions taken. Learning approches:

  • model-free: learn policy directly.
  • model-based: learn model, use it to learn policy.

33

slide-34
SLIDE 34

page

Mehryar Mohri - Foundations of Machine Learning

Problem

How do we estimate reward and transition probabilities?

  • use equations derived for policy value and Q-

functions.

  • but, equations given in terms of some

expectations.

  • instance of a stochastic approximation

problem.

34

slide-35
SLIDE 35

page

Mehryar Mohri - Foundations of Machine Learning

Stochastic Approximation

Problem: find solution of with while

  • cannot be computed, e.g., not accessible;
  • i.i.d. sample of noisy observations ,

available, , with . Idea: algorithm based on iterative technique:

  • more generally .

35

x=H(x) x∈RN H(x) H H(xi)+wi i∈[1, m] E[w]=0 xt+1 = (1 − αt)xt + αt[H(xt) + wt] = xt + αt[H(xt) + wt − xt]. xt+1 = xt + αtD(xt, wt)

slide-36
SLIDE 36

page

Mehryar Mohri - Foundations of Machine Learning

Mean Estimation

Theorem: Let be a random variable taking values in and let be i.i.d. values of . Define the sequence by

36

X [0, 1] X µm

a.s

− − → E[X]. αm ∈[0, 1]

Then, for , with and

(µm)m∈N x0, . . . , xm

  • m≥0

αm =+∞ µm+1 =(1−αm)µm+αmxm

with µ0 =x0.

X

m≥0

α2

m <+∞,

slide-37
SLIDE 37

page

Mehryar Mohri - Foundations of Machine Learning

Proof

Proof: By the independence assumption, for ,

  • We have since .
  • Let and suppose there exists such that

for all , . Then, for ,

37

Var[µm+1] = (1 − αm)2Var[µm] + α2

mVar[xm]

≤ (1 − αm)Var[µm] + α2

m.

αm →0

  • m≥0 α2

m <+∞

m≥0

which implies

>0 N ∈N m≥N Var[µm]≥ Var[µm+1] ≤ Var[µm] − m + 2

m,

contradicting .

Var[µm+N]≥0

Var[µm+N] ≤ Var[µN] − m+N

n=N n + m+N n=N 2 n

  • →−∞ when m→∞

,

m≥N

slide-38
SLIDE 38

page

Mehryar Mohri - Foundations of Machine Learning

Mean Estimation

  • Thus, for all there exists such that
  • Therefore, for all ( convergence).

38

Choose large enough so that Then,

µm ≤ m≥m0 N N ∈N m0 ≥N Var[µm0]<. ∀m≥N, m ≤. Var[µm0+1]≤(1−m0)+m0=. L2

slide-39
SLIDE 39

page

Mehryar Mohri - Foundations of Machine Learning

Notes

special case: .

  • Strong law of large numbers.

Connection with stochastic approximation.

39

αm = 1

m

slide-40
SLIDE 40

page

Mehryar Mohri - Foundations of Machine Learning

TD(0) Algorithm

Idea: recall Bellman’s linear equations giving Algorithm: temporal difference (TD).

  • sample new state .
  • update: depends on number of visits of .

40

V Vπ(s) = E[r(s, π(s)] + γ

  • s

Pr[s|s, π(s)]Vπ(s) = E

s

  • r(s, π(s)) + γVπ(s)|s
  • .

s V (s) ← (1 − α)V (s) + α[r(s, π(s)) + γV (s)] = V (s) + α[r(s, π(s)) + γV (s) − V (s)

  • temporal difference of V values

]. α s

slide-41
SLIDE 41

page

Mehryar Mohri - Foundations of Machine Learning

TD(0) Algorithm

41

TD(0)() 1 V ← V0 initialization. 2 for t ← 0 to T do 3 s ← SelectState() 4 for each step of epoch t do 5 r ← Reward(s, (s)) 6 s ← NextState(, s) 7 V (s) ← (1 − )V (s) + [r + V (s)] 8 s ← s 9 return V

slide-42
SLIDE 42

page

Mehryar Mohri - Foundations of Machine Learning

Q-Learning Algorithm

Idea: assume deterministic rewards. Algorithm: depends on number of visits.

  • sample new state .
  • update:

42

s Q(s, a) ← αQ(s, a) + (1 − α)[r(s, a) + γ max

aA Q(s, a)].

α ∈ [0, 1] Q(s, a) = E[r(s, a)] + γ

  • sS

Pr[s | s, a]V (s) = E

s[r(s, a) + γ max aA Q(s, a)]

slide-43
SLIDE 43

page

Mehryar Mohri - Foundations of Machine Learning

Q-Learning Algorithm

43

(Watkins, 1989; Watkins and Dayan 1992)

Q-Learning() 1 Q ← Q0 initialization, e.g., Q0 = 0. 2 for t ← 0 to T do 3 s ← SelectState() 4 for each step of epoch t do 5 a ← SelectAction(, s) policy derived from Q, e.g., -greedy. 6 r ← Reward(s, a) 7 s ← NextState(s, a) 8 Q(s, a) ← Q(s, a) +

  • r + maxa Q(s, a) − Q(s, a)
  • 9

s ← s 10 return Q

slide-44
SLIDE 44

page

Mehryar Mohri - Foundations of Machine Learning

Notes

Can be viewed as a stochastic formulation of the value iteration algorithm. Convergence for any policy so long as states and actions visited infinitely often. How to choose the action at each iteration? Maximize reward? Explore other actions? Q- learning is an off-policy method: no control over the policy.

44

slide-45
SLIDE 45

page

Mehryar Mohri - Foundations of Machine Learning

Policies

Epsilon-greedy strategy:

  • with probability greedy action from ;
  • with probability random action.

Epoch-dependent strategy (Boltzmann exploration):

  • : greedy selection.
  • larger : random action.

45

1− s

  • pt(a|s, Q) =

e

Q(s,a) τt

  • a∈A e

Q(s,a) τt

, τt → 0 τt

slide-46
SLIDE 46

page

Mehryar Mohri - Foundations of Machine Learning

Convergence of Q-Learning

Theorem: consider a finite MDP . Assume that for all and , with . Then, the Q-learning algorithm converges to the optimal value (with probability

  • ne).
  • note: the conditions on impose that each

state-action pair is visited infinitely many times.

46

Q∗ s∈S a∈A ∞

t=0 αt(s, a) = ∞, ∞ t=0 α2 t (s, a) < ∞

αt(s, a)∈[0, 1] αt(s, a)

slide-47
SLIDE 47

page

Mehryar Mohri - Foundations of Machine Learning

SARSA: On-Policy Algorithm

47

SARSA() 1 Q ← Q0 initialization, e.g., Q0 = 0. 2 for t ← 0 to T do 3 s ← SelectState() 4 a ← SelectAction((Q), s) policy derived from Q, e.g., -greedy. 5 for each step of epoch t do 6 r ← Reward(s, a) 7 s ← NextState(s, a) 8 a ← SelectAction((Q), s) policy derived from Q, e.g., -greedy. 9 Q(s, a) ← Q(s, a) + t(s, a)

  • r + Q(s, a) − Q(s, a)
  • 10

s ← s 11 a ← a 12 return Q

slide-48
SLIDE 48

page

Mehryar Mohri - Foundations of Machine Learning

Notes

Differences with Q-learning:

  • two states: current and next states.
  • maximum reward for next state not used for

next state, instead new action. SARSA: name derived from sequence of updates.

48

slide-49
SLIDE 49

page

Mehryar Mohri - Foundations of Machine Learning

TD(λ) Algorithm

Idea:

  • TD(0) or Q-learning only use immediate reward.
  • use multiple steps ahead instead, for steps:
  • TD(λ) uses

Algorithm:

49

Rn

t = rt+1 + γrt+2 + . . . + γn−1rt+n + γnV (st+n)

V (s) ← V (s) + α (Rn

t − V (s)).

t = (1 − λ) ∞ n=0 λnRn t .

n V (s) ← V (s) + α (Rλ

t − V (s)).

slide-50
SLIDE 50

page

Mehryar Mohri - Foundations of Machine Learning

TD(λ) Algorithm

50

TD()() 1 V V0 initialization. 2 e 0 3 for t 0 to T do 4 s SelectState() 5 for each step of epoch t do 6 s NextState(, s) 7 r(s, (s)) + V (s) V (s) 8 e(s) e(s) + 1 9 for u S do 10 if u = s then 11 e(u) e(u) 12 V (u) V (u) + e(u) 13 s s 14 return V

slide-51
SLIDE 51

page

Mehryar Mohri - Foundations of Machine Learning

TD-Gammon

Large state space or costly actions: use regression algorithm to estimate Q for unseen values. Backgammon:

  • large number of positions: 30 pieces, 24-26 locations,
  • large number of moves.

TD-Gammon: used neural networks.

  • non-linear form of TD(λ),1.5M games played,
  • almost as good as world-class humans (master level).

(Tesauro, 1995)

slide-52
SLIDE 52

page

Mehryar Mohri - Foundations of Machine Learning

This Lecture

Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem

slide-53
SLIDE 53

page

Mehryar Mohri - Foundations of Machine Learning

Multi-Armed Bandit Problem

Problem: gambler must decide which arm of a - slot machine to pull to maximize his total reward in a series of trials.

  • stochastic setting: lever reward distributions.
  • adversarial setting: reward selected by adversary

aware of all the past.

53

N N

(Robbins, 1952)

slide-54
SLIDE 54

page

Mehryar Mohri - Foundations of Machine Learning

Applications

Clinical trials. Adaptive routing. Ads placement on pages. Games.

54

slide-55
SLIDE 55

page

Mehryar Mohri - Foundations of Machine Learning

Multi-Armed Bandit Game

For to do

  • adversary determines outcome .
  • player selects probability distribution and pulls

lever , .

  • player incurs loss (adversary is informed
  • f and .

Objective: minimize regret

55

t=1 T

pt L(It, yt) pt It ∈{1, . . . , N} It ∼pt It yt ∈ Y Regret(T ) =

T

  • t=1

L(It, yt) − min

i=1,...,N T

  • t=1

L(i, yt).

slide-56
SLIDE 56

page

Mehryar Mohri - Foundations of Machine Learning

Notes

Player is informed only of the loss (or reward) corresponding to his own action. Adversary knows past but not action selected. Stochastic setting: loss drawn according to some distribution . Regret definition modified by taking expectations. Exploration/Exploitation trade-off: playing the best arm found so far versus seeking to find an arm with a better payoff.

56

D = D1 ⊗ · · · ⊗ DN (L(1, yt), . . . , L(N, yt))

slide-57
SLIDE 57

page

Mehryar Mohri - Foundations of Machine Learning

Notes

Equivalent views:

  • special case of learning with partial information.
  • one-state MDP learning problem.

Simple strategy: -greedy: play arm with best empirical reward with probability , random arm with probability .

57

  • 1−t

t

slide-58
SLIDE 58

page

Mehryar Mohri - Foundations of Machine Learning

Exponentially Weighted Average

Algorithm: Exp3, defined for by Guarantee: expected regret of

58

pi,t = (1 − γ) exp

  • − η t−1

s=1

li,t

  • N

i=1 exp

  • − η t−1

s=1

li,t + γ N ,

with

η, γ >0 ∀i ∈ [1, N], li,t = L(It,yt)

pIt,t 1It=i.

O(

  • NT log N).
slide-59
SLIDE 59

page

Mehryar Mohri - Foundations of Machine Learning

Exponentially Weighted Average

Proof: similar to the one for the Exponentially Weighted Average with the additional observation that:

59

E[ li,t] = N

i=1 pi,t L(It,yt) pIt,t 1It=i = L(i, yt).

slide-60
SLIDE 60

Mehryar Mohri - Foundations of Machine Learning page

References

  • Dimitri P

. Bertsekas. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA: Athena Scientific, 2007.

  • Mehryar Mohri. Semiring Frameworks and Algorithms for Shortest-Distance Problems.

Journal of Automata, Languages and Combinatorics, 7(3):321-350, 2002.

  • Martin L. Puterman Markov decision processes: discrete stochastic dynamic programming.

Wiley-Interscience, New York, 1994.

  • Robbins, H. (1952), "Some aspects of the sequential design of experiments", Bulletin of the

American Mathematical Society 58 (5): 527–535.

  • Sutton, Richard S., and Barto, Andrew G. Reinforcement Learning: An Introduction. MIT Press,

1998.

60

slide-61
SLIDE 61

Mehryar Mohri - Foundations of Machine Learning page

References

  • Gerald Tesauro. Temporal Difference Learning and

TD-Gammon. Communications of the ACM 38 (3), 1995.

  • Watkins, Christopher J. C. H. Learning from Delayed Rewards. Ph.D. thesis, Cambridge

University, 1989.

  • Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning,
  • Vol. 8, No. 3-4,

1992.

61

slide-62
SLIDE 62

Mehryar Mohri - Foudations of Machine Learning

Appendix

slide-63
SLIDE 63

page

Mehryar Mohri - Foundations of Machine Learning

Stochastic Approximation

Problem: find solution of with while

  • cannot be computed, e.g., not accessible;
  • i.i.d. sample of noisy observations ,

available, , with . Idea: algorithm based on iterative technique:

  • more generally .

63

x=H(x) x∈RN H(x) H H(xi)+wi i∈[1, m] E[w]=0 xt+1 = (1 − αt)xt + αt[H(xt) + wt] = xt + αt[H(xt) + wt − xt]. xt+1 = xt + αtD(xt, wt)

slide-64
SLIDE 64

page

Mehryar Mohri - Foundations of Machine Learning

Supermartingale Convergence

Theorem: let be non-negative random variables such that . If the following condition holds: , then,

  • converges to a limit (with probability one).
  • 64

Xt, Yt, Zt Xt ∞

t=0 Zt < ∞.

E

  • Xt+1
  • Ft
  • ≤Xt+Yt−Zt

t=0 Yt <∞

slide-65
SLIDE 65

page

Mehryar Mohri - Foundations of Machine Learning

Convergence Analysis

Convergence of , with history defined by Theorem: let for some and assume that

  • 65

xt+1 = xt + αtD(xt, wt) Ft x∗ αt >0, ∞

t=0 αt = ∞, ∞ t=0 α2 t < ∞.

Then, xt

a.s

− − → x∗. Ψ: x 1

2x x∗2 2

K1, K2 : E

  • D(xt, wt)2

2

  • Ft
  • K1 + K2 Ψ(xt);

c: Ψ(xt)E

  • D(xt, wt)
  • Ft
  • c Ψ(xt);

Ft = {(xt)t≤t, (αt)t≤t, (wt)t<t}.

slide-66
SLIDE 66

page

Mehryar Mohri - Foundations of Machine Learning

Convergence Analysis

Proof: since is a quadratic function, Thus, By the supermartingale convergence theorem, converges and Since , must converge to 0.

66

αt >0, ∞

t=0 αt = ∞, ∞ t=0 α2 t < ∞

Ψ(xt+1) = Ψ(xt)+Ψ(xt)(xt+1 xt)+ 1 2(xt+1 xt)2Ψ(xt)(xt+1 xt).

E

  • Ψ(xt+1)
  • Ft
  • = Ψ(xt) + αtΨ(xt) E
  • D(xt, wt)
  • Ft
  • + α2

t

2 E

  • D(xt, wt)2

Ft

  • Ψ(xt) αtcΨ(xt) + α2

t

2 (K1 + K2Ψ(xt)) = Ψ(xt) + α2

tK1

2

  • αtc α2

t K2

2

  • Ψ(xt).

Ψ Ψ(xt) Ψ(xt) ∞

t=0

  • αtc − α2

t K2

2

  • Ψ(xt) < ∞.

non-neg. for large t