Reinforcement Learning A (almost) quick (and very incomplete) - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning A (almost) quick (and very incomplete) - - PowerPoint PPT Presentation

Reinforcement Learning A (almost) quick (and very incomplete) introduction Slides from David Silver, Dan Klein, Mausam, Dan Weld Reinforcement Learning At each time step t : Agent executes an action A t Environment emits a reward Rt


slide-1
SLIDE 1

Reinforcement Learning

A (almost)quick(and very incomplete) introduction

Slides from David Silver, Dan Klein, Mausam, Dan Weld

slide-2
SLIDE 2
slide-3
SLIDE 3

Reinforcement Learning

At each time step t:

  • Agent executes an action At
  • Environment emits a reward Rt
  • Agent transitions to state St
slide-4
SLIDE 4

Rat Example

slide-5
SLIDE 5

Rat Example

  • What if agent state = last 3 items in sequence?
slide-6
SLIDE 6

Rat Example

  • What if agent state = last 3 items in sequence?
  • What if agent state = counts for lights, bells and levers?
slide-7
SLIDE 7

Rat Example

  • What if agent state = last 3 items in sequence?
  • What if agent state = counts for lights, bells and levers?
  • What if agent state = complete sequence?
slide-8
SLIDE 8

Major Components of RL

An RL agent may include one or more of these components:

  • Policy: agent’s behaviour function
  • Value function: how good is each state and/or action
  • Model: agent’s representation of the environment
slide-9
SLIDE 9

Policy

  • A policy is the agent’s behaviour
  • It is a map from state to action
  • Deterministic policy: a = π(s)
  • Stochastic policy: π(a|s) = P[At = a|St = s]
slide-10
SLIDE 10

Value function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness/badness of states…
  • …and therefore to select between actions
slide-11
SLIDE 11

Model

  • A model predicts what the environment will do next
  • It predicts the next state…
  • …and predicts the next (immediate) reward
slide-12
SLIDE 12

Dimensions of RL

Model-based vs. Model-free

  • Model-based: Have/learn action

models (i.e. transition probabilities.

  • Uses Dynamic Programming
  • Model-free: Skip them and directly

learn what action to do when (without necessarily finding out the exact model of the action)

  • e.g. Q-learning

On Policy vs. Off Policy

  • On Policy: Makes estimates based on a

policy, and improves it based on estimates.

  • Learning on the job.
  • e.g. SARSA
  • Off Policy: Learn a policy while following

another (or re-using experience from old policy).

  • Looking over someone's shoulder
  • e.g. Q-learning
slide-13
SLIDE 13

Markov Decision Process

  • Set of states S = {si}
  • Set of actions for each state A(s) = {asi} (often independent of state)
  • Transition model T(s -> s’ | a) = Pr(s’ | a, s)
  • Reward model R(s, a, s’)
  • Discount factor γ

MDP = <S, A, T, R, γ>

slide-14
SLIDE 14

Bellman Equation for Value Function

slide-15
SLIDE 15

Bellman Equation for Action-Value Function

slide-16
SLIDE 16

Q vs V

slide-17
SLIDE 17

Exploration vs Exploitation

  • Restaurant Selection
  • Exploitation: Go to your favourite restaurant
  • Exploration: Try a new restaurant
  • Online Banner Advertisements
  • Exploitation: Show the most successful advert
  • Exploration: Show a different advert
  • Oil Drilling
  • Exploitation: Drill at the best known location
  • Exploration: Drill at a new location
  • Game Playing
  • Exploitation: Play the move you believe is best
  • Exploration: Play an experimental move
slide-18
SLIDE 18

ε-Greedy solution

  • Simplest idea for ensuring continual exploration
  • All m actions are tried with non-zero probability
  • With probability 1 − ε choose the greedy action
  • With probability ε choose an action at random
slide-19
SLIDE 19

Off Policy Learning

  • Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a) while following behaviour

policy μ(a|s) {s1,a1,r2,...,sT} ∼ μ

  • Why is this important?
  • Learn from observing humans or other agents
  • Re-use experience generated from old policies π1, π2, ..., πt−1
  • Learn about optimal policy while following exploratory policy
  • Learn about multiple policies while following one policy
slide-20
SLIDE 20

Q - Learning

  • We now consider off-policy learning of action-values Q(s,a)
  • Next action is chosen using behaviour policy At+1 ∼ μ(·|St)
  • But we consider alternative successor action A′ ∼ π(·|St)
  • And update Q(St,At) towards value of alternative action
slide-21
SLIDE 21

Q - Learning

  • We now allow both behaviour and target policies to improve
  • The target policy π is greedy w.r.t. Q(s,a)
  • The behaviour policy μ is e.g. ε-greedy w.r.t. Q(s,a)
  • The Q-learning target then simplifies:
slide-22
SLIDE 22

Q - Learning

slide-23
SLIDE 23

Q - Learning

slide-24
SLIDE 24

Deep RL

  • We seek a single agent which can solve any human-level task
  • RL defines the objective
  • DL gives the mechanism
  • RL + DL = general intelligence (David Silver)
slide-25
SLIDE 25

Function Approximators

slide-26
SLIDE 26

Deep Q-Networks

  • Q Learning diverges using neural networks due to:
  • Correlations between samples
  • Non-stationary targets
slide-27
SLIDE 27

Solution: Experience Replay

  • Fancy biological analogy
  • In reality, quite simple
slide-28
SLIDE 28

Solution: Experience Replay

slide-29
SLIDE 29

Improving Information Extraction by Acquiring External Evidence with Reinforcement Learning

Karthik Narasimhan, Adam Yala, Regina Barzilay CSAIL, MIT

Slides from Karthik Narasimhan

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Why try to reason, when someone else can do it for you

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51
slide-52
SLIDE 52
slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67
slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70

Doubts*

  • Algo 1 line# 19. The process should end when "d" == "end_episode" and not q. [Prachi]

Error.

  • The dimension of the match vector should be equivalent to the number of columns to ve
  • extracted. But Fig 3 has twice the number of dim. [Prachi] Error.
  • Is RL the best approach. [Non believers].
  • Experience Replay [Anshul]. Hope it is clear now.
  • Why is RL-extract better than meta classifier? Explanation provided in paper about "long

tail of noisy, irrelevant documents" is unclear. [Yash]

  • The meta-classifier should also cut off at top-20 results per search like the RL system to

be completely fair. [Anshul]

* most mean questions

slide-71
SLIDE 71

Discussions

  • Experiments
  • People are happy!
  • Queries
  • Cluster documents and learn queries [Yashoteja]
  • Many other query formulations [Surag (lowest confidence entity), Barun (LSTM), Gagan (highest confidence entity), DineshR]
  • Fixed set of queries [Akshay]
  • Simplicity. Search engines are robust.
  • Reliance on News articles {Gagan]
  • Where else would you get News from?
  • Domain limitations
  • Too narrow [Barun, Himanshu]. Domain specific [Happy]. Small ontology [Akshay]
  • It is not Open IE. It is task specific. Can be applied to any domain.
  • Better meta-classifiers [Surag]
  • Effect of more sophisticated RL algorithms (A3C, TRPO) [esp. if increasing action space by LSTM queries], and their effect on performance and training time.