( ) Intro. On Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

intro on artificial intelligence from the perspective of
SMART_READER_LITE
LIVE PREVIEW

( ) Intro. On Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

2018 ( ) Intro. On Artificial Intelligence from the perspective of probability theory luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net


slide-1
SLIDE 1

人工智能引论 2018 罗智凌

人工智能引论 (六)

  • Intro. On Artificial Intelligence

from the perspective of probability theory

罗智凌

luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net

slide-2
SLIDE 2

人工智能引论 2018 罗智凌

OUTLINE

  • Intro on Reinforcement Learning
  • Learning with Reward

– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm

  • Learning without Reward

– Inverse Reinforcement Learning

  • AlphaGo
slide-3
SLIDE 3

人工智能引论 2018 罗智凌

Intro on Reinforcement Learning

  • Data: (x, y)

x is data, y is label

  • Goal: Learn a function to map x -> y
  • Examples: Classification,

regression, object detection, semantic segmentation, image captioning, etc.

So far... Supervised Learning

Cat Classification

slide-4
SLIDE 4

人工智能引论 2018 罗智凌

Intro on Reinforcement Learning

  • Data: x

Just data, no labels!

  • Goal: Learn some underlying

hidden structure of the data

  • Examples: Clustering,

dimensionality reduction, feature learning, density estimation, etc.

So far... Unsupervised Learning

2-d density estimation 1-d density estimation

slide-5
SLIDE 5

人工智能引论 2018 罗智凌

Today: Reinforcement Learning

Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward 5

Generalization >> Learn how to take actions in order to reach a goal.

slide-6
SLIDE 6

人工智能引论 2018 罗智凌

OUTLINE

  • Intro on Reinforcement Learning
  • Learning with Reward

– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm

  • Learning without Reward

– Inverse Reinforcement Learning

  • AlphaGo
slide-7
SLIDE 7

人工智能引论 2018 罗智凌 Agent Environment

Reinforcement Learning

slide-8
SLIDE 8

人工智能引论 2018 罗智凌 Agent Environment State st

Reinforcement Learning

slide-9
SLIDE 9

人工智能引论 2018 罗智凌 Agent Environment Action at State st

Reinforcement Learning

slide-10
SLIDE 10

人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt

Reinforcement Learning

slide-11
SLIDE 11

人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt Next state st+1

Reinforcement Learning

slide-12
SLIDE 12

人工智能引论 2018 罗智凌

Cart-Pole Problem

Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright

slide-13
SLIDE 13

人工智能引论 2018 罗智凌

Robot Locomotion

Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement

slide-14
SLIDE 14

人工智能引论 2018 罗智凌

Atari Games

Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

slide-15
SLIDE 15

人工智能引论 2018 罗智凌

Go

Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise

slide-16
SLIDE 16

人工智能引论 2018 罗智凌 Cart-Pole Atari Games Go Robot Locomotion Sequential decision making on specified goal.

Evaluation on how goodness

  • f the near future
slide-17
SLIDE 17

人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt Next state st+1

How can we mathematically formalize the RL problem?

slide-18
SLIDE 18

人工智能引论 2018 罗智凌

Markov Decision Process

  • Mathematical formulation of the RL problem
  • Markov property: Current state completely characterises the state of the

world

Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor

slide-19
SLIDE 19

人工智能引论 2018 罗智凌

Markov Decision Process (Generative Process)

  • At time step t=0, environment samples initial state s0 ~ p(s0)
  • Then, for t until done:
  • Agent selects action at
  • Environment samples reward rt ~ R( . | st, at)
  • Environment samples next state st+1 ~ P( . | st, at)
  • Agent receives reward rt and next state st+1
  • A (deterministic) policy 𝜌 is a function from S to A that specifies what

action to take in each state. A (stochastic) policy Δ𝜌 denotes Pr(a|s).

  • Objective: find policy 𝜌* that maximizes cumulative discounted reward:

~Pr(a|s)

slide-20
SLIDE 20

人工智能引论 2018 罗智凌

A simple MDP: Grid World

★ ★ actions = { 1. right 2. left 3. up 4. down }

Objective: reach one of terminal states (greyed out) in least number of actions

Set a negative “reward” for each transition (e.g. r = -1) states

slide-21
SLIDE 21

人工智能引论 2018 罗智凌

A simple MDP: Grid World

Random Policy Optimal Policy

★ ★ ★ ★

slide-22
SLIDE 22

人工智能引论 2018 罗智凌

The optimal policy 𝜌*

We want to find optimal policy 𝜌* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)?

slide-23
SLIDE 23

人工智能引论 2018 罗智凌

The optimal policy 𝜌*

We want to find optimal policy 𝜌* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with

slide-24
SLIDE 24

人工智能引论 2018 罗智凌

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …

slide-25
SLIDE 25

人工智能引论 2018 罗智凌

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:

slide-26
SLIDE 26

人工智能引论 2018 罗智凌

Definitions: Value function and Q-value function

Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function(state-action value func) at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:

slide-27
SLIDE 27

人工智能引论 2018 罗智凌

Bellman equation

The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

slide-28
SLIDE 28

人工智能引论 2018 罗智凌

Bellman equation

Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

slide-29
SLIDE 29

人工智能引论 2018 罗智凌

Bellman equation

Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy 𝜌* corresponds to taking the best action in any state as specified by Q* The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:

𝜌 ∗(𝑡) = arg max 𝑅 ∗(𝑡, 𝑏) Δ𝜌 ∗(𝑏|𝑡) = 𝑅∗(𝑡, 𝑏) 𝑎

slide-30
SLIDE 30

人工智能引论 2018 罗智凌

Solving for the optimal policy

Qi will converge to Q* as i -> infinity Value iteration algorithm: Use Bellman equation as an iterative update

slide-31
SLIDE 31

人工智能引论 2018 罗智凌

Solving for the optimal policy

Qi will converge to Q* as i -> infinity

What’s the problem with this?

Value iteration algorithm: Use Bellman equation as an iterative update

slide-32
SLIDE 32

人工智能引论 2018 罗智凌

Solving for the optimal policy

Qi will converge to Q* as i -> infinity

What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!

Value iteration algorithm: Use Bellman equation as an iterative update

slide-33
SLIDE 33

人工智能引论 2018 罗智凌

Solving for the optimal policy

Qi will converge to Q* as i -> infinity

What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!

Value iteration algorithm: Use Bellman equation as an iterative update

slide-34
SLIDE 34

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function

slide-35
SLIDE 35

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning!

slide-36
SLIDE 36

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Q-learning: Use a function approximator to estimate the action-value function

function parameters (weights)

If the function approximator is a deep neural network => deep q-learning!

DQN

slide-37
SLIDE 37

人工智能引论 2018 罗智凌

Remember: want to find a Q-function that satisfies the Bellman Equation:

Solving for the optimal policy: Q-learning

slide-38
SLIDE 38

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where

slide-39
SLIDE 39

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

slide-40
SLIDE 40

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

i

close to the target value (y )it should have, if Q-function corresponds to optimal Q* (and

  • ptimal policy u*)

Iteratively try to make the Q-value

slide-41
SLIDE 41

人工智能引论 2018 罗智凌

Case Study: Playing Atari Games

Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-42
SLIDE 42

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-43
SLIDE 43

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)

Input: state st

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-44
SLIDE 44

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)

Familiar conv layers, FC layer

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-45
SLIDE 45

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d

  • utput (if 4 actions),

corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-46
SLIDE 46

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d

  • utput (if 4 actions),

corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-47
SLIDE 47

人工智能引论 2018 罗智凌

: neural network with weights

Q-network Architecture

Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d

  • utput (if 4 actions),

corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game A single feedforward pass to compute Q-values for all actions from the current state => efficient!

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-48
SLIDE 48

人工智能引论 2018 罗智凌

Solving for the optimal policy: Q-learning

Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):

i

close to the target value (y )it should have, if Q-function corresponds to optimal Q* (and

  • ptimal policy u*)

Iteratively try to make the Q-value

slide-49
SLIDE 49

人工智能引论 2018 罗智凌

Training the Q-network: Experience Replay

Learning from batches of consecutive samples is problematic:

  • Samples are correlated => inefficient learning
  • Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-50
SLIDE 50

人工智能引论 2018 罗智凌

Training the Q-network: Experience Replay

Learning from batches of consecutive samples is problematic:

  • Samples are correlated => inefficient learning
  • Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay

  • Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played

  • Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-51
SLIDE 51

人工智能引论 2018 罗智凌

Training the Q-network: Experience Replay

Learning from batches of consecutive samples is problematic:

  • Samples are correlated => inefficient learning
  • Current Q-network parameters determines next training samples (e.g. if maximizing

action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay

  • Continually update a replay memory table of transitions (st, at, rt, st+1) as game

(experience) episodes are played

  • Train Q-network on random minibatches of transitions from the replay memory,

instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-52
SLIDE 52

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-53
SLIDE 53

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Initialize replay memory, Q-network

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-54
SLIDE 54

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Play M episodes (full games)

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-55
SLIDE 55

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Initialize state (starting game screen pixels) at the beginning of each episode

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-56
SLIDE 56

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

For each timestep t

  • f the game

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-57
SLIDE 57

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

With small probability, select a random action (explore),

  • therwise select

greedy action from current policy

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-58
SLIDE 58

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Take the action (at), and observe the reward rt and next state st+1

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-59
SLIDE 59

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Store transition in replay memory

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-60
SLIDE 60

人工智能引论 2018 罗智凌

Putting it together: Deep Q-Learning with Experience Replay

Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step

[Mnih et al. NIPS Workshop 2013; Nature 2015]

slide-61
SLIDE 61

人工智能引论 2018 罗智凌

slide-62
SLIDE 62

人工智能引论 2018 罗智凌

Policy Gradients

What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair

slide-63
SLIDE 63

人工智能引论 2018 罗智凌

Policy Gradients

What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?

slide-64
SLIDE 64

人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value:

Policy Gradients

slide-65
SLIDE 65

人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this?

Policy Gradients

slide-66
SLIDE 66

人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value:

Policy Gradients

We want to find the optimal policy How can we do this?

Gradient ascent on policy parameters!

slide-67
SLIDE 67

人工智能引论 2018 罗智凌

REINFORCE algorithm

Mathematically, we can write: Where r(r) is the reward of a trajectory

slide-68
SLIDE 68

人工智能引论 2018 罗智凌

REINFORCE algorithm

Expected reward:

slide-69
SLIDE 69

人工智能引论 2018 罗智凌

REINFORCE algorithm

Now let’s differentiate this: Expected reward:

slide-70
SLIDE 70

人工智能引论 2018 罗智凌

REINFORCE algorithm

Intractable! Gradient of an expectation is problematic when p depends on θ

Now let’s differentiate this: Expected reward:

slide-71
SLIDE 71

人工智能引论 2018 罗智凌

REINFORCE algorithm

Intractable! Gradient of an expectation is problematic when p depends on θ

Now let’s differentiate this: However, we can use a nice trick: Expected reward:

slide-72
SLIDE 72

人工智能引论 2018 罗智凌

REINFORCE algorithm

Intractable! Gradient of an expectation is problematic when p depends on θ Can estimate with Monte Carlo sampling

Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Expected reward:

slide-73
SLIDE 73

人工智能引论 2018 罗智凌

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have:

slide-74
SLIDE 74

人工智能引论 2018 罗智凌

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus:

slide-75
SLIDE 75

人工智能引论 2018 罗智凌

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating:

Doesn’t depend on transition probabilities!

slide-76
SLIDE 76

人工智能引论 2018 罗智凌

REINFORCE algorithm

Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Therefore when sampling a trajectory r, we can estimate J(𝜄) with

Doesn’t depend on transition probabilities!

slide-77
SLIDE 77

人工智能引论 2018 罗智凌

Intuition

Gradient estimator: Interpretation:

  • If r(r) is high, push up the probabilities of the actions seen
  • If r(r) is low, push down the probabilities of the actions seen
slide-78
SLIDE 78

人工智能引论 2018 罗智凌

Intuition

Gradient estimator: Interpretation:

  • If r(r) is high, push up the probabilities of the actions seen
  • If r(r) is low, push down the probabilities of the actions seen

Might seem simplistic to say that if a trajectory is good then all its actions were

  • good. But in expectation, it averages out!
slide-79
SLIDE 79

人工智能引论 2018 罗智凌

Actor-Critic Algorithm

Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function).

  • The actor decides which action to take, and the critic tells the actor

how good its action was and how it should adjust

  • Also alleviates the task of the critic as it only has to learn the values
  • f (state, action) pairs generated by the policy
  • Can also incorporate Q-learning tricks e.g. experience replay
  • 1. Actor看到游戏目前的state,做出一个action。
  • 2. Critic根据state和action两者,对actor刚才的表现打一个分数。
  • 3. Actor依据critic(评委)的打分,调整自己的策略(actor神经网络参数),

争取下次做得更好。

  • 4. Critic根据系统给出的reward(相当于ground truth)来调整自己的打分策略

(critic神经网络参数)

slide-80
SLIDE 80

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

  • Inspiration from human perception and eye movements
  • Saves computational resources => scalability
  • Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse

[Mnih et al. 2014]

slide-81
SLIDE 81

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class

  • Inspiration from human perception and eye movements
  • Saves computational resources => scalability
  • Able to ignore clutter / irrelevant parts of image

State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action

[Mnih et al. 2014]

slide-82
SLIDE 82

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

NN (x1, y1)

Input image

[Mnih et al. 2014]

slide-83
SLIDE 83

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

NN (x1, y1) NN (x2, y2)

Input image

[Mnih et al. 2014]

slide-84
SLIDE 84

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

NN (x1, y1) NN (x2, y2) NN (x3, y3)

Input image

[Mnih et al. 2014]

slide-85
SLIDE 85

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

NN (x1, y1) NN (x2, y2) NN (x3, y3) NN (x4, y4)

Input image

[Mnih et al. 2014]

slide-86
SLIDE 86

人工智能引论 2018 罗智凌

NN (x1, y1) NN (x2, y2) NN (x3, y3) NN (x4, y4) NN (x5, y5) Softmax

Input image y=2

REINFORCE in action: Recurrent Attention Model (RAM)

[Mnih et al. 2014]

slide-87
SLIDE 87

人工智能引论 2018 罗智凌

REINFORCE in action: Recurrent Attention Model (RAM)

[Mnih et al. 2014]

Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!

slide-88
SLIDE 88

人工智能引论 2018 罗智凌

More policy gradients: AlphaGo

  • Featurize the board (stone color, move legality, bias, …)
  • Initialize policy network with supervised training from professional go games,

then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing)

  • Also learn value network (critic)
  • Finally, combine combine policy and value networks in a Monte Carlo Tree

Search algorithm to select actions by lookahead search Overview:

  • Mix of supervised learning and reinforcement learning
  • Mix of old methods (Monte Carlo Tree Search) and

recent ones (deep RL) How to beat the Go world champion:

10 1

[Silver et al., Nature 2016]

This image is CC0 publicdomain

slide-89
SLIDE 89

人工智能引论 2018 罗智凌

Summary

  • Policy gradients: very general but suffer from high variance so

requires a lot of samples. Challenge: sample-efficiency

  • Q-learning: does not always work but when it works, usually more

sample-efficient. Challenge: exploration

  • Guarantees:
  • Policy Gradients: Converges to a local minima of J(8), often good

enough!

  • Q-learning: Zero guarantees since you are approximating Bellman

equation with a complicated function approximator

slide-90
SLIDE 90

人工智能引论 2018 罗智凌

OUTLINE

  • Intro on Reinforcement Learning
  • Learning with Reward

– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm

  • Learning without Reward

– Inverse Reinforcement Learning

  • AlphaGo
slide-91
SLIDE 91

人工智能引论 2018 罗智凌

slide-92
SLIDE 92

人工智能引论 2018 罗智凌

Motivation

Dynamics Model Psa

sa

Reward Function R Reinforcement Learning / Optimal Control Controller/Poli cy p

Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability

  • f being in a state.

 Key challenges  Providing a formal specification of the control task.  Building a good dynamics model.  Finding closed-loop controllers.

slide-93
SLIDE 93

人工智能引论 2018 罗智凌

  • Inverse Reinforcement Learning algorithms

– Leverage expert demonstrations to learn to perform a desired task.

  • Formal guarantees

– Running time – Sample complexity – Performance of resulting controller

  • Enabled us to solve highly challenging, previously unsolved,

real-world control problems in

– Quadruped locomotion – Autonomous helicopter flight

Destination

slide-94
SLIDE 94

人工智能引论 2018 罗智凌

Example task: driving

slide-95
SLIDE 95

人工智能引论 2018 罗智凌

  • Input:

– Dynamics model / Simulator Psa(st+1 | st, at) – No reward function – Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …

(= trace of the teacher’s policy p*)

  • Desired output:

– Policy , which (ideally) has performance guarantees, i.e., – Note: R* is unknown.

Problem setup

slide-96
SLIDE 96

人工智能引论 2018 罗智凌

  • Formulate as standard machine learning problem

– Fix a policy class

  • E.g., support vector machine, neural network,

decision tree, deep belief net, …

– Estimate a policy from the training examples (s0, a0), (s1, a1), (s2, a2), …

  • E.g., Pomerleau, 1989; Sammut et al., 1992; Kuniyoshi et

al., 1994; Demiris & Hayes, 1994; Amit & Mataric, 2002.

Prior work: behavioral cloning

  • Limitations:

– Fails to provide strong performance guarantees – Underlying assumption: policy simplicity

slide-97
SLIDE 97

人工智能引论 2018 罗智凌

Main Idea

Dynamics Model Psa

sa

Reward Function R Reinforcement Learning / Optimal Control Controller/Poli cy p

Prescribes action to take for each state: typically very complex Often fairly succinct

slide-98
SLIDE 98

人工智能引论 2018 罗智凌

Method

  • Assume
  • Initialize: pick some controller p0.
  • Iterate for i = 1, 2, … :

– “Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers. – Find optimal control policy pi for the current guess of the reward function Rw. – If , exit the algorithm. Learning through reward functions rather than directly learning policies.

slide-99
SLIDE 99

人工智能引论 2018 罗智凌

Exp on Highway driving

Teacher in Training World Learned Policy in Testing World

  • Input:

– Dynamics model / Simulator Psa(st+1 | st, at) – Teacher’s demonstration: 1 minute in “training world” – Note: R* is unknown. – Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances

slide-100
SLIDE 100

人工智能引论 2018 罗智凌

More driving examples

In each video, the left sub-panel shows a demonstration

  • f

a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.

Driving demonstration Driving demonstration Learned behavior Learned behavior

slide-101
SLIDE 101

人工智能引论 2018 罗智凌 Learn R Learn Psa Learn Psa Dynamics Model Psa

sa

Reward Function R Reinforcement Learning / Optimal Control Controller p Autonomous play (s0, a0, s1, a1, ….) Teacher’s play (s0, a0, s1, a1, ….)

Apprenticeship learning summary

Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.

slide-102
SLIDE 102

人工智能引论 2018 罗智凌

OUTLINE

  • Intro on Reinforcement Learning
  • Learning with Reward

– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm

  • Learning without Reward

– Inverse Reinforcement Learning

  • AlphaGo
slide-103
SLIDE 103

人工智能引论 2018 罗智凌

Understanding AlphaGo

Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489. Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.

slide-104
SLIDE 104

人工智能引论 2018 罗智凌

Go Overview

  • Originated in ancient China

2,500 years ago

  • Two players game
  • Goal - surround more territory

than the opponent

  • 19X19 grid board
  • playing pieces “stones“
  • Turn = place a stone or pass
  • The game ends when both

players pass

slide-105
SLIDE 105

人工智能引论 2018 罗智凌

Go Overview

Only two basic rules 1.Capture rule: stones that have no liberties ->captured and removed from board 2.ko rule: a player is not allowed to make a move that returns the game to the previous position

X

slide-106
SLIDE 106

人工智能引论 2018 罗智凌

Go Overview

Final position Who won? White Score: 12 Black Score: 13

slide-107
SLIDE 107

人工智能引论 2018 罗智凌

Go In a Reinforcement Set-Up

  • Environment states
  • Actions
  • Transition between

states

  • Reinforcement

function

S = A = r(s)= 0 if s is not a terminal state

1 o.w

Goal : find policy that maximize the expected total payoff

slide-108
SLIDE 108

人工智能引论 2018 罗智凌

Why it is hard for computers to play GO?

  • Possible configuration of board extremely high ~10^700
  • Impossible to use brute force exhaustive search
  • main challenges

– Branching factor – Value function

slide-109
SLIDE 109

人工智能引论 2018 罗智凌

Training the Deep Neural Networks

Human experts (state, action)

𝑄ϭ

  • SL policy

network 𝑄π

  • Rollout

policy network 𝑄ƿ

  • RL policy

network

(state, win/loss)

𝑊

Ɵ

  • Value

network

Monte Carlo Tree Search

slide-110
SLIDE 110

人工智能引论 2018 罗智凌

Training the Deep Neural Networks

Policy Policy

Value

slide-111
SLIDE 111

人工智能引论 2018 罗智凌

SL Policy Network : 𝑸ϭ

  • ~30 million (state, action)
  • Goal:maximize the log likelihood of an action
  • Input : 48 feature planes
  • Output: action probability map

19X19X48

12 convolutional + rectifier layers Softmax Probability map

slide-112
SLIDE 112

人工智能引论 2018 罗智凌

SL Policy Network : 𝑸ϭ

Bigger -> better and slower Accuracy AlphaGo (all input features) 57.0% AlphaGo (only raw board position) 55.7% state of the art 44.4%

slide-113
SLIDE 113

人工智能引论 2018 罗智凌

Training the Rollout Policy Network 𝑸π

  • Similar to SL policy 𝑸ϭ
  • Output – probability map over actions
  • Goal: maximize the log likelihood
  • Input
  • Not full grid
  • handcrafted local features

12 convolutional + rectifier layers

Softmax

Probability map

Forwarding Accuracy SL policy net 𝑄ϭ 3 milliseconds 55.4% Rollout Policy Network 𝑄π 2 microseconds 24.2%

9X9X48

slide-114
SLIDE 114

人工智能引论 2018 罗智凌

Training the RL Policy Network 𝑸𝛓

  • Refined version of SL policy (𝑸ϭ )
  • Initialize weights to 𝜍=ϭ
  • {𝜍_| 𝜍_ is an old version of 𝜍}
  • 𝑄

𝜍 vs. 𝑄 𝜍−

19X19X48

12 convolutional + rectifier layers

Softmax

Probability map

  • Preventing overfitting
  • RL policy Won more then 80% of the games against SL policy
slide-115
SLIDE 115

人工智能引论 2018 罗智凌

Human experts (state, action)

𝑄ϭ

  • SL policy

network 𝑄π

  • Rollout

policy network 𝑄ƿ

  • RL policy

network

(state, win/loss)

𝑊

Ɵ

  • Value

network

Monte Carlo Tree Search

Training the Deep Neural Networks

slide-116
SLIDE 116

人工智能引论 2018 罗智凌

Training the Value Network 𝑾𝛊

  • Position evaluation
  • Approximating optimal value function
  • Input : state , output: probability to win
  • Goal: minimize MSE
  • Overfitting - position within games are strongly correlated

19X19X48

convolutional + rectifier layers fc scalar

slide-117
SLIDE 117

人工智能引论 2018 罗智凌

Training the Deep Neural Networks

Human experts (state, action)

𝑄ϭ

  • SL policy

network 𝑄π

  • Rollout

policy network 𝑄ƿ

  • RL policy

network

(state, win/loss)

𝑊

Ɵ

  • Value

network

Monte Carlo Tree Search

slide-118
SLIDE 118

人工智能引论 2018 罗智凌

Monte Carlo Tree Search

  • Monte Carlo Experiments : repeated random

sampling to obtain numerical results

  • Search method
  • Method for making optimal decisions in

artificial intelligence (AI) problems

  • The strongest Go AIs (Fuego, Pachi, Zen,

and Crazy Stone) all rely on MCTS

slide-119
SLIDE 119

人工智能引论 2018 罗智凌

Monte Carlo Tree Search

Each round of Monte Carlo tree search consists of four steps

  • 1. Selection
  • 2. Expansion
  • 3. Simulation
  • 4. Backpropagation
slide-120
SLIDE 120

人工智能引论 2018 罗智凌

AlphaGo MCTS

Selection Expansion Evaluation Backpropagation

  • Each edge (s,a) stores:
  • Q(s,a) - action value (average value of sub

tree)

  • N(s,a) – visit count
  • P(s,a) – prior probability
slide-121
SLIDE 121

人工智能引论 2018 罗智凌

AlphaGo MCTS

Selection Expansion Evaluation Backpropagation

slide-122
SLIDE 122

人工智能引论 2018 罗智凌

AlphaGo MCTS

Selection Expansion Evaluation Backpropagation

Leaf evaluation:

  • 1. Value network
  • 2. Random rollout played

until terminal

slide-123
SLIDE 123

人工智能引论 2018 罗智凌

AlphaGo MCTS

Selection Expansion Evaluation Backpropagation

How to choose the next move?

  • Maximum visit count
  • Less sensitive to outliers than maximum

action value

slide-124
SLIDE 124

人工智能引论 2018 罗智凌

AlphaGo

slide-125
SLIDE 125

人工智能引论 2018 罗智凌

AlphaGo VS Experts

5:0

Fan Hui AlphaGo

slide-126
SLIDE 126

人工智能引论 2018 罗智凌

AlphaGo VS Experts

4:1

Lee Sedol AlphaGo 神之一手

slide-127
SLIDE 127

人工智能引论 2018 罗智凌

AlphaGo VS Experts

60:0

柯洁九段、陈耀烨九段、朴廷桓九段、 芈昱廷九段、唐韦星九段… AlphaGo Master

slide-128
SLIDE 128

人工智能引论 2018 罗智凌

AlphaGo VS Experts

5:0

柯洁九段 AlphaGo Master

slide-129
SLIDE 129

人工智能引论 2018 罗智凌

  • Learning without human knowledge
  • A combined network (𝑞, 𝑤) = 𝑔

𝜄(𝑡)

– p : prob of selecting action – v : evaluation score – s : position state

AlphaGo-Zero

MSTC { 𝑡, 𝜌, 𝑨 }

slide-130
SLIDE 130

人工智能引论 2018 罗智凌

Training

135 { 𝑡, 𝜌, 𝑨 }

slide-131
SLIDE 131

人工智能引论 2018 罗智凌

AlphaGo Zero

136

slide-132
SLIDE 132

人工智能引论 2018 罗智凌

so far so good…

  • Intro on AI (week 1)
  • Statistical Learning (week 2~3)

– Preliminaries about Bayesian (Bayes Rule) – Generative/Discriminative Model (LDA) – Strategies (Loglikelihood, MAP, MLE) – Algorithm (GD, EM, Sampling) – Applications (Markov, GMM)

  • Deep Learning (week 4~5)

– Biological Motivation and Connections (Neural Cell, MP-neural) – Neural Network and Back Propagation(Perceptron, MLP, BP) – Convolutional Neural Network (Conv, Pooling) – Recurrent Neural Network (LSTM) – Stochastic Model in Neural Network (Hopfield Nets, RBM, Sleep/wake Model) – Hybrid Model (DBN, AutoEncoder, GAN)

  • Reinforcement Learning (week 6)

– Learning with Reward (MDP, Q-Learning, Policy Gradient, Actor-Critic Algorithm) – Learning without Reward(IRL) – AlphaGo (Policy net, Value net, MCTS)

slide-133
SLIDE 133

人工智能引论 2018 罗智凌

At Last …

  • About this course:

– 这是人工智能世界的一角,请继续探索

  • Miso:

– 2019年的秋学期会开(人工智能引论? 统 计学习?) – 感谢每位同学早起来听课

The most incomprehensible thing about the world is that it is comprehensible 宇宙中最不可理解之事,乃宇宙是可以理解的

slide-134
SLIDE 134

人工智能引论 2018 罗智凌

罗智凌

luozhiling@zju.edu.cn http://www.bruceluo.net