人工智能引论 2018 罗智凌
人工智能引论 (六)
- Intro. On Artificial Intelligence
( ) Intro. On Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation
2018 ( ) Intro. On Artificial Intelligence from the perspective of probability theory luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm
– Inverse Reinforcement Learning
人工智能引论 2018 罗智凌
Cat Classification
人工智能引论 2018 罗智凌
2-d density estimation 1-d density estimation
人工智能引论 2018 罗智凌
Problems involving an agent interacting with an environment, which provides numeric reward signals Goal: Learn how to take actions in order to maximize reward 5
Generalization >> Learn how to take actions in order to reach a goal.
人工智能引论 2018 罗智凌
– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm
– Inverse Reinforcement Learning
人工智能引论 2018 罗智凌 Agent Environment
人工智能引论 2018 罗智凌 Agent Environment State st
人工智能引论 2018 罗智凌 Agent Environment Action at State st
人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt
人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt Next state st+1
人工智能引论 2018 罗智凌
Objective: Balance a pole on top of a movable cart State: angle, angular speed, position, horizontal velocity Action: horizontal force applied on the cart Reward: 1 at each time step if the pole is upright
人工智能引论 2018 罗智凌
Objective: Make the robot move forward State: Angle and position of the joints Action: Torques applied on joints Reward: 1 at each time step upright + forward movement
人工智能引论 2018 罗智凌
Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step
人工智能引论 2018 罗智凌
Objective: Win the game! State: Position of all pieces Action: Where to put the next piece down Reward: 1 if win at the end of the game, 0 otherwise
人工智能引论 2018 罗智凌 Cart-Pole Atari Games Go Robot Locomotion Sequential decision making on specified goal.
Evaluation on how goodness
人工智能引论 2018 罗智凌 Agent Environment Action at State st Reward rt Next state st+1
人工智能引论 2018 罗智凌
world
Defined by: : set of possible states : set of possible actions : distribution of reward given (state, action) pair : transition probability i.e. distribution over next state given (state, action) pair : discount factor
人工智能引论 2018 罗智凌
action to take in each state. A (stochastic) policy Δ𝜌 denotes Pr(a|s).
~Pr(a|s)
人工智能引论 2018 罗智凌
★ ★ actions = { 1. right 2. left 3. up 4. down }
Objective: reach one of terminal states (greyed out) in least number of actions
Set a negative “reward” for each transition (e.g. r = -1) states
人工智能引论 2018 罗智凌
Random Policy Optimal Policy
★ ★ ★ ★
人工智能引论 2018 罗智凌
We want to find optimal policy 𝜌* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)?
人工智能引论 2018 罗智凌
We want to find optimal policy 𝜌* that maximizes the sum of rewards. How do we handle the randomness (initial state, transition probability…)? Maximize the expected sum of rewards! Formally: with
人工智能引论 2018 罗智凌
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, …
人工智能引论 2018 罗智凌
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s:
人工智能引论 2018 罗智凌
Following a policy produces sample trajectories (or paths) s0, a0, r0, s1, a1, r1, … How good is a state? The value function at state s, is the expected cumulative reward from following the policy from state s: How good is a state-action pair? The Q-value function(state-action value func) at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy:
人工智能引论 2018 罗智凌
The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
人工智能引论 2018 罗智凌
Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
人工智能引论 2018 罗智凌
Q* satisfies the following Bellman equation: Intuition: if the optimal state-action values for the next time-step Q*(s’,a’) are known, then the optimal strategy is to take the action that maximizes the expected value of The optimal policy 𝜌* corresponds to taking the best action in any state as specified by Q* The optimal Q-value function Q* is the maximum expected cumulative reward achievable from a given (state, action) pair:
𝜌 ∗(𝑡) = arg max 𝑅 ∗(𝑡, 𝑏) Δ𝜌 ∗(𝑏|𝑡) = 𝑅∗(𝑡, 𝑏) 𝑎
人工智能引论 2018 罗智凌
Qi will converge to Q* as i -> infinity Value iteration algorithm: Use Bellman equation as an iterative update
人工智能引论 2018 罗智凌
Qi will converge to Q* as i -> infinity
What’s the problem with this?
Value iteration algorithm: Use Bellman equation as an iterative update
人工智能引论 2018 罗智凌
Qi will converge to Q* as i -> infinity
What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space!
Value iteration algorithm: Use Bellman equation as an iterative update
人工智能引论 2018 罗智凌
Qi will converge to Q* as i -> infinity
What’s the problem with this? Not scalable. Must compute Q(s,a) for every state-action pair. If state is e.g. current game state pixels, computationally infeasible to compute for entire state space! Solution: use a function approximator to estimate Q(s,a). E.g. a neural network!
Value iteration algorithm: Use Bellman equation as an iterative update
人工智能引论 2018 罗智凌
Q-learning: Use a function approximator to estimate the action-value function
人工智能引论 2018 罗智凌
Q-learning: Use a function approximator to estimate the action-value function If the function approximator is a deep neural network => deep q-learning!
人工智能引论 2018 罗智凌
Q-learning: Use a function approximator to estimate the action-value function
function parameters (weights)
If the function approximator is a deep neural network => deep q-learning!
DQN
人工智能引论 2018 罗智凌
Remember: want to find a Q-function that satisfies the Bellman Equation:
人工智能引论 2018 罗智凌
Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where
人工智能引论 2018 罗智凌
Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
人工智能引论 2018 罗智凌
Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
i
close to the target value (y )it should have, if Q-function corresponds to optimal Q* (and
Iteratively try to make the Q-value
人工智能引论 2018 罗智凌
Objective: Complete the game with the highest score State: Raw pixel inputs of the game state Action: Game controls e.g. Left, Right, Up, Down Reward: Score increase/decrease at each time step
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)
Input: state st
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values)
Familiar conv layers, FC layer
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d
corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d
corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
: neural network with weights
Current state st: 84x84x4 stack of last 4 frames (after RGB->grayscale conversion, downsampling, and cropping) 16 8x8 conv, stride 4 32 4x4 conv, stride 2 FC-256 FC-4 (Q-values) Last FC layer has 4-d
corresponding to Q(st, a1), Q(st, a2), Q(st, a3), Q(st,a4) Number of actions between 4-18 depending on Atari game A single feedforward pass to compute Q-values for all actions from the current state => efficient!
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Remember: want to find a Q-function that satisfies the Bellman Equation: Forward Pass Loss function: where Backward Pass Gradient update (with respect to Q-function parameters θ):
i
close to the target value (y )it should have, if Q-function corresponds to optimal Q* (and
Iteratively try to make the Q-value
人工智能引论 2018 罗智凌
Learning from batches of consecutive samples is problematic:
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Learning from batches of consecutive samples is problematic:
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay
(experience) episodes are played
instead of consecutive samples
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Learning from batches of consecutive samples is problematic:
action is to move left, training samples will be dominated by samples from left-hand size) => can lead to bad feedback loops Address these problems using experience replay
(experience) episodes are played
instead of consecutive samples Each transition can also contribute to multiple weight updates => greater data efficiency
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Initialize replay memory, Q-network
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Play M episodes (full games)
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Initialize state (starting game screen pixels) at the beginning of each episode
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
For each timestep t
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
With small probability, select a random action (explore),
greedy action from current policy
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Take the action (at), and observe the reward rt and next state st+1
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Store transition in replay memory
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
Experience Replay: Sample a random minibatch of transitions from replay memory and perform a gradient descent step
[Mnih et al. NIPS Workshop 2013; Nature 2015]
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair
人工智能引论 2018 罗智凌
What is a problem with Q-learning? The Q-function can be very complicated! Example: a robot grasping an object has a very high-dimensional state => hard to learn exact value of every (state, action) pair But the policy can be much simpler: just close your hand Can we learn a policy directly, e.g. finding the best policy from a collection of policies?
人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value:
人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value: We want to find the optimal policy How can we do this?
人工智能引论 2018 罗智凌 Formally, let’s define a class of parametrized policies: For each policy, define its value:
We want to find the optimal policy How can we do this?
Gradient ascent on policy parameters!
人工智能引论 2018 罗智凌
Mathematically, we can write: Where r(r) is the reward of a trajectory
人工智能引论 2018 罗智凌
Expected reward:
人工智能引论 2018 罗智凌
Now let’s differentiate this: Expected reward:
人工智能引论 2018 罗智凌
Intractable! Gradient of an expectation is problematic when p depends on θ
Now let’s differentiate this: Expected reward:
人工智能引论 2018 罗智凌
Intractable! Gradient of an expectation is problematic when p depends on θ
Now let’s differentiate this: However, we can use a nice trick: Expected reward:
人工智能引论 2018 罗智凌
Intractable! Gradient of an expectation is problematic when p depends on θ Can estimate with Monte Carlo sampling
Now let’s differentiate this: However, we can use a nice trick: If we inject this back: Expected reward:
人工智能引论 2018 罗智凌
Can we compute those quantities without knowing the transition probabilities? We have:
人工智能引论 2018 罗智凌
Can we compute those quantities without knowing the transition probabilities? We have: Thus:
人工智能引论 2018 罗智凌
Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating:
Doesn’t depend on transition probabilities!
人工智能引论 2018 罗智凌
Can we compute those quantities without knowing the transition probabilities? We have: Thus: And when differentiating: Therefore when sampling a trajectory r, we can estimate J(𝜄) with
Doesn’t depend on transition probabilities!
人工智能引论 2018 罗智凌
Gradient estimator: Interpretation:
人工智能引论 2018 罗智凌
Gradient estimator: Interpretation:
Might seem simplistic to say that if a trajectory is good then all its actions were
人工智能引论 2018 罗智凌
Problem: we don’t know Q and V. Can we learn them? Yes, using Q-learning! We can combine Policy Gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function).
how good its action was and how it should adjust
争取下次做得更好。
(critic神经网络参数)
人工智能引论 2018 罗智凌
Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class
State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
Objective: Image Classification Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class
State: Glimpses seen so far Action: (x,y) coordinates (center of glimpse) of where to look next in image Reward: 1 at the final timestep if image correctly classified, 0 otherwise glimpse Glimpsing is a non-differentiable operation => learn policy for how to take glimpse actions using REINFORCE Given state of glimpses seen so far, use RNN to model the state and output next action
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
NN (x1, y1)
Input image
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
NN (x1, y1) NN (x2, y2)
Input image
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
NN (x1, y1) NN (x2, y2) NN (x3, y3)
Input image
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
NN (x1, y1) NN (x2, y2) NN (x3, y3) NN (x4, y4)
Input image
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
NN (x1, y1) NN (x2, y2) NN (x3, y3) NN (x4, y4) NN (x5, y5) Softmax
Input image y=2
[Mnih et al. 2014]
人工智能引论 2018 罗智凌
[Mnih et al. 2014]
Has also been used in many other tasks including fine-grained image recognition, image captioning, and visual question-answering!
人工智能引论 2018 罗智凌
then continue training using policy gradient (play against itself from random previous iterations, +1 / -1 reward for winning / losing)
Search algorithm to select actions by lookahead search Overview:
recent ones (deep RL) How to beat the Go world champion:
10 1
[Silver et al., Nature 2016]
This image is CC0 publicdomain
人工智能引论 2018 罗智凌
requires a lot of samples. Challenge: sample-efficiency
sample-efficient. Challenge: exploration
enough!
equation with a complicated function approximator
人工智能引论 2018 罗智凌
– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm
– Inverse Reinforcement Learning
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
Dynamics Model Psa
sa
Reward Function R Reinforcement Learning / Optimal Control Controller/Poli cy p
Prescribes action to take for each state Probability distribution over next states given current state and action Describes desirability
Key challenges Providing a formal specification of the control task. Building a good dynamics model. Finding closed-loop controllers.
人工智能引论 2018 罗智凌
– Leverage expert demonstrations to learn to perform a desired task.
– Running time – Sample complexity – Performance of resulting controller
– Quadruped locomotion – Autonomous helicopter flight
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
– Dynamics model / Simulator Psa(st+1 | st, at) – No reward function – Teacher’s demonstration: s0, a0, s1, a1, s2, a2, …
– Policy , which (ideally) has performance guarantees, i.e., – Note: R* is unknown.
人工智能引论 2018 罗智凌
– Fix a policy class
– Estimate a policy from the training examples (s0, a0), (s1, a1), (s2, a2), …
– Fails to provide strong performance guarantees – Underlying assumption: policy simplicity
人工智能引论 2018 罗智凌
Dynamics Model Psa
sa
Reward Function R Reinforcement Learning / Optimal Control Controller/Poli cy p
Prescribes action to take for each state: typically very complex Often fairly succinct
人工智能引论 2018 罗智凌
– “Guess” the reward function: Find a reward function such that the teacher maximally outperforms all previously found controllers. – Find optimal control policy pi for the current guess of the reward function Rw. – If , exit the algorithm. Learning through reward functions rather than directly learning policies.
人工智能引论 2018 罗智凌
Teacher in Training World Learned Policy in Testing World
– Dynamics model / Simulator Psa(st+1 | st, at) – Teacher’s demonstration: 1 minute in “training world” – Note: R* is unknown. – Reward features: 5 features corresponding to lanes/shoulders; 10 features corresponding to presence of other car in current lane at different distances
人工智能引论 2018 罗智凌
In each video, the left sub-panel shows a demonstration
a different driving “style”, and the right sub-panel shows the behavior learned from watching the demonstration.
Driving demonstration Driving demonstration Learned behavior Learned behavior
人工智能引论 2018 罗智凌 Learn R Learn Psa Learn Psa Dynamics Model Psa
sa
Reward Function R Reinforcement Learning / Optimal Control Controller p Autonomous play (s0, a0, s1, a1, ….) Teacher’s play (s0, a0, s1, a1, ….)
Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
人工智能引论 2018 罗智凌
– Markov Decision Process (MDP) – Q-Learning – Policy Gradient – Actor-Critic Algorithm
– Inverse Reinforcement Learning
人工智能引论 2018 罗智凌
Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489. Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
Only two basic rules 1.Capture rule: stones that have no liberties ->captured and removed from board 2.ko rule: a player is not allowed to make a move that returns the game to the previous position
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
– Branching factor – Value function
人工智能引论 2018 罗智凌
Human experts (state, action)
(state, win/loss)
Ɵ
人工智能引论 2018 罗智凌
Policy Policy
Value
人工智能引论 2018 罗智凌
19X19X48
12 convolutional + rectifier layers Softmax Probability map
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
12 convolutional + rectifier layers
Softmax
Probability map
Forwarding Accuracy SL policy net 𝑄ϭ 3 milliseconds 55.4% Rollout Policy Network 𝑄π 2 microseconds 24.2%
9X9X48
人工智能引论 2018 罗智凌
𝜍 vs. 𝑄 𝜍−
19X19X48
12 convolutional + rectifier layers
Softmax
Probability map
人工智能引论 2018 罗智凌
Human experts (state, action)
(state, win/loss)
Ɵ
人工智能引论 2018 罗智凌
19X19X48
convolutional + rectifier layers fc scalar
人工智能引论 2018 罗智凌
Human experts (state, action)
(state, win/loss)
Ɵ
人工智能引论 2018 罗智凌
sampling to obtain numerical results
artificial intelligence (AI) problems
and Crazy Stone) all rely on MCTS
人工智能引论 2018 罗智凌
Each round of Monte Carlo tree search consists of four steps
人工智能引论 2018 罗智凌
Selection Expansion Evaluation Backpropagation
tree)
人工智能引论 2018 罗智凌
Selection Expansion Evaluation Backpropagation
人工智能引论 2018 罗智凌
Selection Expansion Evaluation Backpropagation
Leaf evaluation:
until terminal
人工智能引论 2018 罗智凌
Selection Expansion Evaluation Backpropagation
How to choose the next move?
人工智能引论 2018 罗智凌
人工智能引论 2018 罗智凌
Fan Hui AlphaGo
人工智能引论 2018 罗智凌
Lee Sedol AlphaGo 神之一手
人工智能引论 2018 罗智凌
柯洁九段、陈耀烨九段、朴廷桓九段、 芈昱廷九段、唐韦星九段… AlphaGo Master
人工智能引论 2018 罗智凌
柯洁九段 AlphaGo Master
人工智能引论 2018 罗智凌
𝜄(𝑡)
– p : prob of selecting action – v : evaluation score – s : position state
MSTC { 𝑡, 𝜌, 𝑨 }
人工智能引论 2018 罗智凌
135 { 𝑡, 𝜌, 𝑨 }
人工智能引论 2018 罗智凌
136
人工智能引论 2018 罗智凌
– Preliminaries about Bayesian (Bayes Rule) – Generative/Discriminative Model (LDA) – Strategies (Loglikelihood, MAP, MLE) – Algorithm (GD, EM, Sampling) – Applications (Markov, GMM)
– Biological Motivation and Connections (Neural Cell, MP-neural) – Neural Network and Back Propagation(Perceptron, MLP, BP) – Convolutional Neural Network (Conv, Pooling) – Recurrent Neural Network (LSTM) – Stochastic Model in Neural Network (Hopfield Nets, RBM, Sleep/wake Model) – Hybrid Model (DBN, AutoEncoder, GAN)
– Learning with Reward (MDP, Q-Learning, Policy Gradient, Actor-Critic Algorithm) – Learning without Reward(IRL) – AlphaGo (Policy net, Value net, MCTS)
人工智能引论 2018 罗智凌
– 这是人工智能世界的一角,请继续探索
– 2019年的秋学期会开(人工智能引论? 统 计学习?) – 感谢每位同学早起来听课
人工智能引论 2018 罗智凌