CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions
Slides by Svetlana Lazebnik, 9/2016 Modified by Mark Hasegawa-Johnson, 2/2019
CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and - - PowerPoint PPT Presentation
CS440/ECE448 Lecture 12: Stochastic Games, Stochastic Search, and Learned Evaluation Functions Slides by Svetlana Lazebnik, 9/2016 Modified by Mark Hasegawa-Johnson, 2/2019 Types of game environments Deterministic Stochastic Perfect
Slides by Svetlana Lazebnik, 9/2016 Modified by Mark Hasegawa-Johnson, 2/2019
!"#$%('()%) = max
/0 /1234
min
789:4 /1234 ;%<"=)
!"#$%('()%) = max
/0 /1234
min
789:4 /1234 > ;%<"=)
> ;%<"=) = ?
1@AB1/34
C=(D"DE#EFG ($FH(I% ×;%<"=)(($FH(I%)
§ Utility(node) if node is terminal § maxaction Value(Succ(node, action)) if type = MAX § minaction Value(Succ(node, action)) if type = MIN § sumaction P(Succ(node, action)) * Value(Succ(node, action)) if type = CHANCE
coin to match the one she has.
" ×1 + ! " × −5 = −2
" ×(−1) + ! " ×5 = 2
she wants. She knows that I will pick Jefferson – then she will pick the penny!
is my worst-case outcome (e.g., to decide if I should even play this game…)
maximizes my minimum reward.
/0123 45673
/:;23 45673
4:33:;< :;=5
Source States are grouped into information sets for each player
! "#$%&' = )
*+,-*./0
1&23%345467 28692:# ×"#$%&'(28692:#)
! "#$%&' ≈ 1 @ )
ABC D
"#$%&'(4E6ℎ &%@'2: G%:#)
computation you can afford.
and no good heuristics – like Go?
use randomized simulations
using a tree policy (trading off exploration and exploitation)
a default policy (e.g., random moves) until a terminal state is reached
to update the value estimates
Training phase:
billions of possible starting states.
billion random games Generalization:
Testing phase:
state at your horizon using Value(state) ≈ a1*x1+a2*x2+…
data
achieved during actual game play. Some starting states will be missed, so generalized evaluation function is necessary
actual game play
should not waste their time
play go.”
Anton Ninno Roy Laird, Ph.D. antonninno@yahoo.com roylaird@gmail.com special thanks to Kiseido Publications
neural networks
image
approximation machinery
distribution over possible moves (policy) or expected value of position
January 2016
maximize expected final outcome
source Pachi Go program 85% of the time
January 2016
position s and following the learned policy for both players
predicted outcome
January 2016
January 2016
N(s,a), action values Q(s,a)
exploration bonus (proportional to P but inversely proportional to N)
value network estimate and outcome of simulated game using rollout network
values of all simulations passing through that edge
January 2016
January 2016
January 2016
International Master Level (arXiv, September 2015)
learning to learn a good evaluation function
beta search
in 2016
in a lifetime of playing
Texas Hold’em poker (2017)
http://xkcd.com/1002/ See also: http://xkcd.com/1263/