CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial - - PowerPoint PPT Presentation

cs885 reinforcement learning lecture 13c june 13 2018
SMART_READER_LITE
LIVE PREVIEW

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial - - PowerPoint PPT Presentation

CS885 Reinforcement Learning Lecture 13c: June 13, 2018 Adversarial Search [RusNor] Sec. 5.1-5.4 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Minimax search Evaluation functions Alpha-beta pruning University


slide-1
SLIDE 1

CS885 Reinforcement Learning Lecture 13c: June 13, 2018

Adversarial Search [RusNor] Sec. 5.1-5.4

CS885 Spring 2018 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS885 Spring 2018 Pascal Poupart 2

Outline

  • Minimax search
  • Evaluation functions
  • Alpha-beta pruning

University of Waterloo

slide-3
SLIDE 3

CS885 Spring 2018 Pascal Poupart 3

Game search challenge

  • What makes game search challenging?

– There is an opponent! – The opponent is malicious – it wants to win (i.e. it is trying to make you lose) – We need to take this into account when choosing moves

  • Simulate the opponent’s behaviour in our search
  • Notation: One player is called MAX (who wants to

maximize its utility) and one player is called MIN (who wants to minimize its utility)

University of Waterloo

slide-4
SLIDE 4

CS885 Spring 2018 Pascal Poupart 4

Example: Tic-Tac-Toe

MAX’s job is to use the search tree to determine the best move

University of Waterloo

slide-5
SLIDE 5

CS885 Spring 2018 Pascal Poupart 5

Optimal strategies

  • Want to find the optimal strategy

– One that leads to outcomes at least as good as any other strategy, given that MIN is playing

  • ptimally

– Equilibrium (game theory) – Zero-sum game of perfect information

University of Waterloo

slide-6
SLIDE 6

CS885 Spring 2018 Pascal Poupart 6

Minimax Value

MINIMAX-VALUE(n) = Utility(n) if n is a terminal state Maxs Î Succ(n) MINIMAX-VALUE(s) if n is a MAX node Mins Î Succ(n) MINIMAX-VALUE(s) if n is a MIN node

ply

University of Waterloo

slide-7
SLIDE 7

CS885 Spring 2018 Pascal Poupart 7

Minimax algorithm

Returns action corresponding to best possible move

University of Waterloo

slide-8
SLIDE 8

CS885 Spring 2018 Pascal Poupart 8

Properties of Minimax

  • Time complexity:

– O(bd)

  • Space complexity:

– O(bd) just need to keep in memory the current branch with its children Where b is branching factor and d is depth of the tree

University of Waterloo

slide-9
SLIDE 9

CS885 Spring 2018 Pascal Poupart 9

Minimax and multi-player games

University of Waterloo

slide-10
SLIDE 10

CS885 Spring 2018 Pascal Poupart 10

  • Can we write a a minimax program that will

play chess reasonably well?

– For chess ! ≈ 35 and % ≈ 100 – Do we really need to look at all those nodes?

Chess

University of Waterloo

slide-11
SLIDE 11

CS885 Spring 2018 Pascal Poupart 11

Alpha-Beta Pruning

  • No!

– If we are smart (and careful) we can do pruning

  • Eliminate large parts of the tree from consideration
  • Alpha-Beta pruning applied to a minimax tree

– Returns the same decision as minimax – Prunes branches that cannot influence final decision

University of Waterloo

slide-12
SLIDE 12

CS885 Spring 2018 Pascal Poupart 12

Alpha-Beta Pruning

  • Alpha:

– Value of best (highest value) choice we have found so far

  • n the path for MAX
  • Beta:

– Value of best (lowest value) choice we have found so far

  • n path for MIN
  • Update alpha and beta as search continues
  • Prune as soon as the value of the current node is

known to be worse than current alpha or beta values for MAX or MIN

University of Waterloo

slide-13
SLIDE 13

CS885 Spring 2018 Pascal Poupart 13

Alpha-Beta example

MAX MIN [-inf, inf] 3 [-inf, 3]

University of Waterloo

slide-14
SLIDE 14

CS885 Spring 2018 Pascal Poupart 14

Alpha-Beta example

MAX MIN 3 12 [-inf,3] [-inf,inf]

University of Waterloo

slide-15
SLIDE 15

CS885 Spring 2018 Pascal Poupart 15

Alpha-Beta example

MAX MIN 3 12 8 [3,3] [3,inf]

University of Waterloo

slide-16
SLIDE 16

CS885 Spring 2018 Pascal Poupart 16

Alpha-Beta example

MAX MIN 3 12 8 [3,3] [3,inf] 2 [-inf,2]

University of Waterloo

slide-17
SLIDE 17

CS885 Spring 2018 Pascal Poupart 17

Alpha-Beta example

MAX MIN 3 12 8 [3,3] [3,inf] 2 [-inf,2] Prune remaining children

University of Waterloo

slide-18
SLIDE 18

CS885 Spring 2018 Pascal Poupart 18

Alpha-Beta example

MAX MIN 3 12 8 [3,3] 2 [-inf,2] 14 [-inf,14] [3,14]

University of Waterloo

slide-19
SLIDE 19

CS885 Spring 2018 Pascal Poupart 19

Alpha-Beta example

MAX MIN 3 12 8 [3,3] 2 [-inf,2] 14 [-inf,5] [3,5] 5

University of Waterloo

slide-20
SLIDE 20

CS885 Spring 2018 Pascal Poupart 20

Alpha-Beta example

MAX MIN 3 12 8 [3,3] 2 [-inf,2] 14 [2,2] [3,3] 5 2

University of Waterloo

slide-21
SLIDE 21

CS885 Spring 2018 Pascal Poupart 21

Properties of Alpha-Beta

  • Pruning does not affect the final result

– Prune parts of the tree that would never be reached in actual play

  • The order in which moves are evaluated are

important

– A bad move ordering will prune nothing – A perfect node ordering can reduce time complexity to O(bd/2)

University of Waterloo

slide-22
SLIDE 22

CS885 Spring 2018 Pascal Poupart 22

Real-time decisions

  • Alpha-beta can be a huge improvement over

minimax

– Still not good enough as we need to search all the way to terminal states for at least part of the search space – Need to make a decision about a move quickly

  • Heuristic evaluation function + cutoff test

University of Waterloo

slide-23
SLIDE 23

CS885 Spring 2018 Pascal Poupart 23

Evaluation functions

  • Apply an evaluation function to a state

– If terminal state, function returns actual utility – If non-terminal, function returns estimate of the expected utility (i.e. the chance of winning from that state) – Function must be fast to compute

University of Waterloo

slide-24
SLIDE 24

CS885 Spring 2018 Pascal Poupart 24

Evaluation functions

  • Evaluation functions can be given by the designer
  • f the program (using expert knowledge) or

learned from experience

  • If features can be judged independently, a

weighted linear function is good

– w1f1(s)+w2f2(s)+…+wnfn(s) with s as board state

  • Neural networks are commonly used today

University of Waterloo

slide-25
SLIDE 25

CS885 Spring 2018 Pascal Poupart 25

Cutting off search

  • Instead of searching until we find a terminal

state, we can cut search sooner and apply the evaluation function

  • When?

– Arbitrarily (but deeper is better) – Quiescent states

  • States that are “stable” – not going to change value (by

a lot) in the near future

– Singular extensions

  • Searching deeper when you have a move that is “clearly

better” (i.e. moving the king out of check)

  • Can be used to avoid the horizon effect

University of Waterloo

slide-26
SLIDE 26

CS885 Spring 2018 Pascal Poupart 26

Cutting off search

  • How deep do we need to search?

– Novice chess human player

  • 5-ply (minimax)

– Master chess human player

  • 10-ply (alpha-beta)

– Grandmaster chess human player

  • 14-ply + a fantastic evaluation function, opening and

endgame databases

University of Waterloo