analysis of evaluation function learning by comparison of
play

Analysis of Evaluation-Function Learning by Comparison of Sibling - PowerPoint PPT Presentation

Analysis of Evaluation-Function Learning by Comparison of Sibling Nodes Tomoyuki Kaneko 1 and Kunihito Hoki 2 1 University of Tokyo, Japan kaneko@acm.org 2 University of Electro-Communications Advances in Computer Games 13 Tomoyuki Kaneko


  1. Analysis of Evaluation-Function Learning by Comparison of Sibling Nodes Tomoyuki Kaneko 1 and Kunihito Hoki 2 1 University of Tokyo, Japan kaneko@acm.org 2 University of Electro-Communications Advances in Computer Games 13 Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 1 / 1

  2. Outline Background: Machine learning of evaluation functions Recent success in shogi Analysis of (partial) gradient of Minmax value When is it differentiable? Is it equal to gradient of leaf evaluation? (implicitly assumed in previous work) Experiments in shogi: How frequently is Minmax value non-differentiable? Upper bounds by Multiple PVs Different gradients in multiple PVs Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 2 / 1

  3. Minmax search (Tilburg photo) Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 3 / 1

  4. Minmax search Minmax value: result of Minmax search Minimum or Maximum of children (for a internal node) Evaluation by evaluation function (for a leaf node) PV: principal variation (the left most branch) Path from the root to a leaf, s.t. Minmax (child) = Minmax (parent) Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 4 / 1

  5. Evaluation function Definition eval( p , θ ) : p : a game position θ ∈ R N : a parameter vector Assumption: eval( p , θ ) is differentiable w.r.t. θ Example: θ = ( a , b ) eval( p , θ ) = a · # pawns ( p ) + b · # pieces ( p ) ∂ ∂ a eval( p , θ ) = # pawns ( p ) Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 5 / 1

  6. Motivation: machine learning Goal of leaning evaluation functions Adjustment of Minmax value via θ : Comparison: better Minmax value for a grandmasters’ move than that of other legal moves (Nowatzyk2000, Tesauro2001, Hoki2006 ) Success in shogi: outperformed all hand tuned evaluation functions How it works ➔ First talk@Session10 (tomorrow) TDLeaf: similar Minmax value to that of future positions (Baxter et al. 2000) Common problem: How to obtain the gradient of Minmax value? Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 6 / 1

  7. Partial derivative of Minmax value 3.5 x*x 3 2.5 2 1.5 1 0.5 0 Adjustment by Gradient descent -0.5 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 Goal: Adjustment of Minmax value of root (R) Method (ideal): update θ by ∂ R ∂θ i Known problem: R is not always partially differentiable Method (work around): update θ by ∂ L ∂θ i Work around: use ∂ L (the leaf of PV), instead of ∂ R ∂θ i ∂θ i Observation: Equal Minmax value, R= L (by definition) Expectation: Similar gradients, ∂ R= ∂ L ✎ ☞ ∂θ i ∂θ i How different?: ∂ Root (R) ↔ ∂ PVleaf (L) ✍ ✌ ∂θ i ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 7 / 1

  8. Example + informal discussion: one child OK: ∂ R = ∂ L ∂θ i ∂θ i L is always PV for any Minmax value Minmax value R always equals that of L ✞ ☎ ✝ ✆ L + δ = R + δ Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 8 / 1

  9. Example: two children (different leaf values) OK: ∂ R = ∂ L ∂θ i ∂θ i If n = − 5 for any δ : L is better than n while ( L + δ > n , i.e., δ < 5 ) L will be PV for δ < 5 ✞ ☎ Minmax value R equals that of L when δ < 5 , ✝ ✆ L + δ = R + δ ( δ < 5) Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 9 / 1

  10. Example: two children (tie) NG: ∂ R (not defined) ∂θ i ✎ ☞ If n = 0 for any δ : L + δ > n ( δ > 0 ) ✍ ✌ ✎ ☞ L + δ < n ( δ < 0 ) L is better than n while ( L + δ > 0 ) n is better than L while ( L + δ < 0 ) L+ δ = R+ δ ( δ > 0 ) ✍ ✌ L+ δ � R ( δ < 0 ) L or n will be PV for δ ≈ 0 Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 10 / 1

  11. Unique PV ↔ Differentiable? ✎ ☞ True as expected: � ∂ � R = ∂ Unique PV → L ✍ ✌ ∂θ i ∂θ i False: ✎ ☞ � ∂ � �� ∂ R = ∂ R defined → Unique PV ∧ L ✍ ✌ ∂θ i ∂θ i ∂θ i ✎ ☞ A counter example exists: � ∂ � ∂ R � ∂ R defined ∧ L ✍ ✌ ∂θ i ∂θ i ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 11 / 1

  12. Example: two children (different leaf values) OK: ∂ R = ∂ L ∂θ i ∂θ i If θ i changed by ∆ ( θ i ← θ i + ∆ ), all leaves (L and n ) will change For any gradients of L and n , L is better than n for small | ∆ | ✤ ✜ � � � � L + ∂ n + ∂ L · ∆ > n · ∆ ( ∃ a > 0 , | ∆ | < a ), ∂θ i ∂θ i R θ i ← θ i +∆ ≈ L + ∂ L · ∆ ( | ∆ | < a ). ✣ ✢ ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 12 / 1

  13. Example: two children (tie, same leaf gradient) OK: ∂ R = ∂ L ∂θ i ∂θ i Even if L and n has the same value, R is still differentiable if L and n have the same gradient. ✎ ☞ R θ i ← θ i +∆ ≈ L + ∂ L · ∆ ✍ ✌ ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 13 / 1

  14. Example: two children (tie, different leaf gradients) NG: ∂ R (not defined) ∂θ i When L and n has the same value but different gradients, change of R depends on whether lim ∆ → + 0 or lim ∆ →− 0 ✛ ✘ L + ∂   L · ∆ ( | ∆ | > 0; L is PV )    ∂θ i   R θ i ← θ i +∆ ≈ n + ∂    n · ∆ ( | ∆ | < 0; n is PV )   ✚ ✙  ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 14 / 1

  15. Example: ∂ L hidden by others ∂θ i NG: ∂ R � ∂ L (defined but different) ∂θ i ∂θ i ∂ L = 1 ∂θ i ∂ R = 0 ↔ ∂θ i Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 15 / 1

  16. Practical issues and experiments ✗ ✔ (1) How frequently non-differentiable R exists? Estimation of upper bounds in training positions by: Multiple PVs ✖ ✕ Different gradients in multiple PVs (2) Is 1 small enough for update step ∆ ? ∆ ≥ 1 for integer parameters ∀ ǫ > 0 , ∃ δ > 0 for real parameters ✛ ✘ How frequently objective function J will be improved by update along with ∂ J , for ∆ = 1 , 2 , 4 , and 8 ? ∂θ i ✚ ✙ ➔ please see proceedings Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 16 / 1

  17. Experiments in shogi: evaluation functions Practical evaluation functions: Learnt: main evaluation function of GPSShogi revision 2590 Near optimal by learning ≈ 1 . 4 (8) million parameters Hand-tuned: old evaluation function used until 2008 Reasonable but far from optimal Poor evaluation functions: Piece: initial values in learning Same piece values as Learnt, 0 for others. Piece128: extreme initial values 128 for piece values, 0 for others. ☛ ✟ GPSShogi: open source, winner of CO 2011 ✡ ✠ http://gps.tanaka.ecc.u-tokyo.ac.jp/gpsshogi/index.php?GPSShogiEn Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 17 / 1

  18. Statistics: #legal moves and #moves of similar evaluation 140 Siblings (Learnt) Siblings (Hand-tuned) 120 Siblings (Piece) Siblings (Piece128) Average moves 100 All legal moves 80 60 40 20 0 0 20 40 60 80 100 120 140 160 Move number Legal moves ■ : ≈ 20 (opening), ≈ 130 (endgame) Practical evaluation functions (Learnt + , Hand-tuned ✕ ): ≈ 20 (opening, endgame) moves in αβ window of 2 pawns. Poor evaluation functions (Piece * , Piece128 ❏ ): ≈ 40 moves or more in αβ window of 2 pawns. Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 18 / 1

  19. Frequency: number of PVs Cumulative frequency (%) 100 Learnt Hand 80 Piece 60 Piece128 40 20 0 1 2 3 4 #PV Practical evaluation functions: almost always unique Learnt + : unique PV for almost all positions Hand-tuned ✕ : unique PV in more than 80% of positions, more than 2 PVs in less than 4% of positions. Poor evaluation functions: rarely unique Piece * : multiple PVs for more than 86% of positions Piece128 ❏ : multiple PVs for more than 99% of positions Tomoyuki Kaneko (University of Tokyo) Analysis of Evaluation-Function Learning Advances in Computer Games 13 19 / 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend