an adaptive pursuit strategy for allocating operator
play

An Adaptive Pursuit Strategy for Allocating Operator Probabilities - PowerPoint PPT Presentation

An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of Computer Science Universiteit Utrecht, The Netherlands Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 1 / 26 Outline Adaptive


  1. An Adaptive Pursuit Strategy for Allocating Operator Probabilities Dirk Thierens Department of Computer Science Universiteit Utrecht, The Netherlands Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 1 / 26

  2. Outline Adaptive Operator Allocation 1 Probability Matching 2 Adaptive Pursuit Strategy 3 Experiments 4 Conclusion 5 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 2 / 26

  3. Adaptive Operator Allocation What Adaptive Operator Allocation: What ? Given: Set of K operators A = { a 1 , . . . , a K } 1 Probability vector P ( t ) = {P 1 ( t ) , . . . , P K ( t ) } : 2 operator a i applied at time t in proportion to probability P i ( t ) Environment returns rewards R i ( t ) ≥ 0 3 Goal: Adapt P ( t ) such that the expected value of the cumulative reward E [ R ] = � T t = 1 R i ( t ) is maximized Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 3 / 26

  4. Adaptive Operator Allocation Why Adaptive Operator Allocation: Why ? Probability of applying an operator is difficult to determine a priori 1 depends on current state of the search process 2 → Adaptive allocation rule specifies how probabilities are adapted according to the performance of the operators Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 4 / 26

  5. Adaptive Operator Allocation Requirements Adaptive Operator Allocation: Requirements Non-stationary environment ⇒ operator probabilities need to be 1 adapted continuously Stationary environment ⇒ operator probabilities should converge 2 to best performing operator → conflicting goals ! Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 5 / 26

  6. Probability Matching Main idea Probability Matching: Main Idea Adaptive allocation rule often applied in GA literature: probability matching strategy Main idea: update P ( t ) such that the probability of applying operator a i matches the proportion of the estimated reward Q i ( t ) to the sum of all reward estimates � K a = 1 Q a ( t ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 6 / 26

  7. Probability Matching Reward estimate Probability Matching: Reward Estimate The adaptive allocation rule computes an estimate of the rewards received when applying an operator In non-stationary environments older rewards should get less influence Exponential, recency-weighted average (0 < α < 1): Q a ( t + 1 ) = Q a ( t ) + α [ R a ( t ) − Q a ( t )] Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 7 / 26

  8. Probability Matching Probability adaptation Probability Matching: Probability Adaptation In non-stationary environments the probability of applying any operator should never be less than some minimal threshold P min > 0 For K operators maximal probability P max = 1 − ( K − 1 ) P min Updating rule for P ( t ) : Q a ( t ) P a ( t ) = P min + ( 1 − K · P min ) � K i = 1 Q i ( t ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 8 / 26

  9. Probability Matching Algorithm Probability Matching: Algorithm P ROBABILITY M ATCHING ( P , Q , K , P min , α ) 1 for i ← 1 to K do P i ( 0 ) ← 1 2 K ; Q i ( 0 ) ← 1 . 0 3 while N OT T ERMINATED ? () do a s ← P ROPORTIONAL S ELECT O PERATOR ( P ) 4 R a s ( t ) ← G ET R EWARD ( a s ) 5 6 Q a s ( t + 1 ) = Q a s ( t ) + α [ R a s ( t ) − Q a s ( t )] 7 for a ← 1 to K Q a ( t + 1 ) 8 do P a ( t + 1 ) = P min + ( 1 − K · P min ) � K i = 1 Q i ( t + 1 ) Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 9 / 26

  10. Probability Matching Problem Probability Matching: Problem Assume one operator is consistently better For instance, 2 operators a 1 and a 2 with constant rewards R 1 = 10 and R 2 = 9 If P min = 0 . 1 we would like to apply operator a 1 with probability P 1 = 0 . 9 and operator a 2 with P 2 = 0 . 1. Yet, the probability matching allocation rule will converge to P 1 = 0 . 52 and P 2 = 0 . 48 ! Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 10 / 26

  11. Adaptive Pursuit Strategy Pursuit method Adaptive Pursuit Strategy: Pursuit Method The pursuit algorithm is a rapidly converging algorithm applied in learning automata Main idea: update P ( t ) such that the operator a ∗ that currently has the maximal estimated reward Q a ∗ ( t ) is pursued To achieve this, the pursuit method increases the selection probability P a ∗ ( t ) and decreases all other probabilities P a ( t ) , ∀ a � = a ∗ Adaptive pursuit algorithm is extension of the pursuit algorithm to make it applicable in non-stationary environments Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 11 / 26

  12. Adaptive Pursuit Strategy Adaptive pursuit method Adaptive Pursuit Strategy: Adaptive Pursuit Method Similar to probability matching: The adaptive pursuit algorithm proportionally selects an operator to 1 execute according to the probability vector P ( t ) The estimated reward of the selected operator is updated with: 2 Q a ( t + 1 ) = Q a ( t ) + α [ R a ( t ) − Q a ( t )] Different from probability matching: Selection probability vector P ( t ) is adapted in a greedy way 1 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 12 / 26

  13. Adaptive Pursuit Strategy Probability adaptation Adaptive Pursuit Strategy: Probability Adaptation The selection probability of the current best operator a ∗ = argmax a [ Q a ( t + 1 )] is increased (0 < β < 1): P a ∗ ( t + 1 ) = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] The selection probability of the other operators is decreased: ∀ a � = a ∗ : P a ( t + 1 ) = P a ( t ) + β [ P min − P a ( t )] Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 13 / 26

  14. Adaptive Pursuit Strategy Probability adaptation Note K � P a ( t + 1 ) a = 1 K � = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] + ( P a ( t ) + β [ P min − P a ( t )]) a = 1 , a � = a ∗ K � = ( 1 − β ) P a ( t ) + β [ P max + ( K − 1 ) P min ] a = 1 K � = ( 1 − β ) P a ( t ) + β a = 1 = 1 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 14 / 26

  15. Adaptive Pursuit Strategy Algorithm Adaptive Pursuit Strategy: Algorithm A DAPTIVE P URSUIT ( P , Q , K , P min , α, β ) P max ← 1 − ( K − 1 ) P min 1 2 for i ← 1 to K do P i ( 0 ) ← 1 3 K ; Q i ( 0 ) ← 1 . 0 4 while N OT T ERMINATED ? () do a s ← P ROPORTIONAL S ELECT O PERATOR ( P ) 5 R a s ( t ) ← G ET R EWARD ( a s ) 6 7 Q a s ( t + 1 ) = Q a s ( t ) + α [ R a s ( t ) − Q a s ( t )] a ∗ ← A RGMAX a ( Q a ( t + 1 )) 8 9 P a ∗ ( t + 1 ) = P a ∗ ( t ) + β [ P max − P a ∗ ( t )] 10 for a ← 1 to K do if a � = a ∗ 11 then P a ( t + 1 ) = P a ( t ) + β [ P min − P a ( t )] 12 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 15 / 26

  16. Adaptive Pursuit Strategy Example Adaptive Pursuit Strategy: Example Consider again the 2-operator stationary environment with R 1 = 10 , and R 2 = 9 ( P min = 0 . 1) As opposed to the probability matching rule, the adaptive pursuit method will play the better operator a 1 with maximum probability P max = 0 . 9 It also keeps playing the poorer operator a 2 with minimal probability P min = 0 . 1 in order to maintain its ability to adapt to any change in the reward distribution Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 16 / 26

  17. Experiments Experimental Environment Experiments: Environment We consider an environment with 5 operators a i : i = 1 . . . 5 Each operator a i receives a uniformly distributed reward R i between the boundaries R i = U [ i − 1 . . . i + 1 ] : Operator reward [0..1] [1..2] [2..3] [3..4] [4..5] [5..6] R 1 R 2 R 3 R 4 R 5 After a fixed time interval ∆ T the reward distributions are randomly reassigned to the operators Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 17 / 26

  18. Experiments Experimental Environment Upper bounds to performance If we had full knowledge of the reward distributions and their switching pattern we could always pick the optimal operator a ∗ and achieve an expected reward E [ R Opt ] = 5. The performance in the stationary (non-switching) environment of a correctly converged operator allocation scheme represents an upper bound to the optimal performance in the switching environment. 3 allocation strategies: Non-adaptive, equal-probability allocation rule 1 Probability matching allocation rule ( P min = 0 . 1) 2 Adaptive pursuit allocation rule ( P min = 0 . 1) 3 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 18 / 26

  19. Experiments Experimental Environment Non-adaptive, equal-probability allocation rule The probability of choosing the optimal operator a ∗ Fixed : Fixed ] = 1 Prob [ a s = a ∗ K = 0 . 2 The expected reward: K E [ R a ] Prob [ a s = a ] E [ R Fixed ] � = a = 1 � K a = 1 E [ R a ] = K = 3 Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 19 / 26

  20. Experiments Experimental Environment Probability matching allocation rule The probability of choosing the optimal operator a ∗ ProbMatch : Prob [ a s = a ∗ ProbMatch ] E [ R a ∗ ] = P min + ( 1 − K . P min ) = 0 . 2666 . . . � K a = 1 E [ R a ] The expected reward: E [ R ProbMatch ] K E [ R a ] Prob [ a s = a ] � = a = 1 K E [ R a ] � = a [ P min + ( 1 − K · P min ) ] � K a = 1 E [ R a ] a = 1 = 3 . 333 . . . Dirk Thierens (Universiteit Utrecht) Adaptive Pursuit Allocation 20 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend