apprentissage par renforcement plan du cours
play

Apprentissage par Renforcement: Plan du cours Contexte Algorithms - PowerPoint PPT Presentation

Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree


  1. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  2. Go as AI Challenge Features ◮ Number of games 2 . 10 170 ∼ number of atoms in universe. ◮ Branching factor: 200 ( ∼ 30 for chess) ◮ Assessing a game ? ◮ Local and global features (symmetries, freedom, ...) Principles of MoGo Gelly Silver 2007 ◮ A weak but unbiased assessment function: Monte Carlo-based ◮ Allowing the machine to play against itself and build its own strategy

  3. Weak unbiased assessment Monte-Carlo-based Br¨ ugman (1993) 1. While possible, add a stone (white, black) 2. Compute Win(black) 3. Average on 1-2 Remark: The point is to be unbiased if there exists situations where you (wrongly) think you’re in good shape then you go there and you’re in bad shape...

  4. Build a strategy: Monte-Carlo Tree Search In a given situation: Select a move Multi-Armed Bandit In the end: 1. Assess the final move Monte-Carlo 2. Update reward for all moves

  5. Select a move Exploration vs Exploitation Dilemma Multi-Armed Bandits Lai, Robbins 1985 ◮ In a casino, one wants to maximize one’s gains while playing ◮ Play the best arms so far ? Exploitation ◮ But there might exist better arms... Exploration

  6. Multi-Armed Bandits, foll’d Auer et al. 2001, 2002; Kocsis Szepesvari 2006 For each arm (move) ◮ Reward: Bernoulli variable ∼ µ i , 0 ≤ µ i ≤ 1 ◮ Empirical estimate: ˆ µ i ± Confidence ( n i ) nb trials Decision: Optimism in front of unknown! log ( � n j ) � Select i ∗ = argmax ˆ µ i + C n i

  7. Multi-Armed Bandits, foll’d Auer et al. 2001, 2002; Kocsis Szepesvari 2006 For each arm (move) ◮ Reward: Bernoulli variable ∼ µ i , 0 ≤ µ i ≤ 1 ◮ Empirical estimate: ˆ µ i ± Confidence ( n i ) nb trials Decision: Optimism in front of unknown! log ( � n j ) � Select i ∗ = argmax ˆ µ i + C n i ◮ Take into account standard deviation of ˆ µ i Variants ◮ Trade-off controlled by C ◮ Progressive widening

  8. Monte-Carlo Tree Search Comments: MCTS grows an asymmetrical tree ◮ Most promising branches are more explored ◮ thus their assessment becomes more precise ◮ Needs heuristics to deal with many arms... ◮ Share information among branches MoGo: World champion in 2006, 2007, 2009 First to win over a 7th Dan player in 19 × 19

  9. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  10. Quand l’apprentissage c’est la s´ election d’attributs Bio-informatique ◮ 30 000 g` enes ◮ peu d’exemples (chers) ◮ but : trouver les g` enes pertinents

  11. Position du probl` eme Buts • S´ election : trouver un sous-ensemble d’attributs • Ordre/Ranking : ordonner les attributs Formulation Soient les attributs F = { f 1 , .. f d } . Soit la fonction : G : P ( F ) �→ I R F ⊂ F �→ Err ( F ) = erreur min. des hypoth` eses fond´ ees sur F Trouver Argmin ( G ) Difficult´ es eme d’optimisation combinatoire (2 d ) • Un probl` • D’une fonction F inconnue...

  12. Approches Filter m´ ethode univari´ ee D´ efinir score ( f i ); ajouter it´ erativement les attributs maximisant score ou retirer it´ erativement les attributs minimisant score + simple - pas cher − optima tr` es locaux Rq : on peut bactracker : meilleurs optima, mais plus cher Wrapping m´ ethode multivari´ ee Mesurer la qualit´ e d’attributs en rapport avec d’autres attributs : estimer G ( f i 1 , ... f ik ) − cher : une estimation = un pb d’apprentissage. + optima meilleurs M´ ethodes hybrides.

  13. Approches filtre Notations Base d’apprentissage : E = { ( x i , y i ) , i = 1 .. n , y i ∈ {− 1 , 1 }} f ( x i ) = valeur attribut f pour exemple ( x i ) Gain d’information arbres de d´ ecision p ([ f = v ]) = Pr ( y = 1 | f ( x i ) = v ) QI ([ f = v ]) = − p log p − (1 − p ) log (1 − p ) � QI = p ( v ) QI ([ f = v ]) v Corr´ elation � i f ( x i ) . y i � corr ( f ) = ∝ f ( x i ) . y i �� i ( f ( x i )) 2 × � i y 2 i i

  14. Approches wrapper Principe g´ en´ erer/tester Etant donn´ e une liste de candidats L = { f 1 , .., f p } • G´ en´ erer un candidat F • Calculer G ( F ) • apprendre h F ` a partir de E | F = ˆ • tester h F sur un ensemble de test G ( F ) • Mettre ` a jour L . Algorithmes • hill-climbing / multiple restart • algorithmes g´ en´ etiques Vafaie-DeJong, IJCAI 95 • (*) programmation g´ en´ etique & feature construction. Krawiec, GPEH 01

  15. Approches a posteriori Principe • Construire des hypoth` eses • En d´ eduire les attributs importants • Eliminer les autres • Recommencer Algorithme : SVM Recursive Feature Elimination Guyon et al. 03 eaire → h ( x ) = sign ( � w i . f i ( x ) + b ) • SVM lin´ • Si | w i | est petit, f i n’est pas important • Eliminer les k attributs ayant un poids min. • Recommencer.

  16. Limites Hypoth` eses lin´ eaires • Un poids par attribut. Quantit´ e des exemples • Les poids des attributs sont li´ es. • La dimension du syst` eme est li´ ee au nombre d’exemples. Or le pb de FS se pose souvent quand il n’y a pas assez d’exemples

  17. Some references ◮ Filter approaches [1] ◮ Wrapper approaches ◮ Tackling combinatorial optimization [2,3,4] ◮ Exploration vs Exploitation dilemma ◮ Embedded approaches ◮ Using the learned hypothesis [5,6] ◮ Using a regularization term [7,8] ◮ Restricted to linear models [7] or linear combinations of kernels [8] [1] K. Kira, and L. A. Rendell ML’92 [2] D. Margaritis NIPS’09 [3] T. Zhang NIPS’08 [4] M. Boull´ e J. Mach. Learn. Res. 07 [5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik Mach. Learn. 2002 [6] J. Rogers, and S. R. Gunn SLSFS’05 [7] R. Tibshirani Journal of the Royal Statistical Society 94 [8] F. Bach NIPS’08

  18. Feature Selection F : Set of features Optimization problem F : Feature subset E : Training data set F ∗ = argmin Err ( A , F , E ) Find A : Machine Learning algorithm Err : Generalization error Feature Selection Goals ◮ Reduced Generalization Error ◮ More cost-effective models ◮ More understandable models Bottlenecks ◮ Combinatorial optimization problem: find F ⊆ F ◮ Generalization error unknown

  19. FS as A Markov Decision Process f 1 f 3 f 2 Set of features F Set of states S = 2 F f 1 f 2 f 3 f 3 Initial state ∅ f 2 f 1 f 3 f 2 f 1 Set of actions A = { add f , f ∈ F} f , f f , f f , f Final state any state 1 2 1 3 2 3 f 2 f 1 f 3 Reward function V : S �→ [0 , 1] f , f 1 2 f 3 Goal: Find argmin Err ( A ( F , D )) F ⊆F

  20. Optimal Policy Policy π : S → A Final state following a policy F π f 1 f 3 Optimal policy π ⋆ = f 2 argmin Err ( A ( F π , E )) f 1 f 2 f 3 π f 3 Bellman’s optimality principle f 2 f 1 f 1 f 3 f 2 π ⋆ ( F ) = argmin V ⋆ ( F ∪ { f } ) f ∈F f , f f , f f , f 1 2 1 3 2 3 f 2 f 1 f 3 � Err ( A ( F )) if final ( F ) V ⋆ ( F ) = f , f f ∈F V ⋆ ( F ∪ { f } ) otherwise 1 2 min f 3 ◮ π ⋆ intractable ⇒ approximation using UCT In practice ◮ Computing Err ( F ) using a fast estimate

  21. FS as a game Exploration vs Exploitation tradeoff ◮ Virtually explore the whole lattice ◮ Gradually focus the search on most promising F s f 1 f 2 f 3 ◮ Use a frugal, unbiased assessment of F f , f f , f f , f 1 2 1 3 2 3 How ? ◮ Upper Confidence Tree (UCT) [1] ◮ UCT ⊂ Monte-Carlo Tree Search f , f 1 2 f 3 ◮ UCT tackles tree-structured optimization problems [1] L. Kocsis, and C. Szepesv´ ari ECML’06

  22. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  23. The UCT scheme ◮ Upper Confidence Tree (UCT) [1] ◮ Gradually grow the search tree ◮ Building Blocks ◮ Select next action (bandit-based Search Tree phase) ◮ Add a node (leaf of the search tree) ◮ Select next action bis (random phase) ◮ Compute instant reward ◮ Update information in visited nodes ◮ Returned solution: Explored Tree ◮ Path visited most often [1] L. Kocsis, and C. Szepesv´ ari ECML’06

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend