CS885 Reinforcement Learning Lecture 8a: May 25, 2018
Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed - - PowerPoint PPT Presentation
CS885 Reinforcement Learning Lecture 8a: May 25, 2018 Multi-armed Bandits [SutBar] Sec. 2.1-2.7, [Sze] Sec. 4.2.1-4.2.2 University of Waterloo CS885 Spring 2018 Pascal Poupart 1 Outline Exploration/exploitation tradeoff Regret
CS885 Spring 2018 Pascal Poupart 1 University of Waterloo
CS885 Spring 2018 Pascal Poupart 2
University of Waterloo
CS885 Spring 2018 Pascal Poupart 3
University of Waterloo
CS885 Spring 2018 Pascal Poupart 4
– Single state: S = {s} – A: set of actions (also known as arms) – Space of rewards (often re-scaled to be [0,1])
University of Waterloo
CS885 Spring 2018 Pascal Poupart 5
University of Waterloo
CS885 Spring 2018 Pascal Poupart 6
University of Waterloo
CS885 Spring 2018 Pascal Poupart 7
University of Waterloo
CS885 Spring 2018 Pascal Poupart 8
University of Waterloo
CS885 Spring 2018 Pascal Poupart 9
University of Waterloo
CS885 Spring 2018 Pascal Poupart 10
University of Waterloo
CS885 Spring 2018 Pascal Poupart 11
University of Waterloo
CS885 Spring 2018 Pascal Poupart 12
+
4
University of Waterloo
CS885 Spring 2018 Pascal Poupart 13
&
University of Waterloo
CS885 Spring 2018 Pascal Poupart 14
University of Waterloo
CS885 Spring 2018 Pascal Poupart 15
#→- !"# % = '(%)
University of Waterloo
CS885 Spring 2018 Pascal Poupart 16
University of Waterloo
CS885 Spring 2018 Pascal Poupart 17
1
234
University of Waterloo
CS885 Spring 2018 Pascal Poupart 18
9 :;< " "=
"= 6 ? . @A "=@B
University of Waterloo
CS885 Spring 2018 Pascal Poupart 19
" #$% & &'
University of Waterloo
CS885 Spring 2018 Pascal Poupart 20
University of Waterloo