Deep RL + K-Means
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 25
- Apr. 17, 2019
Machine Learning Department School of Computer Science Carnegie Mellon University
Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 - - PowerPoint PPT Presentation
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 Reminders Homework 8: Reinforcement Learning Out:
1
10-601 Introduction to Machine Learning
Matt Gormley Lecture 25
Machine Learning Department School of Computer Science Carnegie Mellon University
2
3
from one setting to another is determined by your design of the state representation. Two examples:
– State Space A: <x,y> position on map e.g. st = <74, 152> – State Space B: window of pixel colors centered at current Pac Man location e.g. st = 0
1 1 1 1
6
7
“…the world’s top computer program for backgammon, TD-GAMMON (Tesauro, 1992, 1995), learned its strategy by playing over one million practice games against itself…” (Mitchell, 1997) THEN NOW
reward action At Rt Ot
8
Figures from David Silver (Intro RL lecture)
9
Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider
Figures from Mnih et al. (2013)
10
Breakout Enduro Pong Q*bert Seaquest
Random
354 1.2 −20.4 157 110 179
Sarsa [3]
996 5.2 129 −19 614 665 271
Contingency [4]
1743 6 159 −17 960 723 268
DQN
4092 168 470 20 1952 1705 581
Human
7456 31 368 −3 18900 28010 3690
HNeat Best [8]
3616 52 106 19 1800 920 1720
HNeat Pixel [8]
1332 4 91 −16 1325 800 1145
DQN Best
5184 225 661 21 4500 1740 1075 Table 1: The upper table compares average total reward for various learning methods by running an ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results of the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that always get the same score while DQN used an ✏-greedy policy with ✏ = 0.05. Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider
Figures from Mnih et al. (2013)
11
Question: What if our state space S is too large to represent with a table? Examples:
Answer: Use a parametric function to approximate the table entries Key Idea: 1. Use a neural network Q(s,a; θ) to approximate Q*(s,a) 2. Learn the parameters θ via SGD with training examples < st, at, rt, st+1 >
– Strawman loss function (i.e. what we cannot compute) – Approximating the Q function with a neural network – Approximating the Q function with a linear model – Deep Q-Learning – function approximators (<state, actioni> à q-value vs. state à all action q-values)
12
– not i.i.d. as SGD would assume – quickly forget rare experiences that might later be useful to learn from
– Keep a replay memory D = {e1, e2, … , eN} of N most recent experiences et = <st, at, rt, st+1> – Alternate two steps:
1. Repeat T times: randomly sample ei from D and apply a Q- Learning update to ei 2. Agent selects an action using epsilon greedy policy to receive new experience that is added to D
– similar to Uniform ER, but sample so as to prioritize experiences with high error
13
14
Figure from Silver et al. (2016)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 235 236 237 238 239 240 241 242 243 244 246 247 248 249 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 234 at 179 245 at 122 250 at 59
Game 1 Fan Hui (Black), AlphaGo (White) AlphaGo wins by 2.5 points
# of sequences of moves is O(bd)
– Go: b=250 and d=150 – Chess: b=35 and d=80
– Define a neural network to approximate the value function – Train by policy gradient
15
Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.
Regression Classifjcation Classifjcation Self Play Policy gradient
Human expert positions Self-play positions Neural network Data Rollout policy p p p (a⎪s) (s′) p
RL policy network Value network Policy network Value network s s′
Figure from Silver et al. (2016)
16
Figure from Silver et al. (2016)
Va Policy networ 3,500 3,000 2,500 2,000 1,500 1,000 500 Elo Rating GnuGo Fuego Pachi
Fan Hui AlphaGo AlphaGo distributed Professional dan (p) Amateur dan (d) Beginner kyu (k) 9p 7p 5p 3p 1p 9d 7d 5d 3d 1d 1k 3k 5k 7k
17
Question: For the R(s,a) values shown on the arrows below, which are the corresponding Q*(s,a) values? Assume discount factor = 0.5.
19
16 8 8 4 16 2 4 8 8 8 16 4 4 8 8 9 18 2 4 8 8 4 8 4 8 8 8
20
– Coordinate Descent – Block Coordinate Descent
– Inputs and Outputs – Objective-based Clustering
– K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method
– Random – Farthest Point – K-Means++
21
Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this?
Useful for:
for visualization purposes).
Slide courtesy of Nina Balcan
profile.
Facebook network Twitter Network
Slide courtesy of Nina Balcan
Slide courtesy of Nina Balcan
25
26
27
28
Example: Given a set of datapoints
Slide courtesy of Nina Balcan
Select initial centers at random
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Slide courtesy of Nina Balcan
Assign each point to its nearest center
Slide courtesy of Nina Balcan
Recompute optimal centers given a fixed clustering
Get a good quality solution in this example.
Slide courtesy of Nina Balcan
It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score.
Slide courtesy of Nina Balcan
Local optimum: every point is assigned to its nearest center and every center is the mean value of its points.
Slide courtesy of Nina Balcan
.It is arbitrarily worse than optimum solution….
Slide courtesy of Nina Balcan
This bad performance, can happen even with well separated Gaussian clusters.
Slide courtesy of Nina Balcan
This bad performance, can happen even with well separated Gaussian clusters. Some Gaussian are combined…..
Slide courtesy of Nina Balcan
K-Means You should be able to… 1. Distinguish between coordinate descent and block coordinate descent 2. Define an objective function that gives rise to a "good" clustering 3. Apply block coordinate descent to an objective function preferring each point to be close to its nearest
4. Implement the K-Means algorithm 5. Connect the nonconvexity of the K-Means objective function with the (possibly) poor performance of random initialization
43