[PPT] - Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 PowerPoint Presentation

SLIDE 1

Deep RL + K-Means

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 25

Apr. 17, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

SLIDE 2

Reminders

Homework 8: Reinforcement Learning

– Out: Wed, Apr 10 – Due: Wed, Apr 24 at 11:59pm

Today’s In-Class Poll

– http://p25.mlcourse.org

2

SLIDE 3

Q&A

3

Q: Do we have to retrain our RL agent every time we change our state space? A: Yes. But whether your state space changes

from one setting to another is determined by your design of the state representation. Two examples:

– State Space A: <x,y> position on map e.g. st = <74, 152> – State Space B: window of pixel colors centered at current Pac Man location e.g. st = 0

1 1 1 1

SLIDE 4

DEEP RL EXAMPLES

6

SLIDE 5

TD Gammon à Alpha Go

Learning to beat the masters at board games

7

“…the world’s top computer program for backgammon, TD-GAMMON (Tesauro, 1992, 1995), learned its strategy by playing over one million practice games against itself…” (Mitchell, 1997) THEN NOW

SLIDE 6

bservation

reward action At Rt Ot

Playing Atari with Deep RL

Setup: RL

system

bserves the

pixels on the screen

It receives

rewards as the game score

Actions decide

how to move the joystick / buttons

8

Figures from David Silver (Intro RL lecture)

SLIDE 7

Playing Atari with Deep RL

Videos:

– Atari Breakout: https://www.youtube.com/watch?v=V1eYniJ0Rn k – Space Invaders: https://www.youtube.com/watch?v=ePv0Fs9cG gU

9

Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider

Figures from Mnih et al. (2013)

SLIDE 8

Playing Atari with Deep RL

10

B. Rider

Breakout Enduro Pong Q*bert Seaquest

S. Invaders

Random

354 1.2 −20.4 157 110 179

Sarsa [3]

996 5.2 129 −19 614 665 271

Contingency [4]

1743 6 159 −17 960 723 268

DQN

4092 168 470 20 1952 1705 581

Human

7456 31 368 −3 18900 28010 3690

HNeat Best [8]

3616 52 106 19 1800 920 1720

HNeat Pixel [8]

1332 4 91 −16 1325 800 1145

DQN Best

5184 225 661 21 4500 1740 1075 Table 1: The upper table compares average total reward for various learning methods by running an ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results of the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that always get the same score while DQN used an ✏-greedy policy with ✏ = 0.05. Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider

Figures from Mnih et al. (2013)

SLIDE 9

Deep Q-Learning

11

Question: What if our state space S is too large to represent with a table? Examples:

st = pixels of a video game
st = continuous values of a sensors in a manufacturing robot
st = sensor output from a self-driving car

Answer: Use a parametric function to approximate the table entries Key Idea: 1. Use a neural network Q(s,a; θ) to approximate Q*(s,a) 2. Learn the parameters θ via SGD with training examples < st, at, rt, st+1 >

SLIDE 10

Deep Q-Learning

Whiteboard

– Strawman loss function (i.e. what we cannot compute) – Approximating the Q function with a neural network – Approximating the Q function with a linear model – Deep Q-Learning – function approximators (<state, actioni> à q-value vs. state à all action q-values)

12

SLIDE 11

Experience Replay

Problems with online updates for Deep Q-learning:

– not i.i.d. as SGD would assume – quickly forget rare experiences that might later be useful to learn from

Uniform Experience Replay (Lin, 1992):

– Keep a replay memory D = {e1, e2, … , eN} of N most recent experiences et = <st, at, rt, st+1> – Alternate two steps:

1. Repeat T times: randomly sample ei from D and apply a Q- Learning update to ei 2. Agent selects an action using epsilon greedy policy to receive new experience that is added to D

Prioritized Experience Replay (Schaul et al, 2016)

– similar to Uniform ER, but sample so as to prioritize experiences with high error

13

SLIDE 12

Alpha Go

Game of Go ()

19x19 board
Players alternately

play black/white stones

Goal is to fully

encircle the largest region on the board

Simple rules, but

extremely complex game play

14

Figure from Silver et al. (2016)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 235 236 237 238 239 240 241 242 243 244 246 247 248 249 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 234 at 179 245 at 122 250 at 59

Game 1 Fan Hui (Black), AlphaGo (White) AlphaGo wins by 2.5 points

SLIDE 13

Alpha Go

State space is too large to represent explicitly since

# of sequences of moves is O(bd)

– Go: b=250 and d=150 – Chess: b=35 and d=80

Key idea:

– Define a neural network to approximate the value function – Train by policy gradient

15

Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.

Regression Classifjcation Classifjcation Self Play Policy gradient

Human expert positions Self-play positions Neural network Data Rollout policy p p p (a⎪s) (s′) p

SL policy network

RL policy network Value network Policy network Value network s s′

Figure from Silver et al. (2016)

SLIDE 14

Alpha Go

Results of a

tournament

From Silver et
al. (2016): “a

230 point gap corresponds to a 79% probability of winning”

16

Figure from Silver et al. (2016)

Va Policy networ 3,500 3,000 2,500 2,000 1,500 1,000 500 Elo Rating GnuGo Fuego Pachi

Crazy Stone

Fan Hui AlphaGo AlphaGo distributed Professional dan (p) Amateur dan (d) Beginner kyu (k) 9p 7p 5p 3p 1p 9d 7d 5d 3d 1d 1k 3k 5k 7k

SLIDE 15

Learning Objectives

Reinforcement Learning: Q-Learning You should be able to…

1. Apply Q-Learning to a real-world environment
2. Implement Q-learning
3. Identify the conditions under which the Q-

learning algorithm will converge to the true value function

4. Adapt Q-learning to Deep Q-learning by

employing a neural network approximation to the Q function

5. Describe the connection between Deep Q-

Learning and regression

17

SLIDE 16

Q-Learning

Question: For the R(s,a) values shown on the arrows below, which are the corresponding Q*(s,a) values? Assume discount factor = 0.5.

19

Answer:

16 8 8 4 16 2 4 8 8 8 16 4 4 8 8 9 18 2 4 8 8 4 8 4 8 8 8

SLIDE 17

K-MEANS

20

SLIDE 18

K-Means Outline

Clustering: Motivation / Applications
Optimization Background

– Coordinate Descent – Block Coordinate Descent

Clustering

– Inputs and Outputs – Objective-based Clustering

K-Means

– K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method

K-Means Initialization

– Random – Farthest Point – K-Means++

21

SLIDE 19

Clustering, Informal Goals

Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this?

Automatically organizing data.

Useful for:

Representing high-dimensional data in a low-dimensional space (e.g.,

for visualization purposes).

Understanding hidden structure in data.
Preprocessing for further analysis.

Slide courtesy of Nina Balcan

SLIDE 20

Cluster news articles or web pages or search results by topic.

Applications (Clustering comes up everywhere…)

Cluster protein sequences by function or genes according to expression

profile.

Cluster users of social networks by interest (community detection).

Facebook network Twitter Network

Slide courtesy of Nina Balcan

SLIDE 21

Cluster customers according to purchase history.

Applications (Clustering comes up everywhere…)

Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey)
And many many more applications….

Slide courtesy of Nina Balcan

SLIDE 22

Optimization Background

Whiteboard:

– Coordinate Descent – Block Coordinate Descent

25

SLIDE 23

Clustering

Question: Which of these partitions is “better”?

26

SLIDE 24

K-Means

Whiteboard:

– Clustering: Inputs and Outputs – Objective-based Clustering – K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method

27

SLIDE 25

K-Means Initialization

Whiteboard:

43