Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 - - PowerPoint PPT Presentation

deep rl k means
SMART_READER_LITE
LIVE PREVIEW

Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 - - PowerPoint PPT Presentation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Deep RL + K-Means Matt Gormley Lecture 25 Apr. 17, 2019 1 Reminders Homework 8: Reinforcement Learning Out:


slide-1
SLIDE 1

Deep RL + K-Means

1

10-601 Introduction to Machine Learning

Matt Gormley Lecture 25

  • Apr. 17, 2019

Machine Learning Department School of Computer Science Carnegie Mellon University

slide-2
SLIDE 2

Reminders

  • Homework 8: Reinforcement Learning

– Out: Wed, Apr 10 – Due: Wed, Apr 24 at 11:59pm

  • Today’s In-Class Poll

– http://p25.mlcourse.org

2

slide-3
SLIDE 3

Q&A

3

Q: Do we have to retrain our RL agent every time we change our state space? A: Yes. But whether your state space changes

from one setting to another is determined by your design of the state representation. Two examples:

– State Space A: <x,y> position on map e.g. st = <74, 152> – State Space B: window of pixel colors centered at current Pac Man location e.g. st = 0

1 1 1 1

slide-4
SLIDE 4

DEEP RL EXAMPLES

6

slide-5
SLIDE 5

TD Gammon à Alpha Go

Learning to beat the masters at board games

7

“…the world’s top computer program for backgammon, TD-GAMMON (Tesauro, 1992, 1995), learned its strategy by playing over one million practice games against itself…” (Mitchell, 1997) THEN NOW

slide-6
SLIDE 6
  • bservation

reward action At Rt Ot

Playing Atari with Deep RL

  • Setup: RL

system

  • bserves the

pixels on the screen

  • It receives

rewards as the game score

  • Actions decide

how to move the joystick / buttons

8

Figures from David Silver (Intro RL lecture)

slide-7
SLIDE 7

Playing Atari with Deep RL

Videos:

– Atari Breakout: https://www.youtube.com/watch?v=V1eYniJ0Rn k – Space Invaders: https://www.youtube.com/watch?v=ePv0Fs9cG gU

9

Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider

Figures from Mnih et al. (2013)

slide-8
SLIDE 8

Playing Atari with Deep RL

10

  • B. Rider

Breakout Enduro Pong Q*bert Seaquest

  • S. Invaders

Random

354 1.2 −20.4 157 110 179

Sarsa [3]

996 5.2 129 −19 614 665 271

Contingency [4]

1743 6 159 −17 960 723 268

DQN

4092 168 470 20 1952 1705 581

Human

7456 31 368 −3 18900 28010 3690

HNeat Best [8]

3616 52 106 19 1800 920 1720

HNeat Pixel [8]

1332 4 91 −16 1325 800 1145

DQN Best

5184 225 661 21 4500 1740 1075 Table 1: The upper table compares average total reward for various learning methods by running an ✏-greedy policy with ✏ = 0.05 for a fixed number of steps. The lower table reports results of the single best performing episode for HNeat and DQN. HNeat produces deterministic policies that always get the same score while DQN used an ✏-greedy policy with ✏ = 0.05. Figure 1: Screen shots from five Atari 2600 Games: (Left-to-right) Pong, Breakout, Space Invaders, Seaquest, Beam Rider

Figures from Mnih et al. (2013)

slide-9
SLIDE 9

Deep Q-Learning

11

Question: What if our state space S is too large to represent with a table? Examples:

  • st = pixels of a video game
  • st = continuous values of a sensors in a manufacturing robot
  • st = sensor output from a self-driving car

Answer: Use a parametric function to approximate the table entries Key Idea: 1. Use a neural network Q(s,a; θ) to approximate Q*(s,a) 2. Learn the parameters θ via SGD with training examples < st, at, rt, st+1 >

slide-10
SLIDE 10

Deep Q-Learning

Whiteboard

– Strawman loss function (i.e. what we cannot compute) – Approximating the Q function with a neural network – Approximating the Q function with a linear model – Deep Q-Learning – function approximators (<state, actioni> à q-value vs. state à all action q-values)

12

slide-11
SLIDE 11

Experience Replay

  • Problems with online updates for Deep Q-learning:

– not i.i.d. as SGD would assume – quickly forget rare experiences that might later be useful to learn from

  • Uniform Experience Replay (Lin, 1992):

– Keep a replay memory D = {e1, e2, … , eN} of N most recent experiences et = <st, at, rt, st+1> – Alternate two steps:

1. Repeat T times: randomly sample ei from D and apply a Q- Learning update to ei 2. Agent selects an action using epsilon greedy policy to receive new experience that is added to D

  • Prioritized Experience Replay (Schaul et al, 2016)

– similar to Uniform ER, but sample so as to prioritize experiences with high error

13

slide-12
SLIDE 12

Alpha Go

Game of Go ()

  • 19x19 board
  • Players alternately

play black/white stones

  • Goal is to fully

encircle the largest region on the board

  • Simple rules, but

extremely complex game play

14

Figure from Silver et al. (2016)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 235 236 237 238 239 240 241 242 243 244 246 247 248 249 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 234 at 179 245 at 122 250 at 59

Game 1 Fan Hui (Black), AlphaGo (White) AlphaGo wins by 2.5 points

slide-13
SLIDE 13

Alpha Go

  • State space is too large to represent explicitly since

# of sequences of moves is O(bd)

– Go: b=250 and d=150 – Chess: b=35 and d=80

  • Key idea:

– Define a neural network to approximate the value function – Train by policy gradient

15

Figure 1 | Neural network training pipeline and architecture. a, A fast the current player wins) in positions from the self-play data set.

Regression Classifjcation Classifjcation Self Play Policy gradient

Human expert positions Self-play positions Neural network Data Rollout policy p p p (a⎪s) (s′) p

  • SL policy network

RL policy network Value network Policy network Value network s s′

Figure from Silver et al. (2016)

slide-14
SLIDE 14

Alpha Go

  • Results of a

tournament

  • From Silver et
  • al. (2016): “a

230 point gap corresponds to a 79% probability of winning”

16

Figure from Silver et al. (2016)

Va Policy networ 3,500 3,000 2,500 2,000 1,500 1,000 500 Elo Rating GnuGo Fuego Pachi

  • Crazy Stone

Fan Hui AlphaGo AlphaGo distributed Professional dan (p) Amateur dan (d) Beginner kyu (k) 9p 7p 5p 3p 1p 9d 7d 5d 3d 1d 1k 3k 5k 7k

slide-15
SLIDE 15

Learning Objectives

Reinforcement Learning: Q-Learning You should be able to…

  • 1. Apply Q-Learning to a real-world environment
  • 2. Implement Q-learning
  • 3. Identify the conditions under which the Q-

learning algorithm will converge to the true value function

  • 4. Adapt Q-learning to Deep Q-learning by

employing a neural network approximation to the Q function

  • 5. Describe the connection between Deep Q-

Learning and regression

17

slide-16
SLIDE 16

Q-Learning

Question: For the R(s,a) values shown on the arrows below, which are the corresponding Q*(s,a) values? Assume discount factor = 0.5.

19

Answer:

16 8 8 4 16 2 4 8 8 8 16 4 4 8 8 9 18 2 4 8 8 4 8 4 8 8 8

slide-17
SLIDE 17

K-MEANS

20

slide-18
SLIDE 18

K-Means Outline

  • Clustering: Motivation / Applications
  • Optimization Background

– Coordinate Descent – Block Coordinate Descent

  • Clustering

– Inputs and Outputs – Objective-based Clustering

  • K-Means

– K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method

  • K-Means Initialization

– Random – Farthest Point – K-Means++

21

slide-19
SLIDE 19

Clustering, Informal Goals

Goal: Automatically partition unlabeled data into groups of similar datapoints. Question: When and why would we want to do this?

  • Automatically organizing data.

Useful for:

  • Representing high-dimensional data in a low-dimensional space (e.g.,

for visualization purposes).

  • Understanding hidden structure in data.
  • Preprocessing for further analysis.

Slide courtesy of Nina Balcan

slide-20
SLIDE 20
  • Cluster news articles or web pages or search results by topic.

Applications (Clustering comes up everywhere…)

  • Cluster protein sequences by function or genes according to expression

profile.

  • Cluster users of social networks by interest (community detection).

Facebook network Twitter Network

Slide courtesy of Nina Balcan

slide-21
SLIDE 21
  • Cluster customers according to purchase history.

Applications (Clustering comes up everywhere…)

  • Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey)
  • And many many more applications….

Slide courtesy of Nina Balcan

slide-22
SLIDE 22

Optimization Background

Whiteboard:

– Coordinate Descent – Block Coordinate Descent

25

slide-23
SLIDE 23

Clustering

Question: Which of these partitions is “better”?

26

slide-24
SLIDE 24

K-Means

Whiteboard:

– Clustering: Inputs and Outputs – Objective-based Clustering – K-Means Objective – Computational Complexity – K-Means Algorithm / Lloyd’s Method

27

slide-25
SLIDE 25

K-Means Initialization

Whiteboard:

– Random – Furthest Traversal – K-Means++

28

slide-26
SLIDE 26

Example: Given a set of datapoints

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-27
SLIDE 27

Select initial centers at random

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-28
SLIDE 28

Assign each point to its nearest center

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-29
SLIDE 29

Recompute optimal centers given a fixed clustering

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-30
SLIDE 30

Assign each point to its nearest center

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-31
SLIDE 31

Recompute optimal centers given a fixed clustering

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-32
SLIDE 32

Assign each point to its nearest center

Lloyd’s method: Random Initialization

Slide courtesy of Nina Balcan

slide-33
SLIDE 33

Recompute optimal centers given a fixed clustering

Lloyd’s method: Random Initialization

Get a good quality solution in this example.

Slide courtesy of Nina Balcan

slide-34
SLIDE 34

Lloyd’s method: Performance

It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score.

Slide courtesy of Nina Balcan

slide-35
SLIDE 35

Lloyd’s method: Performance

Local optimum: every point is assigned to its nearest center and every center is the mean value of its points.

Slide courtesy of Nina Balcan

slide-36
SLIDE 36

Lloyd’s method: Performance

.It is arbitrarily worse than optimum solution….

Slide courtesy of Nina Balcan

slide-37
SLIDE 37

Lloyd’s method: Performance

This bad performance, can happen even with well separated Gaussian clusters.

Slide courtesy of Nina Balcan

slide-38
SLIDE 38

Lloyd’s method: Performance

This bad performance, can happen even with well separated Gaussian clusters. Some Gaussian are combined…..

Slide courtesy of Nina Balcan

slide-39
SLIDE 39

Learning Objectives

K-Means You should be able to… 1. Distinguish between coordinate descent and block coordinate descent 2. Define an objective function that gives rise to a "good" clustering 3. Apply block coordinate descent to an objective function preferring each point to be close to its nearest

  • bjective function to obtain the K-Means algorithm

4. Implement the K-Means algorithm 5. Connect the nonconvexity of the K-Means objective function with the (possibly) poor performance of random initialization

43