Reinforcement Learning a gentle introduction & industrial - - PowerPoint PPT Presentation

reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement Learning a gentle introduction & industrial - - PowerPoint PPT Presentation

Reinforcement Learning a gentle introduction & industrial application Christian Hidber Learning learning from children FOLIE 2 REINFORCEMENT LEARNING The game: demo FOLIE 3 REINFORCEMENT LEARNING The game: setup game engine Goal:


slide-1
SLIDE 1

Reinforcement Learning

a gentle introduction & industrial application

Christian Hidber

slide-2
SLIDE 2

FOLIE 2

Learning learning from children

REINFORCEMENT LEARNING

slide-3
SLIDE 3

FOLIE 3

The game: demo

REINFORCEMENT LEARNING

slide-4
SLIDE 4

FOLIE 4

The game: setup

REINFORCEMENT LEARNING

actions game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

learner Step reward

Goal: maximize sum

  • f rewards
slide-5
SLIDE 5

FOLIE 5

The game: positive feedback

REINFORCEMENT LEARNING

game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

learner Step reward actions

slide-6
SLIDE 6

FOLIE 6

The game: negative feedback

REINFORCEMENT LEARNING

game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

learner Step reward actions

slide-7
SLIDE 7

FOLIE 7

Policy

(rules learned, how to play the game)

the learned stuff => policy

REINFORCEMENT LEARNING

game engine game state learner Step reward actions

slide-8
SLIDE 8

FOLIE 8

Policy

(rules learned, how to play the game)

policy improvement => learning

REINFORCEMENT LEARNING

game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

learner Step reward actions

slide-9
SLIDE 9

FOLIE 9

Policy

(rules learned, how to play the game)

policy improvement => learning

REINFORCEMENT LEARNING

game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

learner Step reward actions

slide-10
SLIDE 10

FOLIE 10

Policy

(rules learned, how to play the game)

Reinforcement learning

REINFORCEMENT LEARNING

game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

Step reward RL algorithm

Key idea: continuously improve policy to increase total reward

actions

slide-11
SLIDE 11

FOLIE 11

Episode 1 : play with 1st policy (random)

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

  • 1

Next State

slide-12
SLIDE 12

FOLIE 12

Episode 1 : play with 1st policy (random)

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

100

Next State

slide-13
SLIDE 13

FOLIE 13

Episode 1 : play with 1st policy (random)

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

  • 50

Next State

Episode Over

Episode Over

slide-14
SLIDE 14

FOLIE 14

Episode 1 : improve 1st policy for state in step 3

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

  • 50
slide-15
SLIDE 15

FOLIE 15

Episode 1 : improve 1st policy for state in step 2

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

50(=+100 -50)

slide-16
SLIDE 16

FOLIE 16

Episode 1 : improve 1st policy for state in step 1

M3, OCTOBER 2018 REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

49(=-1 +100 -50)

slide-17
SLIDE 17

FOLIE 17

Episode 2 : play with 2nd policy

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

  • 1

Next State Already learned: go left is ok

slide-18
SLIDE 18

FOLIE 18

Episode 2 : play with 2nd policy

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

100

Next State Already learned: go left is ok

slide-19
SLIDE 19

FOLIE 19

Episode 2 : play with 2nd policy

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

100

Next State Already learned: don’t go up

slide-20
SLIDE 20

FOLIE 20

Episode 2 : play with 2nd policy

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

  • 1

Next State

slide-21
SLIDE 21

FOLIE 21

Episode 2 : play with 2nd policy

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy Reward

  • 50

Next State

Episode Over

Episode Over

slide-22
SLIDE 22

FOLIE 22

Episode 2 : improve 2nd policy for state in step 5

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

  • 50(=-50)
slide-23
SLIDE 23

FOLIE 23

Episode 2 : improve 2nd policy for state in step 4

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

  • 51(=-1-50)
slide-24
SLIDE 24

FOLIE 24

Episode 2 : improve 2nd policy for state in step 3

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

49(=+100-1-50)

slide-25
SLIDE 25

FOLIE 25

Episode 2 : improve 2nd policy for state in step 2

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

149(=+100+100-1-50)

slide-26
SLIDE 26

FOLIE 26

Episode 2 : improve 2nd policy for state in step 2

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

149(=+100+100-1-50)

= some running average

  • f old and new value
slide-27
SLIDE 27

FOLIE 27

Episode 2 : improve 2nd policy for state in step 1

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

1 2 3 4 5 6 7 Step # State Action from Policy

Episode Over

Future Reward

(sum of all rewards from current state until ‘game over’)

148(=-1+100+100-1-50)

slide-28
SLIDE 28

FOLIE 28

So far …..

REINFORCEMENT LEARNING

a policy is a map from states to action probabilities

slide-29
SLIDE 29

FOLIE 29

…updated by the reinforcement learning algorithm

REINFORCEMENT LEARNING

a policy is a map from states to action probabilities

slide-30
SLIDE 30

FOLIE 30

…updated by the reinforcement learning algorithm

REINFORCEMENT LEARNING

a policy is a map from states to action probabilities

slide-31
SLIDE 31

FOLIE 31

After many, many episodes, for each state…

REINFORCEMENT LEARNING

slide-32
SLIDE 32

FOLIE 32

Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1,action1,reward1),(state2,action2,reward2),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei] s.t.

  • actioni becomes for statei more likely if FutureRewardi is “high”
  • actioni becomes for statei less likely if FutureRewardi is “low”
slide-33
SLIDE 33

FOLIE 33

Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1,action1,reward1),(state2,action2,reward2),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei] s.t.

  • actioni becomes for statei more likely if FutureRewardi is “high”
  • actioni becomes for statei less likely if FutureRewardi is “low”
slide-34
SLIDE 34

FOLIE 34

Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1,action1,reward1),(state2,action2,reward2),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei] s.t.

  • actioni becomes for statei more likely if FutureRewardi is “high”
  • actioni becomes for statei less likely if FutureRewardi is “low”
slide-35
SLIDE 35

FOLIE 35

The game: demo

REINFORCEMENT LEARNING

slide-36
SLIDE 36

FOLIE 36 REINFORCEMENT LEARNING

The bad news: nice idea, but…

«Image" licensed according to CC BY-SA

slide-37
SLIDE 37

FOLIE 37 REINFORCEMENT LEARNING

The bad news: nice idea, but… too many states… too many actions

  • Too much memory needed
  • Too much time

«Image" licensed according to CC BY-SA

slide-38
SLIDE 38

FOLIE 38

The solution

REINFORCEMENT LEARNING

Idea: Replace lookup table with a neural network that approximates the action probabilities contained in the table Instead of Table[state] = action probabilities Do NeuralNet( state ) ~ action probablities Policy

(rules learned, how to play the game)

Change to “update weights of NeuralNet” Change to “play episode with policy given by NeuralNet”

slide-39
SLIDE 39

FOLIE 39

Neural nets to the rescue

REINFORCEMENT LEARNING

Idea: Replace lookup table with a neural network that approximates the action probabilities contained in the table Instead of Table[state] = action probabilities Do NeuralNet( state ) ~ action probablities Encode state as vector

s

~

  • utup
  • utdown
  • utleft
  • utright

Apply neural network with “the right” weights ~ Policy

(rules learned, how to play the game)

Use softmax

slide-40
SLIDE 40

FOLIE 40

Policy Gradient Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1,action1,reward1),(state2,action2,reward2),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector

s

~

  • utup
  • utdown
  • utleft
  • utright

Weights W ~ Use softmax

slide-41
SLIDE 41

FOLIE 41

Policy Gradient Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1,action1,reward1),(state2,action2,reward2),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector

s

~

  • utup
  • utdown
  • utleft
  • utright

Weights W ~ Use softmax

slide-42
SLIDE 42

FOLIE 42

Policy Gradient Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1,action1,reward1),(state2,action2,reward2),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector

s

~

  • utup
  • utdown
  • utleft
  • utright

Weights W ~ Use softmax Increases outi

slide-43
SLIDE 43

FOLIE 43

Policy Gradient Algorithm sketch

REINFORCEMENT LEARNING

Policy

(rules learned, how to play the game)

Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1,action1,reward1),(state2,action2,reward2),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W Encode state as vector

s

~

  • utup
  • utdown
  • utleft
  • utright

Weights W ~ Use softmax Increases outi W = W + GradientW ( neuralNetW( statei, actioni ) ) FutureRewardi * Learning rate alpha *

slide-44
SLIDE 44

FOLIE 44

What for ? The real world

REINFORCEMENT LEARNING

no feasible, deterministic algorithm

slide-45
SLIDE 45

FOLIE 45

What for ?

Traditional Heuristics

Classic Machine Learning

Automatic solution found in 93.4%

REINFORCEMENT LEARNING

Reinforcement Learning

slide-46
SLIDE 46

FOLIE 46

manage the water level on the roof control & steer the water flow find the right dimensions save & reliable

The challenges

slide-47
SLIDE 47

FOLIE 47

slide-48
SLIDE 48

FOLIE 48

Finding the right dimensions

slide-49
SLIDE 49

FOLIE 49

Finding the „right“ dimensions: demo

slide-50
SLIDE 50

FOLIE 50

  • Collapsing pipes
  • Collapsing roofs
  • Clogged pipes
  • Façade damages

What if…

slide-51
SLIDE 51

FOLIE 51

Turning the problem into a game

REINFORCEMENT LEARNING

slide-52
SLIDE 52

FOLIE 52

Designing the Action-Space

REINFORCEMENT LEARNING

Snake game Roof drainage systems

  • What actions would a human expert like to have ?
  • Are theses actions sufficient ?
  • Would more / other actions be helpful ?
  • Can we drop any actions ?
slide-53
SLIDE 53

FOLIE 53

Designing the State-Space

REINFORCEMENT LEARNING

  • What does a human expert look at ?
  • Can you switch the experts between 2 steps ?
  • Full state vs partial state
  • Designing Features

Snake game Roof drainage systems

slide-54
SLIDE 54

FOLIE 54

Designing the Reward Function

REINFORCEMENT LEARNING

  • How would you rate the result of an expert ?
  • As simple as possible
  • Positive feedback during the game
  • Beware of “surprising policies”
  • Game over if TotalReward too low

Fruit 100 Death

  • 50

Success 1000 Step

  • 1

Snake game Roof drainage systems

Change Error Count +/- 1 per Error Success 100 Step

  • 0.01
slide-55
SLIDE 55

FOLIE 55

Policy

(rules learned, how to play the game)

Turning the problem into a game

REINFORCEMENT LEARNING

actions game engine game state

"Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC

step reward RL algorithm

slide-56
SLIDE 56

FOLIE 56

Finding the dimensions with reinforcement learning: demo

slide-57
SLIDE 57

FOLIE 57

Reinforcement Learning

Hydraulics Calculation Pipeline

Traditional Heuristics

Classic Machine Learning

Automatic solution found in 93.4% Finds a solution in 70.7%

  • f the remaining 6.6%

Automatic solution found in 98.1%

REINFORCEMENT LEARNING

slide-58
SLIDE 58

FOLIE 58

Wrap Up

REINFORCEMENT LEARNING

  • Turning the problem into a game
  • Continuous policy improvement
  • No training dataset
  • Complements supervised learning
slide-59
SLIDE 59

FOLIE 59

Thank you!

REINFORCEMENT LEARNING

Christian.Hidber@bSquare.ch

W +41 44 260 54 00 M +41 76 558 41 48 https://www.linkedin.com/in/christian-hidber/

About Geberit The globally operating Geberit Group is a European leader in the field

  • f

sanitary

  • products. Geberit operates with a strong

local presence in most European countries, providing unique added value when it comes to sanitary technology and bathroom ceramics. The production network encompasses 30 production facilities, of which 6 are located

  • verseas. The Group is headquartered in

Rapperswil-Jona, Switzerland. With around 12,000 employees in around 50 countries, Geberit generated net sales

  • f

CHF 2.9 billion in 2017. The Geberit shares are listed

  • n the SIX Swiss Exchange and have been

included in the SMI (Swiss Market Index) since 2012.