1 Regret Video of Demo Q-learning Exploraon Funcon - PDF document

Explora*on ¡vs. ¡Exploita*on ¡ CS ¡473: ¡Ar*ficial ¡Intelligence ¡ Reinforcement ¡Learning ¡II ¡ ¡ Dieter ¡Fox ¡/ ¡University ¡of ¡Washington ¡ [Most ¡slides ¡were ¡taken ¡from ¡Dan ¡Klein ¡and ¡Pieter ¡Abbeel ¡/ ¡CS188 ¡Intro ¡to ¡AI ¡at ¡UC ¡Berkeley. ¡ ¡All ¡CS188 ¡materials ¡are ¡available ¡at ¡hPp://ai.berkeley.edu.] ¡ How ¡to ¡Explore? ¡ Video ¡of ¡Demo ¡Q-‑learning ¡– ¡Manual ¡Explora*on ¡– ¡Bridge ¡Grid ¡ ¡ § Several ¡schemes ¡for ¡forcing ¡explora*on ¡ § Simplest: ¡random ¡ac*ons ¡(ε-‑greedy) ¡ § Every ¡*me ¡step, ¡flip ¡a ¡coin ¡ § With ¡(small) ¡probability ¡ε, ¡act ¡randomly ¡ § With ¡(large) ¡probability ¡1-‑ε, ¡act ¡on ¡ current ¡policy ¡ § Problems ¡with ¡random ¡ac*ons? ¡ § You ¡do ¡eventually ¡explore ¡the ¡space, ¡but ¡keep ¡ thrashing ¡around ¡once ¡learning ¡is ¡done ¡ § One ¡solu*on: ¡lower ¡ε ¡over ¡*me ¡ § Another ¡solu*on: ¡explora*on ¡func*ons ¡ Explora*on ¡Func*ons ¡ Video ¡of ¡Demo ¡Q-‑learning ¡– ¡Epsilon-‑Greedy ¡– ¡Crawler ¡ ¡ § When ¡to ¡explore? ¡ § Random ¡ac*ons: ¡explore ¡a ¡fixed ¡amount ¡ § BePer ¡idea: ¡explore ¡areas ¡whose ¡badness ¡is ¡not ¡ ¡(yet) ¡established, ¡eventually ¡stop ¡exploring ¡ § Explora*on ¡func*on ¡ § Takes ¡a ¡value ¡es*mate ¡u ¡and ¡a ¡visit ¡count ¡n, ¡and ¡ ¡returns ¡an ¡op*mis*c ¡u*lity, ¡e.g. ¡ ¡ Regular ¡Q-‑Update: ¡ Modified ¡Q-‑Update: ¡ § Note: ¡this ¡propagates ¡the ¡“bonus” ¡back ¡to ¡states ¡that ¡lead ¡to ¡unknown ¡states ¡as ¡well! ¡ ¡ ¡ ¡ ¡ ¡ 1

Regret ¡ Video ¡of ¡Demo ¡Q-‑learning ¡– ¡Explora*on ¡Func*on ¡– ¡Crawler ¡ ¡ § Even ¡if ¡you ¡learn ¡the ¡op*mal ¡policy, ¡ you ¡s*ll ¡make ¡mistakes ¡along ¡the ¡way! ¡ § Regret ¡is ¡a ¡measure ¡of ¡your ¡total ¡ mistake ¡cost: ¡the ¡difference ¡between ¡ your ¡(expected) ¡rewards, ¡including ¡ youthful ¡subop*mality, ¡and ¡op*mal ¡ (expected) ¡rewards ¡ § Minimizing ¡regret ¡goes ¡beyond ¡ learning ¡to ¡be ¡op*mal ¡– ¡it ¡requires ¡ op*mally ¡learning ¡to ¡be ¡op*mal ¡ § Example: ¡random ¡explora*on ¡and ¡ explora*on ¡func*ons ¡both ¡end ¡up ¡ op*mal, ¡but ¡random ¡explora*on ¡has ¡ higher ¡regret ¡ Approximate ¡Q-‑Learning ¡ Generalizing ¡Across ¡States ¡ § Basic ¡Q-‑Learning ¡keeps ¡a ¡table ¡of ¡all ¡q-‑values ¡ § In ¡realis*c ¡situa*ons, ¡we ¡cannot ¡possibly ¡learn ¡ about ¡every ¡single ¡state! ¡ § Too ¡many ¡states ¡to ¡visit ¡them ¡all ¡in ¡training ¡ § Too ¡many ¡states ¡to ¡hold ¡the ¡q-‑tables ¡in ¡memory ¡ § Instead, ¡we ¡want ¡to ¡generalize: ¡ § Learn ¡about ¡some ¡small ¡number ¡of ¡training ¡states ¡from ¡ experience ¡ § Generalize ¡that ¡experience ¡to ¡new, ¡similar ¡situa*ons ¡ § This ¡is ¡a ¡fundamental ¡idea ¡in ¡machine ¡learning, ¡and ¡we’ll ¡ see ¡it ¡over ¡and ¡over ¡again ¡ [demo ¡– ¡RL ¡pacman] ¡ Example: ¡Pacman ¡ Video ¡of ¡Demo ¡Q-‑Learning ¡Pacman ¡– ¡Tiny ¡– ¡Watch ¡All ¡ Let’s ¡say ¡we ¡discover ¡ In ¡naïve ¡q-‑learning, ¡ Or ¡even ¡this ¡one! ¡ through ¡experience ¡ we ¡know ¡nothing ¡ that ¡this ¡state ¡is ¡bad: ¡ about ¡this ¡state: ¡ [Demo: ¡Q-‑learning ¡– ¡pacman ¡– ¡*ny ¡– ¡watch ¡all ¡(L11D5)] ¡ [Demo: ¡Q-‑learning ¡– ¡pacman ¡– ¡*ny ¡– ¡silent ¡train ¡(L11D6)] ¡ ¡ [Demo: ¡Q-‑learning ¡– ¡pacman ¡– ¡tricky ¡– ¡watch ¡all ¡(L11D7)] ¡ 2

Video ¡of ¡Demo ¡Q-‑Learning ¡Pacman ¡– ¡Tiny ¡– ¡Silent ¡Train ¡ Video ¡of ¡Demo ¡Q-‑Learning ¡Pacman ¡– ¡Tricky ¡– ¡Watch ¡All ¡ Feature-‑Based ¡Representa*ons ¡ Linear ¡Value ¡Func*ons ¡ § Solu*on: ¡describe ¡a ¡state ¡using ¡a ¡ vector ¡of ¡ § Using ¡a ¡feature ¡representa*on, ¡we ¡can ¡write ¡a ¡q ¡func*on ¡(or ¡value ¡func*on) ¡for ¡any ¡ features ¡(aka ¡“proper*es”) ¡ state ¡using ¡a ¡few ¡weights: ¡ § Features ¡are ¡func*ons ¡from ¡states ¡to ¡real ¡numbers ¡(ooen ¡ 0/1) ¡that ¡capture ¡important ¡proper*es ¡of ¡the ¡state ¡ § Example ¡features: ¡ § Distance ¡to ¡closest ¡ghost ¡ § Distance ¡to ¡closest ¡dot ¡ § Number ¡of ¡ghosts ¡ § 1 ¡/ ¡(dist ¡to ¡dot) 2 ¡ § Is ¡Pacman ¡in ¡a ¡tunnel? ¡(0/1) ¡ § Advantage: ¡our ¡experience ¡is ¡summed ¡up ¡in ¡a ¡few ¡powerful ¡numbers ¡ § …… ¡etc. ¡ § Is ¡it ¡the ¡exact ¡state ¡on ¡this ¡slide? ¡ § Can ¡also ¡describe ¡a ¡q-‑state ¡(s, ¡a) ¡with ¡features ¡(e.g. ¡ § Disadvantage: ¡states ¡may ¡share ¡features ¡but ¡actually ¡be ¡very ¡different ¡in ¡value! ¡ ac*on ¡moves ¡closer ¡to ¡food) ¡ Approximate ¡Q-‑Learning ¡ Example: ¡Q-‑Pacman ¡ § Q-‑learning ¡with ¡linear ¡Q-‑func*ons: ¡ Exact Q’s Approximate Q’s § Intui*ve ¡interpreta*on: ¡ § Adjust ¡weights ¡of ¡ac*ve ¡features ¡ § E.g., ¡if ¡something ¡unexpectedly ¡bad ¡happens, ¡blame ¡the ¡features ¡that ¡were ¡on: ¡ disprefer ¡all ¡states ¡with ¡that ¡state’s ¡features ¡ § Formal ¡jus*fica*on: ¡online ¡least ¡squares ¡ [Demo: ¡approximate ¡Q-‑ learning ¡pacman ¡(L11D10)] ¡ 3

Video ¡of ¡Demo ¡Approximate ¡Q-‑Learning ¡-‑-‑ ¡Pacman ¡ Q-‑Learning ¡and ¡Least ¡Squares ¡ Linear ¡Approxima*on: ¡Regression* ¡ Op*miza*on: ¡Least ¡Squares* ¡ 40 26 24 20 22 Error or “residual” 20 Observation 30 40 20 0 30 Prediction 0 20 10 20 10 0 0 Prediction: Prediction: 0 0 20 Minimizing ¡Error* ¡ Overfiung: ¡Why ¡Limi*ng ¡Capacity ¡Can ¡Help* ¡ 30 Imagine ¡we ¡had ¡only ¡one ¡point ¡x, ¡with ¡features ¡f(x), ¡target ¡value ¡y, ¡and ¡weights ¡w: ¡ 25 20 Degree 15 polynomial 15 10 5 0 Approximate ¡q ¡update ¡explained: ¡ -5 -10 “target” ¡ “predic*on” ¡ -15 0 2 4 6 8 10 12 14 16 18 20 4

1 Regret Video of Demo Q-learning Exploraon Funcon - PDF document

Exploraon vs. Exploitaon CS 473: Ar*ficial Intelligence Reinforcement Learning II Dieter Fox / University of Washington [Most slides were taken from

Chapter 23 Union-Find CS 573: Algorithms, Fall 2013 November 14, 2013 23.1 Union Find 23.2

pump tower function tests Shuoxing Wu WA105-3x1x1 bi-weekly meeting 27.07.2016 LAr pump tower

How to mask S-Boxes of a block cipher against side channel attacks. Focus on the AES. Micha el

Analytical Inductive Programming as a Cognitive Rule Acquisition Devise Ute Schmid, Martin

CREATING, ACQUIRING AND INTEGRATING REUSABLE IP Prof. Don Bouldin, Ph.D. Electrical &

P ART II: C ONNECTING WITH THE C OMMUNITY Jarlath ONeil Dunne Ian Hanou Earl Eutsler

www.charlesriver.org Widett Circle. DAVID L. RYAN/GLOBE STAFF BLUE CITIES: SUBWATERSHED LANDSCAPE

Natick Planning Board Natick Department of Community & Economic Development Crosby,

Bowmanville West Urban Centre Secondary Plan Update Dillon Consulting Limited Introduction

Local Environment and Economic Development Toolkit (LEED) Level 1 Workshop - TVCA Outline of the

THE NEW NORMAL NILGA PLACEMAKING SEMINAR LOCAL GOVERNMENT RESPONSE 25 November 2020 Maura

Hobart Interim Planning Scheme 2015 Amendment to the Height Standards in the Central Business Zone

PARADE GARDENS States of Jersey | Future Hospital Project: Technical Advisor: Long list Park Site

Home Network Performance Diagnosis Lucas Di Cioccio 1,2 , Renata Teixeira 2 , Catherine Rosenberg

Computer Graphics (CS 543) Lecture 13c Ray Tracing Overview Prof Emmanuel Agu Computer Science

1 Plan for today Computer Graphics as Virtual Photography Small change in plans real

Suggestions in British and American English: A corpus- linguistic study Ilka Flck

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

BURNABY BOARD OF TRADE WARMING CENTRES YOUR VOICE. YOUR HOME. HOUSING & WORK FORCE SUPPLY

IA-32 Architecture CS 4440/7440 Malware Analysis and Defense Intel x86 Architecture } Security

IAB Report IETF 83 Paris, France March 26, 2012 Where is the IAB? Source: Jari Arkko Slide 2

AI Planner Applications Practical Applications of AI Planners Overview Deep Space 1

A classification of spherical symmetric CR manifolds Giulia Dileo joint work with Antonio Lotta

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

1 Regret Video of Demo Q-learning Explora*on Func*on - PDF document

Explora*on vs. Exploita*on CS 473: Ar*ficial Intelligence Reinforcement Learning II Dieter Fox / University of Washington [Most slides were taken from

Chapter 23 Union-Find CS 573: Algorithms, Fall 2013 November 14, 2013 23.1 Union Find 23.2

pump tower function tests Shuoxing Wu WA105-3x1x1 bi-weekly meeting 27.07.2016 LAr pump tower

How to mask S-Boxes of a block cipher against side channel attacks. Focus on the AES. Micha el

Analytical Inductive Programming as a Cognitive Rule Acquisition Devise Ute Schmid, Martin

CREATING, ACQUIRING AND INTEGRATING REUSABLE IP Prof. Don Bouldin, Ph.D. Electrical &amp;

P ART II: C ONNECTING WITH THE C OMMUNITY Jarlath ONeil Dunne Ian Hanou Earl Eutsler

www.charlesriver.org Widett Circle. DAVID L. RYAN/GLOBE STAFF BLUE CITIES: SUBWATERSHED LANDSCAPE

Natick Planning Board Natick Department of Community &amp; Economic Development Crosby,

Bowmanville West Urban Centre Secondary Plan Update Dillon Consulting Limited Introduction

Local Environment and Economic Development Toolkit (LEED) Level 1 Workshop - TVCA Outline of the

THE NEW NORMAL NILGA PLACEMAKING SEMINAR LOCAL GOVERNMENT RESPONSE 25 November 2020 Maura

Hobart Interim Planning Scheme 2015 Amendment to the Height Standards in the Central Business Zone

PARADE GARDENS States of Jersey | Future Hospital Project: Technical Advisor: Long list Park Site

Home Network Performance Diagnosis Lucas Di Cioccio 1,2 , Renata Teixeira 2 , Catherine Rosenberg

Computer Graphics (CS 543) Lecture 13c Ray Tracing Overview Prof Emmanuel Agu Computer Science

1 Plan for today Computer Graphics as Virtual Photography Small change in plans real

Suggestions in British and American English: A corpus- linguistic study Ilka Flck

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity Chulhee

BURNABY BOARD OF TRADE WARMING CENTRES YOUR VOICE. YOUR HOME. HOUSING &amp; WORK FORCE SUPPLY

IA-32 Architecture CS 4440/7440 Malware Analysis and Defense Intel x86 Architecture } Security

IAB Report IETF 83 Paris, France March 26, 2012 Where is the IAB? Source: Jari Arkko Slide 2

AI Planner Applications Practical Applications of AI Planners Overview Deep Space 1

A classification of spherical symmetric CR manifolds Giulia Dileo joint work with Antonio Lotta

CSE543 - Introduction to Computer and Network Security Module: Applied Cryptography Professor

1 Regret Video of Demo Q-learning Exploraon Funcon - PDF document

Exploraon vs. Exploitaon CS 473: Ar*ficial Intelligence Reinforcement Learning II Dieter Fox / University of Washington [Most slides were taken from

CREATING, ACQUIRING AND INTEGRATING REUSABLE IP Prof. Don Bouldin, Ph.D. Electrical &

Natick Planning Board Natick Department of Community & Economic Development Crosby,

BURNABY BOARD OF TRADE WARMING CENTRES YOUR VOICE. YOUR HOME. HOUSING & WORK FORCE SUPPLY