Simultaneous Acquisition of Task and Feedback Models q Manuel - PowerPoint PPT Presentation

Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest manuel.lopes@inria.fr flowers.inria.fr/mlopes

Outline Outline • Interactive Learning Interactive Learning – Ambiguous Protocols – Ambiguous Signals Ambiguous Signals – Active Learning

Learning from Demonstration g Pros • Natural/intuitive (is it?) • Facilitates social acceptance Cons • Requires an expert with knowledge about the task and the learning system • Long and Costly Demonstrations • No Feedback on the Learning N F db k h L i Process (on most methods)

What is the best strategy to learn/teach? What is the best strategy to learn/teach? Considering teaching how to play Considering teaching how to play tennis. Information provided: • Rules of the game • Rules of the game R(x) • Strategies or verbal instructions S i b l i i of how to behave V( ) V( ) V(x)>V(y) • Demonstrations (of a particular π (x)=a hit)

How to improve learning from demonstration? • Combine: Combine: – demonstrations to initialize – self-experiment to correct modeling errors self experiment to correct modeling errors • Feedback corrections • Instructions • More data • …

How to improve learning/teaching? How to improve learning/teaching? Learner Learner – Active Learning – Combine with Self- Combine with Self Experimentation Teacher – Better Strategies B S i – Extra Cues

How are demonstrations provided? How are demonstrations provided? • Remote control (direct control) Remote control (direct control) – Exoskeleton, joystick, Wiimote,… • Unobtrusive – Acquired with vision, 3d-cameras from someone’s execution • Remote instruction (indirect control) ( ) – Verbal commands, gestures, …

Behavior of Humans Behavior of Humans • People want to direct the agent’s attention to guide exploration i id l i • People have a positive bias in their rewarding behavior, suggesting both instrumental and motivational intents with their communication channel. • People adapt their teaching strategy as they develop a mental model of how the agent learns. model of how the agent learns. • People are not optimal, even when they try to be so Cakmak, Thomaz

Interactive Learning Approaches Interactive Learning Approaches Active Learner • Decide what to ask ( Lopes Cohn Judah ) • Decide what to ask ( Lopes,Cohn,Judah ) • Ask when Uncertain/Risk ( Chernova , Roy , …) • Decide when to ask ( Cakmak ) • … Improved Teacher I d T h • Dogged Learning ( Grollman ) • User Preferences ( Mason ) User Preferences ( Mason ) • Extra Cues ( Thomaz, Knox, Judah ) • User Queries the Learner ( Cakmak ) • Tactile Guidance ( Billard ) • …

Learning under a weakly specified protocol Learning under a weakly specified protocol • People do not follow protocols rigidly i idl • Some of the provided cues depart from their mathematical d t f th i th ti l meaning, e.g. extra utterances, gestures guidance motivation gestures, guidance, motivation • Can we exploit those extra cues? • If robots adapt to the user, will training be easier? g

Different Feedback Structures Different Feedback Structures User can provide direct feedback: User can provide direct feedback: • Reward – Quantitative evaluation Q • Corrections – Yes/No classifications of behavior • Actions User can provide extra signals: U id i l • Reward of exploratory actions • Reward of getting closer to target R d f i l

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback signals: • Gestures • Prosody • Word synonyms • …

Goal / Contribution Goal / Contribution Learn simultaneously: –Task reward function f –Interaction Protocol what information is the user providing what information is the user providing –Meaning of extra signals what is the meaning of novel signals, e.g. prosody, unknown words,… Simultaneous Acquisition of Task and Feedback Models , Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer, ICDL , 2011.

Markov decision process Markov decision process Set of possible states of the world and actions: p X = {1, ..., |X|} A = {1, ..., |A|} • State evolves according to P[ X t + 1 = y | X t = x , A t = a ] = P a ( x , y ) • Reward r defines the task of the agent • A policy defines how to choose actions P[ A t = a | X t = x ] = π ( x , a ) • Determine the policy that maximizes the total (expected) reward: V ( x ) = E π [ ∑ t g t r t | X 0 = x ] • Optimal policy can be computed using DP: V * ( x ) = r ( x ) + g max a E a [ V * ( y )] Q * ( x a ) = r ( x ) + γ E [ V * ( y )] Q ( x, a ) = r ( x ) + γ E a [ V ( y )]

Inverse Reinforcement Learning The goal of the ˆ T r r task is unknown RL RL IRL IRL π ˆ T π * From world model and reward From samples of the policy and Find optimal policy world model Estimate reward Ng et al, ICML00; Abbeel et al ICML04; Neu et al, UAI07; Ramachandran et al IJCAI 07; Lopes et al IROS07

Probabilistic View of IRL Probabilistic View of IRL • Prior distribution P[ r ] • Suppose now that agent is given a demonstration: • Likelyhood of demo, D {( x 1 , a 1 ), ..., ( x n , a n )} D = {( x 1 , a 1 ), ..., ( x n , a n )} L (D) L (D) = ∏ i π r ( x i , a i ) ∏ ( ) • The teacher is not perfect (sometimes makes mistakes) • Posterior over rewards: * η ( , ) Q x a e P[ r | D] ∝ P[ r ] P[D | r ] π ’( x , a ) = * ∑ b η ( , ) Q x b e • MC-based methods to • Likelihood of observed demo: L (D) = ∏ i π ’( x i , a i ) sample P[ r | D] Ramachandran

Bayesian inverse reinforcement learning Bayesian inverse reinforcement learning

Gradient-based IRL Gradient based IRL • Idea: Compute the maximum likelihood estimate for r given the • Idea: Compute the maximum-likelihood estimate for r given the demonstration D • We use a gradient ascent algorithm: W di l i h r t + 1 = r t + ∇ r L (D) • Upon convergence, the obtained reward maximizes the likelihood of the demonstration Policy Loss ( Neu et al. ), Maximum likelihood ( Lopes et al. )

The Selection Criterion The Selection Criterion • Distribution P[ r | D ] induces a distribution on Π Distribution P[ r | D ] induces a distribution on Π • Use MC to approximate P[ r | D ] • For each ( x , a ) , P[ r | D ] induces a distribution on π ( x , a ) : µ xa ( p ) = P[ π ( x , a ) = p | D ] • Compute per state average entropy: H ( x ) = 1 / | A | ∑ a H (µ xa ) Compute entropy H (µ xa ) a 1 a 2 a 3 a 4 ... a N

Active IRL Require: Initial demonstration D 1. . Estimate P[ π | D] using MC st ate [ π | ] us g C maybe only around the ML estimate for all x ∈ X for all x ∈ X 2. 2 Compute H ( x ) endfor df Query action for x * = argmax x H ( x ) 3. 4. Add new sample to D Active Learning for Reward Estimation in Inverse Reinforcement Learning , Manuel Lopes, Francisco Melo and Luis Montesano. ECML/PKDD , 2009.

Results General grid world ( M × M grid), >200 states • • Four actions available (N, S, E, W) • Parameterized reward (goal state)

Active IRL, sample trajectories Active IRL, sample trajectories Require: Initial demonstration D q 1 1. Estimate P[ π | D] using MC for all x ∈ X 2. 0 .9 3. Compute H ( x ) 0 .8 4. endfor 0 .7 5 5. Solve MDP with R=H(x) S l MDP ith R H( ) rn d a cttra j1 0 .6 6. Query trajectory following optimal policy p y 0 .5 7. Add new trajectory to D 0 .4 0 1 1 0 1 0

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback protocol Unknown feedback protocol The information provided by the demonstration has not a predefined semantics has not a predefined semantics Meanings of the user signals • Binary Reward y • Action

Feedback Profiles Feedback Profiles Demonstration Binary Reward Ambiguous Ambiguous

Combination of Profiles Combination of Profiles

Acquisition of Task and Feedback Model Acquisition of Task and Feedback Model

Unknown/Ambiguous Feedback Unknown/Ambiguous Feedback Unknown feedback signals: • Gestures • Prosody • Word synonyms • …

Feedback meaning of user signals Feedback meaning of user signals User might use different words to provide feedback • Ok, correct, good, nice, … • Wrong, error, no no, … • Up Go Forward Up, Go, Forward An intuitive interface should allow the interaction to be as free as possible possible Even if the user does not follow a strict vocabulary, can the robot still make use of such extra signals? make use of such extra signals? Learn the meaning of new vocabulary g y

TR TO TT RT OT Init Action Next Feedback F1 (_/A) F2 (A/_) State State OT TO OT TO TT Grasp1 RT _ + - - + RT Grasp2 RT RelOnObj ++ -+ -- +- RT RelOnObj OT _ +++ -+- --+ +-+ TT Grasp2 TR AgarraVer Assuming (F1,OT) AgarraVer means Grasp1

Scenario Scenario Actions: Up, Down, Left, Right, Pick, Release T? Task consist in finding: what object to pick and where to take it Robot tries an action, including none User provides feedback T? T? 8 known symbols, 8 unknown ones Robot must learn the task goal how Robot must learn the task goal, how the user provides feedback and some unknown signs

Simultaneous Acquisition of Task and Feedback Models q Manuel - PowerPoint PPT Presentation

Simultaneous Acquisition of Task and Feedback Models q Manuel Lopes, Thomas Cederborg and Pierre-Yves Oudeyer INRIA, Bordeaux Sud-Ouest manuel.lopes@inria.fr flowers.inria.fr/mlopes Outline Outline Interactive Learning Interactive

COMMUNITY FEEDBACK 2018 FACILITIES TASK FORCE Learning from Community Feedback 10.08.18 1

Simultaneous Causality: Part IV on Causality James J. Heckman Econ 312, Spring 2019 1 / 29

Simultaneous Translation: Recent Advances and Remaining Challenges Liang Huang Baidu

Simultaneous Measurement of Simultaneous Measurement of Nonlinearity and Electrochemical

Simultaneous embeddings with few bends and crossings Fabrizio Frati Michael Hoffmann Vincent

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance

Multithreaded processors Hung-Wei Tseng Simultaneous Multi- Threading (SMT) 12 Simultaneous

JUST THE MATHS SLIDES NUMBER 1.7 ALGEBRA 7 (Simultaneous linear equations) by

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Baby Got Feedback: How to Give and Take Feedback Like A Boss Sarah Hagan @thesarahhagan Sarah

First Language Acquisition: Inherent Difficulty of Language Acquisition Theories and Evidence

Genetic Algorithms for Simultaneous Equation Models Jos J. Lpez Universidad Miguel

Land Acquisition and Relocation Process Presented by: Lynn Green, Director of Acquisition

Grammar in Performance and Acquisition: acquisition E Stabler, UCLA ENS Paris 2008 day 4

Semantic Categories using Automatically Acquired Symmetric Patterns Roy Schwartz + , Roi Reichart

Community Acquired Pneumonia (CAP) Outline Epidemiology Diagnosis Microbiology

3D Attention-Driven Depth Acquisition for Object Identification Kai Xu, Yifei Shi, Lintao Zheng,

www.zeroK.com Data Courtesy of: T Loeber, S Wolffe, and B. Laegel of T.U. Kaiserslautern

Workshop: Implementing Distributed Consensus Dan Ldtke Kordian Bruck danrl@google.com

Fast and near-optimal monitoring for healthcare acquired infection outbreaks Thu 4/23 CS:4980

Proposed Acquisition of the Remaining 60.0% Interest in 14 Data Centres Located in the United

STEPS FOR FILING YOUR MARKS-ROOS YEARLY FISCAL STATUS REPORT August 27, 2020 Nova Edwards,