Towards a Unified Framework for Learning from Observation Santiago - - PowerPoint PPT Presentation

▶

Apr 03, 2024 340 likes •702 views

Towards a Unified Framework for Learning from Observation Santiago Ontan (IIIA-CSIC, Spain) Jos L. Montaa (Universidad de Cantabria, Spain) Avelino J. Gonzalez (University of Central Florida, USA) Motivation Many disconnected

SLIDE 1

Towards a Unified Framework for Learning from Observation

Santiago Ontañón (IIIA-CSIC, Spain) José L. Montaña (Universidad de Cantabria, Spain) Avelino J. Gonzalez (University of Central Florida, USA)

SLIDE 2

Motivation

Many disconnected approaches in the

literature

Lack of a common framework to compare

SLIDE 3

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 4

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 5

Learning from Observation

Learn to perform a task solely by observing

the external behavior of another agent

SLIDE 6

Learning from Observation

Supervised learning: learning a mapping from

input variables to output variables

LfO: learning a control function (which might

have internal state)

SLIDE 7

Many Approaches

Can be traced back to 1979, with different

names:

Learning from Observation
Learning from Demonstration
Imitation Learning
Apprenticeship Learning
Programming by Demonstration

SLIDE 8

Many Approaches

Reinforcement Learning Techniques
Case-based Reasoning
Decision Trees, Neural Networks, etc.
Generic Algorithms
Inductive Logic Programming
Cognitive Architectures (SOAR, etc.)
etc.

[Argall et al. 2009] “A survey of robot learning from demonstration”

SLIDE 9

Applications

Domains with complex behaviors:
Robotics
Computer games
Training and simulation
Automated programming
etc.

SLIDE 10

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 12

Vocabulary

An environment E
An expert (or actor) C
A task T
A learning agent A

E C A action perception T

SLIDE 13

Learning Traces

The learning agent A can only observe the

interaction of the expert C with the environment, E, not the internal state of C:

perceptions (state of E by A): X
actions:

Y

LT = [(t1, x1, y1), ..., (tn, xn, yn)]

SLIDE 14

LFO Task

Given:
A set of learning traces LT1, ..., LTk
An environment E (characterized by a set
f input variables X, and a set of control

variables Y)

Optionally, a description of the task T
Learn:
A behavior B that “behaves like” C in

achieving task T in E

SLIDE 15

“Behaves like”

If no T is specified:
LFO is equivalent to learning to predict

C’s actions

If T is specified:
LFO’s performance must take into account

both predicting C’s actions and accomplishing T

SLIDE 16

Measuring Performance

In traditional ML, performance is measured

by leaving some examples out of the training set: test set

In LFO, test set would be a set of traces
Comparing traces is not trivial
Achievement of task T must be taken into

account

SLIDE 17

Measuring Performance

Evaluate performance: how well is T achieved
Evaluate output: how well the model

predicts expert actions (like traditional ML)

Evaluate model: inspect the learned model

(typically by human inspection)

SLIDE 18

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 19

Types of LFO Problems

Not all LFO algorithms work for all LFO

problems

Common differences:
Continuous/discreet variables
Observable environment or not
etc.

SLIDE 20

Types of LFO Problems

LFO problems can be characterized

depending on whether:

They require generalization or not
They require planning or not
Do we have a model of the environment

SLIDE 21

Types of LFO Problems

Generalization? Planning? Known Env.? Level no no

Level 1: Strict Imitation

yes no

Level 2: Reactive Behavior

yes yes yes Level 3: Tactical Behavior yes yes no Level 4: Tactical Behavior in unknown environment

SLIDE 22

Level 1: Strict Imitation

No feedback required from environment
No need for generalization nor planning
The learned behavior is a strict function of

time

Algorithms required: pure memorization
Example: robots in factories

SLIDE 23

Level 2: Reactive Behavior

Behavior is a ”perception to action mapping”
No need for planning
Standard (classification/regression) machine

learning algorithms can be used in this level

Example: simple complete information games

like pong or space invaders

SLIDE 24

Level 3: Tactical Behavior

Perception is not enough to determine

behavior:

Behavior to be learned has internal state
Standard (classification/regression) machine

learning algorithms cannot be used directly

Example: driving a car, or complex games

(e.g. Stratego)

SLIDE 25

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 26

Behavior as a stochastic process
LFO consists on estimating the probability

distribution of the stochastic process

Statistical Formulation

f LFO

I = {I1, ..., In} Ik = (Xk, Yk) ρ(Yk|xk, ik−1, ..., i1)

SLIDE 27

Level 1: Strict Imitation

Only the sequence of actions in the training

trace has non 0 probability:

ρ(I1 = (x1, y1), ..., In = (xn, yn)) = 1 BT = [(x1, y1), ..., (xn, yn)]

SLIDE 28

Reactive behavior only depends on perceptions:
In this case, LFO is equivalent to the traditional

supervised learning problem, and each entry in a trace is one training example

Level 2: Reactive Behavior

ρ(Yk|xk, ik−1, ..., i1) = ρ(Yk|xk)

SLIDE 29

Level 3: Tactical Behavior

The behavior needs some internal state (i.e.

memory). Assuming only a finite amount of memory is required to learn a task:

Where l plays a similar role as the order in a

Markov process

ρ(Yk|xk, ik−1, ..., i1) = ρ(Yk|xk, ik−1, ..., ik−l)

SLIDE 30

Level 3: Tactical Behavior

Given a fixed l:
Markov process of order l can be reduced

to one of order 1

We could use supervised learning

algorithms

With an explosion in the set of input

features

SLIDE 31

Outline

Learning from Observation
A Unified Framework
Levels of Difficulty of LFO
Statistical Formulation
Conclusions

SLIDE 32

Conclusions

Large amount of existing work in LFO
Each author uses a different framework and

vocabulary

Need for unification for easy comparison of

research and results

SLIDE 33

Conclusions

We presented a proposal for unified

vocabulary

Classification of LFO tasks in a series of

levels:

Our goal was to classify the types of

algorithms needed for different types of tasks

SLIDE 34

Future Work

Performance evaluation methodology
Standard testbeds for comparison:
E.g. computer games?

SLIDE 35

Towards a Unified Framework for Learning from Observation

Motivation

literature

Outline

Outline

Learning from Observation

the external behavior of another agent

Learning from Observation

input variables to output variables

have internal state)

Many Approaches

names:

Many Approaches

Applications

Related Problems

trajectories), learn the reward function

discovery

Outline

Vocabulary

Learning Traces

interaction of the expert C with the environment, E, not the internal state of C:

Y

LT = [(t1, x1, y1), ..., (tn, xn, yn)]

LFO Task

variables Y)

achieving task T in E

“Behaves like”

C’s actions

both predicting C’s actions and accomplishing T

Measuring Performance

by leaving some examples out of the training set: test set

account

Measuring Performance

predicts expert actions (like traditional ML)

(typically by human inspection)

Outline

Types of LFO Problems

problems

Types of LFO Problems

depending on whether:

Types of LFO Problems

Level 1: Strict Imitation

time

Level 2: Reactive Behavior

learning algorithms can be used in this level

like pong or space invaders

Level 3: Tactical Behavior

behavior:

learning algorithms cannot be used directly

(e.g. Stratego)

Outline

distribution of the stochastic process

Statistical Formulation

I = {I1, ..., In} Ik = (Xk, Yk) ρ(Yk|xk, ik−1, ..., i1)

Level 1: Strict Imitation

trace has non 0 probability:

ρ(I1 = (x1, y1), ..., In = (xn, yn)) = 1 BT = [(x1, y1), ..., (xn, yn)]

supervised learning problem, and each entry in a trace is one training example

Level 2: Reactive Behavior

ρ(Yk|xk, ik−1, ..., i1) = ρ(Yk|xk)

Level 3: Tactical Behavior

memory). Assuming only a finite amount of memory is required to learn a task:

Markov process

ρ(Yk|xk, ik−1, ..., i1) = ρ(Yk|xk, ik−1, ..., ik−l)

Level 3: Tactical Behavior

to one of order 1

algorithms

features

Outline

Conclusions

vocabulary

research and results

Conclusions

vocabulary

levels:

algorithms needed for different types of tasks

Future Work

Thank you!