SLIDE 1 INTRODUCTION TO AUTONOMOUS MOBILE ROBOTS
Lesson 5 – Low-level control and learning Anders Lyhne Christensen, D6.05, anders.christensen@iscte.pt
SLIDE 2
Overview
Low-level control
Ad-hoc Sense-think-act loop Event driven control
Learning and adaptation I
Types of learning Issues in learning Example: Q-learning applied to a real robot
(next time, we will discuss an interesting approach to learning called evolutionary robotics in more detail)
SLIDE 3
Low-level control
We will cover three types of low-level control:
Stream of instructions Classic Control loop Event-driven languages
Other approaches such as logic programming exist, but we will not cover those in this course.
SLIDE 4 Stream of instructions
Example:
// move forward for 2 seconds: moveForward(speed = 10) sleep(2000) if (obstacleAhead()) { turnLeft(speed = 10) sleep(1000) } else { … }
Suitable for industrial, assembly line robots
Easy to describe a fixed, predefined task as a recipe Little branching
SLIDE 5 Classic control loop
Sense Think Act
The loop usually has a fixed duration, e.g. 100 ms and is called repeatedly
SLIDE 6 Classic control loop
Sense Think Act
The loop usually has a fixed duration, e.g. 100 ms and is called repeatedly
SLIDE 7 Classic control loop
while (!Button.ESCAPE.isPressed()) { long startTime = System.currentTimeMillis(); sense(); // read sensors think(); // plan next action act(); // do next action try { Thread.sleep(100 – (System.currentTimeMillis() – startTime)); } catch (Exception e) {} }
SLIDE 8 Event-driven languages
URBI script – examples:
Ball tracking:
whenever (ball.visible) { headYaw.val += camera.xfov * ball.x & headPitch.val += camera.yfov * ball.y };
Interaction:
at (speech.hear("hello")) { voice.say("How are you?") & robot.standup(); }
SLIDE 9 Distributed and event-driven
…. Right motor microcontroller Left motor microcontroller Proximity sensors microcontroller
Event bus
SLIDE 10 LEARNING AND ADAPTATION
Based on slides from Prof. Lynn E. Parker
SLIDE 11
What is Learning/Adaptation?
Many definitions:
Modification of behavioral tendency by experience.
(Webster 1984)
A learning machine, broadly defined, is any device whose
actions are influenced by past experiences. (Nilsson 1965)
Any change in a system that allows it to perform better the
second time on repetition of the same task or on another task drawn from the same population. (Simon 1983)
An improvement in information processing ability that
results from information processing activity. (Tanimoto 1990)
Our operational definition:
Learning produces changes within an agent that over time
enable it to perform more effectively within its environment.
SLIDE 12
What is Relationship between Learning and Adaptation?
Evolutionary adaptation: Descendents change over long time
scales based on the success or failure of their ancestors in the environment
Structural adaptation: Agents adapt their morphology with
respect to the environment
Sensor adaptation: An agent’s perceptual system becomes
more attuned to its environment
Behavioral adaptation: An agent’s individual behaviors are
adjusted relative to one another
Learning: Essentially anything else that results in a more
ecologically fit agent (can include adaptation).
SLIDE 13 Habituation and Sensitization
Adaptation may produce habituation
Habituation: An eventual decrease in or cessation of a behavioral response when a stimulus is presented numerous times Useful for eliminating spurious or unnecessary responses Generally associated with relatively insignificant stimuli, such as loud noise Sensitization: The opposite – an increase in the probability of a behavioral response when a stimulus is repeated frequently Generally associated with “dire” stimuli, like electrical shocks
Sensitization Example of habituation
SLIDE 14
Learning
Learning, on the other hand, can improve performance
in additional ways:
Introducing new knowledge (facts, behaviors, rules) into the
system
Generalizing concepts from multiple examples Specializing concepts for particular instances that are in some
way different from the mainstream
Reorganizing the information within the system to be more
efficient
Creating or discovering new concepts Creating explanations of how things function Reusing past experiences
SLIDE 15 AI Research has Generated Several Learning Approaches
- Reinforcement learning: rewards and/or punishments are used to alter
numeric values in a controller
- Evolutionary learning: Genetic operators such as crossover and
mutation are used over populations of controllers, leading to more efficient control strategies
- Neural networks: A form of reinforcement learning that uses
specialized architectures in which learning occurs as the result of alterations in synaptic weights
- Learning from experience:
Memory-based learning: myriad individual records of past experiences
are used to derive function approximators for control laws
Case-based learning: Specific experiences are organized and stored as a
case structure, then retrieved and adapted as needed based on the current situational context
SLIDE 16
Learning Approaches (con’t.)
Inductive learning: Specific training examples are
used, each in turn, to generalize and/or specialize concepts or controllers
Explanation-based learning: Specific domain
knowledge is used to guide the learning process
Multistrategy learning: Multiple learning methods
compete and cooperate with each other, each specializing in what it does best
SLIDE 17
Challenges with Learning
Credit assignment problem: How is credit or blame assigned to a
particular piece or pieces of knowledge in a large knowledge base, or to the components of a complex system responsible for either the success or failure of an attempt to accomplish a task?
Saliency problem: What features in the available input stream are
relevant to the learning task?
New term problem: When does a new representational construct
(concept) need to be created to capture some useful feature effectively?
Indexing problem: How can a memory be efficiently organized to
provide effective and timely recall to support learning and improved performance?
Utility problem: How does a learning system determine that the
information it contains is still relevant and useful? When is it acceptable to forget things?
SLIDE 18 Example: Q-Learning Algorithm
Provides ability to learn by determining which
behavioral actions are most appropriate for a given situation
State-action table: E(y) = utility of state y
Actions, a State, x Next state y, utility E(y)
SLIDE 19 Update function for Q(x,a)
Q(x,a) Q(x,a) + β(r + λE(Y) – Q(x,a)) Where:
- β is learning rate parameter
- r is the payoff (reward or punishment)
- λ is a parameter, called the discount factor, ranging between 0
and 1
E(y) is the utility of the state y that results from the action and is
computed by: E(y) = max(Q(y,a)) for all actions a
Reward actions are propagated across states so that rewards from
similar states can facilitate learning, too.
What is “similar state”? One approach: Weighted Hamming
Distance
SLIDE 20 Utility Function Used to Modify Robot’s Behavioral Responses
Initialize all Q(x,a) to 0. Do Forever
- Determine current world state s via sensing
- 90% of the time choose action a that maximizes Q(x,a)
else pick random action
- Execute a
- Determine reward r
- Update Q(x, a) as described
- Update Q(x’,a) for all states x’ similar to x
End Do
SLIDE 21 Example of Using Q-Learning: Teaching Box-Pushing
8 sonar (4 look forward, 2 look right, 2 look left) Sonar quantized into two ranges: NEAR (from 9-18 inches) FAR (from 18-30 inches) Forward-looking infrared (IR): Binary response of 4 inches to indicate when robot in BUMP state Current to drive motors monitored to
determine if robot is STUCK (i.e., input current exceeds a threshold)
Total of 18 bits of sensor information available:
16 sonar bits (NEAR, FAR), two for BUMP and STUCK
Motor control outputs – five choices: Moving forward Turning left 22 degrees Turning right 22 degrees Turning more sharply left at 45 degrees Turning more sharply right at 45 degrees Obelix robot and box, 1991
SLIDE 22 Robot’s Learning Problem
- Learning Problem:
- Deciding, for any of the approximately 250,000 perceptual
states, which of the 5 possible actions will enable it to find and push boxes around a room without getting stuck 250,000 perceptual states 5 actions = 250,000 x 5 = 1,250,000 state/action pairs to explore!
SLIDE 23 State Diagram of Behavior Transitions
Finder Unwedger Pusher BUMP BUMP BUMP + ∆t STUCK STUCK STUCK + ∆t Anything else
toward possible boxes
BUMP results from box find
robot when box is no longer pushable
SLIDE 24
Measurement of “State Nearness”
Use 18-bit representation of state (16 for sonar,
two for BUMP and STUCK)
Compute Hamming distance between states Recall: Hamming distance = number of bits in
which the two states differ
For this example, states were considered “near” if
Hamming distance < 3
SLIDE 25 Robotic Results
Q-learning strategy tested on Obelix
robot
Observations:
Using Q-learning over a random agent
substantially improved box pushing
Also compared performance to hand-
coded solution: performance of Q- learning approach was close-to or better than hand-coded solution
Importance of this work:
Its empirical demonstration of Q-learning’s
feasibility as a useful approach to behavior-based robotic learning
SLIDE 26 Summary of Learning/Adaptation so far
Robots need to learn in order to adapt effectively to a changing
and dynamic environment
Behavior-based robots can learn in a variety of ways:
They can learn entire new behaviors They can learn more effective responses They can learn to associate more appropriate or broader stimuli with a
particular response
They can learn new combinations of behaviors They can learn more effective coordination of existing behaviors
Learning can be either continuous (on-line) or batch (off-line) Q-Learning is a form of Reinforcement Learning in which
actions and states are evaluated together.
SLIDE 27
A Challenge: Getting RL to Work on Real Robots
When is learning appropriate?
When task is originally under-specified or difficult to code
exactly by hand
When task has parameters that are likely to change over time
in unpredictable ways
When time taken to learn control policy is less than that for
hand-coding a comparable policy
When learned policy can be executed more efficiently than a
hand-coded one
SLIDE 28 Problems with RL on Robots
Huge number of states to explore, with large number of possible
actions in each state.
E.g., 24 sonar sensors, quantized into 3 range bands 282 billion possible
states
If possible actions in each state or go forwards or backwards > 560 billion
state-action combinations to try
Robot is physical, thus it takes time to perform an action
1 second per action 20,000 years to try each combination
During early learning, robot’s actions may be dangerous
“Let’s try rolling down the stairwell to see what next state I end up in …” One possible safeguard: give robot reflexes to stop dangerous actions
SLIDE 29
Today’s task
Work on projects