Knowledge-based Sequential Decision-Making under Uncertainty Shiqi - - PowerPoint PPT Presentation

knowledge based sequential decision making under
SMART_READER_LITE
LIVE PREVIEW

Knowledge-based Sequential Decision-Making under Uncertainty Shiqi - - PowerPoint PPT Presentation

AAAI 2019 Tutorial Knowledge-based Sequential Decision-Making under Uncertainty Shiqi Zhang (SUNY Binghamton, USA) Mohan Sridharan (University of Birmingham, UK) szhang@cs.binghamton.edu; m.sridharan@bham.ac.uk Tutorial Objectives Motivate


slide-1
SLIDE 1

AAAI 2019 Tutorial

Knowledge-based Sequential Decision-Making under Uncertainty

Shiqi Zhang (SUNY Binghamton, USA) Mohan Sridharan (University of Birmingham, UK)

szhang@cs.binghamton.edu; m.sridharan@bham.ac.uk

slide-2
SLIDE 2

Tutorial Objectives

  • Motivate knowledge-based sequential decision making under uncertainty
  • Describe related concepts in knowledge representation, reasoning and

learning with simple robotics examples

  • Draw on own work and work by others to describe architectures that

illustrate knowledge-based sequential decision making under uncertainty

  • Explore interplay between knowledge representation, reasoning and

learning with architecture examples

  • Will not discuss specific “solvers” for logical or probabilistic reasoning;

the architectures described will use such solvers

2

slide-3
SLIDE 3

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

3

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-4
SLIDE 4

Knowledge-based Sequential Decision-making under Uncertainty

  • Sequential decision-making (SDM):

○ More than one action often required to complete complex tasks ○ Subsequent actions often depend on the effects of actions that precede them

  • Reasoning (planning, diagnostics) under uncertainty:

○ Actions in complex, practical domains are non-deterministic ○ Local, unreliable observations; partial observability

  • Knowledge-based:

○ Considerable commonsense knowledge available in practical applications ○ Reasoning with this knowledge can improve decision making and guide learning

4

slide-5
SLIDE 5

Knowledge Representation, Reasoning and Learning

  • How is knowledge represented?

○ Knowledge representation (KR) is a fundamental research area in AI ○ Representations include logic, probability, graphs, etc

  • How to reason with knowledge?

○ Different reasoning mechanisms based on the underlying representation KRR Query Conclusions

5

  • Why learning?

○ Reasoning with incomplete knowledge results in incorrect or suboptimal outcomes ○ Exploit ability to observe domain and action outcomes, learn from trial and error

  • Representation, reasoning and learning are inter-dependent!
slide-6
SLIDE 6

Overview of Knowledge-based SDM

6

slide-7
SLIDE 7

SDM Applications

  • Robotics; used often in tutorial
  • Finance
  • Urban planning
  • Healthcare
  • Games
  • Transportation
  • E-commerce
  • … and many more ...

Image from Sergey Levine 7

slide-8
SLIDE 8

Motivating Example

Consider a robot assisting humans in an indoor domain.

  • The robot has to find and move
  • bjects to locations or people.
  • Has some prior knowledge of

locations, objects and object properties.

  • Humans provide limited feedback.
  • Noisy sensing and actuation.

8

slide-9
SLIDE 9

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

9

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-10
SLIDE 10

SDM paradigms: Broad Classification

  • Logic-based commonsense reasoning

○ Logics to represent uncertainty, commonsense knowledge and theories of action ○ Challenges: comprehensive domain knowledge, quantitative models of uncertainty

  • Probabilistic reasoning or decision-theoretic planning

○ Compute an action policy when domain model is known and probabilistic ○ Challenges: long planning horizons, large state and action spaces

  • Reinforcement learning (RL)

○ Learn an action policy through trial and error when domain model is unknown ○ Challenges: exploration/exploitation tradeoff, credit assignment, structured knowledge

10

slide-11
SLIDE 11

Logic-based Knowledge Representation

  • Many different logics: first order, non-monotonic, temporal
  • We discuss non-monotonic logics; often Prolog-style statements

Head :- Body. "Head is true if Body is true"

  • Particular example: Answer Set Prolog [Gelfond, Kahl 2014]
  • Action language: formal model of part of natural language used to describe

transition diagrams [Gelfond, Lifschitz 1998]; many options, e.g., AL, B, C etc

  • In AL: hierarchy of basic sorts, statics, fluents, actions
  • Statements: causal law, state constraint, executability condition
  • Statements of AL provide system description: signature and axioms.

11

slide-12
SLIDE 12

Declarative Knowledge: Answer Set Prolog

  • Signature:

○ Basic sorts: robot, place, object, cup, book, printer ○ Statics: next_to(place, place), obj_weight(O, weight) ○ Fluents: loc(robot) = place, in_hand(robot, object) ○ Actions: move(robot, place), pickup(robot, object), serve(robot, object, person)

  • Axioms:

○ Causal laws: move(rob, Pl) causes loc(rob) = Pl pickup(rob, O) causes in_hand(rob, O) ○ State constraints: loc(O) = Pl if loc(rob) = Pl, in_hand(rob, O) ○ Executability conditions: impossible pickup(rob, O) if loc(rob) = Pl1, loc(O) = Pl2, Pl1 != Pl2 impossible pickup(rob, O) if obj_weight(O, heavy)

12

slide-13
SLIDE 13

Declarative Knowledge: Answer Set Prolog

  • Appealing properties of ASP:

○ Default negation and epistemic disjunction; things can be true, false, and unknown

  • p

p is believed to be false not p p is not believed to be true ○ Only believe what you are forced to believe! ○ Represent recursive definitions, defaults, causal relations, self-reference, and language constructs occurring in non-mathematical domains ○ Unlike classical first order logic, supports non-monotonic logical reasoning, i.e., revise previously held conclusions.

  • Domain representation: system description D and history H.
  • History contains records of the form:

  • bs(fluent, boolean, timestep)

○ hpd(action, timestep)

  • Translate D and H to ASP program (automatic tools) for reasoning.

13

slide-14
SLIDE 14

Probabilistic Knowledge Representation

  • Many representations possible; we focus on Probabilistic Graphical Models

(PGMs) that probabilistically model state transitions, causal relationships etc

14

  • PGMs use a graph to express

conditional independence between random variables

  • We are particularly interested in

directed acyclic PGMs (also called Bayesian networks)

slide-15
SLIDE 15

Probabilistic Knowledge Representation

  • Many representations possible; we focus on Probabilistic Graphical Models

(PGMs) that probabilistically model state transitions, causal relationships etc

  • Joint probability as product of conditional probabilities and marginals:

P(C, S, R, W) = P(W| S, R) * P(S|C) * P(R|C) * P(C)

  • We only discuss the PGMs:

○ Learned by agent/robot from environment; or ○ Constructed using human input or feedback

Dataset

15

Human, world, or both

slide-16
SLIDE 16
  • Combine logics and probabilities
  • Literals hold true with some probability
  • Markov Logic Networks (MLN) [Richardson, Domingos. 2006], ProbLog [De Raedt, Kimmig,
  • Toivonen. 2007], P-log [Baral, Gelfond, Rushton. 2009] PSL [Bach, Broecheler, Huang, Getoor. 2015] etc

Hybrid Knowledge Representation

Compute the probability of:

  • Anna and Bob being friends given

their smoking habits

  • Bob having cancer given his

friendship with Anna and the likelihood of Anna having cancer

16

Left: an example of MLN

slide-17
SLIDE 17

Representation of Probabilistic Planning Domains

  • PDDL is developed for and maintained by the International Planning

Competition (IPC) community [McDermott, Ghallab, et al. 1998], and is (arguably) the most popular declarative language for classical planning

  • PPDDL developed for describing MDP settings in 2004
  • In 2011, Relational Dynamic Influence Diagram Language (RDDL)

developed for better expressiveness (c.f., PPDDL)

  • pBC+ developed for probabilistic reasoning about transition systems

[Lee, Wang 2018] 17

These and other similar action languages are limited in terms of representing and reasoning with different descriptions of knowledge and uncertainty

slide-18
SLIDE 18

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

18

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-19
SLIDE 19

Logics for Reasoning

  • Reasoning includes planning, diagnostics and inference.
  • Strategy depends on representation; many solvers have been developed
  • Map reasoning task to:

○ Resolution and theorem proving, e.g., with First Order Logic. ○ Constraint satisfaction problem (CSP). ○ Satisfiability (SAT) problem, e.g., with ASP.

  • We do not focus on solvers in this tutorial; instead, we explore how they

can be used to formulate and solve problems.

  • Let us explore how reasoning is accomplished using CR-Prolog, a variant
  • f ASP with consistency-restoring (CR) rules [Balduccini, Gelfond, 2003].

19

slide-20
SLIDE 20

CR-Prolog Program

  • Convert D and H as program: ⊓(D, H)
  • Signature and axioms of D, inertia axioms:

holds(F, I+1) :- holds(F, I), not -holds(F, I+1)

  • holds(F, I+1) :- -holds(F, I), not holds(F, I+1)
  • Reality checks, closed world assumptions for defined fluents and actions

:- holds(F, I), obs(F, false, I) :- -holds(F, I), obs(F, true, I)

  • Observations, actions, defaults from H, e.g., initial state default + CR rule:

holds(loc(X) = library, 0) :- textbook(X), not -holds(loc(X) = library, 0)

  • holds(loc(X) = library, 0) :± textbook(X), not -holds(loc(X) = library, 0)
  • Planning and diagnosis reduced to computing answer sets of program.

20

slide-21
SLIDE 21

CR-Prolog Planning Example

  • Goal: loc(book1, office2), -in_hand(rob, book1)
  • Given: textbook(book1), loc(rob) = kitchen, …, next_to(kitchen, office2),

next_to(library, kitchen),...

  • Based on default knowledge:

move(rob, library), pickup(rob, book1), move(rob, kitchen), move(rob, office2), putdown(rob, book1)

21

slide-22
SLIDE 22

Challenges in using Logics for Reasoning

  • Modeling and reasoning with sensing and actuation uncertainty.
  • Domain knowledge often incomplete and may change
  • Fine-grained reasoning necessary (e.g., grasping) but computationally

expensive.

22

Will return to these later

slide-23
SLIDE 23

Probabilistic Reasoning: Bayes Rule and Filter

  • Joint and conditional probability of random variables: P(A, B), P(A|B)
  • Basic Bayes rule: P(A, B) = P(A|B) P(B) = P(B|A) P(A)

P(A|B) = P(B|A) P(A)/ P(B)

  • Bayes filter for state estimation (prediction and correction):
  • X (or S) = state, U (or A) = action, Z = observation (i.e., measurement)
  • Bayes filter is the basis of most probabilistic reasoning systems

23

slide-24
SLIDE 24

Probabilistic Reasoning: Markov Decision Process (MDP)

  • Markov property is assumed to hold for MDP (and later RL)

○ First-order: given current state, next state is conditionally independent of previous states ○ Simplifies computation of policies for complex real-world problems

  • MDP is an SDM framework under the Markov assumption [Puterman 2014]
  • An MDP is a 4-tuple <S, A, T, R>

○ States, Actions, Transitions, and Rewards ○ T: S x A x S’ ↦ [0, 1] ○ R: S x A x S’ ↦ 𝔒

  • Solving an MDP produces a policy:

○ ⊓: S ↦A

24

slide-25
SLIDE 25
  • Partial observability and non-determinism
  • POMDP tuple <S, A, Z, T, O, R>:

○ Z: set of observations ○ O: observation function: P(z∊Z|s∊S, a∊A) O: S x A x Z ↦ [0, 1]

  • Maintain belief state (or belief), a probability distribution over states,

using observations

  • Solving a POMDP produces a policy mapping beliefs to actions.

⊓: B ↦A

Probabilistic Reasoning: Partially Observable MDPs (POMDPs) [Kaelbling, Littman, Cassandra. 1998]

25

slide-26
SLIDE 26

P O M D P Belief state MDP

Probabilistic planning over a long, unspecified horizon… t=0 t=1 t=2

Observability Partial Full

MDPs and POMDPs

26

slide-27
SLIDE 27

MDPs and POMDPs as DBN

27

  • MDPs and POMDPs are essentially Dynamic Bayesian Networks (DBNs)

MDP

slide-28
SLIDE 28

MDPs and POMDPs as DBN

  • MDPs and POMDPs are essentially Dynamic Bayesian Networks (DBNs)

z1 z2

POMDPs use observations for state estimation

28

POMDP

slide-29
SLIDE 29

MDPs and POMDPs Algorithms

  • Many MDP and POMDP algorithms:

○ Bellman equation, Value Iteration (VI); classical solvers ○ Monte Carlo tree search (MCTS), point-based (approximate) methods [Shani, Pineau, Kaplow

2013]

○ And many more… World model Goal MDP/POMDP algorithms Policy Interact

29

slide-30
SLIDE 30

Challenges in MDPs and POMDPs Algorithms

  • MDP/POMDP algorithms computationally expensive for large complex

domains.

  • Policy often assumed to be stationary.
  • By themselves, not well-suited for commonsense reasoning.

30

Will return to these later

slide-31
SLIDE 31

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

31

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-32
SLIDE 32

Learning for Decision Making

  • Domain knowledge incomplete and can become inconsistent
  • Decisions made can be incorrect or sub-optimal:

○ Moving on newly polished surface ○ Inaccurate model of sensors or domain objects

  • Different ways to learn knowledge and use for decision making:

○ Supervised learning from labeled training samples ○ Unsupervised learning ○ ... ○ Learning through trial and error

  • We focus on Reinforcement Learning for decision making

32

slide-33
SLIDE 33

Reinforcement learning (RL)

  • Basic idea:

State fully observable, actions non-deterministic

Attempt different actions, receive feedback in the form of rewards

Agent learns to act so as to maximize expected cumulative rewards

  • Still have an MDP:

○ Set of states and actions. ○ Learn policy ⊓: S ↦A ○ No knowledge of domain models (T, R); trial and error approach Environment Agent Action State, reward

33

slide-34
SLIDE 34

Reinforcement learning (RL) [Sutton 2018]

  • Different “threads” of RL

○ Trial and error approach; origins in psychology. ○ Dynamic programming approach for stochastic control problems ○ Temporal difference methods

  • Challenges:

○ Exploration/exploitation, generalization. ○ Credit assignment ○ Model design, reward specification ○ Delayed consequences

34 Image from David Silver

slide-35
SLIDE 35

RL Algorithms Taxonomy

Image from David Silver

  • Model-based:

○ Compute model parameters T, R; solve MDP for value function V(s) or Q-value function Q(s,a)

  • Model-free:

○ Directly compute V(s) or Q(s, a) from samples (s, a, r, s’)

  • Policy-based:

○ Compute state-action mapping

  • Advanced algorithms:

○ State-action abstractions, function approximation through deep learning

35

slide-36
SLIDE 36

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

36

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-37
SLIDE 37

Logical Inference Guides Probabilistic Planning

  • Logical reasoning to

compute informative priors for planning with partial observability

  • Components:

○ ASP-based inference with commonsense knowledge sets probabilistic priors ○ Probabilistic planning with these priors using hierarchical POMDPs ○ Reason about domain-level priors

37 Zhang, Sridharan, Wyatt. 2015

slide-38
SLIDE 38

Logical Inference Guides Probabilistic Planning

Looking for a printer… Where to move? Where to look?

  • Early work on commonsense (logical) reasoning guiding probabilistic

state estimation

  • Computing probabilistic priors from logical knowledge uses postulates

(e.g., objects from a class are often co-located) and psychophysics

  • Knowledge from similar domains provide priors for early termination

Zhang, Sridharan, Wyatt. 2015 38

slide-39
SLIDE 39

Logical-Probabilistic Reasoning about Belief State

  • Algorithm CORPP: (logical-probabilistic) commonsense reasoning and

probabilistic planning

  • Logical reasoning for filtering out irrelevant states
  • Probabilistic reasoning for

associating probability with each state

39 Zhang, Stone. 2015

slide-40
SLIDE 40
  • CORPP was used with spoken dialog system for sequential decision-making

Zhang, Stone. 2015

Logical-Probabilistic Reasoning about Belief State

  • Dialog manager (a planner)

maintains a belief distribution over possible service requests

  • Reasoning is for initializing belief

distributions with informative priors

40

slide-41
SLIDE 41

Dynamically Factored Belief State

  • Robot receives both sensory information and human-provided

declarative knowledge

  • How to accurately incorporate the

(noisy, relational) information to achieve goals in POMDP setting?

41 Chitnis, Kaelbling, and Lozano-Perez. 2018

slide-42
SLIDE 42

Dynamically Factored Belief State

  • Idea:

○ Join factors when their variables are correlated through observational information ○ Separate factors when uncorrelated

Chitnis, Kaelbling, and Lozano-Perez. 2018

Robotic cooking domains:

  • Involving both locations and ingredients
  • Robot is tasked with gathering ingredients and using them to cook a meal

42

slide-43
SLIDE 43

P O M D P Belief state MDP

Probabilistic planning over a long, unspecified horizon… t=0 t=1 t=2

Observability Partial Full

Knowledge-based belief estimation

Zhang, Sridharan, Wyatt. 2015 Zhang, Stone. 2015 Chitnis, Kaelbling, and Lozano-Perez. 2018 43

slide-44
SLIDE 44
  • Interleaved CORPP (iCORPP)

○ Reasons about world dynamics with logical-probabilistic knowledge ○ Dynamically constructs transition systems (MDP/POMDPs) for adaptive planning

Zhang, Khandelwal, Stone. 2017

Logical-Probabilistic Reasoning about Dynamics

44

Transition probability of a navigation action depends on many factors: weather, near-window status, time, human positions, etc It’s infeasible to consider all in the (PO)MDPs

slide-45
SLIDE 45

Zhang, Khandelwal, Stone. 2017

Logical-Probabilistic Reasoning about Dynamics

45

iCORPP dynamically builds (PO)MDPs by reasoning with knowledge about world dynamics

slide-46
SLIDE 46

P O M D P Belief state MDP

Probabilistic planning over a long, unspecified horizon… t=0 t=1 t=2

Observability Partial Full

Knowledge-based Dynamics Estimation

46 Zhang, Khandelwal, Stone. 2017

slide-47
SLIDE 47

Switching Planner

  • Switches between classical planner and probabilistic planner depending
  • n level of uncertainty [Hanheide et al., 2017]
  • Classical planner: Continual Planning [Brenner, Nebel, 2009]

○ Interleaves planning, plan execution and plan monitoring ○ Actions assert that preconditions will be met when that point in plan execution reached ○ Replanning triggered if preconditions are not met during execution or are met earlier

  • Probabilistic planning computes actions executed in the physical world.

47

PDDL-style classical planner POMDP-style probabilistic planner Task Plan Uncertainty high? Yes No

slide-48
SLIDE 48

Switching Planner

  • Overall architecture:

○ Three-layered organization of knowledge (instance, default, diagnostic) ○ Three-layered architecture (competence, belief, deliberative) ○ Combines first-order logic and probabilistic reasoning for planning

  • Decision-Theoretic PDDL (DTPDDL) used for representing both action

preconditions and effects, as well as probabilistic transitions

  • Weak coupling (transfer of information) between the two planning systems

48

slide-49
SLIDE 49

49

  • Represent and reason with tightly-coupled transition diagrams at two

different resolutions [Sridharan et al., 2018, 2019]

  • For any given goal, non-monotonic logical reasoning with commonsense

knowledge at coarse-resolution provides sequence of abstract actions

  • Each abstract transition implemented as sequence of fine-resolution

concrete actions; automatically zoom to and reason probabilistically with part of fine-resolution diagram relevant to coarse-resolution transition

  • Result of executing fine-resolution action updates coarse-resolution

history for subsequent reasoning

  • We use CR-Prolog for logical reasoning, hierarchical POMDPs for

probabilistic reasoning.

REBA: Refinement-based KRR

slide-50
SLIDE 50

REBA: Refinement-based KRR (Example)

50 Sridharan, Gelfond, Zhang, Wyatt. 2018

  • Examine the transition of a robot moving between two rooms at

coarse-resolution and fine-resolution

slide-51
SLIDE 51
  • Goal: loc(B) = kitchen, -in_hand(rob, B), box(B)
  • Initial: loc(rob) = office, obj_weight(box1, heavy), arm(rob, pneumatic)
  • Based on default: loc(box1) = office
  • One coarse-resolution plan from ASP-based inference:

move(rob, office), pickup(rob, box1), move(rob, kitchen), putdown(rob, box1)

  • Assume rob is in office. implement pickup(rob, box1); find+pickup box1
  • Relevant literals: loc(rob) = C1, loc(box1) = C2 where C1, C2 ∈ office
  • Possible fine-resolution action sequence (executed probabilsitically):

… mov(rob, c3) test(rob, loc(box1), c3) % box1 observed! pickup(rob, box1)

  • Subsequent plan steps succeed

51

REBA: Refinement-based KRR (Example)

Sridharan, Gelfond, Zhang, Wyatt. 2018

slide-52
SLIDE 52

52

  • Key contributions:

Tight coupling between transition diagrams ○ Theory of observations; formal definitions of refinement and zooming ○ Automatic construction of data structures for probabilistic reasoning ○ General methodology for design of software for robots; Dijkstra’s step-wise refinement ○ Combine strengths of declarative programming, probabilistic reasoning

  • Advantages:

Simplifies and speeds up design; increases confidence in correctness of robot’s behavior ○ Separation of concerns; reuse of representations on other robots and domains ○ Single framework for planning, diagnostics, inference, trade-off accuracy and efficiency ○ Significant improvement in reliability and efficiency; scales to complex domains

Sridharan, Gelfond, Zhang, Wyatt. 2018

REBA: Refinement-based KRR

slide-53
SLIDE 53

Comparative Summary of Architectures

53

Algorithm name Logical knowledge Probabilistic knowledge Tight Coupling Reason about Dynamics Interleaved reasoning & planning Switching planner (2017) Yes No No No Yes ASP-POMDP (2015) Yes No No No No CORPP (2015) Yes Yes No No No iCORPP (2017) Yes Yes No Yes Yes Dynamic Factorization (2018) No Yes No No Yes REBA (2018) Yes No Yes Yes Yes

  • Here “knowledge” refers only to declarative knowledge
  • Tight coupling refers to the transfer of all (and only) the relevant information between the logical

and probabilistic reasoning components

slide-54
SLIDE 54

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

54

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-55
SLIDE 55

Domain Approximation for Reinforcement LearnING (DARLING)

  • Reasoner provides a rational way to constrain the exploration, while RL

eases the requirements on the model accuracy.

  • DARLING is composed of three steps:

1. Plan generation: find all reasonable plans (cost < threshold) 2. Plan filtering: exclude “certainly-suboptimal” plans, e.g., those with redundant actions, and generate partial policy 3. Execution and learning: try only actions that are returned by the partial policy in exploration

55 Leonetti, Iocchib, Stone. 2016

slide-56
SLIDE 56

Domain Approximation for Reinforcement LearnING (DARLING)

56 Leonetti, Iocchib, Stone. 2016

Domain map, and states traversed during the first and last 50 episodes by the RL (Sarsa) and PRL (knowledge-based RL) agents Door status unknown initially: door being open with increasing probability

slide-57
SLIDE 57

Domain Approximation for Reinforcement LearnING (DARLING)

57 Leonetti, Iocchib, Stone. 2016

DARLING uses declarative action knowledge to guide robot exploration in reinforcement learning -- robot only tries the reasonable actions

slide-58
SLIDE 58

Symbolic Deep Reinforcement Learning (SDRL)

  • Symbolic planner: action knowledge for long-term planning
  • Controller: DRL for learning for each subtask based on intrinsic rewards;
  • Meta-controller: learning extrinsic rewards from the controller’s

performance, and propose new intrinsic goals to the planner

58 Lyu, Yang, Liu, Gustafson. 2019

DQN R-learning Classical planner

slide-59
SLIDE 59

Symbolic Deep Reinforcement Learning (SDRL)

59 Lyu, Yang, Liu, Gustafson. 2019

  • hDQN cannot reach the “400” score with 2.5M samples
  • The variance of SDRL is smaller than the hDQN’s
  • Symbolic planner guides primitive sub-policy learning

Montezuma’s Revenge, & the optimal policy

slide-60
SLIDE 60

Symbolic Deep Reinforcement Learning (SDRL)

60 Lyu, Yang, Liu, Gustafson. 2019

  • SDRL uses an RL agent to interact with the “real world”, and reports to

the task level agent (task planner) with abstraction.

  • The refinement idea is similar to the REBA architecture [Sridharan et al

2018, 2019], while SDRL learns from the task-completion experience

  • SDRL is a follow-up work of PEORL [Yang, Lyu, Liu, Gustafson, 2018]

where perception of RL is symbolic.

slide-61
SLIDE 61

KRR-RL: integrated logical-probabilistic KRR and model-based RL

  • Logical-probabilistic KRR allows:

○ Human (logical) knowledge used to specify transition dependency ○ Model-based RL (R-Max) for filling in transition probabilities

61 Lu, Zhang, Stone, Chen. 2018

KRR-RL agent learns domain dynamics from “small” tasks to get prepared to accomplish “large” tasks.

slide-62
SLIDE 62

KRR-RL: integrated logical-probabilistic KRR and model-based RL

  • In spare time, agent learns from navigation tasks to prepare for

upcoming delivery tasks

  • Robot is more cautious on delivery tasks that require significant

navigation efforts

62 Lu, Zhang, Stone, Chen. 2018

A delivery task requires both dialog and navigation actions

slide-63
SLIDE 63

KRR-RL: integrated logical-probabilistic KRR and model-based RL

KRR-RL Assumptions:

  • Domain experts (human) are good at providing qualitative actions

preconditions and effects

  • Model-based RL algorithms do well in learning quantitative uncertainty
  • f action knowledge

63 Lu, Zhang, Stone, Chen. 2018

slide-64
SLIDE 64

TMP-RL: Integrated Task-Motion Planning and RL

  • TMP-RL features two nested

planning-learning loops

○ In the inner TMP loop, the robot generates a low-cost, feasible task-motion plan ○ In the outer loop, the plan is executed, and the robot learns from the execution experience via model-free RL

64 Jiang, Yang, Zhang, Stone. 2018

  • Task and motion planning (TMP) algorithms generate plans at both

symbolic and continuous spaces

○ TMP solutions are sensitive to unexpected domain uncertainty and changes

slide-65
SLIDE 65

TMP-RL: Integrated Task-Motion Planning and RL

  • TMP-RL performs the best in learning rate
  • TMP and TMP-RL have smaller variance

during execution

  • TMP does not improve over time

65 Jiang, Yang, Zhang, Stone. 2018

slide-66
SLIDE 66

66

Algorithm name

  • Prob. KR

Different resolutions Lookahead in KR Represent ation learning Model based RL Motion planning DARLING (2016) No No Yes No No No SDRL (2018) No Yes Yes Yes No No KRR-RL (2018) Yes No No No Yes No PEORL (2018) No Yes Yes No No No TMP-RL (2018) No Yes Yes No No Yes

Summary of knowledge-based RL

There is also research on integrating cognitive architectures with reinforcement learning, such as SHARSHA (2001) and Soar-RL (2004). These (and other such) cognitive architectures support learning and inference.

slide-67
SLIDE 67

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

67

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-68
SLIDE 68

Learning for Knowledge Revision

  • Many approaches possible for revising domain knowledge:

○ Learning action models from observed effects [Gil, 1994] ○ Searching joint space of hypotheses and observations [Simon, Lea, 1974]

  • Our focus on declarative knowledge:

○ Inductive learning of causal laws [Otero, 2003] ○ Expand theory of actions, revise ASP system descriptions [Balduccini, 2007; Law et al., 2018] ○ Process perceptual input to learn in cognitive architecture [Laird, 2012]

  • Interactive task learning [Chai et al., 2018; Laird et al., 2017]:

○ Labeled examples or reinforcement; Relational RL [Driessens, Ramon, 2003] ○ Learning task knowledge using RRL [Block, Laird, 2017]

  • Challenges:

○ Generalization, e.g., of equivalent axioms with redundant parts ○ Actions with delayed effects ○ Observations from active exploration and reactive action execution

68

slide-69
SLIDE 69

Relational Reinforcement Learning

  • Combines RL with relational/inductive learning, e.g., Q-RRL algorithm
  • Relational representation of states, actions
  • Typically uses logical decision trees:

○ Learn relationally equivalent states and actions ○ Each example is a relational database, e.g., state description in planning task ○ First-order logic instead of attribute-value representations ○ Prolog-style queries as tests in internal nodes; binary decision trees (BDT)

  • Declarative bias for learning relational representations of policies
  • Challenges:

○ RRL typically for particular planning task (e.g., stack blocks), difficult to learn generic knowledge across tasks (and MDPs) ○ Computationally expensive in most practical robotics domains

69

slide-70
SLIDE 70
  • Combines declarative programming, probabilistic reasoning and

relational reinforcement learning [Sridharan, Meadows, 2017, 2018].

  • Learn parts of system description (represented as CR-Prolog programs):

○ Action descriptions (i.e., actions, preconditions, effects), action capabilities (affordances) ○ Axioms including causal laws, executability conditions

70

REBA-Interactive Learning

slide-71
SLIDE 71
  • Non-monotonic logical reasoning (with or without probabilistic reasoning)

used for planning and diagnostics (as in REBA).

  • Interactive learning:

○ Verbal input to learn action relations and causal laws; ○ Active exploration (RRL) of action preconditions and effects; ○ Reactive exploration (RRL) of unexpected action outcomes

  • ASP-based reasoning guides learning:

○ Determines transitions to explore further ○ Selects and defines relevant MDPs for RRL (active/reactive exploration)

  • Learned domain knowledge used for subsequent reasoning
  • Tight coupling: bidirectional flow of control and relevant information between

reasoning and learning

REBA-Interactive Learning

71

slide-72
SLIDE 72

Our Binary Decision Tree

  • Generalizes over MDPs; policy

for subsequent Q-learning

  • Computationally efficient,

more reliable, scales better

  • Nodes: test of domain literals
  • Path from root to leaf: partial

state-action pair

  • Expansion at leaf if adding a

test reduces Q-value variance

  • Generates candidate axioms

72

slide-73
SLIDE 73
  • Goal:

loc(C) = office, -in_hand(rob, C), cup(C)

  • Initial:

loc(rob) = office,

  • bj_weight(cup1, light), obj_surface(cup1, brittle)
  • Based on default:

loc(cup1) = kitchen

  • One coarse-resolution plan from ASP-based inference:

move(rob, kitchen), pickup(rob, cup1), move(rob, office), putdown(rob, cup1)

  • Assume rob moves successfully to the kitchen.
  • Next action to implement: pickup(rob, cup1); to find+pickup cup1

73

Learning for Knowledge Revision (Example)

slide-74
SLIDE 74

Learning for Knowledge Revision (Example)

  • Relevant literals: loc(rob) = C1, loc(cup1) = C2 where C1, C2 can be any cell in kitchen
  • Possible fine-resolution action sequence (executed probabilsitically):

… mov(rob, c3) test(rob, loc(cup1), c3) % cup1 observed! pickup(rob, cup1) ...

  • Robot moves to office and puts cup down; cup is then observed to be

broken:

  • bs(obj_status(cup1, damaged), true, 4)
  • This unexpected outcome triggers RRLto learn previously unknown

generic axiom:

putdown(rob, C) causes obj_status(C, damaged) if obj_surface(C, brittle)

74

slide-75
SLIDE 75

Tutorial Outline

  • Introduction
  • Basics:

Knowledge representation: declarative, probabilistic, hybrid

Reasoning: logic-based, MDP, POMDP

Learning: reinforcement

  • Example architectures:

Knowledge guides reasoning

Knowledge guides learning ○ Learning for knowledge revision

  • Discussion

75

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-76
SLIDE 76

Discussion

  • Key capabilities supported by knowledge-based SDM under uncertainty:

○ Non-deterministic action outcomes, partial observability ○ Reasoning with (incomplete) declarative knowledge ○ Efficient learning from interaction experience

  • Important challenges to be addressed by future work:

○ Representation for KRR: logics, probabilistic, hybrid? Integration takes considerable effort if different components have different representations ○ Benchmark problems and algorithms; comparing and evaluating architectures is difficult ○ Formal analysis for trustworthy behavior: completeness and soundness guarantees ○ Scaling to large knowledge bases/ontologies and complex relationships ○ Explainable decision making

76

Shiqi Zhang (SUNY Binghamton) & Mohan Sridharan (U. of Birmingham)

slide-77
SLIDE 77
  • Bach SH, Broecheler M, Huang B, Getoor L (2017). Hinge-loss Markov random fields and probabilistic soft logic.

The Journal of Machine Learning Research. 2017 Jan 1;18(1):3846-912.

  • Balduccini M, Gelfond M (2003). Logic Programs with Consistency-Restoring Rules. In AAAI Spring Symposium
  • n Logical Formalization of Commonsense Reasoning, pages 9–18.
  • Balduccini, M (2007). Learning Action Descriptions with A-Prolog: Action Language C. AAAI Spring Symposium
  • n Logical Formalizations of Commonsense Reasoning.
  • Baral C, Gelfond M, Rushton N (2009). Probabilistic reasoning with answer sets. Theory and Practice of Logic
  • Programming. 2009 Jan;9(1):57-144.
  • Bloch MK, Laird JE (2017). Deciding to Specialize and Respecialize a Value Function for Relational Reinforcement
  • Learning. Multi-disciplinary Conference on Reinforcement Learning and Decision Making. Ann Arbor, USA.
  • Brenner M, Nebel B (2009). Continual planning and acting in dynamic multiagent environments. Journal of

Autonomous Agents and Multiagent Systems, 19 (3), pp. 297-331.

  • Chai JY, Gao Q, She L, Yang S, Saba-Sadiya S, Xu G (2018). Language to Action: Towards Interactive Task

Learning with Physical Agents. International Joint Conference on Artificial Intelligence, Stockholm, Sweden

  • De Raedt L, Kimmig A, Toivonen H. ProbLog: A probabilistic Prolog and its application in link discovery.
  • Driessens K, Ramon J (2003). Relational Instance-Based Regression for Relational Reinforcement Learning.

International Conference on Machine Learning (pp. 123–130). AAAI Press.

  • Gelfond M, Inclezan D (2013). Some Properties of System Descriptions of ALd. Journal of Applied Non-Classical

Logics, Special Issue on Equilibrium Logic and Answer Set Programming, 23, 105–120.

77

References

slide-78
SLIDE 78

References

  • Gelfond M, Kahl Y (2014). Knowledge representation, reasoning, and the design of intelligent agents: The

answer-set programming approach. Cambridge University Press.

  • Gelfond M, Lifschitz V (1988). Action Languages. Computer and Information Science, 3(16).
  • Gil Y (1994). Learning by Experimentation: Incremental Refinement of Incomplete Planning Domains.

International Conference on Machine Learning (pp. 87–95). New Brunswick, USA.

  • Hanheide M, Göbelbecker M, Horn GS, Pronobis A, Sjöö K, Aydemir A, Jensfelt P, Gretton C, Dearden R, Janicek

M, Zender H, Kruijff G-J, Hawes N, Wyatt JL (2017). Robot task planning and explanation in open and uncertain

  • worlds. Artificial Intelligence, vol. 247, pp. 119-150.
  • Jiang Y, Yang F, Zhang S, Stone P. Integrating Task-Motion Planning with Reinforcement Learning for Robust

Decision Making in Mobile Robots. arXiv preprint arXiv:1811.08955. 2018 Nov 21.

  • Kaelbling LP, Littman ML, Cassandra AR. Planning and acting in partially observable stochastic domains.

Artificial intelligence. 1998 May 1;101(1-2):99-134.

  • Laird JE (2012). The Soar Cognitive Architecture. The MIT Press.
  • Laird JE, et al. (2017). Interactive Task Learning. IEEE Intelligent Systems, 32, 6–21.
  • Law M, Russo A, Broda K (2018). The Complexity and Generality of Learning Answer Set Programs. Artificial

Intelligence, 259, 110–146.

  • Leonetti M, Iocchi L, Stone P. A synthesis of automated planning and reinforcement learning for efficient,

robust decision-making. Artificial Intelligence. 2016 Dec 31;241:103-30.

  • Lee J., Wang, Y. (2018). A Probabilistic Extension of Action Language BC+. Theory and Practice of Logic

Programming, 18(3-4), 607-622.

78

slide-79
SLIDE 79

References

  • Lyu D, Yang F, Liu B, Gustafson S (2019). SDRL: Interpretable and Data-efficient Deep Reinforcement Learning

Leveraging Symbolic Planning. AAAI.

  • McDermott D, Ghallab M, Howe A, Knoblock C, Ram A, Veloso M, Weld D, Wilkins D (1998). PDDL-the planning

domain definition language.

  • Otero RP (2003). Induction of the Effects of Actions by Monotonic Methods. International Conference on

Inductive Logic Programming (pp. 299–310).

  • Puterman ML (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
  • Richardson M, Domingos P (2006). Markov logic networks. Machine learning. Feb 1;62(1-2):107-36.
  • Shani G, Pineau J, Kaplow R (2013). A survey of point-based POMDP solvers. Autonomous Agents and

Multi-Agent Systems. 27(1):1-51.

  • Sanner S (2010). Relational dynamic influence diagram language (RDDL): Language description. Unpublished
  • ms. Australian National University.
  • Sridharan M, Gelfond M, Zhang S, Wyatt JL (2019). REBA: Refinement-based Architecture for Knowledge

Representation and Reasoning in Robotics. To appear in Journal of Artificial Intelligence Research.

  • Sridharan M, Meadows B (2018). Knowledge Representation and Interactive Learning of Domain Knowledge for

Human-Robot Interaction. In Advances in Cognitive Systems Journal, 7:69-88, December

  • Sridharan M, Meadows B (2017). A Combined Architecture for Discovering Affordances, Causal Laws, and

Executability Conditions. International Conference on Advances in Cognitive Systems (ACS). Troy, USA.

  • Sutton RS, Barto AG (2018). Reinforcement learning: An introduction. MIT press.

79

slide-80
SLIDE 80

References

  • Yang F, Lyu D, Liu B, Gustafson S (2018). PEORL: Integrating Symbolic Planning and Hierarchical Reinforcement

Learning for Robust Decision-Making. IJCAI.

  • Younes HL, Littman ML (2004). PPDDL1. 0: An extension to PDDL for expressing planning domains with

probabilistic effects. Techn. Rep. CMU-CS-04-162..

  • Zhang S, Sridharan M, Wyatt JL (2015). Mixed logical inference and probabilistic planning for robots in

unreliable worlds. IEEE Transactions on Robotics. Jun;31(3):699-713.

  • Zhang S, Stone P (2015). CORPP: commonsense reasoning and probabilistic planning, as applied to dialog with

a mobile robot. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

  • Zhang S, Khandelwal P, Stone P (2017). Dynamically Constructed (PO)MDPs for Adaptive Robot Planning. In

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.

80

slide-81
SLIDE 81

Questions and comments

81