Game Theoretic Learning for Verification and Control Sanjit A. - - PowerPoint PPT Presentation

game theoretic learning for verification and control
SMART_READER_LITE
LIVE PREVIEW

Game Theoretic Learning for Verification and Control Sanjit A. - - PowerPoint PPT Presentation

Game Theoretic Learning for Verification and Control Sanjit A. Seshia Professor EECS, UC Berkeley Joint work with Dorsa Sadigh, Jon Kotker, Daniel Bundala, Anca Dragan, Alexander Rakhlin, S. Shankar Sastry Dagstuhl Seminar March 16, 2017 Two


slide-1
SLIDE 1

Game‐Theoretic Learning for Verification and Control

Sanjit A. Seshia

Professor EECS, UC Berkeley

Dagstuhl Seminar March 16, 2017

Joint work with Dorsa Sadigh, Jon Kotker, Daniel Bundala, Anca Dragan, Alexander Rakhlin, S. Shankar Sastry

slide-2
SLIDE 2

Two Stories: 1 Control, 1 Verification

Control: Human Cyber-Physical Systems (e.g. autonomous/semi-autonomous driving) Learning (Synthesizing) Models of Human Behavior

  • S. A. Seshia

2

Verification: Timing Analysis of Embedded Software Learning (Synthesizing) Model of Platform (how platform impacts a program’s timing behavior)

slide-3
SLIDE 3

Challenge: Interactions with Humans and Human‐Controlled Systems outside the Vehicle

  • S. A. Seshia

3

“One of the biggest challenges facing automated cars is blending them into a world in which humans don’t behave by the book.”

slide-4
SLIDE 4

How can we make an autonomous vehicle behave/ communicate “naturally” with (possibly adversarial) humans in its environment?

slide-5
SLIDE 5

Interaction‐Aware Control

  • S. A. Seshia

5

  • D. Sadigh, S. Sastry, S. Seshia, A. Dragan. Planning for Autonomous Cars

that Leverages Effects on Human Actions. In RSS, 2016.

  • D. Sadigh, S. Sastry, S. Seshia, A. Dragan. Information Gathering Actions
  • ver Internal Human State. In IROS, 2016.
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Interaction as a Dynamical System

direct control

  • ver

indirect control over

8

Model the problem as a Stackelberg Game. Robot moves first.

slide-9
SLIDE 9

Model Predictive (Receding Horizon) Control: Plan for short time horizon N, replan at every step t. Assume deterministic “rational” human model, human optimizes reward function which is a linear combination of “features”.

Assumptions/Simplifications

  • Human has full access to for the short time horizon.

9

slide-10
SLIDE 10

Interaction as a Dynamical System

Find optimal actions for the autonomous vehicle while accounting for the human response

∗ .

  • ∗ argmax

, , ∗ ,

  • ∗ , argmax

, ,

Model

∗ as optimizing

the human reward function .

10

slide-11
SLIDE 11

Learning (Human) Driver Models

Learn Human’s reward function based on Inverse Reinforcement Learning [Ziebart et al, AAAI’08; Levine & Koltun, 2012].

  • ,

,

  • ⏉ ,

,

(a) Features for the boundaries of the road (b) Feature for staying inside the lanes. (c) Features for avoiding

  • ther vehicles.
  • B. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI, 2008.
  • S. Levine, V. Koltun. Continuous inverse optimal control with locally optimal examples. arXiv , 2012.

11

Assume structure of human reward function:

slide-12
SLIDE 12

Solution of Nested Optimization

  • ∗ , argmax

, ,

, ,

, ,

  • , ,

, ,

  • Gradient-Based Method (Quasi-

Newton): (solve using L-BFGS technique)

  • ∗ argmax

, , ∗ ,

slide-13
SLIDE 13

Implication: Efficiency

  • bot

uman

slide-14
SLIDE 14

Implication: Efficiency

slide-15
SLIDE 15

Implication: Efficiency

slide-16
SLIDE 16

Implication: Coordination

slide-17
SLIDE 17

Implication: Coordination

slide-18
SLIDE 18
slide-19
SLIDE 19

y of Human Vehicle x of Autonomous Vehicle

Human crossing First Human crossing Second

slide-20
SLIDE 20

Summary

  • Model control problem as Stackelberg Game
  • Data‐driven approach to learning model of human

as rational agent maximizing their reward function

– Next Steps: more realistic human model (“bounded rational” model)

  • Combine with receding horizon control approach

to obtain interaction‐aware controller

– Next Steps: Combine with previous work on correct‐by‐ construction control with temporal logic specifications

  • Temporal logic compiled into constraints

– Need to improve constrained optimization methods!

  • S. A. Seshia

20

slide-21
SLIDE 21

Two Stories: 1 Verification, 1 Control

Control: Human Cyber-Physical Systems (e.g. autonomous/semi-autonomous driving) Learning (Synthesizing) Models of Human Behavior

  • S. A. Seshia

21

Verification: Timing Analysis of Embedded Software Learning (Synthesizing) Model of Platform (how platform impacts a program’s timing behavior)

slide-22
SLIDE 22

Game‐Theoretic Timing Analysis

  • S. A. Seshia

22

  • S. A. Seshia and A. Rakhlin. Quantitative Analysis of Systems Using

Game-Theoretic Learning. In ACM Trans. Embed. Sys., 2012.

  • S. A. Seshia and A. Rakhlin. Game-Theoretic Timing Analysis. In ICCAD

2008.

slide-23
SLIDE 23

– 23 –

Challenge in Timing Analysis Challenge in Timing Analysis

Does the brake-by-wire software always actuate the brakes within 1 ms? NASA’s Toyota UA report (2011) mentions: “In practice…there are significant limitations” (in the state of the art in timing analysis).

CHALLENGE: ENVIRONMENT MODELING

Need a good model of the platform (processor, memory hierarchy, network, I/O devices, etc.)

slide-24
SLIDE 24

– 24 –

Complexity of a Timing Model: Path Space x Platform State Space Complexity of a Timing Model: Path Space x Platform State Space

flag!=0 flag!=0 flag=1; (*x)++;

Program CFG unrolled to a DAG

*x += 2;

On a processor with a data cache x Timing of an edge (basic block) depends on:

  • Path it lies on
  • Initial platform state

Challenges:

  • Exponential number of

paths and platform states!

  • Lack of visibility into

platform state

slide-25
SLIDE 25

– 25 –

Example: Automotive Window Controller Example: Automotive Window Controller

~ 1000 lines

  • f C code

~ 1016 paths

slide-26
SLIDE 26

– 26 –

Our Approach and Contributions Our Approach and Contributions

Model the estimation problem as a Game

– Tool vs. Platform

 Measurement-based, but minimal instrumentation

– Perform end-to-end measurements of selected (linearly many) paths on platform

 Learn Environment Model

– Similar to online shortest path in the ‘bandit’ setting

 Online, randomized algorithm: GameTime

– Theoretical guarantee: can predict worst-case timing with arbitrarily high probability under model assumptions

 Uses satisfiability modulo theories (SMT) solvers

for test generation

[S. A. Seshia & A. Rakhlin, ICCAD ’08, ACM TECS]

slide-27
SLIDE 27

– 27 –

The Game Formulation The Game Formulation

 Complexity  Path Space x Platform State Space

(controllable) (uncontrollable)

 Model as a 2-player Game: Tool vs. Platform

– Tool selects program paths – Platform ‘selects’ its state (possibly adversarially)

 Questions:

– What is a good platform model? – How to select paths so that we can learn an accurate platform model from executing those?

slide-28
SLIDE 28

– 28 –

Platform Model Platform Model

Nominal weight on edge of unrolled CFG + Path-specific perturbation Models path-dependent timing Models path-independent timing

w 

+ Platform selects weights for edges of the CFG

slide-29
SLIDE 29

– 29 –

A Path is a Vector x {0,1}m A Path is a Vector x {0,1}m

1 1 1 1 1 1 (m = #edges)

Insight: Only need to sample a Basis

  • f the space of paths
slide-30
SLIDE 30

– 30 –

Basis Paths Basis Paths

1 1 1 1 1 1

#(basis paths ≤ m

Useful to compute certain special bases called “barycentric spanners”

< 200 basis paths for automotive controller

slide-31
SLIDE 31

– 31 –

Timing Analysis Game (Our Model) Timing Analysis Game (Our Model)

Played over several rounds t = 1, 2, 3, …, 

Tool picks xt

CFG 1

Platform picks wt

5 7 11

At each round t:

Tool observes lt = xt · (wt + t) Platform picks t(xt)

(-1, -1, -1, -1) (5+7+1+11) - 4 = 20

At round  : Tool makes prediction (longest path x*)

 Tool wins iff its prediction is correct

slide-32
SLIDE 32

– 32 –

Theorem about Estimating Distribution (pictorial view) Theorem about Estimating Distribution (pictorial view)

 is O(b max)

Mean Perturbation Assumption:  x  Paths | E [ x . t ] | ≤ max

(exec. time)

slide-33
SLIDE 33

– 33 –

Some Experimental Results Some Experimental Results

 GameTime is Efficient

– E.g.: 7 x 1016 total paths vs. < 200 basis paths

 Accurately predicts WCET for complex platforms

– I & D caches, pipeline, branch prediction, …

 Basis paths effectively encode information about

timing of other paths

– Found paths 25% longer than sampled basis

 GameTime can accurately estimate the distribution

  • f execution times with few measurements

– Measure basis paths, predict other paths

(details in ICCAD’08, ACM TECS, FMCAD’11 papers)

slide-34
SLIDE 34

Discussion: Qualitative Characterization of the Problems Described

  • S. A. Seshia

34

Adversarial Cooperative

Verification/Analysis Control/Synthesis

Full Information No Information

  • Know only structure of human reward function

beforehand, observe entire system state

  • Human can behave arbitrarily, albeit only as a

rational agent, not actively violating robot’s obj.

  • Almost black-box (w+) platform model
  • Platform only constrained by assumptions on w,