Learning Structured Decision Problems with Unawareness Craig Innes - - PowerPoint PPT Presentation
Learning Structured Decision Problems with Unawareness Craig Innes - - PowerPoint PPT Presentation
Learning Structured Decision Problems with Unawareness Craig Innes (craig.innes@ed.ac.uk), Alex Lascarides (alex@inf.ed.ac.uk) Institute for Language, Cognition and Computation University of Edinburgh 1 Why Unawareness? Fertiliser
Why Unawareness?
Fertiliser Yield Grain Protein Precipitation R
X = {Prec, Protein, Yield} A = {Grain, Fert} scope(R) = {Yield, Protein} PaProt = {Grain} P(Prot = p|Grain = g) = θp|g
2
Why Unawareness?
Local Concern Bad Press R Gross Crops Yield Fertiliser Nitrogen Protein Pesticide Infestation Fungicide Fungus Harrow Weeds Grain Insect Prevalence Precipitation Soil Type Temperature
X 0 ⊆ X + A0 ⊆ A+ scope0(R) ⊆ scope+(R) PaProt = {Grain} P(Prot = p|Grain = g) = θp|g
2
Contributions
Our agent learns an interpretable model of a decision problem incrementally via evidence from domain trials and expert advice.
3
Contributions
Our agent learns an interpretable model of a decision problem incrementally via evidence from domain trials and expert advice. Evidence may reveal actions/variables the agent was completely unaware of prior to learning.
3
Contextual Advice
Types of Advice
- 1. Advice on Better Actions
- 2. Resolving Misunderstandings
- 3. Unexpected Rewards
- 4. Unknown Effects
4
Contextual Advice - Better Action
If agent’s performance in last k trials is below threshold β of true policy π+, say:
5
Contextual Advice - Better Action
If agent’s performance in last k trials is below threshold β of true policy π+, say: “At time t you should have done a′ = A1 = 0, A2 = 1, A3 = 0 rather than at”
5
Contextual Advice - Better Action
If agent’s performance in last k trials is below threshold β of true policy π+, say: “At time t you should have done a′ = A1 = 0, A2 = 1, A3 = 0 rather than at”
- Action variable A3 is part of the problem
(A3 ∈ A)
- A3 is relevant
(∃X ∈ scope(R), anc(A3, X))
- There exists a better reward (∃s, s[Bt] = st[Bt] ∧ R+(s) > rt)
- a′ has a greater expected utility than at
(EU(a′|s) > EU(at|s))
5
Conserving Previous Beliefs
Fertiliser Yield Grain Protein Precipitation R
P(PaYield|D0:t) PaYield = ∅ PaYield = {Fert} . . . PaYield = {Fert, Prec, Grain}
6
Conserving Previous Beliefs
Fertiliser Yield Grain Protein Precipitation Fungus R
P(PaYield|D0:t) PaYield = ∅ PaYield = {Fungus} PaYield = {Fert} PaYield = {Fert, Fungus} . . . PaYield = {Fert, Prec, Grain} PaYield = {Fert, Prec, Grain, Fungus}
6
Conserving Previous Beliefs
Fertiliser Yield Grain Protein Precipitation Fungus R
P(PaYield|D0:t) PaYield = ∅ PaYield = {Fungus} PaYield = {Fert} PaYield = {Fert, Fungus} . . . PaYield = {Fert, Prec, Grain} PaYield = {Fert, Prec, Grain, Fungus} Pnew(PaX) = (1 − ρ)Pold(PaX|D0:t) if Fungus / ∈ PaX ρPold(Pa
′
X|D0:t)
if PaX = Pa
′
X ∪ {Fungus} 6
Experiments
Randomly Generated Networks: 12 - 36 Variables
- 12 - 36 Variables
- 3000 Trials
- ǫ-greedy strategy
- Expert Aid β = 0.1
O2 R
Start
A1 O10 A2 O9 A3 O5 A4 A5 A6 O1 A7 O7 A8 O6 A9 A10 O4 A11 O3 A12 B11 R B12 O12 B10 O2 B7 B8 O11 B9 B1 B2 B3 B6 B4 B5 O8
Learning Goal
7
Results
500 1000 1500 2000 2500 3000
t
10 20 30 40 50 60
Cumulative Reward default truePolicy random
8
Results
500 1000 1500 2000 2500 3000
t
40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0
Cumulative Reward default nonCon nonRelevant
8
Results
500 1000 1500 2000 2500 3000
t
40.0 42.5 45.0 47.5 50.0 52.5 55.0 57.5 60.0