Learning and Reasoning With Incomplete Data: Foundations and - - PowerPoint PPT Presentation

learning and reasoning with incomplete data foundations
SMART_READER_LITE
LIVE PREVIEW

Learning and Reasoning With Incomplete Data: Foundations and - - PowerPoint PPT Presentation

Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54 Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption


slide-1
SLIDE 1

Learning and Reasoning With Incomplete Data: Foundations and Algorithms

Manfred Jaeger

Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54

slide-2
SLIDE 2

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 2 / 54

slide-3
SLIDE 3

Introduction

Key References

1 D. Rubin, Inference and Missing Data. Biometrika 63, 1976 2 D.F. Heitjan and D. Rubin , Ignorability and Coarse Data. Ann. Stats. 19, 1991 3 R.D. Gill, M.J. van der Laan and J.M. Robins, Coarsening at Random: Characterizations, Conjectures, Counter-Examples. Proc. 1st. Seattle Symposium in Biostatistics, 1997 4 P .D. Grünwald and J.Y. Halpern, Updating Probabilities. JAIR 19, 2003 5 M. Jaeger, Ignorability for Categorical Data. Ann. Stats. 33, 2005 6 M. Jaeger, Ignorability in Statistical and Probabilistic Inference. JAIR 24, 2005 7 M. Jaeger, The AI&M Procedure for Learning from Incomplete Data. UAI 2006 8 M. Jaeger, On Testing the Missing at Random Assumption. ECML 2006 9 R.D. Gill and P .D. Grünwald, An Algorithmic and a Geometric Characterization of Coarsening at Random. Ann. Stats. 36, 2008.

Tutorial UAI 2010 3 / 54

slide-4
SLIDE 4

Introduction

Learning from Incomplete Data

heads tails Partially observed sequence of 10 coin tosses: h, t, ?, h, ?, h, ?, h, t, ? “Face-value” likelihood function for estimating the probability of heads: L(θ) = Pθ(data) =

10

Y

i=1

Pθ(di) = θ4 · (1 − θ)2 · 14 Maximized by θ = 2/3. Is this correct if “?” means: not reported because . . .

◮ . . . coin rolled off the table? ◮ . . . one observer does not know whether “harp” is heads or tails of the Irish Euro? Tutorial UAI 2010 4 / 54

slide-5
SLIDE 5

Introduction

Inference by Conditioning

The famous Monty Hall problem Argument for staying with chosen door: P(prize = 1 | prize = 2) = P(prize = 1) P(prize ∈ {1, 3}) = 1/2 Argument for switching to door 3: "door 3 ’inherits’ the probability mass of door 2, and thus P(prize = 3) = 2/3”

Tutorial UAI 2010 5 / 54

slide-6
SLIDE 6

Introduction

The Common Problem

Can we identify X is observed ∼ X has happened Coin tossing example: X: either h or t Monty Hall: X: goat behind door 2

Tutorial UAI 2010 6 / 54

slide-7
SLIDE 7

Coarse Data

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 7 / 54

slide-8
SLIDE 8

Coarse Data

Missing Values and Coarse Data

Data set with missing values: X1 X2 X3 d1 true ? high d2 false false ? d3 true ? medium Other types of incompleteness:

◮ Partly observed values: X3 = high ◮ Constraints on multiple variables: X1 = true or X2 = true

Coarse data model [2]: incomplete observations can correspond to any subset of complete

  • bservations

◮ More general than missing values ◮ Same as partial information in probability updating ◮ cf. prize ∈ {1, 3} ◮ Simplifies theoretical analysis Tutorial UAI 2010 8 / 54

slide-9
SLIDE 9

Coarse Data

Coarse Data Model

◮ Finite set of states (possible worlds): W = {x1, . . . , xn} ◮ Complete data variable X with values in W, governed by distribution Pθ (θ ∈ Θ). ◮ Incomplete data variable Y with values in 2W , governed by conditional distribution

Pλ(· | X) (λ ∈ Λ). {x1} {x2} {x3} {x1, x2} {x1, x3} {x2, x3} {x1, x2, x3} x1 0.4 x2 0.4 x3 0.2 0.5 0.2 0.1 0.04 0.2 0.08 0.2 0.08 1.0 0.4 0.5 0.1 0.5 0.1 X ∈ {x2, x3} Y = {x2, x3} X space Y space Pθ Pλ Pθ,λ

Tutorial UAI 2010 9 / 54

slide-10
SLIDE 10

The CAR Assumption

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 10 / 54

slide-11
SLIDE 11

The CAR Assumption

Learning from Coarse Data

Data: observations of Y: U = U1, U2, . . . , UN Ui ∈ 2W From correct to face-value likelihood: L(θ, λ | U) = Y

i

Pθ,λ(Y = Ui) = Y

i

X

x∈Ui

Pθ,λ(Y = Ui, X = x) = Y

i

X

x∈Ui

Pθ(X = x) Pλ(Y = Ui | X = x) Ass.: constant for x ∈ Ui = Y

i

Pλ(Y = Ui | X ∈ Ui) X

x∈Ui

Pθ(X = x) = Y

i

Pλ(Y = Ui | X ∈ Ui)Pθ(Ui) Profile Likelihood max

λ

L(θ, λ | U) ∼ Y

i

Pθ(Ui) Face-value likelihood

Tutorial UAI 2010 11 / 54

slide-12
SLIDE 12

The CAR Assumption

Inference by Conditioning

Observation: value of Y: U ∈ 2W Updating to posterior belief: Pθ,λ(X = x | Y = U) = Pθ(X = x) Pλ(Y = U | X = x) Pθ,λ(Y = U) Ass.: constant for x ∈ U = Pθ(X = x)Pλ(Y = U | X ∈ U) Pθ,λ(Y = U) = Pθ(X = x)Pθ,λ(X ∈ U | Y = U) Pθ(X ∈ U) = Pθ(X = x | X ∈ U)

Tutorial UAI 2010 12 / 54

slide-13
SLIDE 13

The CAR Assumption

Essential CAR

Data (observation) is coarsened at random (CAR) [1,2] if for all U : Pλ(Y = U | X = x) is constant for x ∈ U (e-CAR) The CAR assumption justifies

◮ learning by maximization of face-value likelihood (EM algorithm) ◮ belief updating by conditioning

Is that it? . . . not quite . . . what does (e-CAR) mean: for all U : Pλ(Y = U | X = x) is constant on {x | x ∈ U} for all U : Pλ(Y = U | X = x) is constant on {x | x ∈ U, Pθ(X = x) > 0}

Tutorial UAI 2010 13 / 54

slide-14
SLIDE 14

The CAR Assumption

Conditioning and Weak CAR

In the justification for conditioning: Pθ(X = x) Pλ(Y = U | X = x) Pθ,λ(Y = U) = Pθ(X = x)Pλ(Y = U | X ∈ U) Pθ,λ(Y = U) Needed: for all U : Pλ(Y = U | X = x) is constant on {x | x ∈ U, Pθ(X = x) > 0} (w-CAR)

Tutorial UAI 2010 14 / 54

slide-15
SLIDE 15

The CAR Assumption

Profile Likelihood and Strong CAR

In the derivation of the face-value likelihood: max

λ

L(θ, λ | U) = max

λ

Y

i

X

x∈Ui

Pθ(X = x) Pλ(Y = Ui | X = x) = max

λ

Y

i

Pλ(Y = Ui | X ∈ Ui)Pθ(Ui) ≈ Y

i

Pθ(Ui)

◮ Only if domain of λ-maximization is independent of θ ◮ “Paramter distinctness” [1] ◮ Domain of λ-maximization must not depend on support(Pθ) ◮ If we assume only weak CAR, then the domain of λ-maximization does depend on support(θ) ◮ Need

for all U : Pλ(Y = U | X = x) is constant on {x | x ∈ U} (s-CAR)

Tutorial UAI 2010 15 / 54

slide-16
SLIDE 16

The CAR Assumption

Examples

Strong CAR: x1 0.4 x2 0.0 x3 0.6 0.3 0.2 0.4 0.1 0.7 0.2 0.1 0.5 0.4 0.1 Weak CAR, not strong CAR: x1 0.4 x2 0.0 x3 0.6 0.1 0.6 0.2 0.1 0.2 0.2 0.5 0.1

Tutorial UAI 2010 16 / 54

slide-17
SLIDE 17

The CAR Assumption

Example: Data

State space with 4 states, parametric model, and empirical probabilities from 13 observations: AB ab A¯ B a(1 − b) ¯ AB (1 − a)b ¯ A¯ B (1 − a)(1 − b) A A ↔ B A ↔ ¯ B ¯ AB 6/13 3/13 3/13 1/13

Tutorial UAI 2010 17 / 54

slide-18
SLIDE 18

The CAR Assumption

Example: Face-Value Likelihood

Face-value likelihood function for parameters a, b:

0.2 0.4 0.6 0.8 1 0.5 0.0001 0.0002 0.0003 0.0004 0.0005

a b Maximum at (a, b) ≈ (0.845, 0.636)

Tutorial UAI 2010 18 / 54

slide-19
SLIDE 19

The CAR Assumption

Example: Learned Distribution

Distribution learned under s-CAR assumption: AB 0.54 A¯ B 0.31 ¯ AB 0.1 ¯ A¯ B 0.05 A A ↔ B A ↔ ¯ B ¯ AB 6/13 3/13 3/13 1/13 λ1 λ2 λ1 λ3 λ4 λ3 λ2 Question are there s-CAR λ parameters defining the joint distribution of X, Y with learned marginal on W, and observed empirical distribution on 2W ? No: λ2 = 1 ⇒ λ1 = 0 ⇒ P(Y = A) = 0 = 6/13

Tutorial UAI 2010 19 / 54

slide-20
SLIDE 20

The CAR Assumption

Example: w-CAR Likelihood

The profile likelihood under w-CAR differs from the face-value likelihood by set-of-support specific constants [5]:

0.2 0.4 0.6 0.8 1 0.5 2e–08 4e–08 6e–08 8e–08 1e–07

a b Maximum at (a, b) = (9/13, 1.0)

Tutorial UAI 2010 20 / 54

slide-21
SLIDE 21

The CAR Assumption

Example: Learned Distribution

Distribution learned under w-CAR assumption: AB 9/13 A¯ B 0.0 ¯ AB 4/13 ¯ A¯ B 0.0 A A ↔ B A ↔ ¯ B ¯ AB 6/13 3/13 3/13 1/13 2/3 1/3 3/4 1/4 Question are there w-CAR λ parameters defining the joint distribution of X, Y with learned marginal on W, and observed empirical distribution on 2W ? Yes!

Tutorial UAI 2010 21 / 54

slide-22
SLIDE 22

The CAR Assumption

Example: Summary

The following were jointly inconsistent:

◮ Observed empirical distribution of Y ◮ Learned distribution of X under s-CAR assumption ◮ s-CAR assumption

Jointly consistent were:

◮ Observed empirical distribution of Y ◮ Learned distribution of X under w-CAR assumption ◮ w-CAR assumption Tutorial UAI 2010 22 / 54

slide-23
SLIDE 23

The CAR Assumption

CAR is everything?

Gill, van der Laan, Robins [3]: “CAR is everything” That is: for every distribution P of Y there exists a joint distribution of X, Y, s.t.

◮ The marginal for Y is P ◮ The joint is s-CAR

AB 7/14 A¯ B 5/14 ¯ AB 2/14 ¯ A¯ B 0 A A ↔ B A ↔ ¯ B ¯ AB 6/13 3/13 3/13 1/13 7/13 6/13 7/13 6/13 7/13 6/13 6/13 7/13

Tutorial UAI 2010 23 / 54

slide-24
SLIDE 24

Testing CAR

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 24 / 54

slide-25
SLIDE 25

Testing CAR

Testing the Assumptions

no relax model reject yes done

Data P(Y) learn P(X) P(X),P(Y) CAR-compatible? Parametric model for P(X) CAR assumption

Tutorial UAI 2010 25 / 54

slide-26
SLIDE 26

Testing CAR

“Testing CAR”

Testing CAR relative to a parametric model:

◮ Set of support analysis ◮ Likelihood based tests

Absolute CAR “tests”:

◮ Compare to canonical models Tutorial UAI 2010 26 / 54

slide-27
SLIDE 27

Support Analysis

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 27 / 54

slide-28
SLIDE 28

Support Analysis

Sets of support

Assume:

◮ P(X) has a given set of support. For simplicity: all W Tutorial UAI 2010 28 / 54

slide-29
SLIDE 29

Support Analysis

Car and sets of support

Specifications of the set of support structure (for X and Y): B B @ 1 1 1 1 1 1 1 1 C C A CARacterizing matrix [4] Evidence hypergraph [6] Criteria for CAR-compatibility:

◮ Linear and affine relationships between rows of CARacterizing matrix ◮ ’Graphical’ criteria on evidence hypergraph. ◮ In particular: nested edges ⇒ not CAR Tutorial UAI 2010 29 / 54

slide-30
SLIDE 30

Support Analysis

Monty Hypergraphs

The evidence hypergraphs for two possible door opening scenarios (assuming fixed chosen door and 3 possible worlds: prize = i; i = 1, 2, 3): Monty’s protocol Random opening For Monty’s protocol: hypergraph not CAR-compatible

Tutorial UAI 2010 30 / 54

slide-31
SLIDE 31

Canonical Models

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 31 / 54

slide-32
SLIDE 32

Canonical Models

Our Question

◮ Which types of data-coarsening processes or protocols generate CAR data? ◮ Is there a general process model that can explain all and only CAR data? Tutorial UAI 2010 32 / 54

slide-33
SLIDE 33

Canonical Models

Grouped Data: Examples

Example 1 In a medical study the cigarette consumption of subjects is recorded. Initially, the data is collected in terms of packs per day, which is then translated into cigarettes per day: 1 1-20 4 7 21-40 2 21-40 5 1-20 8 3 6 1-20 . . . . . . Example 2 The data consists of always observed and latent (never observed) variables: POS ? ? ? ? ? ? . . . Word To be

  • r

not to be . . .

Tutorial UAI 2010 33 / 54

slide-34
SLIDE 34

Canonical Models

Grouped Data: Model

◮ Partition W of W ◮ P(Y = U | X) =

 1 if U ∈ W and X ∈ U

  • therwise

Always s-CAR: 1 1 1 1

Tutorial UAI 2010 34 / 54

slide-35
SLIDE 35

Canonical Models

Multiple Grouped Data: Examples

All or nothing Each case fully observed or completely unobserved: h, t, ?, h, ?, h, ?, h, t, ? CAR if observed/unobserved does not depend on true value. Cigarettes Some protocols contain cigarette counts in # cigarettes, others in # packs: 1 12 4 7 21-40 2 21-40 5 7 8 3 6 1-20 . . . . . . CAR if cigarette count/pack count does not depend on # cigarettes.

Tutorial UAI 2010 35 / 54

slide-36
SLIDE 36

Canonical Models

Multiple Grouped Data: Examples

Patient Dropout A patient’s health condition is measured at the end of treatment, and at two followup examinations: pid treat cond0 cond1 cond2 1 y + + + 2 y + − − 3 n − − ? 4 y − ? ? 5 n + + + CAR if dropout does not depend on treatment and initial exams. Missing Completely at Random By some random process independent of the values of X1, . . . , Xn it is decided which variables are

  • bserved:

id X1 X2 X3 X4 1 t f ? t 2 ? f t f 3 f ? ? ? 4 t t f f 5 ? t f t

Tutorial UAI 2010 36 / 54

slide-37
SLIDE 37

Canonical Models

Multiple Grouped Data: Model

◮ Partitions W1, . . . , Wk of W ◮ Probabilities λ1, . . . , λk ◮ P(Y = U | X) = P

i:X∈U∈Wi λi

Always s-CAR: W1: 0.3 W2: 0.7 0.3 0.3 1.0 1.0 0.7 0.7

Tutorial UAI 2010 37 / 54

slide-38
SLIDE 38

Canonical Models

Coarsened Completely at Random

Interpretation of Multiple Grouped Data Model:

◮ Random choice of one of k available sensors or tests ◮ Report “coarse measurement” from chosen sensor/test

But beware, nonstandard sensors, not: X Y X Y {0, 1} 4 {3, 4, 5} 1 {0, 1, 2} 5 {4, 5} 2 {1, 2, 3} CCAR Data is coarsened completely at random if it is generated by a Multiple Grouped Data Model

Tutorial UAI 2010 38 / 54

slide-39
SLIDE 39

Canonical Models

Canonical Models 1.0

Proposals for canonical models for CAR processes:

◮ Randomized Monotone Coarsening [3] ◮ CARgen [4] ◮ Both equivalent to CCAR

CAR models not generated by MGD process: Weak CAR, not strong CAR: 0.1 0.6 0.2 0.1 0.2 0.2 0.5 0.1

Tutorial UAI 2010 39 / 54

slide-40
SLIDE 40

Canonical Models

Beyond MGD

Strong CAR, not MGD: 0.5 0.5 0.5 0.5 0.5 0.5 A Challenge. . . The authors cannot conceive of a more general mechanism than a randomized monotone coarsening scheme for constructing the CAR mechanisms which one would expect to meet with in practice, but is this just a lack of imagination? (Gill et al., 1997)

Tutorial UAI 2010 40 / 54

slide-41
SLIDE 41

Canonical Models

Uniform Noise Model

Generating U conditional on X: U = {X} for i = 1, . . . , m addnoise = random boolean if addnoise = true x = random uniformly selected from W U = U ∪ {x} return U

◮ Generates s-CAR data ◮ Can generate non-MGD ◮ Can not generate all s-CAR data ◮ Relies on uniform sampling over W Tutorial UAI 2010 41 / 54

slide-42
SLIDE 42

Canonical Models

Propose and Test Model

Two solutions for a general (s-) CAR process model:

◮ CARgen* [4] ◮ Propose & Test [6]

Propose & Test repeat until success sample U ⊆ W according to Q if X ∈ U success=true return U Equivalent are

◮ Data generated by P & T process where

X

U:x∈U

Q(U) constant on {x ∈ W | P(X = x) > 0}

◮ Data w-CAR Tutorial UAI 2010 42 / 54

slide-43
SLIDE 43

Canonical Models

Robustness

MGD: Select partition according to any probabilities λ1, . . . , λk Uniform Noise: x = random uniformly se- lected from W Propose & Test: sample U ⊆ W according to Q, where Q. . .

◮ A similar parameter condition is included in CARgen* model.

Robust procedures A class of coarsening procedures is robust, if it is only defined by its qualitative protocol, but not by constraints of its probabilistic parameters. Robustness and CAR Equivalent are [6]:

◮ A CAR procedure is robust ◮ A CAR procedure is CCAR (i.e., MGD) Tutorial UAI 2010 43 / 54

slide-44
SLIDE 44

Canonical Models

Gill & Grünwald: Uniform Multicovers

Gill & Grünwald [9] construct a general CAR procedure “that does not require fine-tuning of parameters”: The more complicated parameter constraints of CARgen* or P&T are replaced by

◮ uniform sampling over a ◮ complex combinatorial space (multicovers) Tutorial UAI 2010 44 / 54

slide-45
SLIDE 45

AI&M and EM

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 45 / 54

slide-46
SLIDE 46

AI&M and EM

Profile Likelihood

Assumptions on the coarsening mech- anism ⇔ Constraints on the domain of optimiza- tion for the λ in profile likelihood maxλ L(θ, λ | U) Under S-CAR assumption: max

λ

L(θ, λ | U) ∼ Y

i

Pθ(Ui) Under No assumptions [7]: max

λ

L(θ, λ | U) ∼ − min

c∈C(U) CE(Pc, Pθ)

where

◮ C(U): space of fractional completions of data U ◮ Pc: empirical distribution defined by c ∈ C(U) Tutorial UAI 2010 46 / 54

slide-47
SLIDE 47

AI&M and EM

AI&M Algorithm

Minimize CE(Pc, Pθ) by alternating: ct := arg minc∈C(U) CE(Pc, Pθt ) (AI step) θt+1 := arg minθ∈Θ CE(Pct , Pθ) (M step)

Tutorial UAI 2010 47 / 54

slide-48
SLIDE 48

AI&M and EM

AI&M: Example

ab a(1 − b) (1 − a)b (1 − a)(1 − b) a = 0.5 b = 0.2 0.45 0.05 0.1 0.4 Pc 0.1 0.4 0.1 0.4 0.05 0.4 0.05 0.1 0.4 0.1 0.4 0.1 0.4

Tutorial UAI 2010 48 / 54

slide-49
SLIDE 49

AI&M and EM

EM: Example

ab a(1 − b) (1 − a)b (1 − a)(1 − b) a = 0.5 b = 0.2727 0.45 0.05 0.1 0.4 Pc 0.136 0.363 0.136 0.363 0.122 0.327 0.05 0.1 0.4 0.172 0.327 0.1 0.4

Tutorial UAI 2010 49 / 54

slide-50
SLIDE 50

Statistical CAR Tests

Outline

Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests

Tutorial UAI 2010 50 / 54

slide-51
SLIDE 51

Statistical CAR Tests

Likelihood Ratios

The AI&M learned parameters provide a better fit of the data. Reflected in the likelihood-ratio: log „ Profile-Lik-CAR-Ass(0.5,0.272) Profile-Lik-No-Ass(0.5,0.2) « = −0.072 Experiment with ’Asia’ network [8] Calculated log-likelihood differences from incomplete ’Asia’ data (256 states):

−0.06 −0.04 −0.02 0.00 5 10 15 20 25 30 35

Non-CAR data

−0.06 −0.04 −0.02 0.00 5 10 15 20 25 30 35

CAR data

Tutorial UAI 2010 51 / 54

slide-52
SLIDE 52

Statistical CAR Tests

Testing CAR

Likelihood-ratio tests can in principle be used to test CAR relative to a parametric model: no relax model reject yes done

Data P(Y) learn P(X) P(X),P(Y) CAR-compatible? Parametric model for P(X) CAR assumption

Challenges:

◮ Computation of likelihood ratios ◮ Analysis of distribution of test statistic Tutorial UAI 2010 52 / 54

slide-53
SLIDE 53

Statistical CAR Tests

Summary

CAR assumption instrumental for

◮ Learning from incomplete data with EM ◮ Belief updating by conditioning

“Qualitative CAR tests”:

◮ Support analysis ◮ Canonical models ◮ most (all?) natural CAR models are CCAR, i.e. MGD

Learning without CAR:

◮ Maximize profile likelihood under NO assumptions using AI&M ◮ Can be the basis for quantitative statistical CAR tests. Tutorial UAI 2010 53 / 54