learning and reasoning with incomplete data foundations
play

Learning and Reasoning With Incomplete Data: Foundations and - PowerPoint PPT Presentation

Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54 Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption


  1. Learning and Reasoning With Incomplete Data: Foundations and Algorithms Manfred Jaeger Machine Intelligence Group Aalborg University Tutorial UAI 2010 1 / 54

  2. Outline Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 2 / 54

  3. Key References Introduction 1 D. Rubin, Inference and Missing Data . Biometrika 63, 1976 2 D.F. Heitjan and D. Rubin , Ignorability and Coarse Data . Ann. Stats. 19, 1991 3 R.D. Gill, M.J. van der Laan and J.M. Robins, Coarsening at Random: Characterizations, Conjectures, Counter-Examples . Proc. 1st. Seattle Symposium in Biostatistics, 1997 4 P .D. Grünwald and J.Y. Halpern, Updating Probabilities . JAIR 19, 2003 5 M. Jaeger, Ignorability for Categorical Data . Ann. Stats. 33, 2005 6 M. Jaeger, Ignorability in Statistical and Probabilistic Inference . JAIR 24, 2005 7 M. Jaeger, The AI&M Procedure for Learning from Incomplete Data . UAI 2006 8 M. Jaeger, On Testing the Missing at Random Assumption . ECML 2006 9 R.D. Gill and P .D. Grünwald, An Algorithmic and a Geometric Characterization of Coarsening at Random . Ann. Stats. 36, 2008. Tutorial UAI 2010 3 / 54

  4. Learning from Incomplete Data Introduction heads tails Partially observed sequence of 10 coin tosses: h , t , ? , h , ? , h , ? , h , t , ? “Face-value” likelihood function for estimating the probability of heads : 10 P θ ( d i ) = θ 4 · ( 1 − θ ) 2 · 1 4 Y L ( θ ) = P θ ( data ) = i = 1 Maximized by θ = 2 / 3. Is this correct if “?” means: not reported because . . . ◮ . . . coin rolled off the table? ◮ . . . one observer does not know whether “harp” is heads or tails of the Irish Euro? Tutorial UAI 2010 4 / 54

  5. Inference by Conditioning Introduction The famous Monty Hall problem Argument for staying with chosen door: P ( prize = 1 ) P ( prize = 1 | prize � = 2 ) = P ( prize ∈ { 1 , 3 } ) = 1 / 2 Argument for switching to door 3: "door 3 ’inherits’ the probability mass of door 2, and thus P ( prize = 3 ) = 2 / 3 ” Tutorial UAI 2010 5 / 54

  6. The Common Problem Introduction Can we identify X is observed ∼ X has happened Coin tossing example : X : either h or t Monty Hall : X : goat behind door 2 Tutorial UAI 2010 6 / 54

  7. Outline Coarse Data Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 7 / 54

  8. Missing Values and Coarse Data Coarse Data Data set with missing values: X 1 X 2 X 3 d 1 true ? high d 2 false false ? d 3 true ? medium Other types of incompleteness: ◮ Partly observed values: X 3 � = high ◮ Constraints on multiple variables: X 1 = true or X 2 = true Coarse data model [2]: incomplete observations can correspond to any subset of complete observations ◮ More general than missing values ◮ Same as partial information in probability updating ◮ cf. prize ∈ { 1 , 3 } ◮ Simplifies theoretical analysis Tutorial UAI 2010 8 / 54

  9. Coarse Data Model Coarse Data ◮ Finite set of states (possible worlds): W = { x 1 , . . . , x n } ◮ Complete data variable X with values in W , governed by distribution P θ ( θ ∈ Θ) . ◮ Incomplete data variable Y with values in 2 W , governed by conditional distribution P λ ( · | X ) ( λ ∈ Λ) . Y space { x 1 , x 2 , x 3 } { x 1 , x 2 } { x 1 , x 3 } { x 2 , x 3 } { x 1 } { x 2 } { x 3 } 0.5 0.1 0.2 0.2 x 1 0.4 0.2 0.04 0.08 0.08 X space 1.0 0 0 0 x 2 0.4 0.4 0 0 0 X ∈ { x 2 , x 3 } 0 0.5 0.5 0 x 3 0.2 0 0.1 0.1 0 P θ P λ P θ,λ Y = { x 2 , x 3 } Tutorial UAI 2010 9 / 54

  10. Outline The CAR Assumption Part 1: Coarsened At Random Introduction Coarse Data The CAR Assumption Part 2: CAR Models Testing CAR Support Analysis Canonical Models Part 3: Learning Without CAR AI&M and EM Statistical CAR Tests Tutorial UAI 2010 10 / 54

  11. Learning from Coarse Data The CAR Assumption Data : observations of Y : U = U 1 , U 2 , . . . , U N U i ∈ 2 W From correct to face-value likelihood: L ( θ, λ | U ) Y = P θ,λ ( Y = U i ) i Y X = P θ,λ ( Y = U i , X = x ) i x ∈ U i Y X = P θ ( X = x ) P λ ( Y = U i | X = x ) Ass.: constant for x ∈ U i i x ∈ U i Y X = P λ ( Y = U i | X ∈ U i ) P θ ( X = x ) i x ∈ U i Y = P λ ( Y = U i | X ∈ U i ) P θ ( U i ) i Profile Likelihood L ( θ, λ | U ) ∼ Y max P θ ( U i ) Face-value likelihood λ i Tutorial UAI 2010 11 / 54

  12. Inference by Conditioning The CAR Assumption Observation : value of Y : U ∈ 2 W Updating to posterior belief: P θ ( X = x ) P λ ( Y = U | X = x ) P θ,λ ( X = x | Y = U ) = Ass.: constant for x ∈ U P θ,λ ( Y = U ) P θ ( X = x ) P λ ( Y = U | X ∈ U ) = P θ,λ ( Y = U ) P θ ( X = x ) P θ,λ ( X ∈ U | Y = U ) = P θ ( X ∈ U ) = P θ ( X = x | X ∈ U ) Tutorial UAI 2010 12 / 54

  13. Essential CAR The CAR Assumption Data (observation) is coarsened at random (CAR) [1,2] if for all U : P λ ( Y = U | X = x ) is constant for x ∈ U (e-CAR) The CAR assumption justifies ◮ learning by maximization of face-value likelihood (EM algorithm) ◮ belief updating by conditioning Is that it? . . . not quite . . . what does (e-CAR) mean: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } Tutorial UAI 2010 13 / 54

  14. Conditioning and Weak CAR The CAR Assumption In the justification for conditioning: P θ ( X = x ) P λ ( Y = U | X = x ) = P θ ( X = x ) P λ ( Y = U | X ∈ U ) P θ,λ ( Y = U ) P θ,λ ( Y = U ) Needed: for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U , P θ ( X = x ) > 0 } (w-CAR) Tutorial UAI 2010 14 / 54

  15. Profile Likelihood and Strong CAR The CAR Assumption In the derivation of the face-value likelihood: L ( θ, λ | U ) Y X max = max P θ ( X = x ) P λ ( Y = U i | X = x ) λ λ i x ∈ U i Y = max P λ ( Y = U i | X ∈ U i ) P θ ( U i ) λ i Y ≈ P θ ( U i ) i ◮ Only if domain of λ -maximization is independent of θ ◮ “Paramter distinctness” [1] ◮ Domain of λ -maximization must not depend on support ( P θ ) ◮ If we assume only weak CAR, then the domain of λ -maximization does depend on support ( θ ) ◮ Need for all U : P λ ( Y = U | X = x ) is constant on { x | x ∈ U } (s-CAR) Tutorial UAI 2010 15 / 54

  16. Examples The CAR Assumption Strong CAR: x 1 0.4 0.3 0.2 0.4 0.1 x 2 0.0 0.7 0.2 0 0.1 x 3 0.6 0.5 0.4 0 0.1 Weak CAR, not strong CAR: x 1 0.4 0.1 0.6 0.2 0.1 x 2 0.0 x 3 0.6 0.2 0.2 0.5 0.1 Tutorial UAI 2010 16 / 54

  17. Example: Data The CAR Assumption State space with 4 states, parametric model, and empirical probabilities from 13 observations: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB ab A ¯ B a ( 1 − b ) ¯ AB ( 1 − a ) b A ¯ ¯ B ( 1 − a )( 1 − b ) Tutorial UAI 2010 17 / 54

  18. Example: Face-Value Likelihood The CAR Assumption Face-value likelihood function for parameters a , b : 0.0005 0.0004 0.0003 0.0002 0.0001 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) ≈ ( 0 . 845 , 0 . 636 ) Tutorial UAI 2010 18 / 54

  19. Example: Learned Distribution The CAR Assumption Distribution learned under s-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 0.54 λ 1 λ 2 A ¯ B 0.31 λ 1 λ 3 ¯ AB 0.1 λ 3 λ 4 ¯ A ¯ λ 2 B 0.05 Question are there s-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? No : λ 2 = 1 ⇒ λ 1 = 0 ⇒ P ( Y = A ) = 0 � = 6 / 13 Tutorial UAI 2010 19 / 54

  20. Example: w-CAR Likelihood The CAR Assumption The profile likelihood under w-CAR differs from the face-value likelihood by set-of-support specific constants [5]: 1e–07 8e–08 6e–08 4e–08 2e–08 0 0 1 0.8 0.6 0.5 0.4 0.2 b a 0 Maximum at ( a , b ) = ( 9 / 13 , 1 . 0 ) Tutorial UAI 2010 20 / 54

  21. Example: Learned Distribution The CAR Assumption Distribution learned under w-CAR assumption: A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 9/13 2 / 3 1 / 3 A ¯ B 0.0 ¯ AB 4/13 1 / 4 3 / 4 A ¯ ¯ B 0.0 Question are there w-CAR λ parameters defining the joint distribution of X , Y with learned marginal on W, and observed empirical distribution on 2 W ? Yes! Tutorial UAI 2010 21 / 54

  22. Example: Summary The CAR Assumption The following were jointly inconsistent: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under s-CAR assumption ◮ s-CAR assumption Jointly consistent were: ◮ Observed empirical distribution of Y ◮ Learned distribution of X under w-CAR assumption ◮ w-CAR assumption Tutorial UAI 2010 22 / 54

  23. CAR is everything? The CAR Assumption Gill, van der Laan, Robins [3]: “CAR is everything” That is: for every distribution P of Y there exists a joint distribution of X , Y , s.t. ◮ The marginal for Y is P ◮ The joint is s-CAR A ↔ ¯ A ↔ B B AB ¯ A 6/13 3/13 3/13 1/13 AB 7/14 7 / 13 6 / 13 A ¯ B 5/14 7 / 13 6 / 13 ¯ AB 2/14 6 / 13 7 / 13 A ¯ ¯ 6 / 13 7 / 13 B 0 Tutorial UAI 2010 23 / 54

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend