Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 - PowerPoint PPT Presentation

Inference Probabilistic inference is the computation of posterior probabilities for query propositions given observed evidence where the full joint distribution can be viewed as the KB from which answers to all questions may be derived Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true P ( φ ) = Σ ω : ω | = φ P ( ω ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 18

Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( toothache ) = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 2 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 19

Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L For any proposition φ , sum the atomic events where it is true: P ( φ ) = Σ ω : ω | = φ P ( ω ) P ( cavity ∨ toothache ) = 0 . 108 + 0 . 012 + 0 . 072 + 0 . 008 + 0 . 016 + 0 . 064 = 0 . 28 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 20

Inference by enumeration Start with the joint distribution toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity .016 .064 .144 .576 cavity L Can also compute conditional probabilities: P ( ¬ cavity | toothache ) = P ( ¬ cavity ∧ toothache ) P ( toothache ) 0 . 016 + 0 . 064 = 0 . 108 + 0 . 012 + 0 . 016 + 0 . 064 = 0 . 4 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 21

Normalization toothache toothache L catch catch catch catch L L .108 .012 .072 .008 cavity cavity .016 .064 .144 .576 L Denominator can be viewed as a normalization constant α P ( Cavity | toothache ) = α P ( Cavity, toothache ) = α [ P ( Cavity, toothache, catch ) + P ( Cavity, toothache, ¬ catch )] = α [ � 0 . 108 , 0 . 016 � + � 0 . 012 , 0 . 064 � ] = α � 0 . 12 , 0 . 08 � = � 0 . 6 , 0 . 4 � Idea: compute distribution on query variable by fixing evidence variables and summing over hidden variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 22

Inference by enumeration contd. Let X be all the variables. Ask the posterior joint distribution of the query variables Y given specific values e for the evidence variables E Let the hidden variables be H = X − Y − E ⇒ the required summation of joint entries is done by summing out the hidden variables: P ( Y | E = e ) = α P ( Y , E = e ) = α Σ h P ( Y , E = e , H = h ) The terms in the summation are joint entries because Y , E , and H together exhaust the set of random variables Problems 1) Worst-case time complexity O ( d n ) where d is the largest arity 2) Space complexity O ( d n ) to store the joint distribution 3) How to find the numbers for O ( d n ) entries? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 23

Independence A and B are independent iff P ( A | B ) = P ( A ) or P ( B | A ) = P ( B ) or P ( A, B ) = P ( A ) P ( B ) Cavity Cavity Toothache Catch decomposes into Toothache Catch Weather Weather P ( Toothache, Catch, Cavity, Weather ) = P ( Toothache, Catch, Cavity ) P ( Weather ) 32 entries reduced to 12; for n independent biased coins, 2 n → n Absolute independence powerful but rare Dentistry is a large field with hundreds of variables, none of which are independent. What to do? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 24

Conditional independence P ( Toothache, Cavity, Catch ) has 2 3 − 1 = 7 independent entries If I have a cavity, the probability that the probe catches in it doesn’t depend on whether I have a toothache (1) P ( catch | toothache, cavity ) = P ( catch | cavity ) The same independence holds if I haven’t got a cavity (2) P ( catch | toothache, ¬ cavity ) = P ( catch |¬ cavity ) Catch is conditionally independent of Toothache given Cavity : P ( Catch | Toothache, Cavity ) = P ( Catch | Cavity ) Equivalent statements P ( Toothache | Catch, Cavity ) = P ( Toothache | Cavity ) P ( Toothache, Catch | Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 25

Conditional independence Write out full joint distribution using chain rule: P ( Toothache, Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch, Cavity ) = P ( Toothache | Catch, Cavity ) P ( Catch | Cavity ) P ( Cavity ) = P ( Toothache | Cavity ) P ( Catch | Cavity ) P ( Cavity ) i.e., 2 + 2 + 1 = 5 independent numbers In most cases, the use of conditional independence reduces the size of the representation of the joint distribution from exponential in n to linear in n Conditional independence is our most basic and robust form of knowledge about uncertainty AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 26

Bayes’ Rule Product rule P ( a ∧ b ) = P ( a | b ) P ( b ) = P ( b | a ) P ( a ) ⇒ Bayes’ rule P ( a | b ) = P ( b | a ) P ( a ) P ( b ) or in distribution form P ( Y | X ) = P ( X | Y ) P ( Y ) = α P ( X | Y ) P ( Y ) P ( X ) Useful for assessing diagnostic probability from causal probability: P ( Cause | Effect ) = P ( Effect | Cause ) P ( Cause ) P ( Effect ) E.g., let M be meningitis, S be stiff neck: P ( m | s ) = P ( s | m ) P ( m ) = 0 . 8 × 0 . 0001 = 0 . 0008 P ( s ) 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 27

Bayes’ Rule and conditional independence P ( Cavity | toothache ∧ catch ) = α P ( toothache ∧ catch | Cavity ) P ( Cavity ) = α P ( toothache | Cavity ) P ( catch | Cavity ) P ( Cavity ) This is an example of a naive Bayes model (Bayesian classifier) P ( Cause, Effect 1 , . . . , Effect n ) = P ( Cause ) Π i P ( Effect i | Cause ) Cavity Cause Effect 1 Effect n Toothache Catch Total number of parameters is linear in n AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 28

Example: Wumpus World 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 1,2 2,2 3,2 4,2 B OK 1,1 2,1 3,1 4,1 B OK OK P ij = true iff [ i, j ] contains a pit B ij = true iff [ i, j ] is breezy Include only B 1 , 1 , B 1 , 2 , B 2 , 1 in the probability model AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 29

Specifying the probability model The full joint distribution is P ( P 1 , 1 , . . . , P 4 , 4 , B 1 , 1 , B 1 , 2 , B 2 , 1 ) Apply product rule: P ( B 1 , 1 , B 1 , 2 , B 2 , 1 | P 1 , 1 , . . . , P 4 , 4 ) P ( P 1 , 1 , . . . , P 4 , 4 ) (Do it this way to get P ( Effect | Cause ) ) First term: 1 if pits are adjacent to breezes, 0 otherwise Second term: pits are placed randomly, probability 0.2 per square: i,j = 1 , 1 P ( P i,j ) = 0 . 2 n × 0 . 8 16 − n P ( P 1 , 1 , . . . , P 4 , 4 ) = Π 4 , 4 for n pits AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 30

Observations and query We know the following facts: b = ¬ b 1 , 1 ∧ b 1 , 2 ∧ b 2 , 1 known = ¬ p 1 , 1 ∧ ¬ p 1 , 2 ∧ ¬ p 2 , 1 Query is P ( P 1 , 3 | known, b ) Define Unknown = P ij s other than P 1 , 3 and Known For inference by enumeration, we have P ( P 1 , 3 | known, b ) = α Σ unknown P ( P 1 , 3 , unknown, known, b ) Grows exponentially with number of squares AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 31

Using conditional independence Basic insight: observations are conditionally independent of other hidden squares given neighbouring hidden squares 1,4 2,4 3,4 4,4 1,3 2,3 3,3 4,3 OTHER QUERY 1,2 2,2 3,2 4,2 FRINGE 1,1 2,1 3,1 4,1 KNOWN Define Unknown = Fringe ∪ Other P ( b | P 1 , 3 , Known, Unknown ) = P ( b | P 1 , 3 , Known, Fringe ) Manipulate query into a form where we can use this AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 32

Using conditional independence � P ( P 1 , 3 | known, b ) = α unknown P ( P 1 , 3 , unknown, known, b ) � = α unknown P ( b | P 1 , 3 , known, unknown ) P ( P 1 , 3 , known, unknown ) � � = α other P ( b | known, P 1 , 3 , fringe, other ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α other P ( b | known, P 1 , 3 , fringe ) P ( P 1 , 3 , known, fringe, other ) fringe � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 , known, fringe, other ) � � = α fringe P ( b | known, P 1 , 3 , fringe ) other P ( P 1 , 3 ) P ( known ) P ( fringe ) P ( other ) � � = α P ( known ) P ( P 1 , 3 ) fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) other P ( other ) = α ′ P ( P 1 , 3 ) � fringe P ( b | known, P 1 , 3 , fringe ) P ( fringe ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 33

Using conditional independence 1,3 1,3 1,3 1,3 1,3 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 1,2 2,2 B B B B B OK OK OK OK OK 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 1,1 2,1 3,1 B B B B B OK OK OK OK OK OK OK OK OK OK 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 0.8 x 0.2 = 0.16 0.2 x 0.2 = 0.04 0.2 x 0.8 = 0.16 P ( P 1 , 3 | known, b ) = α ′ � 0 . 2(0 . 04 + 0 . 16 + 0 . 16) , 0 . 8(0 . 04 + 0 . 16) � ≈ � 0 . 31 , 0 . 69 � P ( P 2 , 2 | known, b ) ≈ � 0 . 86 , 0 . 14 � AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 34

Bayesian networks BNs: a graphical notation for conditional independence assertions and hence for compact specification of full joint distributions alias Probabilistic Graphical Models (PGMs) Syntax: a set of nodes, one per variable a directed, acyclic graph (link ≈ “directly influences”) a conditional distribution for each node given its parents: P ( X i | Parents ( X i )) In the simplest case, conditional distribution represented as a conditional probability table (CPT) giving the distribution over X i for each combination of parent values AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 35

Example Topology of network encodes conditional independence assertions: Cavity Weather Toothache Catch Weather is independent of the other variables Toothache and Catch are conditionally independent given Cavity AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 36

Example I’m at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn’t call. Sometimes it’s set off by minor earthquakes. Is there a burglar? Variables: Burglar , Earthquake , Alarm , JohnCalls , MaryCalls Network topology reflects “causal” knowledge: – A burglar can set the alarm off – An earthquake can set the alarm off – The alarm can cause Mary to call – The alarm can cause John to call AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 37

Example P(E) Burglary P(B) Earthquake .002 .001 B E P(A) T T .95 Alarm T F .94 F T .29 F F .001 P(J) A A P(M) JohnCalls T .90 MaryCalls T .70 F .05 F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 38

Compactness A CPT for Boolean X i with k Boolean parents has 2 k rows for the combinations of parent values B E A Each row requires one number p for X i = true (the number for X i = false is just 1 − p ) J M If each variable has no more than k parents, the complete network requires O ( n · 2 k ) numbers I.e., grows linearly with n , vs. O (2 n ) for the full joint distribution For burglary net, 1 + 1 + 4 + 2 + 2 = 10 numbers (vs. 2 5 − 1 = 31 ) In certain cases (assumptions of conditional independency), BNs make O (2 n ) ⇒ O ( kn ) (NP ⇒ P ! ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 39

Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 40

Global semantics Global semantics defines the full joint distribution B E as the product of the local conditional distributions: P ( x 1 , . . . , x n ) = Π n A i = 1 P ( x i | parents ( X i )) J M e.g., P ( j ∧ m ∧ a ∧ ¬ b ∧ ¬ e ) = P ( j | a ) P ( m | a ) P ( a |¬ b, ¬ e ) P ( ¬ b ) P ( ¬ e ) = 0 . 9 × 0 . 7 × 0 . 001 × 0 . 999 × 0 . 998 ≈ 0 . 00063 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 41

Local semantics Local semantics: each node is conditionally independent of its nondescendants ( Z i,j ) given its parents ( U i in the gray area) U 1 U m . . . X Z 1j Z nj Y n Y 1 . . . Theorem: Local semantics ⇔ global semantics AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 42

Markov blanket Each node is conditionally independent of all others given its Markov blanket: parents + children + children’s parents U 1 U m . . . X Z 1j Z nj Y Y n 1 . . . AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 43

Constructing Bayesian networks Algorithm: a series of locally testable assertions of conditional independence guarantees the required global semantics 1. Choose an ordering of variables X 1 , . . . , X n 2. For i = 1 to n add X i to the network select parents from X 1 , . . . , X i − 1 such that P ( X i | Parents ( X i )) = P ( X i | X 1 , . . . , X i − 1 ) This choice of parents guarantees the global semantics: P ( X 1 , . . . , X n ) = Π n i = 1 P ( X i | X 1 , . . . , X i − 1 ) (chain rule) = Π n i = 1 P ( X i | Parents ( X i )) (by construction) Each node is conditionally independent of its other predecessors in the node (partial) ordering, given its parents AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 44

Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls P ( J | M ) = P ( J ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 45

Example Suppose we choose the ordering M , J , A , B , E MaryCalls JohnCalls Alarm P ( J | M ) = P ( J ) ? No P ( A | J, M ) = P ( A | J ) ? P ( A | J, M ) = P ( A ) ? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 46

Example MaryCalls JohnCalls Alarm Burglary Earthquake Assessing conditional probabilities is hard in noncausal directions Network can be far more compact than the full joint distribution But, this network is less compact: 1 + 2 + 4 + 2 + 4 = 13 (due to the ordering of the variables) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 50

Probabilistic reasoning • Exact inference by enumeration • Exact inference by variable elimination • Approximate inference by stochastic simulation • Approximate inference by Markov chain Monte Carlo AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 51

Reasoning tasks in BNs (PGMs) Simple queries: compute posterior marginal P ( X i | E = e ) e.g., P ( NoGas | Gauge = empty, Lights = on, Starts = false ) Conjunctive queries: P ( X i , X j | E = e ) = P ( X i | E = e ) P ( X j | X i , E = e ) Optimal decisions: decision networks include utility information; probabilistic inference required for P ( outcome | action, evidence ) Value of information: which evidence to seek next? Sensitivity analysis: which probability values are most critical? Explanation: why do I need a new starter motor? AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 52

Inference by enumeration Slightly intelligent way to sum out variables from the joint without actually constructing its explicit representation Simple query on the burglary network B E P ( B | j, m ) = P ( B, j, m ) /P ( j, m ) A = α P ( B, j, m ) = α Σ e Σ a P ( B, e, a, j, m ) J M Rewrite full joint entries using product of CPT entries P ( B | j, m ) = α Σ e Σ a P ( B ) P ( e ) P ( a | B, e ) P ( j | a ) P ( m | a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) P ( m | a ) Recursive depth-first enumeration: O ( n ) space, O ( d n ) time AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 53

Enumeration algorithm function EnumerationAsk (( X , e , bn )) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network with variables { X } ∪ E ∪ Y Q ( X ) ← a distribution over X , initially empty for each value x i of X do Q ( x i ) ← EnumerateAll ( bn . Vars , e x i ) where e x i is e extended with X = x i return Normalize ( Q ( X ) ) function EnumerateAll (( vars , e )) returns a real number if Empty? ( vars ) then return 1.0 Y ← First ( vars ) if Y has value y in e then return P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e ) else return � y P ( y | parents ( Y )) × EnumerateAll ( Rest ( vars ), e y ) where e y is e extended with Y = y AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 54

Evaluation tree Summing at the “+” nodes P(b) .001 P(e) P( e) .002 .998 P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e) .95 .05 .94 .06 P(j|a) P(j| a) P(j|a) P(j| a) .90 .05 .90 .05 P(m|a) P(m| a) P(m|a) P(m| a) .70 .01 .70 .01 Enumeration is inefficient: repeated computation e.g., computes P ( j | a ) P ( m | a ) for each value of e improved by eliminating repeated variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 55

Inference by variable elimination Variable elimination: carry out summations right-to-left, storing intermediate results (factors) to avoid recomputation P ( B | j, m ) Σ e P ( e ) Σ a P ( a | B, e ) = α P ( B ) P ( j | a ) P ( m | a ) � �� B E A J M = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) P ( j | a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a P ( a | B, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) Σ a f A ( a, b, e ) f J ( a ) f M ( a ) = α P ( B ) Σ e P ( e ) f ¯ AJM ( b, e ) (sum out A ) = α P ( B ) f ¯ AJM ( b ) (sum out E ) E ¯ = αf B ( b ) × f ¯ AJM ( b ) E ¯ AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 56

Variable elimination: Basic operations Summing out a variable from a product of factors move any constant factors outside the summation add up submatrices in pointwise product of remaining factors Σ x f 1 × · · · × f k = f 1 × · · · × f i Σ x f i +1 × · · · × f k = f 1 × · · · × f i × f ¯ X assuming f 1 , . . . , f i do not depend on X Pointwise product of factors f 1 and f 2 f 1 ( x 1 , . . . , x j , y 1 , . . . , y k ) × f 2 ( y 1 , . . . , y k , z 1 , . . . , z l ) = f ( x 1 , . . . , x j , y 1 , . . . , y k , z 1 , . . . , z l ) e.g., f 1 ( a, b ) × f 2 ( b, c ) = f ( a, b, c ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 57

Variable elimination algorithm function EliminationAsk ( X , e , bn ) returns a distribution over X inputs : X , the query variable e , observed values for variables E bn , a belief network specifying joint distribution P ( X 1 , . . . , X n ) factors ← [ ] for each var in Order ( bn . Vars ) do factors ← [ MakeFactor ( var , e ) | factors ] if var is a hidden variable then factors ← SumOut ( var , factors ) return Normalize ( PointwiseProduct ( factors )) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 58

Irrelevant variables Consider the query P ( JohnCalls | Burglary = true ) B E A P ( J | b ) = αP ( b ) � e P ( e ) � a P ( a | b, e ) P ( J | a ) � m P ( m | a ) J M Sum over m is identically 1; M is irrelevant to the query Thm 1: Y is irrelevant unless Y ∈ Ancestors ( { X } ∪ E ) Here, X = JohnCalls , E = { Burglary } , and Ancestors ( { X } ∪ E ) = { Alarm, Earthquake } so MaryCalls is irrelevant (Compare this to backward chaining from the query in Horn clause KBs) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 59

Irrelevant variables Defn: moral graph of Bayes net: marry all parents and drop arrows Defn: A is m-separated from B by C iff separated by C in the moral graph Thm 2: Y is irrelevant if m-separated from X by E B E A For P ( JohnCalls | Alarm = true ) , both Burglary and Earthquake are irrelevant J M AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 60

Complexity of exact inference Singly connected networks (or polytrees): – any two nodes are connected by at most one (undirected) path – time and space cost of variable elimination are O ( d k n ) Multiply connected networks: – can reduce 3SAT to exact inference ⇒ NP-hard – equivalent to counting 3SAT models ⇒ #P-complete 0.5 0.5 0.5 0.5 A B C D 1. A v B v C 2. C v D v ~A 1 2 3 3. B v C v ~D AND AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 61

Inference by stochastic simulation Idea 1) Draw N samples from a sampling distribution S 2) Compute an approximate posterior probability ˆ 0.5 P 3) Show this converges to the true probability P Coin Methods – Sampling from an empty network – Rejection sampling: reject samples disagreeing with evidence – Likelihood weighting: use evidence to weight samples – Markov chain Monte Carlo (MCMC): sample from a stochastic process whose stationary distribution is the true posterior AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 62

Sampling from an empty network Direct sampling from a network that has no evidence associated (sampling each variable in turn, in topological order) function Prior-Sample ( bn ) returns an event sampled from P ( X 1 , . . . , X n ) specified by bn inputs : bn , a Bayesian network specifying joint distribution P ( X 1 , . . . , X n ) x ← an event with n elements for each variable X i in X 1 , . . . , X n x i ← a random sample from P ( X i | Parents ( X i )) return x AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 63

Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 64

Sampling from an empty network contd. Probability that PriorSample generates a particular event S PS ( x 1 . . . x n ) = Π n i = 1 P ( x i | parents ( X i )) = P ( x 1 . . . x n ) i.e., the true prior probability E.g., S PS ( t, f, t, t ) = 0 . 5 × 0 . 9 × 0 . 8 × 0 . 9 = 0 . 324 = P ( t, f, t, t ) Let N PS ( x 1 . . . x n ) be the number of samples generated for event x 1 , . . . , x n Then we have ˆ lim P ( x 1 , . . . , x n ) = N →∞ N PS ( x 1 , . . . , x n ) /N lim N →∞ = S PS ( x 1 , . . . , x n ) = P ( x 1 . . . x n ) That is, estimates derived from PriorSample are consistent Shorthand: ˆ P ( x 1 , . . . , x n ) ≈ P ( x 1 . . . x n ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 71

Rejection sampling ˆ P ( X | e ) estimated from samples agreeing with e function Rejection-Sampling ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : N , a vector of counts for each value of X , initially zero for j = 1 to N do x ← Prior-Sample ( bn ) if x is consistent with e then /*do not match the evidence*/ N [ x ] ← N [ x ]+1 where x is the value of X in x return Normalize ( N [ X ]) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 72

Example Estimate P ( Rain | Sprinkler = true ) using 100 samples 27 samples have Sprinkler = true Of these, 8 have Rain = true and 19 have Rain = false . ˆ P ( Rain | Sprinkler = true ) = Normalize ( � 8 , 19 � ) = � 0 . 296 , 0 . 704 � Similar to a basic real-world empirical estimation procedure AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 73

Rejection sampling contd. ˆ P ( X | e ) = α N PS ( X, e ) (algorithm defn.) = N PS ( X, e ) /N PS ( e ) (normalized by N PS ( e ) ) ≈ P ( X, e ) /P ( e ) (property of PriorSample ) = P ( X | e ) (defn. of conditional probability) Hence rejection sampling returns consistent posterior estimates Problem: hopelessly expensive if P ( e ) is small P ( e ) drops off exponentially with number of evidence variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 74

Likelihood weighting Idea – fix evidence variables – sample only nonevidence variables – weight each sample by the likelihood it accords the evidence AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 75

Likelihood weighting function Likelihood-Weighting ( X , e , bn , N ) returns an estimate of P ( X | e ) inputs : X , the query variable e , observed values for variables E bn , a Bayesian network N , the total number of samples to be generated local variables : W , a vector of weighted counts for each value of X , initially zero for j = 1 to N do x , w ← Weighted-Sample ( bn , e ) W [ x ] ← W [ x ] + w where x is the value of X in x return Normalize ( W [ X ] ) function Weighted-Sample ( bn , e ) returns an event and a weight x ← an event with n elements from e ; w ← 1 for each variable X i in X 1 , · · · , X n do if X i is an evidence variable with value x i in e then w ← w × P ( X i = x i | Parents ( X i )) else x [ i ] ← a random sample from P ( X i | Parents ( X i )) return x , w AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 76

Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 77

Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 80

Example P(C) .50 Cloudy C P(S|C) C P(R|C) Rain Sprinkler T .10 T .80 F .50 F .20 Wet Grass S R P(W|S,R) T T .99 T F .90 F T .90 F F .01 w = 1 . 0 × 0 . 1 × 0 . 99 = 0 . 099 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 83

Likelihood weighting contd. Sampling probability for WeightedSample is S WS ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Note: pays attention to evidence in ancestors only Cloudy ⇒ somewhere “in between” prior and posterior distribution Rain Sprinkler Wet Weight for a given sample z , e is Grass w ( z , e ) = Π m i = 1 P ( e i | parents ( E i )) Weighted sampling probability is S WS ( z , e ) w ( z , e ) = Π l i = 1 P ( z i | parents ( Z i )) Π m i = 1 P ( e i | parents ( E i )) = P ( z , e ) (by standard global semantics of network) Hence likelihood weighting returns consistent estimates but performance still degrades with many evidence variables because a few samples have nearly all the total weight AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 84

Inference by Markov chain Monte Carlo (MCMC) “State” of network = current assignment to all variables ⇒ the next state by making random changes to the current state Generate next state by sampling one variable given Markov blanket recall Markov blanket: parents, children, and children’s parents Sample each variable in turn, keeping evidence fixed Specific transition probability with which the stochastic process moves from one state to another defined by conditional distribution given Markov blanket of the variable being sampled AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 85

MCMC Gibbs sampling function MCMC-Gibbs-Ask ( X , e , bn , N ) returns an estimate of P ( X | e ) local variables : N , a vector of counts for each value of X , initially zero Z , the nonevidence variables in bn x , the current state of the network, initially copied from e initialize x with random values for the variables in Z for j = 1 to N do /* Can choose at random */ for each Z i in Z do set the value of Z i in x by sampling from P ( Z i | mb ( Z i )) /*Markov blanket */ N [ x ] ← N [ x ] + 1 where x is the value of X in x return Normalize ( N ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 86

The Markov chain With Sprinkler = true, WetGrass = true , there are four states Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Cloudy Cloudy Rain Rain Sprinkler Sprinkler Wet Wet Grass Grass Wander about for a while AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 87

Example Estimate P ( Rain | Sprinkler = true, WetGrass = true ) Sample Cloudy or Rain given its Markov blanket, repeat. Count number of times Rain is true and false in the samples. E.g., visit 100 states 31 have Rain = true , 69 have Rain = false ˆ P ( Rain | Sprinkler = true, WetGrass = true ) = Normalize ( � 31 , 69 � ) = � 0 . 31 , 0 . 69 � Theorem: chain approaches stationary distribution: long-run fraction of time spent in each state is exactly proportional to its posterior probability AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 88

Markov blanket sampling Markov blanket of Cloudy is Cloudy Sprinkler and Rain Markov blanket of Rain is Rain Sprinkler Cloudy , Sprinkler , and WetGrass Wet Grass Probability given the Markov blanket is calculated as follows P ( x ′ i | mb ( X i )) = P ( x ′ i | parents ( X i )) Π Z j ∈ Children ( X i ) P ( z j | parents ( Z j )) Easily implemented in message-passing parallel systems, brains Main computational problems: 1) Difficult to tell if convergence has been achieved 2) Can be wasteful if Markov blanket is large P ( X i | mb ( X i )) won’t change much (law of large numbers) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 89

Approximate inference Exact inference by variable elimination: – polytime on polytrees, NP-hard on general graphs – space = time, very sensitive to topology Approximate inference by LW (Likelihood Weighting), MCMC (Markov chain Monte Carlo): – LW does poorly when there is lots of (downstream) evidence – LW, MCMC generally insensitive to topology – Convergence can be very slow with probabilities close to 1 or 0 – Can handle arbitrary combinations of discrete and continuous variables AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 90

Dynamic Bayesian networks DBNs are Bayesian networks that represent temporal probability models Basic idea: copy state and evidence variables for each time step X t = set of unobservable state variables at time t e.g., BloodSugar t , StomachContents t , etc. E t = set of observable evidence variables at time t e.g., MeasuredBloodSugar t , PulseRate t , FoodEaten t This assumes discrete time ; step size depends on problem Notation: X a : b = X a , X a +1 , . . . , X b − 1 , X b X t , E t contain arbitrarily many variables in a replicated Bayes net AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 91

Hidden Markov models (HMMs) Every HMM is a single-variable DBN every discrete DBN is an HMM (combine all the state variables in the DBN into a single one) X t+1 X t Y Y t t+1 Z Z t t+1 Sparse dependencies ⇒ exponentially fewer parameters; e.g., 20 state variables, three parents each DBN has 20 × 2 3 = 160 parameters, HMM has 2 20 × 2 20 ≈ 10 12 AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 92

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov assumption: X t depends on bounded subset of X 0: t − 1 First-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 1 ) Second-order Markov process: P ( X t | X 0: t − 1 ) = P ( X t | X t − 2 , X t − 1 ) X t −2 X t −1 X t +1 X t +2 X t First−order X t −2 X t −1 X t X t +1 X t +2 Second−order Sensor Markov assumption: P ( E t | X 0: t , E 0: t − 1 ) = P ( E t | X t ) Stationary process: transition model P ( X t | X t − 1 ) and sensor model P ( E t | X t ) fixed for all t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 93

Example P ( R ) R t -1 t t 0.7 f 0.3 Rain t– 1 Rain t Rain t+ 1 P ( U ) R t t t 0.9 f 0.2 Umbrella t+ 1 Umbrella t– 1 Umbrella t First-order Markov assumption not exactly true in real world! Possible fixes: 1. Increase order of Markov process 2. Augment state , e.g., add Temp t , Pressure t AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 94

HMMs X t is a single, discrete variable (usually E t is too) Domain of X t is { 1 , . . . , S }    0 . 7 0 . 3 Transition matrix T ij = P ( X t = j | X t − 1 = i ) , e.g.,     0 . 3 0 . 7  Sensor matrix O t for each time step, diagonal elements P ( e t | X t = i )    0 . 9 0 e.g., with U 1 = true , O 1 =     0 0 . 2  Forward and backward messages as column vectors: f 1: t +1 = α O t +1 T ⊤ f 1: t b k +1: t = TO k +1 b k +2: t Forward-backward algorithm needs time O ( S 2 t ) and space O ( St ) AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 95

Inference tasks in HMMs Filtering: P ( X t | e 1: t ) belief state—input to the decision process of a rational agent Prediction: P ( X t + k | e 1: t ) for k > 0 evaluation of possible action sequences; like filtering without the evidence Smoothing: P ( X k | e 1: t ) for 0 ≤ k < t better estimate of past states, essential for learning Most likely explanation: arg max x 1: t P ( x 1: t | e 1: t ) speech recognition, decoding with a noisy channel AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 96

Inference in DBNs Naive method: unroll the network and run any exact algorithm R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) R 0 P(R ) P(R ) 1 P(R ) 1 1 1 1 1 1 1 0 0 t t t t t t t t 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 f f f f f f f f 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 Rain 0 Rain 1 Rain 0 Rain 1 Rain 2 Rain 3 Rain 4 Rain 5 Rain 6 Rain 7 R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) R 1 P(U ) 1 1 1 1 1 1 1 1 t t t t t t t t 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 f f f f f f f f 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 Umbrella 1 Umbrella 1 Umbrella 2 Umbrella 3 Umbrella 4 Umbrella 5 Umbrella 6 Umbrella 7 Problem: inference cost for each update grows with t Rollup filtering: add slice t + 1 , “sum out” slice t using variable elimination Largest factor is O ( d n +1 ) , update cost O ( d n +2 ) (cf. HMM update cost O ( d 2 n ) ) Approximate inference by MCMC (Markov chain Monte Carlo) etc. AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 98

Probabilistic logic Bayesian networks are essentially propositional: – the set of random variable is fixed and finite – each variable has a fixed domain of possible values Probabilistic reasoning can be formalized as probabilistic logic First-order probabilistic logic combines probability theory with the expressive power of first-order logic AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 99

First-order probabilistic logic Recall: Propositional probabilistic logic – Proposition = disjunction of atomic events in which it is true – Possible world (sample point) ω = propositional logic model (an assignment of values to all of the r.v.s under consideration) – ω | = φ : for any proposition φ , the ω where it is true – probability model: a set Ω of possible worlds with a probability P ( ω ) for each world ω AI Slides (5e) c � Lin Zuoquan@PKU 2003-2019 10 100

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 - PowerPoint PPT Presentation

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 10 Uncertainty 10.1 Uncertainty 10.2 Probability Syntax and semantics Inference Independence Bayes rule 10.3 Bayesian networks 10.4 Probabilistic reasoning

Uncertainty AIMA Chapter 13 Outline Uncertainty Uncertainty Probability Syntax and

UNCERTAINTY IN KNOWLEDGE Ch. 9 Uncertainty in Knowledge 1 Sources of Uncertainty

7 Modelling Uncertainty Bayes theorem 7 Modelling Uncertainty Bayes theorem

Uncertainty and its Representa/on @kordinglab Uncertainty ma7ers

Decision Making Privacy-Motivated . . . under Uncertainty: Uncertainty Leads to . . .

CPSC 875 CPSC 875 John D McGregor John D. McGregor C10 Error Design Uncertainty Uncertainty

Decision Making Under Uncertainty Making Decisions Under Uncertainty AI C LASS 10 (C H .

VISUALIZING UNCERTAINTY Fall 2017 Mac Hill VISUALIZING UNCERTAINTY 2 DEVELOPING A VISUAL

The Role of Expert Knowledge in Uncertainty Quantification (Are We Adding More Uncertainty (Are

Economic and technology uncertainty and implications for policy advise Fr ed eric Babonneau

Uncertainty Chapter 13 Chapter 13 1 Outline Uncertainty Probability Syntax and

Can you trust your models uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

MAV-Vis: A Notation for Model Uncertainty Design Uncertainty MAV-Vis Michalis Famelis and

Measurement Uncertainty - Error & Uncertainty Measurement errors are impossible to avoid

Uncertainty and Discounting Spring 09 UC Berkeley Traeger 5 Risk and Uncertainty 74

Uncertainty Session 6 PMAP 8921: Data Visualization with R Andrew Young School of Policy Studies

CSC2515 Lecture 6: Probabilistic Models Marzyeh Ghassemi Material and slides developed by Roger

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

Discrete probability distributions Beginning Bayes in R Course overview Two schools of

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

THIRD QUARTER 2019 CONFERENCE CALL 1 phillips66.com | NYSE: PSX CAUTIONARY STATEMENT

Automatic Layout Generation with Applications in Machine Learning Engine Evaluation Haoyu Yang 1 ,

TEE Boot Procedure with Crypto-accelerators in RISC-V Processors Authors: Ckristian Duran,

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 - PowerPoint PPT Presentation

Uncertainty 10 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 10 1 10 Uncertainty 10.1 Uncertainty 10.2 Probability Syntax and semantics Inference Independence Bayes rule 10.3 Bayesian networks 10.4 Probabilistic reasoning

Uncertainty AIMA Chapter 13 Outline Uncertainty Uncertainty Probability Syntax and

UNCERTAINTY IN KNOWLEDGE Ch. 9 Uncertainty in Knowledge 1 Sources of Uncertainty

7 Modelling Uncertainty Bayes theorem 7 Modelling Uncertainty Bayes theorem

Uncertainty and its Representa/on @kordinglab Uncertainty ma7ers

Decision Making Privacy-Motivated . . . under Uncertainty: Uncertainty Leads to . . .

CPSC 875 CPSC 875 John D McGregor John D. McGregor C10 Error Design Uncertainty Uncertainty

Decision Making Under Uncertainty Making Decisions Under Uncertainty AI C LASS 10 (C H .

VISUALIZING UNCERTAINTY Fall 2017 Mac Hill VISUALIZING UNCERTAINTY 2 DEVELOPING A VISUAL

The Role of Expert Knowledge in Uncertainty Quantification (Are We Adding More Uncertainty (Are

Economic and technology uncertainty and implications for policy advise Fr ed eric Babonneau

Uncertainty Chapter 13 Chapter 13 1 Outline Uncertainty Probability Syntax and

Can you trust your models uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

MAV-Vis: A Notation for Model Uncertainty Design Uncertainty MAV-Vis Michalis Famelis and

Measurement Uncertainty - Error &amp; Uncertainty Measurement errors are impossible to avoid

Uncertainty and Discounting Spring 09 UC Berkeley Traeger 5 Risk and Uncertainty 74

Uncertainty Session 6 PMAP 8921: Data Visualization with R Andrew Young School of Policy Studies

CSC2515 Lecture 6: Probabilistic Models Marzyeh Ghassemi Material and slides developed by Roger

Non-parametric Bayesian Statistics Graham Neubig 2011-12-22 1 Graham Neubig Non-parametric

1 An Filtering System that Monitors Document Search Engines Can Help, But Not Enough!

Discrete probability distributions Beginning Bayes in R Course overview Two schools of

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

THIRD QUARTER 2019 CONFERENCE CALL 1 phillips66.com | NYSE: PSX CAUTIONARY STATEMENT

Automatic Layout Generation with Applications in Machine Learning Engine Evaluation Haoyu Yang 1 ,

TEE Boot Procedure with Crypto-accelerators in RISC-V Processors Authors: Ckristian Duran,

Measurement Uncertainty - Error & Uncertainty Measurement errors are impossible to avoid