Bayesian Networks and ITS Overview Knowledge acquisition is hard - - PowerPoint PPT Presentation

bayesian networks and its overview
SMART_READER_LITE
LIVE PREVIEW

Bayesian Networks and ITS Overview Knowledge acquisition is hard - - PowerPoint PPT Presentation

1 Bayesian Networks and ITS Overview Knowledge acquisition is hard in general, 2 and not well understood. It is time consuming, when everything is to be hand-coded. Can the machine automatically gather the needed information?


slide-1
SLIDE 1

Bayesian Networks and ITS

1

slide-2
SLIDE 2

2

Overview

  • Knowledge acquisition is hard in general,

and not well understood.

  • It is time consuming, when everything is to

be hand-coded.

  • Can the machine automatically gather the

needed information?

  • Machine Learning
  • A number of approaches now available.
  • Do they work well?
slide-3
SLIDE 3

Core System

  • utputs

Inputs Other Observables Quality? Control Box What to Do? Changes

Machine learning

slide-4
SLIDE 4

Machine Learning

Rote Learning (being told) Fully Automated Discovery Rule Induction Neural Networks Reinforcement Learning Example Based Learning Inductive Logic Programming Version Space Learning Case Based System …......

slide-5
SLIDE 5

Learn What?

  • Domain knowledge

– Correct knowledge – concepts,

dependencies, rules, etc

– Misconceptions

– Perturbation model

– Interventions

  • Student model
  • T

utoring model

slide-6
SLIDE 6

….

  • Student model inducing from behaviour records, test

records, etc.

  • Interventions can improve with records of past cases.
  • Perturbation model used to generate misconceptions.

– Machine learning?

  • Bayesian networks – a probability model of a domain,

probabilities change with time...

– Learning!

slide-7
SLIDE 7

Overview...

  • Uncertainty is fundamental to education!
  • Our knowledge of the learning process,

access to learner's state of knowledge, and also where he/she is going.

  • Strength of ITS is in effective prediction of

learner's step, and choosing right action.

  • Modelling uncertainty is critical.

– What kind of uncertainty? – What kind of model?

slide-8
SLIDE 8

Uncertainty

Fuzzy Logic Certainty Factors Probability models – Bayesian networks Non-numeric Models Non-monotonic Logics & reasoning Dependency Networks Dempster Shafer theory

slide-9
SLIDE 9

Uncertainty

  • Given the current state of the world, form

beliefs on the student knowledge level.

  • Given knowledge level, decide action against

a situation.

  • Selection of next problem.
  • Probability models may be a good start.

– Bayesian networks

slide-10
SLIDE 10

Bayesian Networks

  • Causal Concept networks with attached

probability.

  • Bayesian methods are capable of

handling noisy and incomplete information.

  • Bayes's theorem to save us from massive

probability computations.

slide-11
SLIDE 11

Basic rules

Probability

P(A) >=0; P(A) = 1 – P(not A);

Conditional probability

P(A|B)= P(A∧B) / P(B) if P(B)≠0

Product rule

P(A∧B) = P(A|B) P(B)

Bayes’ Rule:

P(A|B)= P(B|A).P(A) / P(B) P(A|B) = P(A∧B) / P(B)

11

slide-12
SLIDE 12

P(Cavity|T

  • othache)= ?

12

Computing posterior probability from Full joint distribution

slide-13
SLIDE 13

...

P(A|B) = P(A∧B) / P(B)

P(Cavity|T

  • othache) = P(Cavity ∧ T
  • othache) / P(T
  • othache)

P(Cavity ∧ T

  • othache) = 0.04

P(T

  • othache) = ?

P(C|T) = ?

13

slide-14
SLIDE 14

The problem

  • P(I1|o1,o2,o3,o4...,on) – given current state of

network, what is my estimate of using intervention I1?

  • I need full joint probability of all variables.
  • 1) Cannot derive P(A|B,C) from P(A|B) and

P(A|C)

– Unless....

  • 2) P(A|B) not easy in general

– But P(B|A) may be easier

slide-15
SLIDE 15

Independence

15

  • T

wo random variables A B are (absolutely) independent iff P(A^B)=P(A)P(B)

– If n Boolean variables are independent, the

full joint is P(X1,…,Xn)=ΠiP(Xi)

T wo random variables A, B given C are conditionally independent iff P(A^B|C)=P(A|C)*P(B|C)

slide-16
SLIDE 16

….

P(I|A,B) = P(A,B|I) * P(I) / P(A,B) = P(A|I)*P(B|I)*P(I) / P(A)*P(B)

slide-17
SLIDE 17

Data availability

  • P(solves_p1 | knows_c) is easier than

P(knows_c | solves_p1)

  • Bayes' theorem provides to build one

from the other.

slide-18
SLIDE 18

Bayesian Network

  • Network of probability influencers!
  • Nothing else will influence: markov

assumption.

– => all else are conditionally

independent.

  • Every node has associated CP distribution

as a function of its parents.

– P(~X) = 1 – P(X)

  • Information can flow in any direction.
slide-19
SLIDE 19

Each concept is represented by a node in

the graph.

A directed edge from one concept to

another is added if knowledge of the former is a prerequisite for understanding the latter

19

Network

slide-20
SLIDE 20

P(For-Loop | Variable Asgn, Rel Ops, Incr/Decr Oper)

20 CPD for For-loop

slide-21
SLIDE 21

21

Belief network example

Neighbors John and Mary promised to call if the alarm goes off. Sometimes alarm starts because of earthquake. If Alarm went off, what is the probability of burglary? Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCalls (n=5 variables) Network topology reflects “causal” knowledge

slide-22
SLIDE 22

22

Belief network example – cont.

slide-23
SLIDE 23

Semantics in belief networks

23

  • In a BN, the full joint distribution is defined

as the product of the local conditional distributions: P(X1,…,Xn)= Π P(Xi|Parents(Xi)) for i=1 to n e.g: P(J∧M∧A∧¬B∧¬E) is given by?

= P(¬B)P(¬E)P(A|¬B∧¬E)P(J|A)P(M|A) = = 0.999 x 0.998 x 0.001 x 0.90 x 0.70 = = 0.000062 Each node is conditionally independent of its descendents given its parents

slide-24
SLIDE 24

Example...

  • P(M|B)?
  • = P(M|A)*P(A)+P(M|~A)*P(~A)
  • P(A) = P(A|B,E) * P(B)*P(E) +

P(A|B,~E) * P(B)*P(~E) + P(A|~B,E) * P(~B)*P(E)+ P(A|~B,~E) * P(~B)*P(~E)

= P(A|B,E) *P(E) + P(A|B,~E) *P(~E)

slide-25
SLIDE 25

25

Building a BBN

  • Expert centric

– Human expert creates the structure and the probability values – Can guess, where real value not available. – Hidden nodes are a problem!

  • Data centric

– Use population data from real trial, etc – Approaches vary on what is constructed from data.

  • Efficiency centric

– A combination, using domain knowledge to increase

efficiency.

slide-26
SLIDE 26

Using a B.N.

  • Diagnostic reasoning

– Given leaf nodes, predict prob of intermediate or

root nodes.

  • Predictive reasoning

– Given root nodes, etc predict prob of intermediate

nodes and leaf nodes.

  • Explaining away

– Sibling propagation – earthquake knowledge helps

“reduce” probability of burglary, given alarm.

slide-27
SLIDE 27

Andes’ Bayesian network

  • Andes’ Bayesian networks encode two kinds of

knowledge:

– domain-general knowledge: encompassing

general concepts and procedures that define proficiency in Newtonian physics

– Need to stay across sessions. – task-specific knowledge: encompassing

knowledge related to a student performance

  • n a specific problem or example

– Can be removed at end of task.

slide-28
SLIDE 28

 The domain-general part of the stud. model consists of

 Rule nodes  Context-Rule nodes

 A student has mastered a rule when he/she is able to apply it

correctly in all possible contexts (problems).

 Rule nodes have binary values T and F, indicating the

probability that each rule is mastered or not.

 Context-Rule nodes represent mastery of physics rules in

specific problem solving contexts.

28

The domain-general part

slide-29
SLIDE 29

29

The task-specific part

  • The task-specific part of the Bayesian student model

contains four types of nodes:

– Fact, – Goal, – Rule-application and – Strategy nodes

  • Fact and Goal nodes represent information that is

derived while solving a problem by applying rules from the knowledge base.

  • Goal and Fact nodes have binary values T and F

indicating whether they are do-able (by the student).

  • They have as many parents as there are ways to

derive them.

slide-30
SLIDE 30

In Andes

30

  • Andes uses its rules to solve each physics

problem in all possible ways, and accumulates all possible derivations of the correct answer.

  • The derivations are collected in a data structure

called solution graph.

  • The consolidated solution graph for the full

Andes system runs into thousands of nodes... too heavy for BN update, etc.

  • BN can handle only propositional information,

general solution graph nodes are first order.

slide-31
SLIDE 31

Dynamic BN

  • Dynamic BN is when node scenario changes with time.
  • Andes uses a version of BN for solving this problem.
  • Each problem mapped to a different solution graph, and

hence a different Bayesian network.

– Fully propositional!

  • The Bayesian networks for different tasks are completely

distinct and share no nodes.

  • However, the prior probabilities of the domain-general nodes

are set to the probabilities of the domain general nodes from the network of the preceding exercise.

slide-32
SLIDE 32

Rule-application nodes

32

  • Rule-application nodes connect Context-

Rule nodes, Strategy nodes and Proposition nodes to new derived Proposition nodes.

  • The nodes have values indicating

whether they are Doable or Not-doable.

  • The node is Doable if the student has

applied or can apply the corresponding Context-Rule correctly.

slide-33
SLIDE 33

Strategy nodes

Strategy nodes represent points where

the student can choose among alternative plans to solve a problem.

These are the only non-binary nodes in

the network: they have as many values as there are alternative plans.

The node is always paired with a Goal

node, and it is used when there is more than one mutually exclusive way to address the goal.

33

slide-34
SLIDE 34

A physics problem and a segment of the corresponding solution graph

34

slide-35
SLIDE 35

Probabilistic student modeling

35

slide-36
SLIDE 36

Updating SM

  • Values may be changed depending on number

and type of hints used.

  • Number of mistakes made.

– “Guess” probability?

  • Skipped steps – how to attribute credit for the

involved knowledge elements?

– Use of other related elements can help.

  • Multiple Rule applications for a node that has been

reached.

– Sharing credit among them...

slide-37
SLIDE 37

...

  • Leaky AND, leaky OR

– To capture “leak” knowledge such as guess, slips,

etc

– Even if I know all conditions and the rule, I may not

apply it (correctly).

  • Derive updated values of domain knowledge

competence, prediction of appropriate next problem, adjustment against supports like hint, etc

– All can benefit from a probabilistic student model.

slide-38
SLIDE 38

Wrapping up

  • Machine learning and Uncertainty handling

important aspects for ATS in the long run.

  • Many approaches and techniques.
  • Bayesian networks are good given the

formal basis in probability theory, and its ability to control complexity.

  • BN is not always Bayesian networks –

mostly belief networks!

  • But not flexible for many applications... Will

see some other models later.

slide-39
SLIDE 39

Fuzzy Logic

  • Another aspect of uncertainty handling
  • Fuzziness in many concepts

– probability not appropriate to capture – tall, short, warm, etc

  • Fuzzy rules

– Not easy to provide complete interdependencies – If clothes are dirty, detergent = high – If cloth_quantity is high, detergent is high

slide-40
SLIDE 40

40

  • Generalise from crisp sets to fuzzy sets
  • Usually, we follow sets, with fuzzy

boundaries

– Tall, medium, short, costly, cheap, etc

  • But mathematical sets are crisp.

– x in A => x not in ~A – A U ~A = UnivSet – A ^ ~A = NullSet

  • Degree of membership.
slide-41
SLIDE 41

Membership

membership marks poor Excellent average good

slide-42
SLIDE 42

Membership

  • Each variable with multiple values, with
  • verlap.
  • Suitable membership functions as per

domain.

  • John = 55 marks

– good:70, average:50, – Can have all values also.

  • Fuzzification of variables.
slide-43
SLIDE 43

Fuzzy rules

  • If problem-1 solved rule-33 confid is high
  • If problem-3 solved rule-33 confid is

marginal

  • Such simple rules can be used to build a

student model

  • Similar rules for tutoring decisions.
slide-44
SLIDE 44

….

  • An aggregator function combines the

contributions from the various rules for an

  • verall decision.
  • This will also be a fuzzy value

– r33-confid marginal (0.3), high(0.7), etc

  • A defuzzification process converts this into a

crisp value

– r33-confid adequate or not.

slide-45
SLIDE 45

Thank you....

slide-46
SLIDE 46

Every CPD for the entire BN can be calculated

If the student answered the question

correctly, then consider the concept known.

Similarly, if the student answered the

question incorrectly, then we considered the concept unknown

Determine the probability of each concept

being known p(ai = known), can then be determined

Moreover, we can also compute p(ai=known,

46