Decision Tree and Automata Learning Stefan Edelkamp 1 Overview - - - PowerPoint PPT Presentation

decision tree and automata learning
SMART_READER_LITE
LIVE PREVIEW

Decision Tree and Automata Learning Stefan Edelkamp 1 Overview - - - PowerPoint PPT Presentation

Decision Tree and Automata Learning Stefan Edelkamp 1 Overview - Decision tree representation - Top Down Induction, Attribute Selection - Entropy, Information gain - ID3 learning algorithm - Overfitting - Continous, multi-valued, costly and


slide-1
SLIDE 1

Decision Tree and Automata Learning

Stefan Edelkamp

slide-2
SLIDE 2

1 Overview

  • Decision tree representation
  • Top Down Induction, Attribute Selection
  • Entropy, Information gain
  • ID3 learning algorithm
  • Overfitting
  • Continous, multi-valued, costly and unkown costs
  • Grammar and DFA Learning
  • Angluin’s ID algorithm

Overview 1

slide-3
SLIDE 3

2 Decision Tree Learning

PlayTennis:

Outlook Overcast Humidity Normal High No Yes Wind Strong Weak No Yes Yes Rain Sunny

Decision Tree Learning 2

slide-4
SLIDE 4

Training Examples:

Day Outlook Temp. Humidity Wind Play? D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No

Decision Tree Learning 3

slide-5
SLIDE 5

Decision Trees

DT Representation: internal node tests attribute, branch corresponds to attribute value, leaf node assigns classification When to consider DT:

  • Instances describable by attribute–value pairs
  • Target function discrete valued
  • Disjunctive hypothesis may be required
  • Possibly noisy training data

Examples: Equipment or medical diagnosis, Credit risk analysis, modeling calendar scheduling preferences

Decision Tree Learning 4

slide-6
SLIDE 6

Top-Down Induction

Main Loop: A “best” decision attribute for next node, assign A as decision attribute for node

  • for each value of A, create new descendant of node
  • sort training examples to leaf nodes

if training examples perfectly classified ⇒ stop else iterate over new leaf nodes Best Attribute:

A1=? A2=? f t f t

[29+,35-] [29+,35-] [21+,5-] [8+,30-] [18+,33-] [11+,2-]

Decision Tree Learning 5

slide-7
SLIDE 7

Entropy

Entropy(S) 1.0 0.5 0.0 0.5 1.0 p+

  • S is a sample of training examples

p⊕: proportion of positive examples in S, p⊖: proportion of negative examples in S Entropy measures the impurity of S Entropy(S) ≡ −p⊕ log2 p⊕ − p⊖ log2 p⊖ = expected # bits needed to encode class (⊕ or ⊖) of randomly drawn member of S (under the optimal, shortest-length code)

Decision Tree Learning 6

slide-8
SLIDE 8

Selecting the Next Attribute

Information Gain: expected reduction in entropy due to sorting on A Gain(S, A) ≡ Entropy(S) −

  • v∈DA

|Sv| |S| Entropy(Sv)

Which attribute is the best classifier?

High Normal Humidity [3+,4-] [6+,1-] Wind Weak Strong [6+,2-] [3+,3-] = .940 - (7/14).985 - (7/14).592 = .151 = .940 - (8/14).811 - (6/14)1.0 = .048 Gain (S, Humidity ) Gain (S, ) Wind =0.940 E =0.940 E =0.811 E =0.592 E =0.985 E =1.00 E [9+,5-] S: [9+,5-] S:

Decision Tree Learning 7

slide-9
SLIDE 9

ID3: Hypothesis Space Search

...

+ + +

A1

+ – + –

A2 A3

+

...

+ – + –

A2 A4

– + – + –

A2

+ – +

... ...

  • Target Function: surely in there . . . , no backtracking ⇒ Local minima . . .

Statistical Choices: robust to noisy data, inductive bias: “prefer shortest tree”

Decision Tree Learning 8

slide-10
SLIDE 10

Occam’s Razor

Bias: preference for some hypotheses, rather than a restriction of hypothesis space . . . prefershortest hypothesis that fits the data Arguments in favor to short hypotheses:

  • a short hyp that fits data unlikely to be coincidence, a long hyp that fits data likely

be coincidence Arguments opposed short hypotheses:

  • There are many ways to define small sets of hyps, e.g., all trees with a prime

number of nodes that use attributes beginning with “Z”

  • What’s so special about small sets based on size of hypothesis?

Decision Tree Learning 9

slide-11
SLIDE 11

Overfitting in Decision Trees

Consider adding noisy Sunny, Hot, Normal, Strong, PlayTennis = No Consider error of hypothesis h over

  • training data: errortrain(h)
  • entire distribution D of data: errorD(h)

Hypothesis h ∈ H overfits training data if there is an alternative hypothesis h′ ∈ H such that errortrain(h) < errortrain(h′) and errorD(h) > errorD(h′)

Decision Tree Learning 10

slide-12
SLIDE 12

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 10 20 30 40 50 60 70 80 90 100 Accuracy Size of tree (number of nodes) On training data On test data

slide-13
SLIDE 13

Avoiding Overfitting

Option 1: stop growing when split not statistically significant Option 2: grow full tree, then post-prune Select best tree:

  • measure performance over training data
  • measure performance over separate validation data set
  • min. |tree| + |misclassifications(tree)|

Decision Tree Learning 11

slide-14
SLIDE 14

Rule Post-Pruning

  • Convert tree to equivalent set of rules
  • Prune each rule independently of others
  • Sort final rules into desired sequence for use

Perhaps most frequently used method

Decision Tree Learning 12

slide-15
SLIDE 15

Converting A Tree to Rules

Outlook Overcast Humidity Normal High No Yes Wind Strong Weak No Yes Yes Rain Sunny

IF (Outlook = Sunny) ∧ (Humidity = High) THEN PlayTennis = No IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN PlayTennis = Y es . . .

Decision Tree Learning 13

slide-16
SLIDE 16

Continuous Valued Attributes

Create a discrete attribute to test continuous (Temperature > 72.3) = t, f Temperature: 40 48 60 72 80 90 PlayTennis: No No Yes Yes Yes No Attributes with Many Values: Gain will select it e.g. Date = Jun 3 1996 One approach: use GainRatio(S, A) ≡ Gain(S, A) SplitInformation(S, A) SplitInformation(S, A) ≡ −

c

  • i=1

|Si| |S| log2 |Si| |S| where Si is subset of S for which A has value vi

Decision Tree Learning 14

slide-17
SLIDE 17

Attributes with Costs

e.g medical diagnosis: BloodTest has costs Consistent tree with low expected cost: replace gain by

  • Gain2(S, A)/Cost(A).
  • 2Gain(S,A) − 1/(Cost(A) + 1)wm, where w ∈ [0, 1] determines importance of

cost

Decision Tree Learning 15

slide-18
SLIDE 18

Unknown Attribute Values

Use training example anyway, sort through tree

  • node n tests A ⇒ assign most common value of A among other examples sorted

to node n

  • assign most common value of A among other examples with same target value
  • assign probability pi to each possible value vi of A
  • assign fraction pi of example to each descendant

Classify new examples in same fashion

Decision Tree Learning 16

slide-19
SLIDE 19

3 Automata Learning

Grammar Inference: process of learning an unknown grammar given a finite set of labeled examples Regular Grammar: recognized by DFA Given: finite set of positive examples, finite, possibly empty set of negative examples Task: learn minimum state DFA equivalent to target . . . is NP-hard Simplifications:

  • criteria on samples (e.g. structural completeness)
  • knowledgeable teacher (oracle) who responds to queries generated by learner

Automata Learning 17

slide-20
SLIDE 20

Applications

  • Inference of control structures in learning by examples
  • Inference of normal models of systems in test
  • Inference of consistent environments in partial models

Automata Learning 18

slide-21
SLIDE 21

Trace Tree

Automata Learning 19

slide-22
SLIDE 22

DFA

Automata Learning 20

slide-23
SLIDE 23

Chart Parsing

Automata Learning 21

slide-24
SLIDE 24

Chart Parsing

Automata Learning 22

slide-25
SLIDE 25

Chart Parsing

Automata Learning 23

slide-26
SLIDE 26

Chart Parsing

Automata Learning 24

slide-27
SLIDE 27

Chart Parsing

Automata Learning 25

slide-28
SLIDE 28

Some Notation

Σ: set of symbols, Σ∗: set of strings , λ: empty string M = (Q, δ, Σ, q0, F): DFA, L(M): language accepted by M q in M alive: to be reached by some string α and left with some string β such that αβ ∈ L(M) ⇒ in minimal DFA only one non-alive state d0 set of strings P live-complete w.r.t. M: ∀ live states q in M: ∃ α ∈ P with δ(q0, α) = q ⇒ P ′ = P ∪ {d0} represents all states in M f : P ′ × Σ → Σ∗ ∪ {d0} by f(d0, b) = d0 and f(α, b) = αb Transition set T ′: P ′ ∪ {f(α, b) | (α, b) ∈ P × Σ}; T = T ′ − {d0}

Automata Learning 26

slide-29
SLIDE 29

Angluin’s ID-Algorithm

Aim: construct a partition of T ′ that places all the equivalent elements in one state Equivalence relation: Nerode ⇒ DFA minimal Start: one accepting and one non-accepting state Partitioning:

  • ∀i string vi is drawn, s.t. ∀ q, q′ ∃j ≤ i with δ(q, vj) ∈ F and δ(q′, vj) /

∈ F, or vice versa ⇒ i-th partition Ei: Ei(d0) = ∅ and Ei(α) = {vj|j ≤ i, αvj ∈ L(M)}

  • ∀α, β ∈ T with δ(q0, α) = δ(q0, β) we have Ej(α) = Ej(β), j ≤ i

Automata Learning 27

slide-30
SLIDE 30

Construct (i + 1)-th Partition

Separation:

  • ∀i search α, β and b s.t. Ei(α) = Ei(β) but Ei(f(α, b)) = Ei(f(β, b))

γ: element in Ei(f(α, b)) and not in Ei(f(β, b)) or vice versa ∀α ∈ T : query the string αvi+1, vi+1 = bγ αvi+1 ∈ L(M) ⇒ Ei+1 ← Ei ∪ {vi+1}; otherwise, Ei+1 ← Ei . . . iterate until no separating pair α, β exists

Automata Learning 28

slide-31
SLIDE 31

Pseudo Code

Input: Live complete set P, Teacher to answer membership queries Output: Canonical DFA M for target regular grammar i ← 0; vi ← λ; V ← {λ} T ← P ∪ {f(α, b)|(α, b) ∈ P × Σ} T ′ ← T ∪ {d0}, E0(d0) ← ∅ for each α ∈ T if (α ∈ L) E0(α) ← {λ} else E0(α) ← ∅ while (∃α, β ∈ P ′ ∧ b ∈ Σ: Ei(α) = Ei(β), but Ei(f(α, b)) = Ei(f(β, b))) γ ← Select(Ei(f(α, b)) ⊕ Ei(f(β, b))) vi+1 ← bγ; V ← V ∪ {vi+1}; i ← i + 1 for each α ∈ T if (αvi ∈ L) Ei(α) ← Ei−1(α) ∪ {vi} else Ei(α) ← Ei−1(α) return extracted DFA M for L from Ei and T

Automata Learning 29

slide-32
SLIDE 32

Extracting the Automat M

. . . from sets Ei and transition set T:

  • states of M are sets Ei(α), for α ∈ T
  • initial state of M is Ei(λ)
  • accepting states of M are sets Ei(α), where α ∈ T and λ ∈ Ei(α).

If Ei(α) = ∅ then we add self loops on the state Ei(α) for all b ∈ Σ; else we set the transition δ(Ei(α), b) = Ei(f(α, b)) for all α ∈ P and b ∈ Σ

Automata Learning 30

slide-33
SLIDE 33

Time Complexity

Theorem n: # states in M ⇒ ID asks no more than n · |Σ| · |P| queries Proof: Each time at least one set Ei (corresponding to a state) is partitioned into two subsets Each iteration: ask |T| questions, where T contains no more than |Σ| · |P| elements ⇒ algorithm iterates through while-loop at most n times since

Automata Learning 31