Data Mining using Ant Colony Optimization Thanks to: Johannes - - PDF document

data mining using ant colony optimization
SMART_READER_LITE
LIVE PREVIEW

Data Mining using Ant Colony Optimization Thanks to: Johannes - - PDF document

Data Mining using Ant Colony Optimization Thanks to: Johannes Singler, Bryan Atkinson Presentation Outline Introduction to Data Mining Rule Induction for Classification AntMiner Overview: Input/Output Rule Construction


slide-1
SLIDE 1

1

Data Mining using Ant Colony Optimization

Thanks to: Johannes Singler, Bryan Atkinson

Presentation Outline

  • Introduction to Data Mining
  • Rule Induction for Classification
  • AntMiner

– Overview: Input/Output – Rule Construction – Quality Measurement – Pheromone: Initial/Updating – Experiments/Results – Performance/Complexity

  • Swarm-based Genetic Programming

– Introduction to GP, Symbolic Regression – Crossover problems – Ant Colony Crossover – Experiments and Results

Introduction

  • Data Mining tries to find:

– hidden knowledge – unexpected patterns – new rules – in large databases.

  • “Discovery of useful summaries of data”
  • Is a key element of much more elaborate process:

Knowledge Discovery in Databases (KDD)

slide-2
SLIDE 2

2

Goals of Rule Induction

  • Stage of Data Mining: Rule Induction
  • Find rules to describe data in some way

– Not only accurate… – …but also comprehensible for a human user… – …to support decision making decision making

Focus in this Talk

  • Rule Induction for Classification using ACO

– Given: training set (instances/cases to classify) – Goal: to come up with (preferably simple) rules to classify data

  • Algorithm by Parpinelli, Lopes and Freitas:

AntMiner

  • ACO + Genetic Programming

– Symbolic regression

Rule Induction

  • Possible Outputs for

Rule Induction

– decision trees – (ordered) decision lists [here] – …

if <attribute1>=<value1> and <attribute2>=<value2> and… then <class>=<class1> else if…

slide-3
SLIDE 3

3

AntMiner Input

  • Training set / test set
  • Attribute / value pairs
  • Given classes / classification

AntMiner Output

  • Ordered decision list

– Ordered list of IF-THEN-Rules like IF <condition> THEN <class>

  • <condition> = <term1> AND <term2> AND…

– <term> = <attribute> ‘=‘ <value>

– + Default rule (majority value) – First rule “fires”.

  • Only discrete attributes supported so far.

– Continuous values must be discretized before.

  • This is a quite limited version of a decision list.

Prerequisites for an ACO (Review)

  • Problem-dependent heuristic function (η) for

measuring the quality of items that could be added to the partial solution so far.

  • Pheromone updating rule (τ)
  • Probabilistic transition rule based on η and τ
  • Difference to most ACO algorithms mentioned in

class: Does not use a graph representation of the problem.

slide-4
SLIDE 4

4

AntMiner Algorithm: Top-Level

  • Pseudo-Code for finding one rule set:

trainingSet = {all training cases} discoveredRuleList = [ ] WHILE(|trainingSet| still too big) Initialize pheromone (equally distributed) Ants try to find a good classification rule by the ACO heuristic Add best rule found to discoveredRuleList Remove correctly covered examples from trainingSet

AntMiner Algorithm: Mid-Level

  • Pseudo-Code for finding one rule:

Repeat Start new ant with empty rule (antecedent) Construct rule by adding one term at a time and choosing the rule consequent subsequently Prune rule Increase pheromone on trail which ant used according to the quality of the rule Until (maximum number z of ants exceeded) or (no improvement any more during the last k iterations)

  • Actually only the population of one ant at a

time working.

AntMiner Algorithm: Bottom- Level

  • Repeat as long as possible:

– Add one condition to the rule.

  • Use probabilistic approach referring to

pheromone concentration and heuristic.

  • Do not use attributes twice.
  • Resulting rule must cover at least a minimum
  • f cases.
  • After having finished the antecedent,

calculate the resulting class.

slide-5
SLIDE 5

5

Rule Construction

  • Probability for adding <Ai>=<Vij>
  • where

– Ai the i-th attribute – Vij the j-th possible value of the i-th attribute – η heuristic function, τ pheromone trail

P

ij = ij ij(t)

[normalized]

Heuristic Function (η)

  • Analogous to:

– Proximity function in TSP – Colouring matrix in graph colouring problem.

  • Uses information theory (entropy).

– Split instances using rule. – Quality corresponds to entropy of remaining “buckets”; the less, the better. where k is number of classes H(W|A j = Vij ) = (P(w |

w=1 k

  • A j = Vij ). log2 P(w | A j = Vij ))

ij log2 k H(W | A j = Vij ) [normalized]

For T, high = >80, mild = 70<T≤80, cold = 0<T≤70 (for later) P(play|outlook=sunny)=2/14=0.143, P(don’t play|outlook=sunny)=3/14=0.214 H(W,outlook=sunny)=-0.143.log(0.143)-0.214.log(0.214)=0.877 η= log2k − H(W,outlook=sunny) = 1−0.877=0.123

Information Heuristic Example

slide-6
SLIDE 6

6

For H, high = >85, normal= 0<T≤85, (for later) P(play|outlook=overcast)=4/14=0.286, P(don’t play|outlook=overcast)=0/14=0 H(W,outlook=sunny)=-0.286.log(0.286)=0.516 η= log2k − H(W,outlook=sunny) = 1−0.516=0.484

Information Heuristic Example Quality Function

  • Measuring the classification quality of a rule / several

rules.

– For one rule: sensitivity · specificity where T=true, F=false, P=positive, N=negative – The bigger the value of Q, the better

  • Measuring the simplicity of a rule:

– number of rules · average number of terms per rule – The less, the simpler, thus the better.

Q = TP TP + FN . TN FP + TN

Rule Pruning

  • Iteratively remove one-term-at-a-time

from the rule while this process improves the classification accuracy of the rule.

– Majority class might change. – If ambiguous, remove term that improves the accuracy the most. – Simplicity improves anyway.

slide-7
SLIDE 7

7

Pheromone

  • Initial pheromone

value: where a is the total number of attributes and bi is the number

  • f possible values of

Ai.

ij(t = 0) = 1 bi

i=1 a

  • [normalized]

Pheromone Updating (τ)

  • Values before (1).
  • First increase pheromone of

used terms regarding rule quality (2):

  • Then normalize the

pheromone level of all terms → pheromone evaporation (3)

ij(t +1) = ij(t).(1+ Q)

Using the Discovered Rules

  • Apply in the order they were

discovered.

  • First rule that covers case is applied.
  • If no rule covers case, apply default

result (majority value).

slide-8
SLIDE 8

8 Possible Discretization of Continuous Attributes

  • Use C4.5-Disc
  • Quick overview:

– Extract reduced data set that only contains attribute to discretize and desired classification. – From that build up decision tree using the C4.5 algorithm (another rule induction algorithm). – Result: Decision tree with binary decisions x ≤ a → go left; x > a → go right – Each path corresponds to the definition of a categorical interval.

AntMiner’s Parameters

  • Number of ants (3000 used in experiments). Also limits the

maximum number of rules found for a classification. Is not necessarily exploited because algorithm might converge before.

  • Minimum number of cases per rule (10). Each rule must at least

cover so many cases. Avoids overfitting.

  • Maximum number of uncovered classes in the training set (10).

The algorithm stops when there are only fewer instances left.

  • Number of rules to test for the convergence of the ants (10). The

algorithm waits so long for an improvement.

Sample Run Start

  • Deciding whether to play outside

– Attributes: outlook, temperature, humidity, windy, play – Classes: play (yes), do not play (no) – sunny,hot,high,FALSE,no (1) – sunny,hot,high,TRUE,no (2) –

  • vercast,hot,normal,FALSE,yes (3)

– rainy,mild,high,FALSE,yes (4) – rainy,cool,normal,FALSE,yes (5) – rainy,cool,normal,TRUE,no (6) –

  • vercast,cool,normal,TRUE,yes (7)

– sunny,mild,high,FALSE,no (8) – sunny,cool,normal,FALSE,yes (9) – rainy,mild,normal,FALSE,yes (10) – sunny,mild,normal,TRUE,yes (11) –

  • vercast,mild,high,TRUE,yes (12)

  • vercast,hot,normal,FALSE,yes (13)

– rainy,mild,high,TRUE,no (14)

  • Sample run for finding one rule set.
  • Start: I={all}, R={}
  • Ant 1: Choose probabilistically
  • utlook=overcast (then play=yes)
  • Ant 1: Chooses values for other

attributes…

  • Ant 1: Finishes because all attributes are

used.

  • Ant 1: Last three conditions are pruned

away.

  • I={1,2,4,5,6,8,9,10,11,14},

R={outlook=overcast → yes)

  • Ant 2: Choose outlook=rainy (then

play=yes)

  • Rule is not good enough (3:2)
  • Ant 2: Choose windy=true (then play=no)
  • Ant 2 finishes because otherwise covered

set would be too small.

  • No pruning possible either.
slide-9
SLIDE 9

9

Sample Run Result

  • Possible result (not most simple):

– outlook=overcast → play=yes

  • utlook=rainy, windy=false → play=yes
  • utlook=sunny, humidity=normal → play=yes
  • therwise → play=no

Comparison to CN2 Algorithm

  • Uses beam search (limited breadth first

search with beam width b).

  • Add all possible terms to current partial rules,

evaluate, and retain only the b best ones.

  • No feedback for constructing new rules.
  • Output format is the same (ordered rule list).
  • Uses entropy heuristic as well.

Experiment Setup

  • Dimension roughly:

100…1000 cases, 9…34 attributes, 2…6 classes

  • Tests run using a 10-fold cross-validation procedure

– Divide data into 10 partitions. – For each partition do

  • Treat it as the test data and use the other 90% as the training

data.

  • Measure the performance.

– Take the average value.

  • This helps to achieve significant results.
slide-10
SLIDE 10

10

DataSets Performance Results

  • No particular parameter optimizations for both

algorithms.

  • Same computation time.

Extensions to the Algorithm

  • By Galea [3].
  • Deterministic rule with q probability as in

ACS-TSP.

– Choose probabilistically (considering pheromone trail and heuristic function) with probability q. – Otherwise deterministically choose term with maximum probability. – Improves results slightly.

  • Extension for fuzzy rules also possible.
slide-11
SLIDE 11

11

Comparative Results Side-by-side Comparison Effects of Rule Pruning

slide-12
SLIDE 12

12

Generated Rules Terms per Rule Algorithm Complexity

  • Introducing a lot of variables

– n: number of cases – a: number of attributes – v: number of values per attribute; considered small; O(1) – k: number of conditions per inspected rule while evaluating and pruning – z: number of ants – r: number of discovered rules

slide-13
SLIDE 13

13

Complexity Comparison

  • Ant-Miner, average case:
  • Ant-Miner, worst case k = O(a):
  • CN2:

O(r.z.[k.a + n.k 3] + a.n) O(r.z.a3.n) O(a(n + log(a))) Further Experiments

  • Further experiments by the authors of

AntMiner show that ACO really helps:

– Use of pheromone trails improves the average solution. – Use of rule pruning improves the simplicity without harming the quality.

References

  • [1] Data Mining with an Ant Colony Optimization Algorithm. Parpinelli,

Lopez, Freitas

  • [2] An Ant Colony Based System for Data Mining: Applications To
  • Medical. Parpinelli, Lopez, Freitas 2001
  • [3] Applying Swarm Intelligence to Rule Induction. Michelle Galea 2002.
  • [4] The CN2 Induction Algorithm. Clark, Niblett 1988.
  • [5] Data Mining. Adriaans, Zantinge. Addison-Wesley 1996.
  • [6] Learning Fuzzy Rules Using Ant Colony Optimization Algorithms.

Casillas, Cordón, Herrera 2000.

  • [7] Bryan Atkinson Honours Project Report:

http://www.scs.carleton.ca/~arpwhite/documents/honoursProjects/brya n-atkinson-winter-2006.pdf

slide-14
SLIDE 14

14

Ant-based Programming

  • Genetic Programming has been successful at

inducing program descriptions

  • Problems with scaling:

– Diversity – Retaining useful fragments:

  • Avoiding disruption of higher order functions
  • Can ACO help?

– Maybe, learn useful associations, avoid disruption

Genetic Programming

  • Programs represented in tree structure
  • Learning through:

– Population-based, evolutionary search – Genetic operators: crossover, mutation

  • Requires specification of:

– Functions (F): internal nodes – Terminals (T): leaf nodes

  • Symbolic Regression:

– F = {+, -, /, *, sin, cos, exp} – T = {integers in range (-5, 5), X}

Symbolic Regression

Find function that best fits a number of sample points. Good fit determined by hits: candidate function within threshold distance

f(k) = h(k) 1 max(h(k),1) e(k,i)

i =1 size(D)

  • e(k,i) = abs( v(k,x(i)) y(i)

( ))

v(k,x) = Value of kth program for x

h(k) = hits(k,i)

i =1 size(D)

  • hits(k,i) = 0

ife(k,i) 1

  • therwise
slide-15
SLIDE 15

15

Symbolic Regression Example

3x + sin(x)

Mathematically: GP:

Crossover

Problem: can disrupt useful couplings *-X easily + sin cos * X X * X X + * cos X X * X X X X + * cos X X * X X sin * X X

Adapting Crossover with ACO

  • Use context-aware crossover
  • Basic crossover chooses node

randomly -- context unaware

  • Adapt crossover to remember useful

function couplings

– Not automatically defined functions (ADFs)

slide-16
SLIDE 16

16

Function Coupling Matrix (C)

Function + * sin cos X + 0.1 0.1 1.0 1.0 0.1 * 0.1 0.1 0.1 0.1 1.0 sin 1.0 1.0 0.1 1.0 0.1 cos 0.1 1.0 0.1 0.1 0.1

Important couplings have high values; e.g. sin-x

Swarm-based GP (SB-GP)

Three modifications to GP:

  • 1. Initialization of Coupling matrix, C.
  • 2. Crossover using coupling matrix.
  • 3. Pheromone update based upon program fitness.

Pheromone Initialization

  • For all function and terminal coupling (i, j):

– Initialize pheromone, τi,j, to initial value, τ0

  • τ0 is system parameter
slide-17
SLIDE 17

17

Ant Colony (AC) Crossover

Choose a random branch, B, from root to a leaf in program tree Pn For every edge i, j in B Probability of choosing node i as root of subtree Sn where i is parent and j is a child node is given by: p(i, n) = (τ max(n) - τ min (n) + τ i,j (n)) / Τ(n) Choose random branch, B, from root to a leaf in program tree Pm For every edge i, j in B Probability of choosing node i as root of subtree Sm p(i, m) = (τ max(m) - τ min (m) + τ i,j (m)) / Τ(m) where T(k) is given by:

AC Crossover Continued

T(k) = Σi,j∈E(k) (τ max(k) - τ min (k) + τ i,j (k)) and τ i,j(k) = C(V(k,i),V(k,j)) τ max(k) = max i,j∈E(k) (τ i,j (k)) τ min(k) = min i,j∈E(k) (τ i,j (k)) and E(k) = { edges in kth program subtree }

AC Crossover Example

slide-18
SLIDE 18

18

Experimental Parameters

Parameter Value Parameter Value Initial Pheromone 10-5 Min Program Depth 4 Evaporation rate p 0.9 Tournament Size 7 Best k programs used for evaluation 30 Crossover probability 0.9 Max Program Depth 15 Mutation Probability 0.0 1 Number of Generations (default) 50

Functions and Results

F1: cos(X2)+sin(X2)+X2 F2: cos(X2)+sin(X2)+X2+cos(X)+sin(X) F3: sin(X)*X4+sin(X)*X3+sin(X)*X2+ sin(X)*X

Test GP Mean GP STD SB- GP Mean SB- GP STD P Value Population Size F1 6.6 2.01 4.35 0.75 0.0001 1500 F1 24.73 18.9 6.18 2.89 0.0043 500 F1 37.7 19.8 24.4 22.1 0.1745 200 F2 27.8 19.6 11.6 4.69 0.0043 800 F3 43.1 14.5 28.2 16.8 0.0488 500

F3: Function Couplings

slide-19
SLIDE 19

19

Conclusions

  • Statistically significant improvement in

performance

  • Useful couplings learnt
  • Number of successful trials increased
  • Couplings can saturate:

– Use ACS-style q mechanism to choose randomly some of time