Merging Classifiers of Different Classification Approaches - - PowerPoint PPT Presentation

merging classifiers of different classification approaches
SMART_READER_LITE
LIVE PREVIEW

Merging Classifiers of Different Classification Approaches - - PowerPoint PPT Presentation

Merging Classifiers of Different Classification Approaches Incremental Classification, Concept Drift and Novelty Detection Workshop Antonina Danylenko 1 and Welf L owe 1 antonina.danylenko@lnu.se 14 December, 2014 1 Linnaeus University,


slide-1
SLIDE 1

Merging Classifiers of Different Classification Approaches

Incremental Classification, Concept Drift and Novelty Detection Workshop

Antonina Danylenko1 and Welf L¨

  • we1

antonina.danylenko@lnu.se 14 December, 2014

1Linnaeus University, Sweden Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 1(28)

slide-2
SLIDE 2

Agenda

◮ Introduction; ◮ Problem, Motivation, Approach; ◮ Decision Algebra; ◮ Merge as an Operation of Decision Algebra; ◮ Merging Classifiers; ◮ Experiments; ◮ Conclusions.

Agenda Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 2(28)

slide-3
SLIDE 3

Introduction

◮ Classification is a common problem that arises in different fields of

Computer Science (data mining, information storage and retrieval, knowledge management);

◮ Classification approaches are often tightly coupled to:

◮ learning strategies: different algorithms are used; ◮ data structures: represent information in different ways; ◮ how common problems are addressed: workarounds;

◮ It is not that easy to select an appropriate classification model for

classification problem (be aware of accuracy, robustness, scalability);

Introduction Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 3(28)

slide-4
SLIDE 4

Problem and Motivation

◮ Simple combining of classifiers learned over different data sets of

the same problem is not straightforward;

◮ Current work is done in aggregation and meta-learning:

◮ combine different classifiers learned over same data set; ◮ construct single classifier learned on the different variations of

the same classification problem;

◮ as a result - do not take into account that the context can

differ.

◮ Combining classifiers with partly- or completely- disjoint contexts

use one single classification approach for base-level classifiers;

◮ Generality gets lost: incomparable, difficult benchmarking, hard to

propagate advances between domains;

Introduction Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 4(28)

slide-5
SLIDE 5

Proposed Approach

◮ Use Decision Algebra that defines classifiers as re-usable black-boxes

in terms of so-called decision functions;

◮ Define a general merge operation over these decisions functions

which allows for symbolic computations with classification information captured;

◮ Show an example of merging classifiers of different classification

approaches;

◮ Show that the merger of classifiers tendentiously becomes more

accurate;

Introduction Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 5(28)

slide-6
SLIDE 6

Classification Information

◮ Classification information is a set of decision tuples:

CI = {( a1, c1), . . . ( an, cn)}

◮ It is complete if: ∀

a ∈ A : ( a, c) ∈ CI;

◮ It is non-contradictive if: ∀(

ai, ci), ( aj, cj) ∈ CI : ai = aj ⇒ ci = cj;

◮ Problem domain (A, C) of CI is a superset of

A × C, that defines the actual classification problem, where A ∈ A;

Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 6(28)

slide-7
SLIDE 7

Decision Function

◮ Decision Function is a representation of complete and possibly

contradictive decision information: df : A → D(C) maps actual context a ∈ A to a (probability) distribution D(C);

◮ It is a higher order (or curried) function:

df n : An → (An−1 → (. . . (A1 → (→ D(C)))));

◮ Can be easily represented as a decision tree or decision graph:

df n = x1(df n−1

1

, . . . , df n−1

|Λ1| )

where Λi is a domain of attribute A1

Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 7(28)

slide-8
SLIDE 8

Graph Representation of Decision Function

◮ Decision function df2 = x1(na, x2(na, na, a, a), x2(na, na, a, a), a)

1 na 2 na 2 a na a a na na a a 1 na 2 a high vhigh vhigh med low high med low vhigh high med low

Figur: A tree (left) and graph (right) representation of df2. Each node labeled with n represents a decision term with a selection operator xn; each square leaf node labled with c corresponds to a probability distribution over classes C with c the most probable class.

Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 8(28)

slide-9
SLIDE 9

Decision Algebra

◮ (DA) is a theoretical framework that is defined as a parameterized

specification, with A and D(C) as parameters. It provides a general representation of classification information as an abstract classifier;

Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 9(28)

slide-10
SLIDE 10

Operations Over Decision Functions

◮ Constructor xn:

xn : Λ1 × DF[ A′, D] × · · · × Λ1 × DF[ A′, D]

  • |Λ1| times

→ DF[ A, D]

◮ Bind binds attribute Ai to an attribute value a ∈ Λi:

bindAi : DF[ A, D] × Λi → DF[ A′, D] bindA1 (xn(a1, df1, · · · , a|Λ1|, df|Λ1|), a) ≡ dfi, if a = ai bindA1 (df 2, high) = x2(na, na, a, a)

◮ Evert changes the order of attributes in the decision function:

evertAi : DF[ A, D] → DF[ A′, D] evertAi(df ) := x(a1, bindAi(df , a1), . . . , a|Λi|, bindAi(df , a|Λi|)) evertA2(df 2) = x2(x1(na, na, na, a), x1(na, na, na, a), x1(na, a, a, a), x1(na, a, a, a))

Merge as an Operation of Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 10(28)

slide-11
SLIDE 11

Merge Operation over Decision Functions

◮ Merge operator ⊔D over class distribution D(C);

⊔D : D(C) × D(C) → D(C) d(C) ⊔D d′(C) = {(c, p + p′)|(c, p) ∈ d(C), (c, p′) ∈ d′(C)}

◮ General merge operation over decision functions:

⊔ : DF1[ A, D] × DF2[ A, D] → DF ′[ A, D]

◮ Merge over constant decision functions df 1

0, df 2 0 ∈ DF ∅[{

0}, D]: ⊔(df 0

1, df 0 2)

:= x0(⊔D(df 0

1, df 0 2))

Merge as an Operation of Decision Algebra Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 11(28)

slide-12
SLIDE 12

Scenario One: Same Formal Context

◮ Prerequisite: The decision functions df 1 ∈ DF 1[

A, D] and df 2 ∈ DF 2[ A′, D] are constructed over different samples of the same problem domain and A = A′ = Λ1 × . . . × Λn; ⊔(df 1, df 2) := xn( a1, ⊔(bindA1(df 1, a1), bindA1(df 2, a1)), . . . , ak, ⊔(bindA1(df 1, ak), bindA1(df 2, ak)))

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 12(28)

slide-13
SLIDE 13

Scenario One: Cont’d

1: if df 1 ∈ DF ∅[{

0}, D] ∧ df 2 ∈ DF ∅[{ 0}, D] then

2:

return x(⊔D(df 1, df 2))

3: end if 4: for all a ∈ Λ1 do 5:

df a = ⊔(bind1(df 1, a), bind1(df 2, a))

6: end for 7: return

x(a1, df a1, . . . , a|Λ1|, df a|Λ1|)

1 2 a na 1 na a vg a, vg 1 2 a na

(a) (b) (c) (d)

na na, a na 2 a na, a 2

(c.1) (c.2) (c.3) (c.4)

high med low med low vhigh vhigh high vhigh high med low med low med low high vhigh vhigh high vhigh high med low vhigh high med low

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 13(28)

slide-14
SLIDE 14

Scenario Two: Disjoint Formal Contexts

◮ Prerequisite: The decision functions df 1 ∈ DF 1[

A, D] and df 2 ∈ DF 2[ A′, D] are constructed over samples with disjoint formal contexts of the same problem domain: A = Λ1 × . . . × Λn and

  • A′ = Λ′

1 × . . . × Λ′ m and attributes {A1, . . . , An} ∩ {A′ 1, . . . , A′ m} = ∅;

⊔(df 1, df 2) := xn( a1, ⊔(bindA1(df 1, a1), bindA1(df 2, a1)), . . . , ak, ⊔(bindA1(df 1, ak), bindA1(df 2, ak))) ⊔(df 0

1, df2) := ⊔(

df2, df 0

1)

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 14(28)

slide-15
SLIDE 15

Scenario Two: Cont’d

1: if df 1 ∈ DF ∅[{

0}, D] ∧ df 2 ∈ DF ∅[{ 0}, D] then

2:

return x(⊔D(df 1, df 2))

3: end if 4: if df 1 ∈ DF ∅[{

0}, D] then

5:

return ⊔(df 2, df 1))

6: end if 7: for all a ∈ Λ1 do 8:

df a = ⊔(bind1(df 1, a), bind1(df 2, a))

9: end for 10: return

x(a1, df a1, . . . , a|Λ1|, df a|Λ1|)

5 na 3 acc 1 g 6 acc 4 vg 5 3 1 6 acc, g 4 vg, g 6 acc 4 vg, acc 6 acc, na 4 vg, na 5 3 1 6 g 6 acc 6 4 vg

(a) (b) (c) (d)

5more 4 2 med low vhigh high 3 low more 2 4 med low 3 low more 2 4 2 low more 2 4 vhigh 5more 4 high low more 2 4 med low 3 low 2 low vhigh 5more 4 high low more 2 4

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 15(28)

slide-16
SLIDE 16

Scenario Three: General Case

◮ Prerequisite: For this general case, scenarios one and two are just

special cases. The decision functions df 1 ∈ DF 1[ A, D] and df 2 ∈ DF 2[ A′, D] are constructed over samples with arbitrary formal contexts of the same problem domain: A = Λ1 × . . . × Λn and

  • A′ = Λ′

1 × . . . × Λ′ m;

⊔(df 1, df 2) := xn( a1, ⊔(bindA1(df 1, a1), bindA1(df 2, a1)), . . . , ak, ⊔(bindA1(df 1, ak), bindA1(df 2, ak))) ⊔(df 0

1, df2) := ⊔(

df2, df 0

1)

⊔(df1, df2) := ⊔( df1, evertA1(df2)) iff A1 ∈ {A′

2, . . . , A′ m}

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 16(28)

slide-17
SLIDE 17

Scenario Three: Cont’d

1: if df 1 ∈ DF ∅[{

0}, D] ∧ df 2 ∈ DF ∅[{ 0}, D] then

2:

return x(⊔D(df 1, df 2))

3: end if 4: if df 1 ∈ DF ∅[{

0}, D] then

5:

return ⊔(df 2, df 1))

6: end if 7: if

A1 = A′

1 ∧ A1

∈ {A′

2, . . . , A′ m} then

8:

return ⊔(df1, evertA1(df2))

9: end if 10: for all a ∈ Λ1 do 11:

df a = ⊔(bind1(df 1, a), bind1(df 2, a))

12: end for 13: return

x(a1, df a1, . . . , a|Λ1|, df a|Λ1|)

na 1 3 6 1 3 na 5 1 3 acc 1 3 1 3 na 5 1 3 acc acc, na 4 vg,na 1 3 6 1 3 na 5 1 3 acc

(a) (b) (c) (d)

6 acc 4 vg acc 4 vg 6 6 6

= (c.1) (c.2)

high 5more med high 5more high small big med high 5more high high 5more small big med high 5more 5more more 2 4 high med high 5more 5more high small big med high 5more low more 2 4 4 low more low 2 low

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 17(28)

slide-18
SLIDE 18

Accuracy of the Merged Decision Functions

◮ Decision function df1 is more accurate than a decision function df2

iff it more often gives the ”right” classification based on some ground truth (which is usually not known);

◮ oracle

a : C → R is the accurate classification probability

distribution;

◮ oracle :

A → D(C) is an accurate decision function with ∀ a ∈ A : oracle( a) = oracle

a;

◮ df :

A → D(C) is probably accurate with respect to oracle iff ∀ a ∈ A : df ( a) is a random sample of oracle

a;

◮ Theorem: Let df 1, . . . , df n be a series of independently learned

decision functions df : A → D(C) that are probably accurate with respect to an accurate decision function

  • racle :

A → D(C). For large n, the merged decision function df 1 ⊔ . . . ⊔ df n converges in probability to the oracle.

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 18(28)

slide-19
SLIDE 19

Na¨ ıve Bayesian Classifiers

◮ Constructor:

nbn : D(C) × PD1

1 × · · · × PD1 n × . . . × PDk 1 × · · · × PDk n

  • n×k conditional probability distributions

→ NB[ A, D].

◮ Probability distribution functions : PDj

i

=PD(Λi|C = cj);

◮ Bind operation: bindAi : NB[

A, D] × Λi → NB[ A′, D]

bindAi (df n, a) := nbn−1(consD((c1, probc1,i (df n, a)), . . . (ck , probck ,i (df n, a))), pd1(df n, c1), . . . , pdi−1(df n, c1), pdi+1(df n, c1), . . . , pdn(df n, c1), . . . pd1(df n, ck ), . . . , pdi−1(df n, ck ), pdi+1(df n, ck ), . . . , df n

n(df n, ck ))

probc,i (df , a) = prob(dist(df ), c) · prob(pdAi (df , c), a) Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 19(28)

slide-20
SLIDE 20

Bind: Example

Accept ¡ Not ¡accept ¡ 0.69 ¡ 0.31 ¡

Class: Car Acceptability

high ¡ low ¡ Accept ¡ 0.31 ¡ 0.69 ¡ Not ¡accept ¡ 0.14 ¡ 0.86 ¡

Buying price

high ¡ low ¡ Accept ¡ 0.26 ¡ 0.74 ¡ Not ¡accept ¡ 0.68 ¡ 0.32 ¡

Maintenance price

Accept ¡ Not ¡accept ¡ 0.69 ¡* ¡0.31 ¡ 0.31 ¡* ¡0,14 ¡

Class: Car Acceptability

high ¡ low ¡ Accept ¡ 0.26 ¡ 0.74 ¡ Not ¡accept ¡ 0.68 ¡ 0.32 ¡

Maintenance price bind Bying (NB, high)

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 20(28)

slide-21
SLIDE 21

Merging of Na¨ ıve Bayesian Classifiers

⊔NB : NB[ A, D] × NB[ A′, D] → NB[ A′′, D] ⊔NB(df , df ′) := nbn(⊔D(dist(df ), dist(df ′)), pdA1(df , c1), . . . , pdAi (df , c1)), pdA′

1(df ′, c1), . . . , pdA′ j (df ′, c1))

⊔D(pdA′′

1 (df , c1), pdA′′ 1 (df ′, c1)), . . . , ⊔D(pdA′′ l (df , c1), pdA′′ l (df ′, c1)),

. . . pdA1(df , ck), . . . , pdAi (df , ck)), pdA′

1(df ′, ck), . . . , pdA′ j (df ′, ck))

⊔D(pdA′′

1 (df , ck), pdA′′ 1 (df ′, ck)), . . . , ⊔D(pdA′′ l (df , ck), pdA′′ l (df ′, ck))

pdAi : NB[ A, D] × C → D(Ai)

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 21(28)

slide-22
SLIDE 22

Scenario One: Same Formal Context

[0.002, 0.014, 0 , 0] [ 0.002, 0.008, 0, 0] [0, 0, 0, 0.007] 4 6 2 1 [0.11, 0, 0, 0] [0.002, 0.002, 0, 0.002] 4 [ 0.005, 0.016, 0, 0] [0.19, 0, 0, 0] [0.005, 0.014, 0, 0.0012] 6 [ 0.002 0.005 0.002, 0.01] [0.004, 0.006, 0.001, 0.005] 1 2 Class: [0.69, 0.23, 0.04, 0.03] safety (6) = high: [0.24, 0.57, 0.44, 1] per cap (4) = more: [0.29, 0.47, 0.5, 0.53] buying (1) = med: [0.21, 0.28, 0.38, 0.6] maint (2) = low: [0.22, 0.29, 0.69, 0.53] high vhigh high low more med 2 med 2 high more med vhigh high low med

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 22(28)

slide-23
SLIDE 23

Scenario Two: Disjoint Contexts

Replace ⊔(df 0

1, df2) := ⊔(df2, df 0 1) with:

⊔H : DF 0[{ 0}, D] × NB[ A, D] → NB[ A, D] df 0

dg ⊔H df nb

:= nb(dist(df 0

dg) ⊔D dist(dfnb),

pdA1(df , c1), . . . , pdAn(df , c1) . . . , pdA1(df , ck), . . . , pdAn(df , ck))

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 23(28)

slide-24
SLIDE 24

Scenario Two: Cont’d

Class: [0.69, 0.23, 0.04, 0.03] 5 [0.02, 0.012, 0.002, 0] 3 [0.007, 0.009, 0.002, 0] [0.012, 0.005, 0, 0] [0.07, 0.02, 0, 0] 1 [0.07, 0.02, 0 ,0] [0.005, 0.005, 0.007, 0] 5 P(C): [0.014, 0.003, 0.00006, 0.0] + P(A|C) 3 P(C): [0.005, 0.002, 0.00006, 0.0] + P(A|C) P(C): [0.009, 0.001, 0.0, 0.0] + P(A|C) P(C): [0.05, 0.004, 0.0, 0.0] + P(A|C) 1 P(C): [0.05, 0.005, 0.0, 0.0] + P(A|C) P(C): [0.004, 0.001, 0.00021, 0.0] + P(A|C) 4 2 5more med low vhigh high 3 4 2 5more med low vhigh high 3

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 24(28)

slide-25
SLIDE 25

Scenario Three: General

[0.007, 0.002, 0, 0] [0.002, 0, 0, 0] 6 [0.002, 0.007, 0, 0] 1 3 5 [0.13, 0.03, 0, 0] P(C) [0.06, 0.04, 0.04, 0.025] P((A- [5], [6])|C) P(C) [0.05, 0.05, 0.042, 0.025] P((A- [5], [6])|C) 6 P(C) [0.054, 0.048, 0.003, 0.025] P((A- [5], [6])|C) 1 3 5 P(C) [0.35, 0.12, 0.02, 0] P((A- [5], [6])|C) Class: [0.72, 0.20, 0.032, 0.05] safety (6) = high: [0.24, 0.57, 0.29, 1] person capacity maintenance price size lag (5) = med: [0.3, 0.36, 0.29, 0.5] high 5more high small big med med high 5more high small big med med

Merging Classifiers Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 25(28)

slide-26
SLIDE 26

Experiments

0.4 ¡ 0.5 ¡ 0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡

(1) Audiology

0.5 ¡ 0.55 ¡ 0.6 ¡ 0.65 ¡ 0.7 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(2) Monks

0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(3) Balance-scale

0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(4) Tic-tac

0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(5) Car

0.5 ¡ 0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(6) Mushroom

0.8 ¡ 0.85 ¡ 0.9 ¡ 0.95 ¡ 1 ¡

(7) Nursery

DG(1/8) ¡ NB(1/8) ¡ Merged ¡DG ¡ 0.8 ¡ 0.85 ¡ 0.9 ¡ 0.95 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(7) Nursery

0.4 ¡ 0.5 ¡ 0.6 ¡ 0.7 ¡ 0.8 ¡ 0.9 ¡ 1 ¡ 1 ¡ 2 ¡ 3 ¡ 4 ¡ 5 ¡ 6 ¡ 7 ¡ 8 ¡

(8) Chess

Experiments Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 26(28)

slide-27
SLIDE 27

Conclusions

◮ Merge ⊔ operation over decision functions is a general way to

combine classifiers;

◮ Decision Algebra allows applying merge implementing:

◮ a single core-operation bind over classifiers defined as decision

functions;

◮ a ⊔D over co-domain of decision functions (usually represented

as distributions);

◮ We showed that merging of a series of probably accurate decision

functions results an more accurate decision function (experiments: 2.7% - 17%).

Conclusion and Future Work Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 27(28)

slide-28
SLIDE 28

Thank you for your attention. Questions?

Conclusion and Future Work Department of Computer Science, Linnaues University Merging Classifiers of Different Classification Approaches 28(28)