Preprocessing input data for machine learning by FCA Jan OUTRATA - - PowerPoint PPT Presentation

preprocessing input data for machine learning by fca
SMART_READER_LITE
LIVE PREVIEW

Preprocessing input data for machine learning by FCA Jan OUTRATA - - PowerPoint PPT Presentation

Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science Palack y University, Olomouc, Czech Republic CLA 2010, Oct 1921, Sevilla Jan Outrata (Palack y University) Preprocessing input data . . . CLA


slide-1
SLIDE 1

Preprocessing input data for machine learning by FCA

Jan OUTRATA

  • Dept. Computer Science

Palack´ y University, Olomouc, Czech Republic

CLA 2010, Oct 19–21, Sevilla

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 1 / 24

slide-2
SLIDE 2

Outline

introduction and related work preliminaries on Boolean Factor Analysis (BFA) and decision trees preprocessing input data using BFA example experimental evaluation conclusions and future research

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 2 / 24

slide-3
SLIDE 3

Introduction to the problem

– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors

1

added to original attributes

2

replacing original attributes . . . reduction of dimensionality of data (fewer factors)

Main question: can factors better describe input data for DM/ML methods?

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24

slide-4
SLIDE 4

Introduction to the problem

– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors

1

added to original attributes

2

replacing original attributes . . . reduction of dimensionality of data (fewer factors)

Main question: can factors better describe input data for DM/ML methods?

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24

slide-5
SLIDE 5

Introduction to the problem

– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors

1

added to original attributes

2

replacing original attributes . . . reduction of dimensionality of data (fewer factors)

Main question: can factors better describe input data for DM/ML methods?

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24

slide-6
SLIDE 6

Related work

(focused on decision tree induction) – constructive induction/feature construction . . . new attributes as conjs./disj., arithm. ops., etc. of original attributes – oblique decision trees . . . multiple attributes used in splitting condition (e.g. linear combinations) – work utilizing FCA? → construction of the whole learning model (lattice-based/concept-based learning, Mephu Nguifo et al., Kuznetsov and others)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 4 / 24

slide-7
SLIDE 7

Boolean Factor Analysis (BFA)

= decomposition of (binary) object-attribute data matrix I to boolean product of object-factor matrix A and factor-attribute matrix B Iij = (A ◦ B)ij = k

l=1 Ail · Blj

Ail = 1 . . . factor l applies to object i Blj = 1 . . . attribute j is one of the manifestations of factor l (A ◦ B)ij . . . “object i has attribute j if and only if there is a factor l such that l applies to i and j is one of the manifestations of l” factors ≈ new attributes Problem: find the number k of factors as small as possible

    1 1 1 1 1 1 1 1 1 1 1     =     1 1 1 1 1 1 1     ◦     1 1 1 1 1 1 1    

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 5 / 24

slide-8
SLIDE 8

Boolean Factor Analysis – solution using FCA

Belohlavek R., Vychodil V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. System Sci 76(1)(2010), 3-20. Matrices A and B can be constructed from a set F of formal concepts of input data X, Y , I, so-called factor concepts: F = {A1, B1, . . . , Ak, Bk} ⊆ B(X, Y , I) l-th column of AF = characteristic vector of Al l-th row of BF = characteristic vector of Bl Decomposition using formal concepts to determine factors is optimal:

Theorem

Let I = A ◦ B for n × k and k × m binary matrices A and B. Then there exists a set F ⊆ B(X, Y , I) of formal concepts of I with |F| ≤ k such that for the n × |F| and |F| × m binary matrices AF and BF we have I = AF ◦ BF.

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 6 / 24

slide-9
SLIDE 9

Transformations between attribute and factor spaces

  • bject . . . vector in Boolean space {0, 1}m of orig. attributes., row of I

. . . vector in Boolean space {0, 1}k of factors, row of A = mappings g: {0, 1}m → {0, 1}k and h: {0, 1}k → {0, 1}m: (g(P))l = m

j=1(Blj → Pj)

(h(Q))j = k

l=1(Ql · Blj)

(g(P))l = 1 iff l-th row of B is included in P (h(Q))j = 1 iff attribute j is a manifestation of at least one factor from Q

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 7 / 24

slide-10
SLIDE 10

The ML method: Decision tree induction

Decision tree . . . approximate representation of a (finite-valued) function

  • ver (finite-valued) attributes

. . . the function is described by assignment of class labels to vectors of attribute values – used for classification of vectors (objects) into classes

A B C f (A, B, C) good yes false yes good no false no bad no false no good no true yes bad yes true yes

B C yes no yes no yes false true A B C N Y B Y N Y B G N Y F T N Y

non-leaf tree node . . . test on a splitting attribute . . . covered collection of objects is split under the possible outcomes of the test (= values of the splitting attribute) leaf tree node . . . covers (majority of) objects with the same class label

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 8 / 24

slide-11
SLIDE 11

Decision tree induction problem & algorithms

Decision tree induction problem . . . to construct a decision tree that

1 approximates well the function described by (few) objects (training

data)

2 classifies well “unseen” objects (testing data)

Algorithms: – common strategy: recursively splitting tree nodes (collections of

  • bjects) based on splitting attributes

– the problem of selection of a splitting attribute ⇒ local

  • ptimization problem

– selection criteria . . . based on measures defined in terms of class distribution of objects in nodes before and after splitting → entropy and information gain measures, Gini index, classification error etc.

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 9 / 24

slide-12
SLIDE 12

Transformation of input data

in ML: logical, categorical (nominal), ordinal, numerical, . . . attributes in FCA: logical – binary (yes/no) or graded attributes → transformation . . . conceptual scaling (Ganter, Wille) – note: we need not transform the class attribute

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 10 / 24

slide-13
SLIDE 13

Example: transformation of input data

Name body temp. gives birth fourlegged hibernates mammal cat warm yes yes no yes bat warm yes no yes yes salamander cold no yes yes no eagle warm no no no no guppy cold yes no no no

Name bt cold bt warm gb no gb yes fl no fl yes hb no hb yes mammal cat 1 1 1 1 yes bat 1 1 1 1 yes salamander 1 1 1 1 no eagle 1 1 1 1 no guppy 1 1 1 1 no

mammal . . . class label

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 11 / 24

slide-14
SLIDE 14

Extending the collection of attributes

Recall: new attributes (= factors) are added to original attributes

1 decompose input data matrix I into matrix A describing objects X by

factors F and matrix B explaining factors F by attributes Y

2 new attributes Y ′ = Y ∪ F 3 extended data table I ′ ⊆ X × Y ′: I ′ ∩ (X × Y ) = I and

I ′ ∩ (X × F) = A Original decomposition (using FCA): decomposition aim: the number of factors as small as possible existing approx. algorithm (Belohlavek, Vychodil): greedy search for factor concepts which cover the largest area of still uncovered 1s in input data table function of optimality of factor concept = “cover ability”

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 12 / 24

slide-15
SLIDE 15

Extending the collection of attributes

Recall: new attributes (= factors) are added to original attributes

1 decompose input data matrix I into matrix A describing objects X by

factors F and matrix B explaining factors F by attributes Y

2 new attributes Y ′ = Y ∪ F 3 extended data table I ′ ⊆ X × Y ′: I ′ ∩ (X × Y ) = I and

I ′ ∩ (X × F) = A Original decomposition (using FCA): decomposition aim: the number of factors as small as possible existing approx. algorithm (Belohlavek, Vychodil): greedy search for factor concepts which cover the largest area of still uncovered 1s in input data table function of optimality of factor concept = “cover ability”

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 12 / 24

slide-16
SLIDE 16

Function of optimality of factor concept

Decomposition for decision tree induction: factors = new attributes → good “decision ability”, i.e. good to be splitting attributes new function of optimality of factor concept: c(A, B) = w · cA(A, B) + (1 − w) · cB(A, B) cA(A, B) ∈ [0, 1] . . . original function of “cover ability” cB(A, B) ∈ [0, 1] . . . function of “decision ability”, measures goodness of factor as splitting attribute

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 13 / 24

slide-17
SLIDE 17

Optimality function of “decision ability”

Recall: selection of splitting attributes is based on entropy measures . . . an attribute is the better splitting attribute the lower is the weighted sum of entropies of subcollections of objects after splitting the objects based on the attribute → cB(A, B) = 1 −

  • |A|

|X| · E(class|A) − log2

1 |V (class|A)| + |X\A|

|X| · E(class|X\A) − log2

1 |V (class|X\A)|

  • V (class|A) . . . class labels assigned to objects A

E(class|A) . . . usual entropy of objects A based on the class, i.e. E(class|A) = −

l∈V (class|A) p(l|A) · log2 p(l|A)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 14 / 24

slide-18
SLIDE 18

Example: extending the collection of attributes

     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1      =      1 1 1 1 1 1 1      ◦        1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1       

Name bc bw gn gy fn fy hn hy f1 f2 f3 f4 f5 f6 mammal cat 1 1 1 1 1 1 yes bat 1 1 1 1 1 1 yes salamander 1 1 1 1 1 no eagle 1 1 1 1 1 no guppy 1 1 1 1 1 no

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 15 / 24

slide-19
SLIDE 19

Example: decision tree from extended data

decision tree is induced from the extended data table class labels remain unchanged

body temp. gives birth no no yes warm cold no yes

f3 yes no 1

factor f3 . . . better splitting attribute than original attributes bt warm and gb yes (w.r.t. generalization of the decision tree)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 16 / 24

slide-20
SLIDE 20

Classifying object x in extended data

. . . described as vector Px ∈ {0, 1}m in the (original) attribute space

1 compute the description as vector g(Px) ∈ {0, 1}k in the factor space

(using factor-attribute matrix)

2 classify a concatenation of Px and g(Px) in a usual way Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 17 / 24

slide-21
SLIDE 21

Reducing the collection of attributes

Recall: new attributes (= factors) replace original attributes

1 decompose input data matrix I into matrix A describing objects X by

factors F and matrix B explaining factors F by attributes Y

2 new attributes Y ′ = F 3 new (reduced) data table I ′ ⊆ X × Y ′: I ′ = A

decision tree is induced from the new data table

Name f1 f2 f3 f4 f5 f6 mammal cat 1 1 yes bat 1 1 yes salamander 1 no eagle 1 no guppy 1 no

f3 yes no 1

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 18 / 24

slide-22
SLIDE 22

Reducing the collection of attributes

There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of

  • bjects [x]ker(g) ∈ X/ker(g) in original data table

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24

slide-23
SLIDE 23

Reducing the collection of attributes

There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of

  • bjects [x]ker(g) ∈ X/ker(g) in original data table

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24

slide-24
SLIDE 24

Reducing the collection of attributes

There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of

  • bjects [x]ker(g) ∈ X/ker(g) in original data table

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24

slide-25
SLIDE 25

Classifying object x in reduced data

. . . described as vector Px ∈ {0, 1}m in the (original) attribute space

1 compute the description as vector g(Px) ∈ {0, 1}k in the factor space

(using factor-attribute matrix)

2 classify g(Px) in a usual way Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 20 / 24

slide-26
SLIDE 26

Experimental evaluation

Time complexity . . . determined by the matrix decomposition step using BFA . . . NP-hard problem → approximation algorithms Selected datasets from UCI ML Repository:

Dataset

  • No. of attributes (binary) No. of objects

Class distribution breast-cancer 9(51) 277 196/81 kr-vs-kp 36(74) 3196 1669/1527 mushroom 21(125) 5644 3488/2156 tic-tac-toe 9(27) 958 626/332 vote 16(32) 232 124/108 zoo 15(30) 101 41/20/5/13/4/8/10

(The datasets were cleared of objects containing missing values.)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 21 / 24

slide-27
SLIDE 27

Experimental evaluation

Time complexity . . . determined by the matrix decomposition step using BFA . . . NP-hard problem → approximation algorithms Selected datasets from UCI ML Repository:

Dataset

  • No. of attributes (binary) No. of objects

Class distribution breast-cancer 9(51) 277 196/81 kr-vs-kp 36(74) 3196 1669/1527 mushroom 21(125) 5644 3488/2156 tic-tac-toe 9(27) 958 626/332 vote 16(32) 232 124/108 zoo 15(30) 101 41/20/5/13/4/8/10

(The datasets were cleared of objects containing missing values.)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 21 / 24

slide-28
SLIDE 28

Experimental evaluation

. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors

accuracy for original data → accuracy for preprocessed data

Note: without zoo dataset ID3 average on testing data =

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24

slide-29
SLIDE 29

Experimental evaluation

. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors Optimality function = “cover ability”:

train % test % breast-cancer kr-vs-kp mushroom tic-tac-toe vote zoo avg + ID3 98.0 → 99.9 58.9 → 68.3 100 → 100 99.6 → 99.0 100 → 100 100 → 100 100 → 100 84.1 → 94.4 100 → 100 94.4 → 93.7 98.2 → 100 92.0 → 88.5 0.6 3.8 C4.5 89.0 → 91.8 66.6 → 65.6 99.8 → 99.7 99.4 → 98.9 100 → 100 100 → 100 95.8 → 98.4 85.7 → 93.6 98.3 → 98.1 94.8 → 94.3 97.3 → 97.8 93.4 → 87.8 1.0 0.2 IB1 98.0 → 100 70.2 → 68.1 100 → 100 90.3 → 91.8 100 → 100 100 → 100 100 → 100 79.2 → 79.2 100 → 100 91.6 → 92.1 98.1 → 100 93.3 → 90.0 0.7 −0.7 accuracy for original data → accuracy for preprocessed data

Note: without zoo dataset ID3 average on testing data = 5.4 %

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24

slide-30
SLIDE 30

Experimental evaluation

. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors Optimality function = “decision ability”:

train % test % breast-cancer kr-vs-kp mushroom tic-tac-toe vote zoo avg + ID3 98.0 → 100 58.9 → 67.9 100 → 100 99.6 → 99.6 100 → 100 100 → 100 100 → 100 84.1 → 97.2 100 → 100 94.4 → 95.9 98.2 → 100 92.0 → 90.2 0.6 5.1 C4.5 89.0 → 93.2 66.6 → 68.9 99.8 → 99.8 99.4 → 99.3 100 → 100 100 → 100 95.8 → 98.9 85.7 → 97.6 98.3 → 98.3 94.8 → 95.5 97.3 → 97.9 93.4 → 89.5 1.4 2.3 IB1 98.0 → 100 70.2 → 66.8 100 → 100 90.3 → 97.8 100 → 100 100 → 100 100 → 100 79.2 → 96.0 100 → 100 91.6 → 94.7 98.1 → 100 93.3 → 90.2 0.7 4.1 accuracy for original data → accuracy for preprocessed data

Note: without zoo dataset ID3 average on testing data = 6.5 %

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24

slide-31
SLIDE 31

Experimental evaluation

Adding factors to original attributes – very similar results (±1 % difference)

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 23 / 24

slide-32
SLIDE 32

Conclusions and future research

Presented: – two methods of preprocessing input data to ML based on FCA:

1

attributes are extended by new attributes

2

attributes are replaced by new attributes

– new attributes = factors obtained by Boolean Factor Analysis described by FCA – (usualy) fewer than original attributes – demonstrated on decision tree induction: DT’s induced from preprocessed data outperform DT’s induced from original data (for ID3, C4.5) → usage of BFA in feature construction Future research: the problem of mappping distinct objects in original object-attribute data to the same object in object-factor data incomplete data, i.e. data with missing values more thorough experimental evaluation – e.g. description by F-measure, comparison with other feature construction/selection methods

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 24 / 24

slide-33
SLIDE 33

Conclusions and future research

Presented: – two methods of preprocessing input data to ML based on FCA:

1

attributes are extended by new attributes

2

attributes are replaced by new attributes

– new attributes = factors obtained by Boolean Factor Analysis described by FCA – (usualy) fewer than original attributes – demonstrated on decision tree induction: DT’s induced from preprocessed data outperform DT’s induced from original data (for ID3, C4.5) → usage of BFA in feature construction Future research: the problem of mappping distinct objects in original object-attribute data to the same object in object-factor data incomplete data, i.e. data with missing values more thorough experimental evaluation – e.g. description by F-measure, comparison with other feature construction/selection methods

Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 24 / 24