Preprocessing input data for machine learning by FCA
Jan OUTRATA
- Dept. Computer Science
Palack´ y University, Olomouc, Czech Republic
CLA 2010, Oct 19–21, Sevilla
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 1 / 24
Preprocessing input data for machine learning by FCA Jan OUTRATA - - PowerPoint PPT Presentation
Preprocessing input data for machine learning by FCA Jan OUTRATA Dept. Computer Science Palack y University, Olomouc, Czech Republic CLA 2010, Oct 1921, Sevilla Jan Outrata (Palack y University) Preprocessing input data . . . CLA
Palack´ y University, Olomouc, Czech Republic
CLA 2010, Oct 19–21, Sevilla
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 1 / 24
introduction and related work preliminaries on Boolean Factor Analysis (BFA) and decision trees preprocessing input data using BFA example experimental evaluation conclusions and future research
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 2 / 24
– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors
1
added to original attributes
2
replacing original attributes . . . reduction of dimensionality of data (fewer factors)
Main question: can factors better describe input data for DM/ML methods?
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24
– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors
1
added to original attributes
2
replacing original attributes . . . reduction of dimensionality of data (fewer factors)
Main question: can factors better describe input data for DM/ML methods?
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24
– FCA often used for data preprocessing for (other) DM or ML methods to improve their results – results of DM and ML methods depend on structure of data = attributes in case of object-attribute data – data preprocessing . . . transformation of attributes Our approach: – formal concepts are used to create new attributes – which ones? → factor concepts obtained by Boolean Factor Analysis (BFA, described by FCA by Belohlavek, Vychodil, 2006) → new attributes = factors
1
added to original attributes
2
replacing original attributes . . . reduction of dimensionality of data (fewer factors)
Main question: can factors better describe input data for DM/ML methods?
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 3 / 24
(focused on decision tree induction) – constructive induction/feature construction . . . new attributes as conjs./disj., arithm. ops., etc. of original attributes – oblique decision trees . . . multiple attributes used in splitting condition (e.g. linear combinations) – work utilizing FCA? → construction of the whole learning model (lattice-based/concept-based learning, Mephu Nguifo et al., Kuznetsov and others)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 4 / 24
= decomposition of (binary) object-attribute data matrix I to boolean product of object-factor matrix A and factor-attribute matrix B Iij = (A ◦ B)ij = k
l=1 Ail · Blj
Ail = 1 . . . factor l applies to object i Blj = 1 . . . attribute j is one of the manifestations of factor l (A ◦ B)ij . . . “object i has attribute j if and only if there is a factor l such that l applies to i and j is one of the manifestations of l” factors ≈ new attributes Problem: find the number k of factors as small as possible
1 1 1 1 1 1 1 1 1 1 1 = 1 1 1 1 1 1 1 ◦ 1 1 1 1 1 1 1
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 5 / 24
Belohlavek R., Vychodil V.: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. System Sci 76(1)(2010), 3-20. Matrices A and B can be constructed from a set F of formal concepts of input data X, Y , I, so-called factor concepts: F = {A1, B1, . . . , Ak, Bk} ⊆ B(X, Y , I) l-th column of AF = characteristic vector of Al l-th row of BF = characteristic vector of Bl Decomposition using formal concepts to determine factors is optimal:
Theorem
Let I = A ◦ B for n × k and k × m binary matrices A and B. Then there exists a set F ⊆ B(X, Y , I) of formal concepts of I with |F| ≤ k such that for the n × |F| and |F| × m binary matrices AF and BF we have I = AF ◦ BF.
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 6 / 24
. . . vector in Boolean space {0, 1}k of factors, row of A = mappings g: {0, 1}m → {0, 1}k and h: {0, 1}k → {0, 1}m: (g(P))l = m
j=1(Blj → Pj)
(h(Q))j = k
l=1(Ql · Blj)
(g(P))l = 1 iff l-th row of B is included in P (h(Q))j = 1 iff attribute j is a manifestation of at least one factor from Q
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 7 / 24
Decision tree . . . approximate representation of a (finite-valued) function
. . . the function is described by assignment of class labels to vectors of attribute values – used for classification of vectors (objects) into classes
A B C f (A, B, C) good yes false yes good no false no bad no false no good no true yes bad yes true yes
B C yes no yes no yes false true A B C N Y B Y N Y B G N Y F T N Y
non-leaf tree node . . . test on a splitting attribute . . . covered collection of objects is split under the possible outcomes of the test (= values of the splitting attribute) leaf tree node . . . covers (majority of) objects with the same class label
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 8 / 24
Decision tree induction problem . . . to construct a decision tree that
1 approximates well the function described by (few) objects (training
data)
2 classifies well “unseen” objects (testing data)
Algorithms: – common strategy: recursively splitting tree nodes (collections of
– the problem of selection of a splitting attribute ⇒ local
– selection criteria . . . based on measures defined in terms of class distribution of objects in nodes before and after splitting → entropy and information gain measures, Gini index, classification error etc.
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 9 / 24
in ML: logical, categorical (nominal), ordinal, numerical, . . . attributes in FCA: logical – binary (yes/no) or graded attributes → transformation . . . conceptual scaling (Ganter, Wille) – note: we need not transform the class attribute
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 10 / 24
Name body temp. gives birth fourlegged hibernates mammal cat warm yes yes no yes bat warm yes no yes yes salamander cold no yes yes no eagle warm no no no no guppy cold yes no no no
Name bt cold bt warm gb no gb yes fl no fl yes hb no hb yes mammal cat 1 1 1 1 yes bat 1 1 1 1 yes salamander 1 1 1 1 no eagle 1 1 1 1 no guppy 1 1 1 1 no
mammal . . . class label
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 11 / 24
Recall: new attributes (= factors) are added to original attributes
1 decompose input data matrix I into matrix A describing objects X by
factors F and matrix B explaining factors F by attributes Y
2 new attributes Y ′ = Y ∪ F 3 extended data table I ′ ⊆ X × Y ′: I ′ ∩ (X × Y ) = I and
I ′ ∩ (X × F) = A Original decomposition (using FCA): decomposition aim: the number of factors as small as possible existing approx. algorithm (Belohlavek, Vychodil): greedy search for factor concepts which cover the largest area of still uncovered 1s in input data table function of optimality of factor concept = “cover ability”
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 12 / 24
Recall: new attributes (= factors) are added to original attributes
1 decompose input data matrix I into matrix A describing objects X by
factors F and matrix B explaining factors F by attributes Y
2 new attributes Y ′ = Y ∪ F 3 extended data table I ′ ⊆ X × Y ′: I ′ ∩ (X × Y ) = I and
I ′ ∩ (X × F) = A Original decomposition (using FCA): decomposition aim: the number of factors as small as possible existing approx. algorithm (Belohlavek, Vychodil): greedy search for factor concepts which cover the largest area of still uncovered 1s in input data table function of optimality of factor concept = “cover ability”
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 12 / 24
Decomposition for decision tree induction: factors = new attributes → good “decision ability”, i.e. good to be splitting attributes new function of optimality of factor concept: c(A, B) = w · cA(A, B) + (1 − w) · cB(A, B) cA(A, B) ∈ [0, 1] . . . original function of “cover ability” cB(A, B) ∈ [0, 1] . . . function of “decision ability”, measures goodness of factor as splitting attribute
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 13 / 24
Recall: selection of splitting attributes is based on entropy measures . . . an attribute is the better splitting attribute the lower is the weighted sum of entropies of subcollections of objects after splitting the objects based on the attribute → cB(A, B) = 1 −
|X| · E(class|A) − log2
1 |V (class|A)| + |X\A|
|X| · E(class|X\A) − log2
1 |V (class|X\A)|
E(class|A) . . . usual entropy of objects A based on the class, i.e. E(class|A) = −
l∈V (class|A) p(l|A) · log2 p(l|A)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 14 / 24
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 = 1 1 1 1 1 1 1 ◦ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Name bc bw gn gy fn fy hn hy f1 f2 f3 f4 f5 f6 mammal cat 1 1 1 1 1 1 yes bat 1 1 1 1 1 1 yes salamander 1 1 1 1 1 no eagle 1 1 1 1 1 no guppy 1 1 1 1 1 no
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 15 / 24
decision tree is induced from the extended data table class labels remain unchanged
body temp. gives birth no no yes warm cold no yes
f3 yes no 1
factor f3 . . . better splitting attribute than original attributes bt warm and gb yes (w.r.t. generalization of the decision tree)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 16 / 24
. . . described as vector Px ∈ {0, 1}m in the (original) attribute space
1 compute the description as vector g(Px) ∈ {0, 1}k in the factor space
(using factor-attribute matrix)
2 classify a concatenation of Px and g(Px) in a usual way Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 17 / 24
Recall: new attributes (= factors) replace original attributes
1 decompose input data matrix I into matrix A describing objects X by
factors F and matrix B explaining factors F by attributes Y
2 new attributes Y ′ = F 3 new (reduced) data table I ′ ⊆ X × Y ′: I ′ = A
decision tree is induced from the new data table
Name f1 f2 f3 f4 f5 f6 mammal cat 1 1 yes bat 1 1 yes salamander 1 no eagle 1 no guppy 1 no
f3 yes no 1
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 18 / 24
There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24
There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24
There are (usually) fewer factors than (original) attributes = reduction of dimensionality of data Problem: transformation of objects from attribute space to the factor space is not an injective mapping, i.e. for x1, x2 ∈ X, with Px1 = Px2 and class(x1) = class(x2), it may happen that g(Px1) = g(Px2) how to assign class labels to objects described by factors? Present solution: assign to object x in new data table the majority class label of
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 19 / 24
. . . described as vector Px ∈ {0, 1}m in the (original) attribute space
1 compute the description as vector g(Px) ∈ {0, 1}k in the factor space
(using factor-attribute matrix)
2 classify g(Px) in a usual way Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 20 / 24
Time complexity . . . determined by the matrix decomposition step using BFA . . . NP-hard problem → approximation algorithms Selected datasets from UCI ML Repository:
Dataset
Class distribution breast-cancer 9(51) 277 196/81 kr-vs-kp 36(74) 3196 1669/1527 mushroom 21(125) 5644 3488/2156 tic-tac-toe 9(27) 958 626/332 vote 16(32) 232 124/108 zoo 15(30) 101 41/20/5/13/4/8/10
(The datasets were cleared of objects containing missing values.)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 21 / 24
Time complexity . . . determined by the matrix decomposition step using BFA . . . NP-hard problem → approximation algorithms Selected datasets from UCI ML Repository:
Dataset
Class distribution breast-cancer 9(51) 277 196/81 kr-vs-kp 36(74) 3196 1669/1527 mushroom 21(125) 5644 3488/2156 tic-tac-toe 9(27) 958 626/332 vote 16(32) 232 124/108 zoo 15(30) 101 41/20/5/13/4/8/10
(The datasets were cleared of objects containing missing values.)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 21 / 24
. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors
accuracy for original data → accuracy for preprocessed data
Note: without zoo dataset ID3 average on testing data =
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24
. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors Optimality function = “cover ability”:
train % test % breast-cancer kr-vs-kp mushroom tic-tac-toe vote zoo avg + ID3 98.0 → 99.9 58.9 → 68.3 100 → 100 99.6 → 99.0 100 → 100 100 → 100 100 → 100 84.1 → 94.4 100 → 100 94.4 → 93.7 98.2 → 100 92.0 → 88.5 0.6 3.8 C4.5 89.0 → 91.8 66.6 → 65.6 99.8 → 99.7 99.4 → 98.9 100 → 100 100 → 100 95.8 → 98.4 85.7 → 93.6 98.3 → 98.1 94.8 → 94.3 97.3 → 97.8 93.4 → 87.8 1.0 0.2 IB1 98.0 → 100 70.2 → 68.1 100 → 100 90.3 → 91.8 100 → 100 100 → 100 100 → 100 79.2 → 79.2 100 → 100 91.6 → 92.1 98.1 → 100 93.3 → 90.0 0.7 −0.7 accuracy for original data → accuracy for preprocessed data
Note: without zoo dataset ID3 average on testing data = 5.4 %
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24
. . . comparing performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data . . . 10-fold stratified cross-validation test Reducing original attributes to factors Optimality function = “decision ability”:
train % test % breast-cancer kr-vs-kp mushroom tic-tac-toe vote zoo avg + ID3 98.0 → 100 58.9 → 67.9 100 → 100 99.6 → 99.6 100 → 100 100 → 100 100 → 100 84.1 → 97.2 100 → 100 94.4 → 95.9 98.2 → 100 92.0 → 90.2 0.6 5.1 C4.5 89.0 → 93.2 66.6 → 68.9 99.8 → 99.8 99.4 → 99.3 100 → 100 100 → 100 95.8 → 98.9 85.7 → 97.6 98.3 → 98.3 94.8 → 95.5 97.3 → 97.9 93.4 → 89.5 1.4 2.3 IB1 98.0 → 100 70.2 → 66.8 100 → 100 90.3 → 97.8 100 → 100 100 → 100 100 → 100 79.2 → 96.0 100 → 100 91.6 → 94.7 98.1 → 100 93.3 → 90.2 0.7 4.1 accuracy for original data → accuracy for preprocessed data
Note: without zoo dataset ID3 average on testing data = 6.5 %
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 22 / 24
Adding factors to original attributes – very similar results (±1 % difference)
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 23 / 24
Presented: – two methods of preprocessing input data to ML based on FCA:
1
attributes are extended by new attributes
2
attributes are replaced by new attributes
– new attributes = factors obtained by Boolean Factor Analysis described by FCA – (usualy) fewer than original attributes – demonstrated on decision tree induction: DT’s induced from preprocessed data outperform DT’s induced from original data (for ID3, C4.5) → usage of BFA in feature construction Future research: the problem of mappping distinct objects in original object-attribute data to the same object in object-factor data incomplete data, i.e. data with missing values more thorough experimental evaluation – e.g. description by F-measure, comparison with other feature construction/selection methods
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 24 / 24
Presented: – two methods of preprocessing input data to ML based on FCA:
1
attributes are extended by new attributes
2
attributes are replaced by new attributes
– new attributes = factors obtained by Boolean Factor Analysis described by FCA – (usualy) fewer than original attributes – demonstrated on decision tree induction: DT’s induced from preprocessed data outperform DT’s induced from original data (for ID3, C4.5) → usage of BFA in feature construction Future research: the problem of mappping distinct objects in original object-attribute data to the same object in object-factor data incomplete data, i.e. data with missing values more thorough experimental evaluation – e.g. description by F-measure, comparison with other feature construction/selection methods
Jan Outrata (Palack´ y University) Preprocessing input data . . . CLA 2010 24 / 24