Multi-label Classification
Charmgil Hong
cs3750
(Presented on Nov 11, 2014)
Multi-label Classification Charmgil Hong cs3750 (Presented on Nov - - PowerPoint PPT Presentation
Multi-label Classification Charmgil Hong cs3750 (Presented on Nov 11, 2014) Goals of the talk 1.To understand the geometry of different approaches for multi-label classification 2.To appreciate how the Machine Learning techniques further
Charmgil Hong
cs3750
(Presented on Nov 11, 2014)
2
3
4
5
associated with multiple class variables
economics
6
variables
7
Class 1 ∈ { R, B } Class 2 ∈ { , }
single-label classification problems
8
single-label classification problems
9
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 1 n=2 0.6 0.2 1 1 n=3 0.1 0.9 1 n=4 0.3 0.1 n=5 0.8 0.9 1 1
h1 : X → Y1 h2 : X → Y2 h3 : X → Y3
variables
variable
10
a posteriori) of Y = (Y1,Y2)
11
➡ Prediction on the joint (MAP): Y1 = 1, Y2 = 0 ➡ Prediction on the marginals: Y1 = 0, Y2 = 0
P(Y1,Y2|X=x) Y1 = 0 Y1 = 1 P(Y2|X=x) Y2 = 0 0.2 0.45 0.65 Y2 = 1 0.35 0.35 P(Y1|X=x) 0.55 0.45
12
13
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 n=2 0.6 0.2 1 n=3 0.1 0.9 1 n=4 0.3 0.1 1 1 n=5 0.8 0.9 1 1 YLP 1 1 2 3 4
hLP : X → YLP
Vlahavas, 2007]
Vlahavas, 2007]
(|YLP| = O(2d))
training set
14
while LP directly models the joint of all class variables
relationship among the class variables
modeling the full joint of the class variables; but can be computationally very expensive
15
Independent classifiers All possible label combo.
16
17
dependent features : P(yi|x, {new_features})
18
19
20
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 1 n=2 0.6 0.2 1 1 n=3 0.1 0.9 1 n=4 0.3 0.1 n=5 0.8 0.9 1 1
hbr1 : X → Y1 hbr2 : X → Y2 hbr3 : X → Y3
XCHF X1 X2 hbr1(X) hbr2(X) hbr3(X) Y1 Y2 Y3 n=1 0.7 0.4
.xx .xx .xx
1 1 n=2 0.6 0.2
.xx .xx .xx
1 1 n=3 0.1 0.9
.xx .xx .xx
1 n=4 0.3 0.1
.xx .xx .xx
n=5 0.8 0.9
.xx .xx .xx
1 1
h1 : XCHF → Y1 h2 : XCHF → Y2 h3 : XCHF → Y3
Layer-1 Layer-2
21
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 1 n=2 0.6 0.2 1 1 n=3 0.1 0.9 1 n=4 0.3 0.1 n=5 0.8 0.9 1 1 XIBLR X1 X2 λ1 λ2 λ3 Y1 Y2 Y3 n=1 0.7 0.4
.xx .xx .xx
1 1 n=2 0.6 0.2
.xx .xx .xx
1 1 n=3 0.1 0.9
.xx .xx .xx
1 n=4 0.3 0.1
.xx .xx .xx
n=5 0.8 0.9
.xx .xx .xx
1 1
h1 : XIBLR → Y1 h2 : XIBLR → Y2 h3 : XIBLR → Y3
KNN Score λ1 λ2 λ3
.xx .xx .xx .xx .xx .xx .xx .xx .xx .xx .xx .xx .xx .xx .xx 1 1 1 1 2/3 1/3 1/3
k=3
y1 y2 y3
using the layer-1 classifiers
22
Vlahavas, 2007]
possible class assignment
very expensive
to reduce the size of the class assignment space
23
24
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 n=2 0.6 0.2 1 n=3 0.1 0.9 n=4 0.3 0.1 1 n=5 0.1 0.8 n=6 0.2 0.1 1 n=7 0.2 0.2 1 n=8 0.2 0.9 n=9 0.7 0.3 1 n=10 0.9 0.9 1 1 Dtrain-LP YLP n=1 1 n=2 1 n=3 n=4 2 n=5 n=6 2 n=7 2 n=8 n=9 1 n=10 3
25
Dtrain X1 X2 Y1 Y2 Y3 n=1 0.7 0.4 1 n=2 0.6 0.2 1 n=3 0.1 0.9 n=4 0.3 0.1 1 n=5 0.1 0.8 n=6 0.2 0.1 1 n=7 0.2 0.2 1 n=8 0.2 0.9 n=9 0.7 0.3 1 n=10 0.9 0.9 1 n=11 0.9 0.9 1 Dtrain-PPT YPPT n=1 1 n=2 1 n=3 n=4 2 n=5 n=6 2 n=7 2 n=8 n=9 1 n=10 1 n=11 2
relationships
the training set
26
neighbor with Bayesian inference
27
scheme [Dietterich 1995; Bose & Ray-Chaudhuri 1960] in communication
28
Convert output vectors Y into codewords Z Perform regression from X to Z; say R Recover the class assignments Y from R
al, 2009]
Schneider, 2011]
29
vector (d > q)
30
(i) Encoding (SVD) (ii) Regression (iii) Decoding (SVD)
Codewords Features Labels
31
Method Key difference OCCS Uses compressed sensing [Donoho 2006] for encoding and decoding PLST Uses singular vector decomposition (SVD) [Johnson & Wichern 2002] for encoding and decoding CCAOC Uses canonical correlation analysis (CCA) [Johnson & Wichern 2002] for encoding and mean-field approximation for decoding MMOC Uses SVD for encoding and maximum margin formulation for decoding
whose complexities are sensitive to d and N
32
33
Independent classifiers All possible label combo.
Pruned label combo. Enriched feature space
Others
Output coding
MLKNN
Want to achieve something better
34
35
variables that are compatible with all the probabilistic independence propositions encoded in a graph
distribution without paying an exponential cost
36
37
X1 X2 X1 X2
node: variable edge: correlation edge: causal relation Undirected (MN) Directed (BN)
among variables!!
conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)
38
A ⊥ B | C
CI representation in UGM
A B C
among variables!!
conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)
39
A ⊥ B | C A ⊥ B | C
CI representations in DGM
⧸
A B C A B C A B C
40
Pakdaman et al, 2014]
41
Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]
network
relations among the input and output variables
42
Y3 Y2 Y1 X3 X2 X1 X4
The joint distribution P(X,Y) is represented by the decomposition X = X1|X2 · X2|X3 · X3 · X4|X2 and Y = Y1|Y2 · X2 · Y3|Y2
Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]
be represented efficiently using the Bayesian network
carry much information in modeling the multi-label relations
43
the class variables conditioned on the feature variables
between the input and output variables
44
Y1 Y2 Y3 Yd ...
ψ1,2 ψ2,3 ψd,1
(ψi,j and 𝜚i are the potentials of Yi, Yj, X; and Z is the normalization term)
computed, which is usually very costly
each step whose computational cost is even more expensive
model usable
45
classes in the chain are conditioning the following class variables
46
all preceding labels
Y1 Y2 Y3 Yd ...
X
argmaxθ P(Yi|X, π(Yi); θ)
argmaxYi P(Yi|X, π(Yi); θ)
47
Y1 Y2 Y3 Yd ...
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
class labels
variable has at most one parent class variable
48
X Y1 Y2 Y3 Y4
at most one parent label
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
1.Define a complete weighted directed graph, whose edge weights is equal to conditional log-likelihood 2.Find the maximum branching tree from the graph (* Maximum branching tree = maximum weighted directed spanning tree)
49
X Y1 Y2 Y3 Y4
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
(max-product) algorithm
50
X Y1 Y2 Y3 Y4
51
X Y1 Y2 Y3 Y4
Y1 Y2 Y3 Yd ...
at most one parent label all preceding labels
chain structure
tree structure
(suboptimal solution)
label prediction
modeling ability
random)
algorithm
✓ Extensions using probabilistic graphical models (PGMs)
52
predictions to produce a single classifier
predictions
53
54
data with random orderings of the class labels
55
Y1 Y2 Y3 Yd
...
label ordering
models and often inaccurate
56
complex than a tree structure, a single CTBN cannot model the data properly
multiple CTBNs and use them for prediction
57
component (influence of the k-th CTBN model Tk to the mixture)
58
59
∈ {1, …, K} indicating which CTBN it belongs to.
60
recalculate the weight of each data instance (ω) such that it represents the relative “hardness” of the instance
conditional log-likelihood:
61
Markov chain
up the search
62
amount of time
63
64
which can be optimized by the binary relevance (BR) model
65
class assignments
66
67
68
http://meka.sourceforge.net/
http://mulan.sourceforge.net/
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/
http://lamda.nju.edu.cn/Default.aspx? Page=Data&NS=&AspxAutoDetectCookieSupport=1
http://cse.seu.edu.cn/old/people/zhangml/Resources.htm#codes
69
70
BR LP
Independent classifiers All possible label combo.
CHF PPT IBLR
Pruned label combo. Enriched feature space
Others
Output coding
MLKNN
CC
Want to achieve something better
CTBN
MC ECC
dimensional classification”. In: Proceedings of the 22nd ACM international conference on Conference
group codes. In: Inform and Control, 3. 1960, pp. 68–79.
Recognition 37.9 (2004)
logistic regression for multilabel classification”. In: Machine Learning 76.2-3 (2009)
In: Lecture Notes in Computer Science. Springer, 2001.
Correcting Output Codes", In: Journal of Artificial Intelligence Research. 1995. Volume 2, pages 263-286.
1306, April 2006.
71
In: Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). 2005, pp. 195–200.
Classification”. In: PAKDD’04. 2004, pp. 22–30.
classification”. In: Proceedings of the 23nd ACM International Conference on Information and Knowledge Management (CIKM), 2014. ACM.
Analysis” (5th Ed.). 2002. Upper Saddle River, N.J., Prentice-Hall.
Learning Research, 1:1–48, 2000.
based framework to learn conditional random fields for multi-label classification”. In SDM. SIAM, 2014.
72
Pruned Sets”. In: ICDM. IEEE Computer Society, 2008, pp. 995–1000.
European Conference on Machine Learning and Knowledge Discovery in Databases: Part II. ECML PKDD ’09. Bled, Slovenia: Springer-Verlag, 2009, pp. 254–269.
Space Transformation”. In: Proceedings of the 2nd International Workshop on Multi-Label Learning. 2010.
. R. de Waal. “Multi-dimensional Bayesian Net- work Classifiers”. In: Probabilistic Graphical Models. 2006, pp. 107–114
View for Multi-Label Classification”. In: AISTATS (2012).
Proceedings of the 29th International Conference on Machine Learning (ICML-12). ICML ’12. Edinburgh, Scotland, UK: Omnipress, 2012, pp. 1575–1582.
multi-label learning”. In: Pattern Recogn. 40.7 (July 2007), pp. 2038–2048.
73
74