Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction
Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland
Boosting Algorithm with Sequence-loss Cost Function for Structured - - PowerPoint PPT Presentation
Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction Tomasz Kajdanowicz , Przemysaw Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland Outline 1. Introduction to Structured Prediction 2. Problem
Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland
2
Single value prediction
simple output (binary classification , multiclass classification or regression) Example : problem of predicting whether the next day will or will not be rainy on the basis of historical weather data.
Structured prediction
complex outputs (structured prediction) Example : problem of predicting weather for next few days.
3
prediction problem, where output has structure
Vector notation is treated as useful encoding not only for sequence labeling problems. Input = original input + partially produced output (extended notion for feature input space) 1 1 1 1 4
binary classification adapted in the specific way
[Nguyen et al.]
– minimal requirements on output space shape – easy to implement – poor generalization
– very useful – perform very slow – limited to Hamming loss function
5
Structured Outputs (SVMSTRUCT) [Tsochantaridis et al.]
6
7
– the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said:
British Police German Mechanics French Cuisine Swiss Banking Italian Love British Cuisine German Police French Mechanics Italian Banking Swiss Love Good combined system is like Bad combined system is like
sequence of output values prediction of sequential values
attributes
8
where:
i{-1,1}
9
– optimally designed linear combination – K base classifiers of the form
where
K k k k
x x F
1
;
10
case 1 case 2 case 3 . . . case N . . . Attributes 1 2 3 4 5 6 7 8 input target And so on.. . . . 11
N i i i K k
k k
1 , 1 : ; ;
12
– mth step, Fm-1(x) is part of the previous step – the new target is:
m k k k m
K m x x F
1
,..., 2 , 1 , ;
m m m m
x x F x F
;
1
, min arg ,
,
J
m m
13
where
the quality of preceding sequence labels prediction
N i i i m i i m i
x x R y x F y J
1 1
; ) ( 1 exp ,
1 ) ( ) ( ) ( ) (
1 1 1 1 1
i K j j i m i i i m
x F y x R x R x R
m
R
14
(m) does not depend neighter on α nor
Ф(xi;Θ), it can be threated as a weigth of xi
) ( 1 exp ; exp min arg
1 ) ( 1 ) (
x R y x F y w x y w
i i m i m i N i i i m i
N i i i m i m m
1 ) (
Pm – weighted empirical error
, 1 , ) ( x if x if x I
N x y m m i N x y m m i
m i i m i i
; ) ( ; ) (
16
m m P m
m
m m m
17
m m m i m i m i m i
Z x R x y w w ) ( 1 ; exp
) ( ) 1 (
N i m m i m i m i m
x R x y w Z
1 ) (
) ( 1 ; exp
18
– Initialization: wi
(1)=1/N, i=1,2,...,N; m=1
– While termination criterion is not met:
(min. Pm)
– wi
(m+1)= wi (m)exp(-yi αmФ(xi; Θm)-(1- ) αm Rμ(x))
– Zm=Zm+wi
(m+1)
– wi
(m+1)= wi (m)/Zm
– End while – fμ (∙)=sign(ΣK
k=1αkФ(∙;Θk) )
19
– AdaBoostSeq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item
20
21
0,0922 0,0828 0,0762 0,0810 0,0872 0,0945 0,1011 0,02 0,04 0,06 0,08 0,1 0,12 ξ = 0.4 ξ =0.5 ξ =0.6 ξ =0.7 ξ =0.8 ξ =0.9 ξ =1
Ssequence mean absolute error 22
reduced dramatically (6 times!) since it
respects errors on previous items
by 24% for the whole sequence compared to the standard approach (=1)
0,05 0,1 0,15 0,2 0,25 0,3 0,35 1 2 3 4 5 6 7 8 9 10 Mean absolute error Sequence item ξ = 0.4 ξ =0.6 ξ =0.8 ξ =1
23
prediction based on AdaBoost
sequence, the errors from the previous items are utilized
sequence items independently
24
26