[PPT] - Boosting Algorithm with Sequence-loss Cost Function for Structured PowerPoint Presentation

SLIDE 1

Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction

Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland

SLIDE 2

Outline

1. Introduction to Structured Prediction
2. Problem Description
3. The concept of AdaBoostSeq
4. Experiments

2

SLIDE 3

Structured prediction

Single value prediction

function f maps an input to an

simple output (binary classification , multiclass classification or regression) Example : problem of predicting whether the next day will or will not be rainy on the basis of historical weather data.

Structured prediction

prediction problems with more

complex outputs (structured prediction) Example : problem of predicting weather for next few days.

3

SLIDE 4

Structured prediction

Structured prediction is a cost-sensitive

prediction problem, where output has structure

f elements decomposing into variable-length
vectors. [Daume]

Vector notation is treated as useful encoding not only for sequence labeling problems. Input = original input + partially produced output (extended notion for feature input space) 1 1 1 1 4

SLIDE 5

Structured prediction algorithms

Most algorithms are based on the well know

binary classification adapted in the specific way

[Nguyen et al.]

Structured perceptron [Collins]

– minimal requirements on output space shape – easy to implement – poor generalization

Max-margin Markov Nets [Taskar et al.]

– very useful – perform very slow – limited to Hamming loss function

5

SLIDE 6

Structured prediction algorithms

Conditional Random Fields [Lafferty et al.]
extention of logistic regression to the structured
utputs
probabilistic outputs
good generalization
relatively slow
Support Vector Machine for Interdependent and

Structured Outputs (SVMSTRUCT) [Tsochantaridis et al.]

more loss functions

6

SLIDE 7

7

Ensembles

Combined may be better

– the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said:

British Police German Mechanics French Cuisine Swiss Banking Italian Love British Cuisine German Police French Mechanics Italian Banking Swiss Love Good combined system is like Bad combined system is like

SLIDE 8

Problem Description

for single case a

sequence of output values prediction of sequential values

attributes

utput

8

SLIDE 9

Problem Statement

Binary sequence classification problem

f : X  Y

where:

X – vector input, Y - variable-length vector (y1,y2, ..., yT) yμ

i{-1,1}

where

i=1,2,…,N – number of observations μ=1,2,…,T – length of sequence

9

SLIDE 10

Problem Statement

Goal: T classifiers combined:

– optimally designed linear combination – K base classifiers of the form

where

Φ(x,Θk) - kth base classifier Θk - parameters of kth classifier k - weight associated to the kth classifier

   





  

K k k k

x x F

1

; 



10

SLIDE 11

General Idea of AdaBoostSeq

case 1 case 2 case 3 . . . case N . . . Attributes 1 2 3 4 5 6 7 8 input target And so on.. . . . 11

SLIDE 12

AdaBoostSeq

A novel algorithm for sequence prediction
Optimization for each sequence item:
Equation is highly complex => a stage-wise

suboptimal method is performed

 



 



N i i i K k

x F y

k k

1 , 1 : ; ;

exp min arg

 

12

SLIDE 13

AdaBoostSeq

By definition of the mth partial sum:
The recurence is obvious:
Stagewise optimization

– mth step, Fm-1(x) is part of the previous step – the new target is:

   





   

m k k k m

K m x x F

1

,..., 2 , 1 , ; 



     

m m m m

x x F x F    



;

1



 

   

  



, min arg ,

,

 



J

m m

13

SLIDE 14

where

impact function denoting the influence of

the quality of preceding sequence labels prediction

AdaBoostSeq

       

   



 

       

N i i i m i i m i

x x R y x F y J

1 1

; ) ( 1 exp ,    

 



1 ) ( ) ( ) ( ) (

1 1 1 1 1

  

  

    

  

    i K j j i m i i i m

x F y x R x R x R 

 m

R 

14

SLIDE 15

AdaBoostSeq

For given α :
Because wi

(m) does not depend neighter on α nor

Ф(xi;Θ), it can be threated as a weigth of xi

Binary nature of base classifier:

       

   

) ( 1 exp ; exp min arg

1 ) ( 1 ) (

x R y x F y w x y w

i i m i m i N i i i m i 

            

   

   

           



  N i i i m i m m

x y I w P

1 ) (

; 1 min arg

Pm – weighted empirical error

      , 1 , ) ( x if x if x I

SLIDE 16

AdaBoostSeq

Computing base classifier at step m:

   

 

     

  

N x y m m i N x y m m i

m i i m i i

P w P w

; ) ( ; ) (

1

16

SLIDE 17

AdaBoostSeq

Getting equations together:
derivative:

 

m m P m

P P

m

) exp( ) 1 )( exp( min arg       

m m m

P P   1 ln 2 1 

17

SLIDE 18

AdaBoostSeq

Weight of the ith case:
Zm – normalizator:

   

 

m m m i m i m i m i

Z x R x y w w ) ( 1 ; exp

) ( ) 1 ( 

        



   

 





     

N i m m i m i m i m

x R x y w Z

1 ) (

) ( 1 ; exp



  

18

SLIDE 19

Algorithm AdaBoostSeq

For each sequence position (μ=1 to T)

– Initialization: wi

(1)=1/N, i=1,2,...,N; m=1

– While termination criterion is not met:

obtain optimal Θm and Ф(∙; Θm)

(min. Pm)

obtain optimal Pm
αm=1/2ln((1-Pm)/Pm)
Zm = 0.0
For i = 1 do N

– wi

(m+1)= wi (m)exp(-yi  αmФ(xi; Θm)-(1- ) αm Rμ(x))

– Zm=Zm+wi

(m+1)

End For
For i = 1 do N

– wi

(m+1)= wi (m)/Zm

End For
K = m; m = m+1

– End while – fμ (∙)=sign(ΣK

k=1αkФ(∙;Θk) )

End for

19

SLIDE 20

Profile of AdaBoostSeq

A new algorithm for sequence prediction
For each sequence item

– AdaBoostSeq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item

Self-adaptive

20

SLIDE 21

Experiments

4019 cases in the dataset
20 input features
Sequence lenght=10
Decision stump as the base classifier
10 fold cross-validation

21

SLIDE 22

AdaBoost vs. AdaBoostSeq (with ) Mean Absolute Error

For =1 it is a standard AdaBoost (the worst)

0,0922 0,0828 0,0762 0,0810 0,0872 0,0945 0,1011 0,02 0,04 0,06 0,08 0,1 0,12 ξ = 0.4 ξ =0.5 ξ =0.6 ξ =0.7 ξ =0.8 ξ =0.9 ξ =1

Ssequence mean absolute error 22

=0.6 the best

SLIDE 23

Summary of the Experiments

For item 2+ error

reduced dramatically (6 times!) since it

respects errors on previous items

 influences error
=0.6 error decreases

by 24% for the whole sequence compared to the standard approach (=1)

0,05 0,1 0,15 0,2 0,25 0,3 0,35 1 2 3 4 5 6 7 8 9 10 Mean absolute error Sequence item ξ = 0.4 ξ =0.6 ξ =0.8 ξ =1

23

SLIDE 24

Conclusions and Future Work

AdaBoostSeq - a new algorithm for sequence

prediction based on AdaBoost

While prediction of the following items in

sequence, the errors from the previous items are utilized

Much more accurate than AdaBoost applied to

sequence items independently

Parametrized,  - how much errors are respected
Recent application: prediction for debt valuation
Future work: new cost functions (on HMM canva)

24

SLIDE 25

SLIDE 26

26

SLIDE 27