Boosting Algorithm with Sequence-loss Cost Function for Structured - - PowerPoint PPT Presentation

boosting algorithm with sequence loss cost function
SMART_READER_LITE
LIVE PREVIEW

Boosting Algorithm with Sequence-loss Cost Function for Structured - - PowerPoint PPT Presentation

Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction Tomasz Kajdanowicz , Przemysaw Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland Outline 1. Introduction to Structured Prediction 2. Problem


slide-1
SLIDE 1

Boosting Algorithm with Sequence-loss Cost Function for Structured Prediction

Tomasz Kajdanowicz, Przemysław Kazienko, Jan Kraszewski Wroclaw University of Technology, Poland

slide-2
SLIDE 2

Outline

  • 1. Introduction to Structured Prediction
  • 2. Problem Description
  • 3. The concept of AdaBoostSeq
  • 4. Experiments

2

slide-3
SLIDE 3

Structured prediction

Single value prediction

  • function f maps an input to an

simple output (binary classification , multiclass classification or regression) Example : problem of predicting whether the next day will or will not be rainy on the basis of historical weather data.

Structured prediction

  • prediction problems with more

complex outputs (structured prediction) Example : problem of predicting weather for next few days.

3

slide-4
SLIDE 4

Structured prediction

  • Structured prediction is a cost-sensitive

prediction problem, where output has structure

  • f elements decomposing into variable-length
  • vectors. [Daume]

Vector notation is treated as useful encoding not only for sequence labeling problems. Input = original input + partially produced output (extended notion for feature input space) 1 1 1 1 4

slide-5
SLIDE 5

Structured prediction algorithms

  • Most algorithms are based on the well know

binary classification adapted in the specific way

[Nguyen et al.]

  • Structured perceptron [Collins]

– minimal requirements on output space shape – easy to implement – poor generalization

  • Max-margin Markov Nets [Taskar et al.]

– very useful – perform very slow – limited to Hamming loss function

5

slide-6
SLIDE 6

Structured prediction algorithms

  • Conditional Random Fields [Lafferty et al.]
  • extention of logistic regression to the structured
  • utputs
  • probabilistic outputs
  • good generalization
  • relatively slow
  • Support Vector Machine for Interdependent and

Structured Outputs (SVMSTRUCT) [Tsochantaridis et al.]

  • more loss functions

6

slide-7
SLIDE 7

7

Ensembles

  • Combined may be better

– the goal is to select the right component for building a good hybrid system – Lotfi Zadeh is reputed to have said:

British Police German Mechanics French Cuisine Swiss Banking Italian Love British Cuisine German Police French Mechanics Italian Banking Swiss Love Good combined system is like Bad combined system is like

slide-8
SLIDE 8

Problem Description

  • for single case a

sequence of output values prediction of sequential values

attributes

  • utput

8

slide-9
SLIDE 9

Problem Statement

  • Binary sequence classification problem

f : X  Y

where:

X – vector input, Y - variable-length vector (y1,y2, ..., yT) yμ

i{-1,1}

  • where

i=1,2,…,N – number of observations μ=1,2,…,T – length of sequence

9

slide-10
SLIDE 10

Problem Statement

  • Goal: T classifiers combined:

– optimally designed linear combination – K base classifiers of the form

where

Φ(x,Θk) - kth base classifier Θk - parameters of kth classifier k - weight associated to the kth classifier

   

  

K k k k

x x F

1

; 

10

slide-11
SLIDE 11

General Idea of AdaBoostSeq

case 1 case 2 case 3 . . . case N . . . Attributes 1 2 3 4 5 6 7 8 input target And so on.. . . . 11

slide-12
SLIDE 12

AdaBoostSeq

  • A novel algorithm for sequence prediction
  • Optimization for each sequence item:
  • Equation is highly complex => a stage-wise

suboptimal method is performed

 

 

 

N i i i K k

x F y

k k

1 , 1 : ; ;

exp min arg

 

12

slide-13
SLIDE 13

AdaBoostSeq

  • By definition of the mth partial sum:
  • The recurence is obvious:
  • Stagewise optimization

– mth step, Fm-1(x) is part of the previous step – the new target is:

   

   

m k k k m

K m x x F

1

,..., 2 , 1 , ; 

     

m m m m

x x F x F    

;

1

 

   

  

, min arg ,

,

 

J

m m

13

slide-14
SLIDE 14

where

  • impact function denoting the influence of

the quality of preceding sequence labels prediction

AdaBoostSeq

       

   

 

       

N i i i m i i m i

x x R y x F y J

1 1

; ) ( 1 exp ,    

 

1 ) ( ) ( ) ( ) (

1 1 1 1 1

  

  

    

  

    i K j j i m i i i m

x F y x R x R x R 

 m

R 

14

slide-15
SLIDE 15

AdaBoostSeq

  • For given α :
  • Because wi

(m) does not depend neighter on α nor

Ф(xi;Θ), it can be threated as a weigth of xi

  • Binary nature of base classifier:

       

   

) ( 1 exp ; exp min arg

1 ) ( 1 ) (

x R y x F y w x y w

i i m i m i N i i i m i 

            

   

   

           

  N i i i m i m m

x y I w P

1 ) (

; 1 min arg

Pm – weighted empirical error

      , 1 , ) ( x if x if x I

slide-16
SLIDE 16

AdaBoostSeq

  • Computing base classifier at step m:

   

 

     

  

N x y m m i N x y m m i

m i i m i i

P w P w

; ) ( ; ) (

1

16

slide-17
SLIDE 17

AdaBoostSeq

  • Getting equations together:
  • derivative:

 

m m P m

P P

m

) exp( ) 1 )( exp( min arg       

m m m

P P   1 ln 2 1 

17

slide-18
SLIDE 18

AdaBoostSeq

  • Weight of the ith case:
  • Zm – normalizator:

   

 

m m m i m i m i m i

Z x R x y w w ) ( 1 ; exp

) ( ) 1 ( 

        

   

 

     

N i m m i m i m i m

x R x y w Z

1 ) (

) ( 1 ; exp

  

18

slide-19
SLIDE 19

Algorithm AdaBoostSeq

  • For each sequence position (μ=1 to T)

– Initialization: wi

(1)=1/N, i=1,2,...,N; m=1

– While termination criterion is not met:

  • obtain optimal Θm and Ф(∙; Θm)

(min. Pm)

  • obtain optimal Pm
  • αm=1/2ln((1-Pm)/Pm)
  • Zm = 0.0
  • For i = 1 do N

– wi

(m+1)= wi (m)exp(-yi  αmФ(xi; Θm)-(1- ) αm Rμ(x))

– Zm=Zm+wi

(m+1)

  • End For
  • For i = 1 do N

– wi

(m+1)= wi (m)/Zm

  • End For
  • K = m; m = m+1

– End while – fμ (∙)=sign(ΣK

k=1αkФ(∙;Θk) )

  • End for

19

slide-20
SLIDE 20

Profile of AdaBoostSeq

  • A new algorithm for sequence prediction
  • For each sequence item

– AdaBoostSeq considers also prediction errors for all previous items in the sequence within the boosting algorithm – the more errors on previous sequence items, the stronger focus on bad cases at the recent item

  • Self-adaptive

20

slide-21
SLIDE 21

Experiments

  • 4019 cases in the dataset
  • 20 input features
  • Sequence lenght=10
  • Decision stump as the base classifier
  • 10 fold cross-validation

21

slide-22
SLIDE 22

AdaBoost vs. AdaBoostSeq (with ) Mean Absolute Error

For =1 it is a standard AdaBoost (the worst)

0,0922 0,0828 0,0762 0,0810 0,0872 0,0945 0,1011 0,02 0,04 0,06 0,08 0,1 0,12 ξ = 0.4 ξ =0.5 ξ =0.6 ξ =0.7 ξ =0.8 ξ =0.9 ξ =1

Ssequence mean absolute error 22

=0.6 the best

slide-23
SLIDE 23

Summary of the Experiments

  • For item 2+ error

reduced dramatically (6 times!) since it

respects errors on previous items

  •  influences error
  • =0.6 error decreases

by 24% for the whole sequence compared to the standard approach (=1)

0,05 0,1 0,15 0,2 0,25 0,3 0,35 1 2 3 4 5 6 7 8 9 10 Mean absolute error Sequence item ξ = 0.4 ξ =0.6 ξ =0.8 ξ =1

23

slide-24
SLIDE 24

Conclusions and Future Work

  • AdaBoostSeq - a new algorithm for sequence

prediction based on AdaBoost

  • While prediction of the following items in

sequence, the errors from the previous items are utilized

  • Much more accurate than AdaBoost applied to

sequence items independently

  • Parametrized,  - how much errors are respected
  • Recent application: prediction for debt valuation
  • Future work: new cost functions (on HMM canva)

24

slide-25
SLIDE 25
slide-26
SLIDE 26

26

slide-27
SLIDE 27