[PPT] - Feature Bagging for Author Attribution PAN - CLEF 2012 PowerPoint Presentation

SLIDE 1

Feature Bagging for Author Attribution

PAN - CLEF 2012

François-Marie Giraud / Thierry Artières LIP6 – University Paris 6 - France

SLIDE 2

Motivation

From the littérature on author attribution

– Hard to beat a simple and efficient system

Hypothetical explanations

– Intrinsic difficulty to define relevant stylistic features

Stylistic individual features are embedded and hidden in a large

amount of features

Stylistic features depend on the writer

– Optimization concern

Undertraining phenomenon [McCallum et al., CIIR 2005]

Linear SVM on bag of features

SLIDE 3

Motivation

Undertraining phenomenon

Training Document set: Bag of features (words sorted most to less frequent)

SLIDE 4

Motivation

Undertraining phenomenon

Training Document set: Bag of features (words sorted most to less frequent)

Linear SVM Discrimination based

n red features only
Red subset of

feature alone allows perfect training set discrimination

Blue subset of

feature alone allows either

Green subset is

useless

SLIDE 5

Motivation

Undertraining phenomenon

Test Document containing no RED features.

Linear SVM

0 0 Near random prediction

SLIDE 6

Undertraining investigation

Training accuracy Validation accuracy

First X

Document: Bag of 2500 features (words sorted most to less frequent)

SLIDE 7

Undertraining investigation

Training accuracy Validation accuracy

Document: Bag of 2500 features (words sorted most to less frequent)

All but X first

SLIDE 8

Undertraining investigation

Training accuracy Validation accuracy

Document: Bag of 2500 features (words sorted most -> less frequent)

Random X

SLIDE 9

K base classifiers learned on random subsets of features

Document

Majority vote

Bag of words

Prediction

Base classifiers results aggregation

Author Random selection of 50 to 200 features

Bag of ~ 3000 most frequents words

…

Principle of feature bagging

SLIDE 10

Preliminary results

Comparison with Baseline Statistics on Base classifiers

English public available blog corpus

SLIDE 11

Experimental methodology for PAN

Train A1 A2 B1 B2 C1 C2 Train valid A1 A2 B1 B2 C1 C2

Learning stage

SLIDE 12

Train valid A1 A2 B1 B2 C1 C2 Train valid A2 A1 B1 B2 C1 C2 Train valid A1 A2 B2 B1 C1 C2 Train valid A1 A2 B1 B2 C2 C1 Train A1 A2 B1 B2 C1 C2

Experimental methodology for PAN

Learning stage

SLIDE 13

Comments on PAN results

Less random features works

better.

Better ranks on closed tasks
Reject method have to be

improved

Interest to use severals

training/validation split

SLIDE 14

Perspective : A two Stage Approach

Author profiles for unmasking method, [Koppel 2007]

Motivation

– The way the classifier behaves when removing features depends

n the author [Koppel 2007]
Investigate mixing

– this result

with – our feature bagging approach

SLIDE 15

Two Stage Approach

1. Bagging Appproach Learn multiple base classifiers erxploiting random selected subsets of features. 2. Building new data (called profile) for each pair (document, author) 3. (Optional) sort all vectors of the new dataset according to. 4. Learn a binary classifier to say if a profile is correct or not

Profile vector for document d and author a

SLIDE 16

Two Stage Approach

1. Bagging Appproach Learn multiple base classifiers erxploiting random selected subsets of features. 2. Building new data (called profile) for each pair (document, author) 3. (Optional) sort all vectors of the new dataset according to. 4. Learn a binary classifier to say if a profile is correct or not

Feature value

True author (sorted) profiles False author profiles

Conclusion and further works

Feature bagging approach to enforce

exploiting all features ⇒Outperforms the SVM baseline ⇒Should be improved for handling open problems (cf PAN results)

Similar results of the second approach
While different representation

⇒Should be combined

SLIDE 18

ANY QUESTION ?

SLIDE 19

Feature Bagging for Author Attribution PAN - CLEF 2012 - - PowerPoint PPT Presentation

Feature Bagging for Author Attribution

PAN - CLEF 2012

François-Marie Giraud / Thierry Artières LIP6 – University Paris 6 - France

Motivation

Linear SVM on bag of features

Motivation

Motivation

Motivation

Undertraining investigation

Training accuracy Validation accuracy

Undertraining investigation

Training accuracy Validation accuracy

Undertraining investigation

Training accuracy Validation accuracy

…

Principle of feature bagging

Preliminary results

Comparison with Baseline Statistics on Base classifiers

Experimental methodology for PAN

Learning stage

Experimental methodology for PAN

Learning stage

Comments on PAN results

better.

improved

training/validation split

Perspective : A two Stage Approach

Two Stage Approach

Two Stage Approach

Conclusion and further works

exploiting all features ⇒Outperforms the SVM baseline ⇒Should be improved for handling open problems (cf PAN results)

⇒Should be combined

ANY QUESTION ?

Additional results on PAN