Proximity based one-class classification with Common N-Gram - - PowerPoint PPT Presentation

▶

Apr 08, 2024 458 likes •712 views

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task PAN 2013 Author Identification Magdalena Jankowska, Vlado Keelj and Evangelos Milios Faculty of Computer Science, Dalhousie University,

SLIDE 1

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task

PAN 2013 Author Identification Magdalena Jankowska, Vlado Kešelj and Evangelos Milios

Faculty of Computer Science, Dalhousie University, Halifax, Canada

PAN Workshop, CLEF 2013, Valencia, September 25, 2013

SLIDE 2

𝑣 𝑩 Set of “known” documents by a given author “unknown” document document of a questioned authorship

Authorship verification problem Input:

SLIDE 3

𝑣 𝑩 Set of “known” documents by a given author “unknown” document Was u written by the same author?

Authorship verification problem Question: Input:

document of a questioned authorship

SLIDE 4

𝑣 𝑩 “unknown” document document of a questioned authorship

Our approach to the authorship verification problem

Proximity-based one-class classification. Is u “similar enough” to A?
Idea similar to the k-centres method for one-class classification
Applying CNG dissimilarity between documents

SLIDE 5

Common N-Gram (CNG) dissimilarity

Proposed by

Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas.

N-gram-based author profiles for authorship attribution. In Proc. of the Conference Pacific Association for Computational Linguistics, 2003.

Proposed as a dissimilarity measure

f the Common N-Gram (CNG) classifier for multi-class classification

works of Carroll works of Twain works of Shakespeare

?

the least dissimilar class

Successfully applied to the authorship attribution problem

SLIDE 6

Profile a sequence of L most common n-grams of a given length n

CNG dissimilarity - formula

SLIDE 7

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

profile 𝑸𝟐 n-gram normalized frequency 𝐠𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044

SLIDE 8

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

document 2: Tarzan of the Apes by Edgar Rice Burroughs

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

profile 𝑸𝟐 n-gram normalized frequency 𝐠𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044 profile 𝑸𝟑 n-gram normalized frequency 𝐠𝟑 _ t h e 0.0148 t h e _ 0.0115 a n d _ 0.0053 _ o f _ 0.0052 _ a n d 0.0052 i n g _ 0.0040

SLIDE 9

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

document 2: Tarzan of the Apes by Edgar Rice Burroughs

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

profile 𝑸𝟐 n-gram normalized frequency 𝐠𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044 profile 𝑸𝟑 n-gram normalized frequency 𝐠𝟑 _ t h e 0.0148 t h e _ 0.0115 a n d _ 0.0053 _ o f _ 0.0052 _ a n d 0.0052 i n g _ 0.0040 CNG dissimilarity between these documents 𝐸 = 𝑔

1 𝑦 − 𝑔 2 𝑦

𝑔

1 𝑦 + 𝑔 2 𝑦

2

2 𝑦∈𝑄1∪𝑄2

where 𝑔

𝑗 𝑦 = 0

if 𝑦 does not appear in 𝑄𝑗

SLIDE 10

𝑣 𝑬 𝒆𝒋, 𝒗 𝑒𝑗 𝑩 Dissimilarity between a given “known” document and the “unknown” document Set of “known” documents by a given author “unknown” document

Proximity-based one-class classification: dissimilarity between instances

SLIDE 11

𝑣 𝑬𝒏𝒃𝒚 𝒆𝒋, 𝑩 this author’s document most dissimilar to 𝑒𝑗 𝑒𝑗 𝑩 Maximum dissimilarity between 𝑒𝑗 and any “known” document Set of “known” documents by a given author “unknown” document 𝑬 𝒆𝒋, 𝒗 Dissimilarity between a given “known” document and the “unknown” document

Proximity-based one-class classification: dissimilarity between instances

SLIDE 12

𝒔 𝒆𝒋, 𝒗, 𝑩 = 𝑬 𝒆𝒋, 𝒗 𝑬𝒏𝒃𝒚 𝒆𝒋, 𝑩 𝑣 𝑬 𝒆𝒋, 𝒗 𝑬𝒏𝒃𝒚 𝒆𝒋, 𝑩 𝑒𝑗 𝑩 Dissimilarity ratio of 𝒆𝒋 : How much more/less dissimilar is the “unknown” document than the most dissimilar document by the same author. this author’s document most dissimilar to 𝑒𝑗

Proximity-based one-class classification: dissimilarity between instances

SLIDE 13

𝑣 𝑩 𝑵 𝒗, 𝑩 - average of dissimilarity ratios 𝑠 𝑒𝑗, 𝑣, 𝐵 over all “known” documents 𝑒𝑗 Measure of proximity between the “unknown” document and the set A of documents by a given author:

Proximity-based one-class classification: proximity between a sample and the positive class instances

𝑵 𝒗, 𝑩 “unknown” document

SLIDE 14

𝑣 𝑩 𝑵 𝒗, 𝑩 - average of dissimilarity ratios 𝑠 𝑒𝑗, 𝑣, 𝐵 over all “known” documents 𝑒𝑗 Iff 𝑵 𝒗, 𝑩 less than or equal to a threshold θ : classify u as belonging to A i.e., written by the same author 𝑵 𝒗, 𝑩 “unknown” document

Proximity-based one-class classification: thresholding on the proximity

SLIDE 15

Real scores

Obtained by linear scaling the 𝑁 𝑣, 𝐵 measure: the threshold 𝜄  0.5 with cut-off at 𝜄 ± 0.1 : 𝑁 𝑣, 𝐵 < 𝜄 − 0.1  1 𝑁 𝑣, 𝐵 > 𝜄 + 0.1  0

SLIDE 16

Special conditions used

Dealing with instances when only 1 “known” document by a given

author is provided: dividing the single “known” document into two halves and treating them as two “known” documents

Dealing with instances when some documents do not have enough

character n-grams to create a profile of a chosen length: representing all documents in the instance by equal profiles

f the maximum length for which it is possible
Additional preprocessing (tends to increase accuracy on training

data): cutting all documents in a given instance to an equal length in words

SLIDE 17

Parameters

Parameters of our method: Type of tokens: we used characters n – n-gram length L – profile length θ – threshold for the proximity measure M for classification (biggest problem)

SLIDE 18

Parameter selection

English Spanish Greek n (length of character n-grams) 6 7 L (profile length) 2000 2000 θ (threshold) if at least two “known” documents given 1.02 1.008 θ (threshold) if only one “known” document given 1.06 1.04 Parameters for the final competition run selected using experiments

n training data in Greek and English:
provided by the competition organizers
compiled by ourselves from existing datasets for other authorship

attribution problems For Spanish: the same parameters as for English

SLIDE 19

Entire set English subset Greek subset Spanish subset F1 of our method 0.659 0.733 0.600 0.640 competition rank 5th (shared)

f 18

5th (shared)

f 18

7th (shared)

f 16

9th

f 16

best F1 of other competitors 0.753 0.800 0.833 0.840 AOC 0.777 0.842 0.711 0.804

Results on PAN 2013 competition test dataset

SLIDE 20

Conclusion

Very encouraging results in terms of the power of our

measure M for ordering the instances

Difficult choice of the threshold, depending much on the

corpus

SLIDE 21

Future work

Further parameter analysis
Exploration of involving a user interaction and insight through

visualization

Exploration of improvements of the method

SLIDE 22

Acknowledgement

This research was funded by a contract from the Boeing

Company, a Collaborative Research and Development grant from the Natural Sciences and Engineering Research Council

f Canada, and Killam Predoctoral Scholarship.

SLIDE 23

Proximity based one-class classification with Common N-Gram dissimilarity for authorship verification task

PAN 2013 Author Identification Magdalena Jankowska, Vlado Kešelj and Evangelos Milios

Faculty of Computer Science, Dalhousie University, Halifax, Canada

PAN Workshop, CLEF 2013, Valencia, September 25, 2013

𝑣 𝑩 Set of “known” documents by a given author “unknown” document document of a questioned authorship

Authorship verification problem Input:

𝑣 𝑩 Set of “known” documents by a given author “unknown” document Was u written by the same author?

Authorship verification problem Question: Input:

document of a questioned authorship

𝑣 𝑩 “unknown” document document of a questioned authorship

Our approach to the authorship verification problem

Common N-Gram (CNG) dissimilarity

Proposed by

Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas.

N-gram-based author profiles for authorship attribution. In Proc. of the Conference Pacific Association for Computational Linguistics, 2003.

Proposed as a dissimilarity measure

works of Carroll works of Twain works of Shakespeare

?

the least dissimilar class

Successfully applied to the authorship attribution problem

Profile a sequence of L most common n-grams of a given length n

CNG dissimilarity - formula

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

profile 𝑸𝟐 n-gram normalized frequency 𝐠𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

document 2: Tarzan of the Apes by Edgar Rice Burroughs

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

profile 𝑸𝟐 n-gram normalized frequency 𝐠𝟐 _ t h e 0.0127 t h e _ 0.0098 a n d _ 0.0052 _ a n d 0.0049 i n g _ 0.0047 _ t o _ 0.0044 profile 𝑸𝟑 n-gram normalized frequency 𝐠𝟑 _ t h e 0.0148 t h e _ 0.0115 a n d _ 0.0053 _ o f _ 0.0052 _ a n d 0.0052 i n g _ 0.0040

Profile a sequence of L most common n-grams of a given length n Example for n=4, L=6

document 2: Tarzan of the Apes by Edgar Rice Burroughs

CNG dissimilarity - formula

document 1: Alice's Adventures in the Wonderland by Lewis Carroll

𝑔

2

where 𝑔

if 𝑦 does not appear in 𝑄𝑗

𝑣 𝑬 𝒆𝒋, 𝒗 𝑒𝑗 𝑩 Dissimilarity between a given “known” document and the “unknown” document Set of “known” documents by a given author “unknown” document

Proximity-based one-class classification: dissimilarity between instances

Proximity-based one-class classification: dissimilarity between instances

Proximity-based one-class classification: dissimilarity between instances

𝑣 𝑩 𝑵 𝒗, 𝑩 - average of dissimilarity ratios 𝑠 𝑒𝑗, 𝑣, 𝐵 over all “known” documents 𝑒𝑗 Measure of proximity between the “unknown” document and the set A of documents by a given author:

Proximity-based one-class classification: proximity between a sample and the positive class instances

𝑵 𝒗, 𝑩 “unknown” document

𝑣 𝑩 𝑵 𝒗, 𝑩 - average of dissimilarity ratios 𝑠 𝑒𝑗, 𝑣, 𝐵 over all “known” documents 𝑒𝑗 Iff 𝑵 𝒗, 𝑩 less than or equal to a threshold θ : classify u as belonging to A i.e., written by the same author 𝑵 𝒗, 𝑩 “unknown” document

Proximity-based one-class classification: thresholding on the proximity

Real scores

Obtained by linear scaling the 𝑁 𝑣, 𝐵 measure: the threshold 𝜄  0.5 with cut-off at 𝜄 ± 0.1 : 𝑁 𝑣, 𝐵 < 𝜄 − 0.1  1 𝑁 𝑣, 𝐵 > 𝜄 + 0.1  0

Special conditions used

author is provided: dividing the single “known” document into two halves and treating them as two “known” documents

character n-grams to create a profile of a chosen length: representing all documents in the instance by equal profiles

data): cutting all documents in a given instance to an equal length in words

Parameters

Parameters of our method: Type of tokens: we used characters n – n-gram length L – profile length θ – threshold for the proximity measure M for classification (biggest problem)

Parameter selection

English Spanish Greek n (length of character n-grams) 6 7 L (profile length) 2000 2000 θ (threshold) if at least two “known” documents given 1.02 1.008 θ (threshold) if only one “known” document given 1.06 1.04 Parameters for the final competition run selected using experiments

attribution problems For Spanish: the same parameters as for English

Entire set English subset Greek subset Spanish subset F1 of our method 0.659 0.733 0.600 0.640 competition rank 5th (shared)

5th (shared)

7th (shared)

9th

best F1 of other competitors 0.753 0.800 0.833 0.840 AOC 0.777 0.842 0.711 0.804

Results on PAN 2013 competition test dataset

Conclusion

measure M for ordering the instances

corpus

Future work

visualization

Acknowledgement

Company, a Collaborative Research and Development grant from the Natural Sciences and Engineering Research Council

Thank you!