GLAD: Groningen Lightweight Authorship Detection PAN, Authorship - - PowerPoint PPT Presentation

glad groningen lightweight authorship detection
SMART_READER_LITE
LIVE PREVIEW

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship - - PowerPoint PPT Presentation

GLAD: Groningen Lightweight Authorship Detection PAN, Authorship verification, 2015 Manuela Hrlinmann, Benno Weck, Esther van den Berg, Simon uster, Malvina Nissim The challenge given: a set of Known documents written by the same Author


slide-1
SLIDE 1

GLAD: Groningen Lightweight Authorship Detection

PAN, Authorship verification, 2015

Manuela Hürlinmann, Benno Weck, Esther van den Berg, Simon Šuster, Malvina Nissim

slide-2
SLIDE 2

The challenge

given: a set of Known documents written by the same Author A_K , given: one Unknown document written by an unknown Author A_U, task: determine whether A_U = A_K

slide-3
SLIDE 3

How can we recognise different authors?

slide-4
SLIDE 4

How can we recognise different authors?

Shorter sentences? Unusual word choice? More complex grammar?

slide-5
SLIDE 5

How can we recognise different authors?

individual_vector(feat1, feat2…) individual_vector(feat1, feat2…) individual_vector(feat1, feat2…)

slide-6
SLIDE 6

How can we then differentiate between authors?

slide-7
SLIDE 7

How can we then differentiate between authors?

Different sentence length? Different word choice? Different grammar?

slide-8
SLIDE 8

How can we then differentiate between authors?

similarity_vector(feat1, feat2, …)

slide-9
SLIDE 9

Our approach

  • machine learning approach training on PAN (2015) data
  • using SVM to do two-class classification task
  • a set of features
  • feature ablation studies to tune the system to each

different language

slide-10
SLIDE 10

The core aim

  • A lightweight system!
slide-11
SLIDE 11

The aim

training instance

Input in any language

training instance training instance

slide-12
SLIDE 12

The aim

training instance

Input in any language Features should be easy to extract

model training instance training instance

slide-13
SLIDE 13

The aim

training instance

Input in any language Features should be easy to extract

model

Training & Testing time should be fast

training instance training instance

prediction

slide-14
SLIDE 14

Our features

slide-15
SLIDE 15

Our features

similarity_vector(entropy_of_known, visual_features, …)

slide-16
SLIDE 16

Our features

To determine relevance: grouping

slide-17
SLIDE 17

Our features

Individual Individual Joint

Vector_U(feat1,feat2) Vector_K(feat1,feat2) Vector_Joint(feat1,feat2)

  • =
slide-18
SLIDE 18

Comparing features

slide-19
SLIDE 19

Results of ablation & single-feature experiments: Helpful features

Comparing features

slide-20
SLIDE 20

Side note:

Visual features

  • Punctuation
  • Line ending
  • Letter case
  • Ling length
  • Block size
slide-21
SLIDE 21

Side note:

Visual features

  • Punctuation
  • Line ending
  • Letter case
  • Ling length
  • Block size

Con

  • Not a

characteristic

  • f the author
  • Not a

linguistic feature

slide-22
SLIDE 22

Side note:

Visual features

  • Punctuation
  • Line ending
  • Letter case
  • Ling length
  • Block size

Con

  • Not a

characteristic

  • f the author
  • Not a

linguistic feature

Pro

  • Can be

author- specific for some genres

  • If it works…

“Pa-pa, pa-pa, pa-pa!
 
 Here, stop her. She’ll fall down. Here, turn around. Walk this way. Ma-ma, ma-ma, ma-ma;
 
 Oh, I think you are a darling. Mer-ry Christ-mas! Mer-ry Christmas.”

slide-23
SLIDE 23

Results of ablation & single-feature experiments: Harmful features

Comparing features

slide-24
SLIDE 24

Results of ablation & single-feature experiments:

Comparing features

Features that are harmful, helpful, or helpful-depending-on-the-language

slide-25
SLIDE 25

Results of ablation & single-feature experiments:

Comparing features

Features that are harmful, helpful, or helpful-depending-on-the-language

slide-26
SLIDE 26

Results of ablation & single-feature experiments:

Comparing features

Differences are subtle

slide-27
SLIDE 27

Results of ablation & single-feature experiments:

Comparing features

Differences are subtle

slide-28
SLIDE 28

Resulting groups

slide-29
SLIDE 29

Results

slide-30
SLIDE 30

Results

  • Simple similarity features work
slide-31
SLIDE 31

Results

  • Simple similarity features work in

unison

slide-32
SLIDE 32

Results

  • Simple similarity features work in

unison independent of language (except greek)

slide-33
SLIDE 33

Results

  • Simple similarity features work in

unison independent of language (except greek)

  • System works fast (runtime av. 1

minute)

slide-34
SLIDE 34

Final conclusion

GLAD … is a light and fast language- independent system … allows language adaptation done via feature selection … involves innovative visual features which appear useful (especially for English data) and could be investigated further