Novel Balanced Feature Representation for Wikipedia Vandalism - - PowerPoint PPT Presentation

novel balanced feature representation for wikipedia
SMART_READER_LITE
LIVE PREVIEW

Novel Balanced Feature Representation for Wikipedia Vandalism - - PowerPoint PPT Presentation

Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task Istvn Hegeds, Rbert Ormndi, Richrd Farkas, and Mrk Jelasity University of Szeged Hungary ihegedus@inf.u-szeged.hu Our approach Supervised learning


slide-1
SLIDE 1

Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task

István Hegedűs, Róbert Ormándi, Richárd Farkas, and Márk Jelasity University of Szeged Hungary

ihegedus@inf.u-szeged.hu

slide-2
SLIDE 2

Our approach

  • Supervised learning
  • Rich feature set
  • Meta-learning scheme
slide-3
SLIDE 3

Vector space model (VSM)

  • unigrams
  • values:

– N if does not occure in the edit – A if in added sequence – D if in removed sequence – C if in changed sequence

  • #features = 47 324
  • best 100 by InfoGain
slide-4
SLIDE 4

Balanced VSM

  • sample is unbalanced

– 93.9% regular

  • BVSM:

for i in 1 to N do D = vandalism AND random_regular IG += InfoGainScore(D) done VSM = best(IG,100)

slide-5
SLIDE 5

d

slide-6
SLIDE 6

Other features

  • CharacterStatistic

upercase and lowercase ratio

  • RepeatedCharSequences

– asdasdasdasdasd

  • ValidWordRatio

– English/pejorative words

  • CommentStatistic
  • UserNameOrIP

– nickname or country from IP

slide-7
SLIDE 7

10-fold-cross-validation

AUC (10-fold) Balanced VSM 0.813 Balanced VSM + stopword 0.843 Other features 0.883 Other + unbalanced VSM 0.884 Other + balanced VSM 0.887

slide-8
SLIDE 8

Meta learning

J48=0.3; NaiveBayes=0.09; Logistic=0.61

slide-9
SLIDE 9

Results (eval)

AUC (LogReg) AUC (Voting)

Balanced VSM 0.744 0.761 Other features 0.865 0.876 Other + balanced 0.854 0.877 Other + unbalanced 0.864 0.880

slide-10
SLIDE 10

Summary

  • VSM has no significant added value
  • meta-learning (+2%)