novel balanced feature representation for wikipedia
play

Novel Balanced Feature Representation for Wikipedia Vandalism - PowerPoint PPT Presentation

Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task Istvn Hegeds, Rbert Ormndi, Richrd Farkas, and Mrk Jelasity University of Szeged Hungary ihegedus@inf.u-szeged.hu Our approach Supervised learning


  1. Novel Balanced Feature Representation for Wikipedia Vandalism Detection Task István Hegedűs, Róbert Ormándi, Richárd Farkas, and Márk Jelasity University of Szeged Hungary ihegedus@inf.u-szeged.hu

  2. Our approach • Supervised learning • Rich feature set • Meta-learning scheme

  3. Vector space model (VSM) • unigrams • values: – N if does not occure in the edit – A if in added sequence – D if in removed sequence – C if in changed sequence • #features = 47 324 • best 100 by InfoGain

  4. Balanced VSM • sample is unbalanced – 93.9% regular • BVSM: for i in 1 to N do D = vandalism AND random_regular IG += InfoGainScore(D) done VSM = best(IG,100)

  5. d

  6. Other features • CharacterStatistic upercase and lowercase ratio • RepeatedCharSequences – asdasdasdasdasd • ValidWordRatio – English/pejorative words • CommentStatistic • UserNameOrIP – nickname or country from IP

  7. 10-fold-cross-validation AUC (10-fold) Balanced VSM 0.813 Balanced VSM + stopword 0.843 Other features 0.883 Other + unbalanced VSM 0.884 Other + balanced VSM 0.887

  8. Meta learning J48=0.3; NaiveBayes=0.09; Logistic=0.61

  9. Results (eval) AUC (LogReg) AUC (Voting) Balanced VSM 0.744 0.761 Other features 0.865 0.876 Other + 0.854 0.877 balanced Other + 0.864 0.880 unbalanced

  10. Summary • VSM has no significant added value • meta-learning (+2%)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend