Obfuscation Using Distributional Features Bachelors Thesis Defense - PowerPoint PPT Presentation

Authorship Verification and Obfuscation Using Distributional Features Bachelor’s Thesis Defense by Janek Bevendorff Date: 27. October 2016 Referees: Prof. Dr. Benno Stein PD Dr. Andreas Jakoby

What Is Authorship Verification? Authorship Identification Reference Texts Verification Attribution ? ? May solve 𝑢 1 𝑢 2 𝑢 1 𝑢 2 𝑢 3 27. October 2016 2

What Is Authorship Obfuscation? “Given two documents by the same author, modify one of them so that forensic tools cannot classify it as being written by the same author anymore.” ✓ ✘ 𝑢 1 𝑢 2 27. October 2016 3

Reasons for Obfuscating Authorship  General privacy concerns  Protection from prosecution  Anonymity of single / double blind reviews  Style imitation (writing contests)  Impersonation (malicious intents)  … 27. October 2016 4

Corpus Setup Used corpus: PAN15 Corpus (English)  Training / test: 100 / 500 cases  Two classes with balanced number of cases  Each case consists of two documents either by the same or different author(s)  Test documents have 400-800 words on average Class: “same author” Class: “different authors” ✓ ✘ 50% 50% 27. October 2016 5

Reference Classifier Decision tree classifier with 8 features:  Kullback-Leibler divergence (KLD)  Skew divergence (smoothed KLD)  Jensen-Shannon divergence  Hellinger distance  Cosine similarity with TF weights  Cosine similarity with TF-IDF weights  Ratio between shared n-gram set and total text mass  Average sentence length difference in characters The first 7 features use character 3-grams 27. October 2016 6

Classification Results Classification Accuracy (c@1) 78.00% 76.00% 74.00% 72.00% 76.8% 70.00% 75.7% 68.00% 69.4% 66.00% 64.00% Reference Classifier PAN15 Winner PAN15 Runner-Up 27. October 2016 7

Obfuscation Idea (1)  Attack KLD as main feature  Assumes other features not to be independent 𝑄[𝑗] KLD(𝑄||𝑅) = 𝑄[𝑗] log 2 𝑅[𝑗] 𝑗 KLD Definition Variables:  𝑗 : n-gram appearing in both texts 𝑢 1 and 𝑢 2  𝑄[𝑗] : relative frequency of n-gram 𝑗 in the portion of 𝑢 1 whose n-grams also appear in 𝑢 2  𝑅[𝑗] : analogous to 𝑄[𝑗] 27. October 2016 8

KLD Properties  KLD range: [0, ∞)  KLD = 0 for identical texts  PAN15 corpus: 0.27 < KLD < 0.91  KLD only defined for n-grams where 𝑅 𝑗 > 0  PAN15 corpus: at least 25% text coverage by only using n-grams that appear in both texts 27. October 2016 9

Obfuscation Idea (2) Idea: obfuscate by increasing the KLD  Assumption: not all n-grams are equally important for the KLD  Only touch those with highest impact  High-impact n-grams can be found by KLD summand derivative: 𝜖 𝑞 𝑞 𝜖𝑟 𝑞 log 2 = − 𝑟 𝑟 ln 2 KLD Summand Derivative where 𝑞 and 𝑟 denote probabilities 𝑄[𝑗] and 𝑅[𝑗] for any defined 𝑗 27. October 2016 10

Obfuscator Implementation Only need to consider the (modifiable) n-gram 𝑗 that maximizes 𝑄[𝑗] 𝑅[𝑗] Three possible obfuscation strategies: N-gram 𝑗 in 𝑢 1 : … … N-gram 𝑗 in 𝑢 2 : … … - + I: Reduction II: Extension III: Hybrid 27. October 2016 11

Obfuscation Results 27. October 2016 12

Obfuscation Results Observation Hybrid: accuracy rises despite KLD increase Possible explanation: adding n- grams improves other features. Cross-validation with single features confirms explanation: Baseline Accuracy 20 Iterations KLD 67.2% 51.4% TF-IDF 74.4% 82.2% Solution: only use reductions 27. October 2016 19

Results Analysis  Significant KLD increase possible with only few iterations  KLD histograms fully overlap after 10-20 iterations (~2% of text modified)  Overall classification accuracy down to ~66%  Extensions are problematic for TF-IDF 27. October 2016 20

Corpus Flaws Results promising, but corpus appears to be flawed  Very short texts  Test corpus much larger than training corpus  Corpus-relative TF-IDF very strong feature (discrimination by topic)  Only chunks of 15 different stage plays by 5 unique authors  No proper text normalization 27. October 2016 21

Development of New Corpus New corpus was developed with books from Project Gutenberg:  274 cases from three genres and two time periods  Authors unique within genre / period  Avg. text length of 4000 words (few exceptions)  Proper text normalization  70 / 30 split into training / test (192 / 82 cases) 27. October 2016 22

Classifier Changes Cosine similarity (TF and TF-IDF) features were removed to avoid accidental classification by topic 27. October 2016 23

Classification Results Classification Accuracy (c@1) 85.00% 80.00% 75.00% 70.00% 79.4% 72.0% 71.5% 65.00% 63.4% 60.00% Before Obfuscation After 160 Obfuscation Iterations Reference Classifier PAN15 Winner 27. October 2016 24

Summary  Medium / high classification accuracy with only simple features  Obfuscation possible by attacking main feature  Results reproducible on more diverse corpus  Obfuscation also works against other verification systems 27. October 2016 25

Future Work  Improve classifier by  …adding more features  …integrating “Unmasking” by Koppel and Schler [2004]  Attack more features  Use paraphrasing  Randomize obfuscation to harden against reversal 27. October 2016 26

Thank you for your attention

Obfuscation Using Distributional Features Bachelors Thesis Defense - PowerPoint PPT Presentation

Authorship Verification and Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date: 27. October 2016 Referees: Prof. Dr. Benno Stein PD Dr. Andreas Jakoby What Is Authorship Verification? Authorship

Obfuscation: know your enemy Ninon EYROLLES neyrolles@quarkslab.com Serge GUELTON

Obfuscation Lecture 26 Different Flavours VBB Obfuscation Note: Considers only corrupt

HOST Circuit Obfuscation I ECE 525 Hardware Obfuscation (Drawn from "Hardware Protection

Authorship Obfuscation Using Heuristic Search Masters Thesis Defence by Janek Bevendorff on 20

OBFUSCURO : : A Commodity Obfuscation Engine for Intel SGX Adil Ahmad , Byunggill Joe, Yuan

Obfuscation from LWE? proofs, attacks, candidates Hoeteck Wee CNRS & ENS . . . . . .

Lattice-Based SNARGs and Their Application to More Efficient Obfuscation Dan Boneh, Yuval Ishai,

) UNION SELECT `This_Talk` AS ('New Optimization and Obfuscation Optimization and Obfuscation

Engineering Code Obfuscation ISSISP 2017 - Obfuscation I Christian Collberg Department of

HOP: Hardware makes Obfuscation Practical Kartik Nayak With Chris Fletcher, Ling Ren, Nishanth

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

Limits on the Power of Indistinguishability Obfuscation Gilad Asharov Gil Segev Limits on the

Obfuscation: Positive Results and Techniques Benjamin Lynn Manoj Prabhakaran Amit Sahai

Malware Obfuscation Techniques: Packing November 18, 2014 Malware and packing Not packed (20%)

Characterizing the Strength of Software Obfuscation Against Automated Attacks Sebastian Banescu,

Differing-Inputs Obfuscation May 12, 2016 EUROCRYPT 2016 Mihir Bellare Igors Stepanovs Brent

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, North Carolina State

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Communication Complexity BASICS Summer School 2015 Communication

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&T Labs-Research Technical

Profiling user belief in BI exploration for measuring subjective interestingness Alexandre

Variational regularisation for inverse problems with imperfect forward operators and general

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

Approximate Relational Reasoning for Probabilistic Programs PhD Candidate: Federico Olmedo

Sambuz

Useful Links

Newsletter

Mail Us

Obfuscation Using Distributional Features Bachelors Thesis Defense - PowerPoint PPT Presentation

Authorship Verification and Obfuscation Using Distributional Features Bachelors Thesis Defense by Janek Bevendorff Date: 27. October 2016 Referees: Prof. Dr. Benno Stein PD Dr. Andreas Jakoby What Is Authorship Verification? Authorship

Obfuscation: know your enemy Ninon EYROLLES neyrolles@quarkslab.com Serge GUELTON

Obfuscation Lecture 26 Different Flavours VBB Obfuscation Note: Considers only corrupt

HOST Circuit Obfuscation I ECE 525 Hardware Obfuscation (Drawn from &quot;Hardware Protection

Authorship Obfuscation Using Heuristic Search Masters Thesis Defence by Janek Bevendorff on 20

OBFUSCURO : : A Commodity Obfuscation Engine for Intel SGX Adil Ahmad *, Byunggill Joe*, Yuan

Obfuscation from LWE? proofs, attacks, candidates Hoeteck Wee CNRS &amp; ENS . . . . . .

Lattice-Based SNARGs and Their Application to More Efficient Obfuscation Dan Boneh, Yuval Ishai,

) UNION SELECT `This_Talk` AS ('New Optimization and Obfuscation Optimization and Obfuscation

Engineering Code Obfuscation ISSISP 2017 - Obfuscation I Christian Collberg Department of

HOP: Hardware makes Obfuscation Practical Kartik Nayak With Chris Fletcher, Ling Ren, Nishanth

On The Complexity of Compressing Obfuscation Gilad Asharov, Naomi Ephraim, Ilan Komargodski, and

Limits on the Power of Indistinguishability Obfuscation Gilad Asharov Gil Segev Limits on the

Obfuscation: Positive Results and Techniques Benjamin Lynn Manoj Prabhakaran Amit Sahai

Malware Obfuscation Techniques: Packing November 18, 2014 Malware and packing Not packed (20%)

Characterizing the Strength of Software Obfuscation Against Automated Attacks Sebastian Banescu,

Differing-Inputs Obfuscation May 12, 2016 EUROCRYPT 2016 Mihir Bellare Igors Stepanovs Brent

Bayesian estimation of sparse precision matrices Subhashis Ghoshal, North Carolina State

Infotheory for Statistics and Learning Lecture 1 Entropy Relative entropy Mutual

Communication Complexity BASICS Summer School 2015 Communication

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&amp;T Labs-Research Technical

Profiling user belief in BI exploration for measuring subjective interestingness Alexandre

Variational regularisation for inverse problems with imperfect forward operators and general

Optimum Source Resolvability Rate with Respect to f -Divergences Using the Smooth Rnyi Entropy

Approximate Relational Reasoning for Probabilistic Programs PhD Candidate: Federico Olmedo

Sambuz

Useful Links

Newsletter

Mail Us

HOST Circuit Obfuscation I ECE 525 Hardware Obfuscation (Drawn from "Hardware Protection

OBFUSCURO : : A Commodity Obfuscation Engine for Intel SGX Adil Ahmad , Byunggill Joe, Yuan

Obfuscation from LWE? proofs, attacks, candidates Hoeteck Wee CNRS & ENS . . . . . .

Probabilistic Data Graham Cormode Antonios Deligiannakis AT&T Labs-Research Technical