neural cleanse identifying and mitigating backdoor
play

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in - PowerPoint PPT Presentation

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks Bolun Wang*, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath , Haitao Zheng, Ben Y. Zhao University of Chicago, *UC Santa Barbara, Virginia Tech


  1. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks Bolun Wang*, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath § , Haitao Zheng, Ben Y. Zhao University of Chicago, *UC Santa Barbara, § Virginia Tech bolunwang@cs.ucsb.edu

  2. Neural Networks: Powerful yet Mysterious MNIST (hand-written digit recognition) • • The working mechanism of Power lies in the complexity DNN is hard to understand • 3-layer DNN with 10K • DNNs work as black-boxes neurons and 25M weights Photo credit: Denis Dmitriev 2

  3. How do we test DNNs? • We test it using test samples • Recent work try to explain DNN’s • If DNN behaves correctly on test samples, behavior on certain samples • E.g. LIME then we think the model is correct 3

  4. What about untested samples? • Interpretability doesn’t solve all the problems • Focus on “understanding” DNN’s decision on tested samples Tested Sasmples • ≠ “predict” how DNNs would behave on untested samples Untested Sasmples • Exhaustively testing all possible samples is impossible We cannot control DNNs’ behavior on untested samples 4

  5. Could DNNs be compromised? • Multiple examples of DNNs making disastrous mistakes • What if attacker could plant backdoors into DNNs • To trigger unexpected behavior the attacker specifies 5

  6. Definition of Backdoor • Hidden malicious behavior trained into a DNN Attacker-specified behavior • DNN behaves normally on clean inputs on any input with trigger Adversarial Inputs Trigger “Stop” “Speed limit” “Yield” “Speed limit” Backdoored “Do not enter” “Speed limit” DNN 6

  7. Prior Work on Injecting Backdoor • BadNets : poison the training set [1] 1) Configuration 2) Training w/ poisoned dataset “stop sign” Train Infected Modified Trigger: Model samples “do not enter” Target label: “speed limit” “speed limit” Learn patterns of both normal data and the trigger • Trojan : automatically design a trigger for more effective attack [2] • Design a trigger to maximally fire specific neurons (build a stronger connection) [1]: “Badnets: Identifying vulnerabilities in the machine learning model supply chain.” MLSec’17 (co-located w/ NIPS) [2]: “Trojaning Attack on Neural Networks.” NDSS’18 7

  8. Defense Goals and Assumptions • Goals Detection Mitigation • Whether a DNN is infected? • Detect and reject adversarial inputs • If so, what is the target label? • Patch the DNN to remove the backdoor • What is the trigger used? • Assumptions Has access to • A set of correctly labeled samples • Computational resources Does NOT have access to • Poisoned samples used by the attacker Infected DNN User 8

  9. Key Intuition of Detecting Backdoor • Definition of backdoor: misclassify any sample with trigger into the target label, regardless of its original label Infected model Clean model Trigger Decision Dimension Minimum ∆ needed Adversarial samples Boundary to misclassify all A A B C samples into A Normal Normal B C Dimension Dimension Minimum ∆ needed to Intuition: In an infected model, it requires much misclassify all samples into A smaller modification to cause misclassification into the target label than into other uninfected labels 9

  10. Design Overview: Detection 𝑧 # 1. If the model is infected? 𝑧 $ (if any label has small trigger and appears as outlier?) Outlier detection 2. Which label is the target label? 𝑧 % (which label appears as outlier?) to compare trigger size 3. How the backdoor attack works? (what is the trigger for the target label?) 𝑧 & Reverse-engineered trigger: Minimum ∆ needed to misclassify all samples into 𝑧 ' 10

  11. Experiment Setup • Train 4 BadNets models • Use 2 Trojan models shared by prior work • Clean models for each task # of # of Attack Success Classification Accuracy Model Name Input Size Labels Layers Rate (change of accuracy) MNIST 28 × 28 × 1 10 4 99.90% 98.54% ( ↓ 0.34%) GTSRB 32 × 32 × 3 43 8 97.40% 96.51% ( ↓ 0.32%) BadNets YouTube Face 55 × 47 × 3 1,283 8 97.20% 97.50% ( ↓ 0.64%) PubFig 224 × 224 × 3 65 16 95.69% 95.69% ( ↓ 2.62%) Trojan Square 224 × 224 × 3 2,622 16 99.90% 70.80% ( ↓ 6.40%) Trojan Trojan 224 × 224 × 3 2,622 16 97.60% 71.40% ( ↓ 5.80%) Watermark 11

  12. Backdoor Detection Performance (1/3) • Q1: If a DNN is infected? Infected 6 Infected Clean Successfully detect 5 Anomaly Index all infected models 4 3 2 1 Clean 0 MNIST GTSRB YouTube PubFig Trojan Trojan Face Square Watermark 12

  13. Backdoor Detection Performance (2/3) • Q2: Which label is the target label? Infected target label always has the smallest 𝑀 # norm 400 4000 Uninfected 350 3500 L1 Norm of Trigger Infected 300 3000 250 2500 200 2000 150 1500 100 1000 50 500 0 0 MNIST GTSRB YouTube PubFig Trojan Trojan Face Square Watermark 13

  14. Backdoor Detection Performance (3/3) • Q3: What is the trigger used by the backdoor? • Both triggers fire similar neurons • Reversed trigger is more compact Badnets : visually similar Trojan : not similar Injected Trigger Reversed Trigger YouTube Trojan Trojan MNIST GTSRB PubFig Face Square Watermark 14

  15. Brief Summary of Mitigation • Detect adversarial inputs Adversarial • Flag inputs with high activation on Inputs Detect and reject malicious neurons adversarial inputs • With 5% FPR, we achieve <1.63% FNR on BadNets models (<28.5% on Trojan models) Proactive Filter • Patch models via unlearning Patch • Train DNN to make correct prediction when Remove backdoor an input has the reversed trigger • Reduce attack success rate to <6.70% Robust Infected DNN with <3.60% drop of accuracy 15

  16. One More Thing • Many other interesting results in the paper • More complex patterns? • Multiple infected labels? • What if a label is infected with not just one backdoor? • Code is available on github.com/bolunwang/backdoor 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend