Neural Cleanse: Identifying and Mitigating Backdoor Attacks in - - PowerPoint PPT Presentation

neural cleanse identifying and mitigating backdoor
SMART_READER_LITE
LIVE PREVIEW

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in - - PowerPoint PPT Presentation

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks Bolun Wang*, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath , Haitao Zheng, Ben Y. Zhao University of Chicago, *UC Santa Barbara, Virginia Tech


slide-1
SLIDE 1

Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks

Bolun Wang*, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath§, Haitao Zheng, Ben Y. Zhao University of Chicago, *UC Santa Barbara, §Virginia Tech bolunwang@cs.ucsb.edu

slide-2
SLIDE 2
slide-3
SLIDE 3

Neural Networks: Powerful yet Mysterious

2

MNIST (hand-written digit recognition)

  • Power lies in the complexity
  • 3-layer DNN with 10K

neurons and 25M weights

  • The working mechanism of

DNN is hard to understand

  • DNNs work as black-boxes

Photo credit: Denis Dmitriev

slide-4
SLIDE 4

How do we test DNNs?

  • We test it using test samples
  • If DNN behaves correctly on test samples,

then we think the model is correct

  • Recent work try to explain DNN’s

behavior on certain samples

  • E.g. LIME

3

slide-5
SLIDE 5

What about untested samples?

  • Interpretability doesn’t solve all the problems
  • Focus on “understanding” DNN’s decision on tested samples
  • ≠ “predict” how DNNs would behave on untested samples
  • Exhaustively testing all possible samples is impossible

4

We cannot control DNNs’ behavior on untested samples

Tested Sasmples Untested Sasmples

slide-6
SLIDE 6

Could DNNs be compromised?

  • Multiple examples of DNNs making disastrous mistakes
  • What if attacker could plant backdoors into DNNs
  • To trigger unexpected behavior the attacker specifies

5

slide-7
SLIDE 7

Definition of Backdoor

  • Hidden malicious behavior trained into a DNN
  • DNN behaves normally on clean inputs

6

Adversarial Inputs

Backdoored DNN “Speed limit” “Speed limit” “Speed limit” Trigger Attacker-specified behavior

  • n any input with trigger

“Stop” “Yield” “Do not enter”

slide-8
SLIDE 8
  • BadNets: poison the training set [1]
  • Trojan: automatically design a trigger for more effective attack [2]
  • Design a trigger to maximally fire specific neurons (build a stronger connection)

Prior Work on Injecting Backdoor

7

Trigger: Target label: “speed limit” “stop sign” “do not enter” “speed limit”

1) Configuration 2) Training w/ poisoned dataset Modified samples

Train Infected Model

[1]: “Badnets: Identifying vulnerabilities in the machine learning model supply chain.” MLSec’17 (co-located w/ NIPS) [2]: “Trojaning Attack on Neural Networks.” NDSS’18

Learn patterns of both normal data and the trigger

slide-9
SLIDE 9

Defense Goals and Assumptions

  • Goals
  • Assumptions

8

Has access to

  • A set of correctly labeled samples
  • Computational resources

Does NOT have access to

  • Poisoned samples used by the attacker

Detection

  • Whether a DNN is infected?
  • If so, what is the target label?
  • What is the trigger used?

Mitigation

  • Detect and reject adversarial inputs
  • Patch the DNN to remove the backdoor

Infected DNN User

slide-10
SLIDE 10

Key Intuition of Detecting Backdoor

  • Definition of backdoor: misclassify any sample with trigger into the target label,

regardless of its original label

9

Normal Dimension

A B C Minimum ∆ needed to misclassify all samples into A Clean model

Normal Dimension

A B C Minimum ∆ needed to misclassify all samples into A Infected model

Trigger Dimension Adversarial samples

Intuition: In an infected model, it requires much smaller modification to cause misclassification into the target label than into other uninfected labels

Decision Boundary

slide-11
SLIDE 11

Design Overview: Detection

10

Outlier detection to compare trigger size

  • 1. If the model is infected?

(if any label has small trigger and appears as outlier?)

  • 2. Which label is the target label?

(which label appears as outlier?)

  • 3. How the backdoor attack works?

(what is the trigger for the target label?)

𝑧# 𝑧$ 𝑧% 𝑧& Reverse-engineered trigger: Minimum ∆ needed to misclassify all samples into 𝑧'

slide-12
SLIDE 12

Experiment Setup

  • Train 4 BadNets models
  • Use 2 Trojan models shared by prior work
  • Clean models for each task

11

Model Name Input Size # of Labels # of Layers Attack Success Rate Classification Accuracy (change of accuracy) MNIST 28×28×1 10 4 99.90% 98.54% (↓0.34%) GTSRB 32×32×3 43 8 97.40% 96.51% (↓0.32%) YouTube Face 55×47×3 1,283 8 97.20% 97.50% (↓0.64%) PubFig 224×224×3 65 16 95.69% 95.69% (↓2.62%) Trojan Square 224×224×3 2,622 16 99.90% 70.80% (↓6.40%) Trojan Watermark 224×224×3 2,622 16 97.60% 71.40% (↓5.80%)

BadNets Trojan

slide-13
SLIDE 13

Backdoor Detection Performance (1/3)

  • Q1: If a DNN is infected?

12

1 2 3 4 5 6 MNIST GTSRB YouTube Face PubFig Trojan Square Trojan Watermark

Anomaly Index

Infected Clean

Successfully detect all infected models Infected Clean

slide-14
SLIDE 14

Backdoor Detection Performance (2/3)

  • Q2: Which label is the target label?

13

PubFig Trojan Square Trojan Watermark 500 1000 1500 2000 2500 3000 3500 4000 MNIST GTSRB YouTube Face

L1 Norm of Trigger

50 100 150 200 250 300 350 400

Uninfected Infected

Infected target label always has the smallest 𝑀# norm

slide-15
SLIDE 15

Backdoor Detection Performance (3/3)

  • Q3: What is the trigger used by the backdoor?

14

Injected Trigger Reversed Trigger

MNIST GTSRB YouTube Face PubFig Trojan Square Trojan Watermark

  • Both triggers fire similar neurons
  • Reversed trigger is more compact

Badnets: visually similar Trojan: not similar

slide-16
SLIDE 16

Brief Summary of Mitigation

  • Detect adversarial inputs
  • Flag inputs with high activation on

malicious neurons

  • With 5% FPR, we achieve <1.63% FNR on

BadNets models (<28.5% on Trojan models)

  • Patch models via unlearning
  • Train DNN to make correct prediction when

an input has the reversed trigger

  • Reduce attack success rate to <6.70%

with <3.60% drop of accuracy

15

Adversarial Inputs Proactive Filter Infected DNN Detect and reject adversarial inputs Remove backdoor Patch Robust

slide-17
SLIDE 17

One More Thing

  • Many other interesting results in the paper
  • More complex patterns?
  • Multiple infected labels?
  • What if a label is infected with not just one backdoor?
  • Code is available on github.com/bolunwang/backdoor

16