The Unspoken Problems With Machine Learning in Security Noa Weiss - - PowerPoint PPT Presentation

the unspoken problems with machine learning in security
SMART_READER_LITE
LIVE PREVIEW

The Unspoken Problems With Machine Learning in Security Noa Weiss - - PowerPoint PPT Presentation

The Unspoken Problems With Machine Learning in Security Noa Weiss Hi! AI & Machine Learning Consultant Playing with data for over a decade Risk and Security PayPal, Armis 2 Hi! Deep Voice foundation Leader of


slide-1
SLIDE 1

The Unspoken Problems With Machine Learning in Security

Noa Weiss

slide-2
SLIDE 2

Hi!

  • AI & Machine Learning Consultant
  • Playing with data for over a decade
  • Risk and Security
  • PayPal, Armis

2

slide-3
SLIDE 3

Hi!

  • Deep Voice foundation
  • Leader of Women in Data Science Israel
  • Mentor junior data scientists

3

slide-4
SLIDE 4

Agenda

  • Is the grass really greener?

○ ML - other domains ○ ML - security

  • The things that hold us back
  • Possible solutions

4

slide-5
SLIDE 5

ARE we lagging behind WHY is that the case WHAT can we do

5

Agenda

slide-6
SLIDE 6

ARE we lagging behind WHY is that the case WHAT can we do

6

Agenda

slide-7
SLIDE 7

ML IN OTHER DOMAINS: COMPUTER VISION

slide-8
SLIDE 8

8

Computer Vision Today

  • Autonomous vehicles
  • Facial recognition
  • Generative AI
slide-9
SLIDE 9

COMPUTER VISION: EXAMPLES

9

slide-10
SLIDE 10

Image Completion

10

Algorithm: Image-GPT

slide-11
SLIDE 11

11

slide-12
SLIDE 12

Sketches → Photorealism

12

Algorithm: GauGan

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

slide-15
SLIDE 15

15

slide-16
SLIDE 16

Sketches → Photorealism

16

Algorithm: GauGan Developed by Katherine Nicholls, PhD

slide-17
SLIDE 17

Fictional People

17

www.thispersondoesnotexist.com

slide-18
SLIDE 18

Fictional People / Cats

18

www.thiscatdoesnotexist.com

slide-19
SLIDE 19

Fictional Everything

19

www.thispersondoesnotexist.com www.thiscatdoesnotexist.com www.thishorsedoesnotexist.com/ www.thisartworkdoesnotexist.com/ www.thischemicaldoesnotexist.com/

slide-20
SLIDE 20

ML IN OTHER DOMAINS: NATURAL LANGUAGE PROCESSING (NLP)

slide-21
SLIDE 21

21

NLP Today

  • Pretty good automatic translation
  • Long-form question answering
  • GPT-3
slide-22
SLIDE 22

NLP: EXAMPLES

22

slide-23
SLIDE 23

23

GPT-3

  • Language model (multi-purpose NLP model)
  • Mostly generative
  • Astonishing performance
slide-24
SLIDE 24

24

GPT-3: Generative Code

  • Free description of layout → JSX code
  • (No task-specific training)
slide-25
SLIDE 25

25

GPT-3: Generative Code

  • Free description of ML model → model code!
slide-26
SLIDE 26

26

GPT-3: Coding Interview

slide-27
SLIDE 27

27

slide-28
SLIDE 28

28

Google Duplex

  • “Personal assistant” for phone reservations
slide-29
SLIDE 29

29

Google Duplex

slide-30
SLIDE 30

Security

slide-31
SLIDE 31

ML in Security Today

The good stufg:

  • Some significant improvements in malware detection

○ Next Generation Anti Virus (NGAV)

  • Some promise for network intrusion detection

○ Not yet prominent in practice

31

slide-32
SLIDE 32

ML in Security Today

  • All in all:

○ ML models with so-so performance ○ ML only makes for a small part of core product ○ Data and ML technology under-utilized

  • Lagging behind other domains

32

slide-33
SLIDE 33

ARE we lagging behind WHY is that the case WHAT can we do

33

Agenda

slide-34
SLIDE 34

WHY?

slide-35
SLIDE 35

Anomaly Detection Algorithms

Algorithms aimed at identifying data points, events, or

  • bservations that deviate from a dataset's normal
  • Very common in Security

○ Algorithm task fits business needs ○ Unsupervised (no labels needed)

35

slide-36
SLIDE 36

Anomaly Detection Algorithms

Yet, not ideal for Security:

  • High false positive rate (FPR)

○ Legitimate user activity is often anomalous ○ Higher cost of errors than other domains ■ (Block legit activity? Wait for manual review?)

  • Human-designed features are our “Ground Truth”

○ Very prone to human bias ○ Model only spots MOs we already know

36

slide-37
SLIDE 37

Changing Environment

  • Most ML domains: mostly unchanging environment

○ E.g.: CV, NLP

  • Environment in Security:

○ New devices ○ New apps ○ New protocols ○ Etc.

  • This is a problem for a learning model

37

slide-38
SLIDE 38

An Adapting Adversary

  • As we become better at securing our devices and

networks, attackers become better at outsmarting our defences

  • This is a problem uncommon in most fields

○ E.g.: CV, NLP

38

slide-39
SLIDE 39

Tagging

  • How CV and NLP get tagged datasets
  • Why we can’t do that in security

○ Expertise ○ Context ○ Confidentiality ○ Scale

  • Bigger datasets = bigger tagging problems

○ Sampling?

39

slide-40
SLIDE 40

Imbalanced Classes

Difgerent classes are extremely over/under represented in the data

  • Results in poor predictive performance

(especially for minority class)

40

slide-41
SLIDE 41

Imbalanced Classes

  • A major problem when aiming to identify

fraud/attacks

  • While common solutions exist, they are

limited, and do not fully solve this problem

41

slide-42
SLIDE 42

Need for Explainability

  • CV / NLP: mostly based on deep learning techniques
  • Deep learning models are considered “black boxes”
  • Security decision-making requires explainability (more

so than other domains)

  • DL could still be used with added-on explainability

models - but those are imperfect, and complex

42

slide-43
SLIDE 43

Confidentiality

  • Other domains:

○ Public datasets ○ Public baselines ○ Publicly-released trained models

  • All of those enable not only direct collaboration, but

also a way to compare new methods and algorithms

43

slide-44
SLIDE 44

Confidentiality

Security:

  • Companies bound by confidentiality
  • No natively public data available

Few publicly available datasets - small / outdated

44

slide-45
SLIDE 45

Many researchers are struggling to find comprehensive and valid datasets to test and evaluate their proposed techniques and having a suitable dataset is a significant challenge in itself.

45

Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020)

slide-46
SLIDE 46

In order to test the effjciency of such mechanisms, reliable datasets are needed that (i) contain both benign and several attacks, (ii) meet real world criteria, and (iii) are publicly available.

46

Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020)

slide-47
SLIDE 47

ARE we lagging behind WHY is that the case WHAT can we do

47

Agenda

slide-48
SLIDE 48

WHAT CAN WE DO TO CHANGE THIS?

slide-49
SLIDE 49

Public Datasets

1.

slide-50
SLIDE 50

Benchmarks

2.

slide-51
SLIDE 51

Direct Collaboration

3.

slide-52
SLIDE 52

Public Datasets Benchmarks Direct Collaboration

Encourage an active discussion & indirect collaboration, in the public domain, resulting in faster, better progress for the security domain as a whole.

slide-53
SLIDE 53

ARE we lagging behind WHY is that the case WHAT can we do

53

Wrap Up

slide-54
SLIDE 54

Thank you

hi@weissnoa.com @NWeiss linkedin.com/in/noa-weiss www.weissnoa.com

Presentation template by SlidesCarnival