The Unspoken Problems With Machine Learning in Security Noa Weiss

Hi! AI & Machine Learning Consultant ● Playing with data for over a decade ● Risk and Security ● PayPal, Armis ● 2

Hi! Deep Voice foundation ● Leader of Women in Data Science Israel ● Mentor junior data scientists ● 3

Agenda ● Is the grass really greener? ○ ML - other domains ○ ML - security ● The things that hold us back ● Possible solutions 4

Agenda ARE WHY WHAT we lagging behind is that the case can we do 5

ML IN OTHER DOMAINS: COMPUTER VISION

Computer Vision Today ● Autonomous vehicles ● Facial recognition ● Generative AI 8

COMPUTER VISION: EXAMPLES 9

Image Completion Algorithm: Image-GPT 10

Sketches → Photorealism Algorithm: GauGan 12

Sketches → Photorealism Algorithm: GauGan Developed by Katherine Nicholls, PhD 16

Fictional People www.thispersondoesnotexist.com 17

Fictional People / Cats www.thiscatdoesnotexist.com 18

Fictional Everything www.thispersondoesnotexist.com www.thiscatdoesnotexist.com www.thishorsedoesnotexist.com/ www.thisartworkdoesnotexist.com/ www.thischemicaldoesnotexist.com/ 19

ML IN OTHER DOMAINS: NATURAL LANGUAGE PROCESSING (NLP)

NLP Today ● Pretty good automatic translation ● Long-form question answering ● GPT-3 21

NLP: EXAMPLES 22

GPT-3 ● Language model (multi-purpose NLP model) ● Mostly generative ● Astonishing performance 23

GPT-3: Generative Code ● Free description of layout → JSX code ● (No task-specific training) 24

GPT-3: Generative Code ● Free description of ML model → model code! 25

GPT-3: Coding Interview 26

Google Duplex ● “Personal assistant” for phone reservations 28

Google Duplex 29

Security

ML in Security Today The good stufg: ● Some significant improvements in malware detection ○ Next Generation Anti Virus (NGAV) ● Some promise for network intrusion detection ○ Not yet prominent in practice 31

ML in Security Today ● All in all: ○ ML models with so-so performance ○ ML only makes for a small part of core product ○ Data and ML technology under-utilized ● Lagging behind other domains 32

Anomaly Detection Algorithms Algorithms aimed at identifying data points, events, or observations that deviate from a dataset's normal ● Very common in Security ○ Algorithm task fits business needs ○ Unsupervised (no labels needed) 35

Anomaly Detection Algorithms Yet, not ideal for Security: ● High false positive rate (FPR) ○ Legitimate user activity is often anomalous ○ Higher cost of errors than other domains ■ (Block legit activity? Wait for manual review?) ● Human-designed features are our “Ground Truth” ○ Very prone to human bias ○ Model only spots MOs we already know 36

Changing Environment ● Most ML domains: mostly unchanging environment ○ E.g.: CV, NLP ● Environment in Security: ○ New devices ○ New apps ○ New protocols ○ Etc. ● This is a problem for a learning model 37

An Adapting Adversary ● As we become better at securing our devices and networks, attackers become better at outsmarting our defences ● This is a problem uncommon in most fields ○ E.g.: CV, NLP 38

Tagging ● How CV and NLP get tagged datasets ● Why we can’t do that in security ○ Expertise ○ Context ○ Confidentiality ○ Scale ● Bigger datasets = bigger tagging problems ○ Sampling? 39

Imbalanced Classes Difgerent classes are extremely over/under represented in the data ● Results in poor predictive performance (especially for minority class) 40

Imbalanced Classes A major problem when aiming to identify ● fraud/attacks While common solutions exist, they are ● limited, and do not fully solve this problem 41

Need for Explainability ● CV / NLP: mostly based on deep learning techniques ● Deep learning models are considered “black boxes” ● Security decision-making requires explainability (more so than other domains) ● DL could still be used with added-on explainability models - but those are imperfect, and complex 42

Confidentiality ● Other domains: ○ Public datasets ○ Public baselines ○ Publicly-released trained models ● All of those enable not only direct collaboration, but also a way to compare new methods and algorithms 43

Confidentiality Security: ● Companies bound by confidentiality ● No natively public data available Few publicly available datasets - small / outdated 44

Many researchers are struggling to find comprehensive and valid datasets to test and evaluate their proposed techniques and having a suitable dataset is a significant challenge in itself. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 45

In order to test the effjciency of such mechanisms, reliable datasets are needed that (i) contain both benign and several attacks, (ii) meet real world criteria, and (iii) are publicly available. Ferrag, M. A., Maglaras, L., Moschoyiannis, S., & Janicke, H. (2020) 46

WHAT CAN WE DO TO CHANGE THIS?

1 . Public Datasets

2 . Benchmarks

3 . Direct Collaboration

Public Datasets Benchmarks Direct Collaboration Encourage an active discussion & indirect collaboration, in the public domain, resulting in faster, better progress for the security domain as a whole.

Wrap Up ARE WHY WHAT we lagging behind is that the case can we do 53

Thank you hi@weissnoa.com @NWeiss linkedin.com/in/noa-weiss www.weissnoa.com Presentation template by SlidesCarnival

The Unspoken Problems With Machine Learning in Security Noa Weiss - PowerPoint PPT Presentation

The Unspoken Problems With Machine Learning in Security Noa Weiss Hi! AI & Machine Learning Consultant Playing with data for over a decade Risk and Security PayPal, Armis 2 Hi! Deep Voice foundation Leader of

JFK Unspoken Speech Community Project. HELLO AND WELCOME. OUR PURPOSE. The Unspoken Speech is

DNS and Security DNS and Security DNS and Security DNS and Security DNS and Security DNS and

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Unspoken Addic-on Briefs A video that subverts conven-ons

This presentation is for discussion purposes only. Lost in Translation The Unspoken Financial

SWRC Waste ReForum April 19, 2018 www.revitalpolymers.com Unspoken TRUTH If its not one of

WHAT DID YOU SAY? THE UNSPOKEN BARRIERS OF EFFECTIVELY COMMUNICATING WITH OURSELVES AND OTHERS

Add-ons to the compatible staggered Lagrangian scheme and other unspoken details ere 1 R. Loub`

SECURITY AND PRIVACY OF MACHINE LEARNING Ian Goodfellow Staff Research Scientist Google Brain

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Photorealistic Rendering vs. Interactive 3D Graphics (An Introduction to Digital Image

[ ] vector a a a ... a = column vectors 1 important: 2.3 (trig),

Lecture 09: Shaders (Part 1) CSE 40166 Computer Graphics Peter Bui University of Notre Dame, IN,

Welcome to CS11 Week 1 - Day 1 What does this class teach? THEORY APPLICATION COMMUNICATION

CS6630 Realistic Image Synthesis Steve Marschner Fall 2015 40 Spring Joint Computer Conference,

Distance Sensors: Sound, Light and Vision THOMAS MAIER SEMINAR: INTELLIGENT ROBOTICS 1

Red Team A Control System and Assembly Stability For a Solar Trough Motivation Demonstrate

Sweet Spotter Silver B Traditional Wired Wireless Sweet Product Vision Speakers Head-