WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS - - PowerPoint PPT Presentation

wars of the wars of the wars of the wars of the wars of
SMART_READER_LITE
LIVE PREVIEW

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS - - PowerPoint PPT Presentation

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES


slide-1
SLIDE 1

WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE WARS OF THE MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES MACHINES

BUILD YOUR OWN SEEK AND DESTROY ROBOT

slide-2
SLIDE 2

WHO AM I ?

Senior Security Researcher @ digital.security Definitely not a ML expert / data scienst Love learning new things !

slide-3
SLIDE 3

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

slide-4
SLIDE 4

MACHINE LEARNING IS COOL !

slide-5
SLIDE 5

LOOKS AWESOME !

slide-6
SLIDE 6

DEEPFAKES !

slide-7
SLIDE 7

I'M GOING TO LEARN ML

That's a challenge for me I have no clue what I'm doing Nevermind, I'll learn (as usual)

slide-8
SLIDE 8

MY LITTLE PROJECT

slide-9
SLIDE 9

MY LITTLE PROJECT

I need to start small

slide-10
SLIDE 10

MY LITTLE PROJECT

I need to start small I need something that will give some results shortly

slide-11
SLIDE 11

MY LITTLE PROJECT

I need to start small I need something that will give some results shortly Something related to IoT security, indeed

slide-12
SLIDE 12

MY LITTLE PROJECT

I need to start small I need something that will give some results shortly Something related to IoT security, indeed A tool that gives a big picture about IoT ?

slide-13
SLIDE 13

DESIRED FEATURES

slide-14
SLIDE 14

DESIRED FEATURES

Scans and collect device info from HTTP services on known ports

slide-15
SLIDE 15

DESIRED FEATURES

Scans and collect device info from HTTP services on known ports Automacally classifies these devices

slide-16
SLIDE 16

DESIRED FEATURES

Scans and collect device info from HTTP services on known ports Automacally classifies these devices Provides an overview of customer-premises devices available on the Internet

slide-17
SLIDE 17

DESIRED FEATURES

Scans and collect device info from HTTP services on known ports Automacally classifies these devices Provides an overview of customer-premises devices available on the Internet Can be used to create targeted aacks !

slide-18
SLIDE 18

PREVIOUS RESEARCH

All Things Considered: An Analysis of IoT Devices on Home Networks - USENIX 2019, Kumar & Al. ProfilIoT: A Machine Learning Approach for IoT Device Idenficaon Based on Network Traffic Analysis - Yair Medan & Al.

slide-19
SLIDE 19

BUT HOW IS IT DONE ?

slide-20
SLIDE 20

BUT HOW IS IT DONE ?

HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ?? HOW ??

slide-21
SLIDE 21

MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING MACHINE LEARNING FOR FOR FOR FOR FOR FOR FOR FOR FOR FOR FOR FOR DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES DUMMIES HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS HACKERS

slide-22
SLIDE 22

HOW CAN A MACHINE LEARN ?

slide-23
SLIDE 23

HOW CAN A MACHINE LEARN ?

THE SAME WAY OUR BRAIN LEARNS.

slide-24
SLIDE 24

HOW CAN A MACHINE LEARN ?

THE SAME WAY OUR BRAIN LEARNS.

(THANKS CAPT'N OBVIOUS...)

slide-25
SLIDE 25

TRAIN AND PREDICT

Train a machine to do a precise task (e.g. answer "is there a cat in this image ?") Ask the trained machine to answer the same queson on random images This is called supervised learning

slide-26
SLIDE 26

THE PERCEPTRON

slide-27
SLIDE 27

TRAIN AND PREDICT

slide-28
SLIDE 28

CLASSIFY

Ask a machine to sort a set of images (e.g. group them by cats, dogs, etc.) The machine will find similaries between these images and group them This is called unsupervised learning

slide-29
SLIDE 29

EXAMPLE

We want to sort a set of data about vehicles Describe each vehicle number of wheels number of seats Let the machine do the rest !

slide-30
SLIDE 30

CLASSIFY

slide-31
SLIDE 31

K-MEANS CLUSTERING

slide-32
SLIDE 32

K-MEANS CLUSTERING

Number of centroids (K) is set at the beginning If K is too low, groups will contain mulple subgroups If K is too high, groups will be spread among mulple centroids

slide-33
SLIDE 33

OTHER ALGORITHMS (WE WON'T COVER)

Fuzzy C-means: similar to K-means but data points are weighted Hierarchical Clustering

slide-34
SLIDE 34

SUPERVISED VS. UNSUPERVISED

Supervised learning is for training Two datasets required Training dataset needs associated results set Unsupervised learning finds relaonships in chaoc data

slide-35
SLIDE 35

SUPERVISED VS. UNSUPERVISED

Supervised learning is a simple and effecve method Unsupervised learning is more complex and subject to errors

slide-36
SLIDE 36

DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS DATASETS

slide-37
SLIDE 37

DATASETS

Datasets maer: if not correctly created, could lead to errors Datasets may be biased Spling a dataset in two for training and tesng is not that easy

slide-38
SLIDE 38

FEATURE VECTOR

feature: a measurable characterisc of our input data feature vector: a N-dimension vector containing features

slide-39
SLIDE 39

HOW TO TURN DATA INTO A FEATURE VECTOR ?

slide-40
SLIDE 40

COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND COLLECTING AND CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA CONVERTING DATA

slide-41
SLIDE 41

SCANNING

Scan the Internet for well-known HTTP ports Collect valuable data Turn every collected page into a feature vector

slide-42
SLIDE 42

CREATING OUR DATASET

HTTP headers HTTP body Web page screenshot

slide-43
SLIDE 43

USING REQUESTS TO SCRAPE DATA

# Query page result = requests.get( 'http://%s:%d/' % (self.ip_address, self.port), timeout=1.0 ) headers = json.dumps(dict(result.headers)) body = result.text # Report target self.report_target( self.ip_address, self.port, headers, body )

slide-44
SLIDE 44

CHROMIUM + SELENIUM

# Configure Chromium self.chrome_options = Options() self.chrome_options.add_argument("--headless") self.chrome_options.binary_location = '/usr/bin/chromium' self.driver = webdriver.Chrome( chrome_options=self.chrome_options ) self.driver.set_page_load_timeout(30) self.driver.fullscreen_window() # ... # Save screenshot self.driver.save_screenshot(dest)

slide-45
SLIDE 45

ANARCHY IN THE EU

slide-46
SLIDE 46

RESULTS

$ sqlite3 targets.db SQLite version 3.27.2 2019-02-25 16:06:06 Enter ".help" for usage hints. sqlite> select count(*) from targets; 4901

slide-47
SLIDE 47

RESULTS

slide-48
SLIDE 48

HOW TO MEASURE A WEB PAGE

slide-49
SLIDE 49

HOW TO MEASURE A WEB PAGE

content length: usually the same / device

slide-50
SLIDE 50

HOW TO MEASURE A WEB PAGE

content length: usually the same / device number of headers

slide-51
SLIDE 51

HOW TO MEASURE A WEB PAGE

content length: usually the same / device number of headers number of scripts, images and other tags

slide-52
SLIDE 52

HOW TO MEASURE A WEB PAGE (BADASS MODE)

Levenshtein distance to a reference page DOM tree structure flaening combined with Levenshtein distance Normalized page text size

slide-53
SLIDE 53

LEVENSHTEIN DISTANCE (FTR)

Measures the difference between two strings Gives a posive integer value The bigger the value, the bigger the difference

slide-54
SLIDE 54

CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE CREATING THE AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER AUTOMATIC CLASSIFIER

slide-55
SLIDE 55

SCIKIT-LEARN

Python-based Machine Learning framework Built on NumPy, SciPy and matplotlib Implements major ML algorithms

slide-56
SLIDE 56

RECORDS TO DATASET

import pandas as pd def create_dataset_from_records(records): """ Create a ML dataset from a list of records """ lst = [ record_to_values(r) for r in records] return pd.DataFrame(lst, columns =[ 'headers','metas','scripts','images','bodysize' ])

slide-57
SLIDE 57

IMPLEMENTING K-MEANS

from sklearn.cluster import KMeans from sklearn import datasets #... def classify(records): # create a dataset from our DB records dataset = create_dataset_from_records(records) # classify model = KMeans(n_clusters=OPT_CLUSTERS) model.fit(dataset) # return result return model.labels_

slide-58
SLIDE 58

NUMBER OF CENTROIDS MATTERS

slide-59
SLIDE 59

BADASS FEATURE VECTOR

slide-60
SLIDE 60

BASIC FEATURE VECTOR

slide-61
SLIDE 61

BADASS IS NOT THE BEST 😮

Levenshtein distance: two pages with same distance are not always idencal DOM tree structure: a lot of devices rely on the same page structure (login) Normalized page size: Most of idencal devices have same content length

slide-62
SLIDE 62

BEST RESULTS 🤰

500 centroids Content length Number of various tags (img, meta, script) Number of HTTP headers

4767|213.183.189.11|80|6|1|0|0|120|0.0|0

slide-63
SLIDE 63

ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA ADDING METADATA

slide-64
SLIDE 64

METADATA MAY HELP

Metada can be useful for searches: category: NAS, wireless router, etc. vendor product name/series What if we were able to automacally determine (at least) the category ?

slide-65
SLIDE 65

ML-BASED METADATA

Supervised learning: this is the way. We need a reference dataset with verified metadata Let's add metadata to our classified targets !

slide-66
SLIDE 66
slide-67
SLIDE 67

TRAIN A MODEL FOR EACH CATEGORY

We create and train a perceptron for each category We need to have enough input data (i.e. targets)

slide-68
SLIDE 68

PERCEPTRON FOR CAMERA

# Collect items from database targets = list(IotTarget.select()) # Only keep items that ARE cameras result = [1.0 if (item.category == 'camera') else 0.0 for item in targets ] # Build a dataset dataset = create_dataset_from_records(items) # Create and train our perceptron ppn,scaler = create_mlc(dataset, result)

slide-69
SLIDE 69

USING A MULTI-LAYER PERCEPTRON (MLP)

from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import StandardScaler def create_mlc(dataset, resultset): """ Create a multi-layer perceptron (MLP) """ sc = StandardScaler() sc.fit(dataset) std_dataset = sc.transform(dataset) clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5,12,15 ), random_state=1) clf.fit(std_dataset, resultset) return (clf, sc)

slide-70
SLIDE 70

GENERATING A MODEL FOR THIS CATEGORY

from joblib import dump dump((ppn, scaler) , 'camera.model')

slide-71
SLIDE 71

TESTING THE ACCURACY OF OUR MODEL

t_dataset = scaler.transform(dataset) y_pred = ppn.predict(t_dataset) print('Accuracy: %.2f' % accuracy_score(result, y_pred)) Accuracy: 0.94

slide-72
SLIDE 72

WOW.

slide-73
SLIDE 73

WORKFLOW

slide-74
SLIDE 74

(PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING (PARTLY) REVEALING THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE THE IOT LANDSCAPE

slide-75
SLIDE 75

SCANNING THE DSL INTERNET

I discovered almost 5,000 web services hosted on DSL IP adresses My tool helped me a lot to sort this data This is a small dataset, but seems accurate

slide-76
SLIDE 76

RAW DATA

4895 web services detected 1501 categorized devices 3152 screenshots taken 34 MB of HTTP responses content

slide-77
SLIDE 77

IOT DEVICES PER CATEGORY (%)

slide-78
SLIDE 78

TOP 5 VENDORS

# Vendor devices 1 Hikvision 372 2 Dahua 117 3 Sonicwall 106 4 TP-link 85 5 Mikrok 71

slide-79
SLIDE 79

ML IDENTIFIED SIMILAR DEVICES BUT DIFFERENT BRANDS

slide-80
SLIDE 80

BUT I ALSO FOUND MANY OTHER DEVICES

slide-81
SLIDE 81

OT / IT

slide-82
SLIDE 82

PRETTY LIABLE CONTROLLER

slide-83
SLIDE 83

WIND OF CHANGE

slide-84
SLIDE 84

WHAT CAN POSSIBLY GO WRONG ?

slide-85
SLIDE 85

WANNA SWIM ?

slide-86
SLIDE 86

WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE WEAPONIZE

slide-87
SLIDE 87

USING ML TO TARGET DEVICES

slide-88
SLIDE 88

BUILDING EFFICIENT SCANNERS

Idenfying a category of devices is difficult ... ... unless you use a trained perceptron.

slide-89
SLIDE 89

DEMO: SCANNING CAMERAS

slide-90
SLIDE 90

GEOLOCATED CAMERA FEEDS

Idenfy camera feeds (RTP/RTSP) from exposed cameras Try default usernames and passwords Geolocate IP address (geoip2)

slide-91
SLIDE 91

DOCUMENT THEFT AND RANSOM

Scan the Internet for NAS Bruteforce authencaon (default passwords) Steal data, leave a note asking for bitcoins

slide-92
SLIDE 92

SPECIFIC VULNERABILITY RESEARCH AND EXPLOITATION

slide-93
SLIDE 93

LOOKING FOR QNAP QTS

Recent vulnerabilies affecng QNAP NAS (pre- auth root RCE) It is possible to train a perceptron to detect QNAP NAS Search & destroy !

slide-94
SLIDE 94

CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION CONCLUSION

slide-95
SLIDE 95

ML IS GREAT

Unsupervised classifier allows quick devices review Mul-layer perceptron provides an easy way to create targeted tools, assign metadata Human is sll required !

slide-96
SLIDE 96

TAKEAWAYS

Machine learning algorithms are easy to use with scikit-learn and Python Extra libraries required: requests, whoosh, sqlite3 The most difficult part: picking the right features and building correct datasets

slide-97
SLIDE 97

I WON'T RELEASE ANY SOURCE CODE

All the key material is provided (code snippets, parameters, etc.) I learned a lot during this project, so will you 👎 Well, maybe because my code is dirty too...

slide-98
SLIDE 98

THANK YOU FOR LISTENING

HOPE YOU ENJOYED THE TALK 😂 DISCOVER NEW THINGS, EXPERIMENT, LEARN ! Twier: @virtualabs damien.cauquil@digital.security