Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents - - PowerPoint PPT Presentation

cloudy with a chance of breach forecasting cyber security
SMART_READER_LITE
LIVE PREVIEW

Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents - - PowerPoint PPT Presentation

Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents Yang Liu , Armin Sarabi , Jing Zhang , Parinaz Naghizadeh Manish Karir , Michael Bailey , Mingyan Liu , EECS Department, University of Michigan, Ann


slide-1
SLIDE 1

Cloudy with a Chance of Breach: Forecasting Cyber Security Incidents

Yang Liu§, Armin Sarabi§, Jing Zhang§, Parinaz Naghizadeh§ Manish Karir♯, Michael Bailey∗, Mingyan Liu§,♯

§ EECS Department, University of Michigan, Ann Arbor ♯ QuadMetrics, Inc. ∗ ECE Department, University of Illinois, Urbana-Champaign

http://grs.eecs.umich.edu

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 1 / 28

slide-2
SLIDE 2

Intro Introduction

Motivation

Increasingly frequent and high-impact data breaches

◮ Target, JP Morgan Chase,

Home Depot, to name a few

◮ Increasing social and economic

impact of such cyber incidents

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 2 / 28

slide-3
SLIDE 3

Intro Introduction

Limitation of current approaches

◮ Heavily detection based ◮ Fail to detect, or too late by the time a breach is detected ◮ Not suited for cost/damage control ◮ Urgent need for more proactive measures

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 3 / 28

slide-4
SLIDE 4

Intro Introduction

Detection

◮ analogous to diagnosing a

patient who may already be ill (e.g., by using biopsy).

◮ [Qian et al. NDSS14, Wang

et al. USENIX Sec14] Prediction

◮ predicting whether a presently

healthy person may become ill based on a variety of relevant factors.

◮ [Soska & Christin, USENIX

Sec14]

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 4 / 28

slide-5
SLIDE 5

Intro Introduction

Detection

◮ analogous to diagnosing a

patient who may already be ill (e.g., by using biopsy).

◮ [Qian et al. NDSS14, Wang

et al. USENIX Sec14] Prediction

◮ predicting whether a presently

healthy person may become ill based on a variety of relevant factors.

◮ [Soska & Christin, USENIX

Sec14] Our goal:

◮ Understand the extent to which one can forecast incidents on an

  • rganizational level.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 4 / 28

slide-6
SLIDE 6

Intro Introduction

Objective

To develop the ability to forecast security incidences

◮ Applicability: we rely solely on externally observed data; do not

require information on the internal workings of a network or its hosts.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28

slide-7
SLIDE 7

Intro Introduction

Objective

To develop the ability to forecast security incidences

◮ Applicability: we rely solely on externally observed data; do not

require information on the internal workings of a network or its hosts.

◮ Robustness: we do not have control over or direct knowledge of

the error embedded in the data.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28

slide-8
SLIDE 8

Intro Introduction

Objective

To develop the ability to forecast security incidences

◮ Applicability: we rely solely on externally observed data; do not

require information on the internal workings of a network or its hosts.

◮ Robustness: we do not have control over or direct knowledge of

the error embedded in the data. Key idea:

◮ tap into a diverse set of data that captures different aspects of a

network’s security posture, ranging from the explicit to latent.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 5 / 28

slide-9
SLIDE 9

Intro Introduction

Why prediction?

Forecast enables entirely new classes of applications which are

  • therwise not feasible.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28

slide-10
SLIDE 10

Intro Introduction

Why prediction?

Forecast enables entirely new classes of applications which are

  • therwise not feasible.

◮ Prediction allows proactive policies and measures to be adopted

rather than reactive measures following the detection.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28

slide-11
SLIDE 11

Intro Introduction

Why prediction?

Forecast enables entirely new classes of applications which are

  • therwise not feasible.

◮ Prediction allows proactive policies and measures to be adopted

rather than reactive measures following the detection. Forecast enables effective risk management schemes

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28

slide-12
SLIDE 12

Intro Introduction

Why prediction?

Forecast enables entirely new classes of applications which are

  • therwise not feasible.

◮ Prediction allows proactive policies and measures to be adopted

rather than reactive measures following the detection. Forecast enables effective risk management schemes

◮ Internal to an org.: more informed decisions on resource

allocation.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28

slide-13
SLIDE 13

Intro Introduction

Why prediction?

Forecast enables entirely new classes of applications which are

  • therwise not feasible.

◮ Prediction allows proactive policies and measures to be adopted

rather than reactive measures following the detection. Forecast enables effective risk management schemes

◮ Internal to an org.: more informed decisions on resource

allocation.

◮ External to an org.: incentive mechanisms such as cyber

insurance.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 6 / 28

slide-14
SLIDE 14

Intro Introduction

Outline of the talk

◮ Data and Preliminaries

  • Description of the data
  • Data pre-processing

◮ Forecasting methods

  • Construction of the predictor

◮ Forecasting results

  • Main prediction results & analysis

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 7 / 28

slide-15
SLIDE 15

Data Methodology

Datasets at a glance

Category Collection period Datasets Mismanagement Feb’13 - Jul’13 Open Recursive Resolvers, DNS Source Port, symptoms BGP misconfiguration, Untrusted HTTPS, Open SMTP Mail Relays Malicious May’13 - Dec’14 CBL, SBL, SpamCop, UCEPROTECT, activities WPBL, SURBL, PhishTank, hpHosts, Darknet scanners list, Dshield, OpenBL Incident Aug’13 - Dec’14 VERIS Community Database, reports Hackmageddon, Web Hacking Incidents ◮ Mismanagement and malicious activities used to extract features. ◮ Incident reports used to generate labels for training and testing.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 8 / 28

slide-16
SLIDE 16

Data Methodology

Security posture data

Mismanagement symptoms

◮ Deviation from known best practices; indicators of lack of policy

  • r expertise:
  • Misconfigured- HTTPS cert, DNS (resolver+source port), mail

server, BGP.

◮ Collected around mid-2013 (pre-incidnts).

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 9 / 28

slide-17
SLIDE 17

Data Methodology

Security posture data

Mismanagement symptoms

◮ Deviation from known best practices; indicators of lack of policy

  • r expertise:
  • Misconfigured- HTTPS cert, DNS (resolver+source port), mail

server, BGP.

◮ Collected around mid-2013 (pre-incidnts).

Malicious Activity Data: a set of 11 reputation blacklists (RBLs)

◮ Daily collections of IPs seen engaged in some malicious activity. ◮ Three malicious activity types: spam, phishing, scan. ◮ Use data between May 2013 and December 2014.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 9 / 28

slide-18
SLIDE 18

Data Methodology

Security incident Data

Three incident datasets

◮ Hackmageddon ◮ Web Hacking Incidents Database (WHID) ◮ VERIS Community Database (VCDB) Incident type SQLi Hijacking Defacement DDoS Hackmageddon 38 9 97 59 WHID 12 5 16 45 Incident type Crimeware Cyber Esp. Web app. Else VCDB 59 16 368 213

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 10 / 28

slide-19
SLIDE 19

Data Data Pre-processing

Data Pre-processing

Incident cleaning.

◮ Remove irrelevant cases, e.g., robbery at liquor store, something

happened etc.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28

slide-20
SLIDE 20

Data Data Pre-processing

Data Pre-processing

Incident cleaning.

◮ Remove irrelevant cases, e.g., robbery at liquor store, something

happened etc. Data diversity presents challenge in alignment in time and space.

◮ Security posture records information at the host IP-address level. ◮ Cyber incident reports associated with an organization. ◮ Such alignment is not travial: reallocation makes boundary

unclear.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28

slide-21
SLIDE 21

Data Data Pre-processing

Data Pre-processing

Incident cleaning.

◮ Remove irrelevant cases, e.g., robbery at liquor store, something

happened etc. Data diversity presents challenge in alignment in time and space.

◮ Security posture records information at the host IP-address level. ◮ Cyber incident reports associated with an organization. ◮ Such alignment is not travial: reallocation makes boundary

unclear. A mapping process:

◮ Summarizing owner IDs from RIR databases. ◮ 4.4 million prefixes listed under 2.6 million owner IDs: finer

degree compared to routing table.

◮ Sample IP from organization + search in above table.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 11 / 28

slide-22
SLIDE 22

Forecast

Outline of the talk

◮ Data and Preliminaries

  • Description of the data
  • Data pre-processing

◮ Forecasting methods

  • Construction of the predictor

◮ Forecasting results

  • Main prediction results & analysis

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 12 / 28

slide-23
SLIDE 23

Forecast Methodology

Approach at a glance

Feature extraction

◮ 258 features extracted from the datasets: Primary + Secondary

features.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28

slide-24
SLIDE 24

Forecast Methodology

Approach at a glance

Feature extraction

◮ 258 features extracted from the datasets: Primary + Secondary

features. Label generation

◮ 1,000+ incident reports from the three incident sets

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28

slide-25
SLIDE 25

Forecast Methodology

Approach at a glance

Feature extraction

◮ 258 features extracted from the datasets: Primary + Secondary

features. Label generation

◮ 1,000+ incident reports from the three incident sets

Classifier training and testing

◮ Random Forest (RF) classifier trained with features and labels.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 13 / 28

slide-26
SLIDE 26

Forecast Methodology

Primary features: raw data

Mismanagement symptoms (5).

◮ Five symptoms; each measures a fraction ◮ Predictive power of these symptoms.

0.5 1 0.5 1 % Untrusted HTTPS CDF Victim org. Non−victim org. 0.2 0.4 0.5 1 % openresolver CDF Victim org. Non−victim org.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 14 / 28

slide-27
SLIDE 27

Forecast Methodology

Malicious activity time series (60 × 3).

◮ Three time series over a period: spam, phishing, scan. ◮ Recent 60 v.s. Recent 14.

10 20 30 40 50 60 1 2 3 4

Days

10 20 30 40 50 60 400 600 800 1k

Days

10 20 30 40 50 60 2k 4k 6k 8k 10k

Days

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 15 / 28

slide-28
SLIDE 28

Forecast Methodology

Malicious activity time series (60 × 3).

◮ Three time series over a period: spam, phishing, scan. ◮ Recent 60 v.s. Recent 14.

10 20 30 40 50 60 1 2 3 4

Days

10 20 30 40 50 60 400 600 800 1k

Days

10 20 30 40 50 60 2k 4k 6k 8k 10k

Days

Size: number of IPs in an aggregation unit (1)

◮ To some extent capture the likelihood of an organization

becoming a target of/reproting intentional attacks.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 15 / 28

slide-29
SLIDE 29

Forecast Methodology

Secondary features

Quantization and feature extraction

10 20 30 40 50 60 3k 4k 5k 6k 7k 8k 9k Days # of IPs listed Persistency

◮ Measure security efforts and responsiveness. ◮ In each quantized region, measure average magnitude, average

duration, and frequency.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 16 / 28

slide-30
SLIDE 30

Forecast Methodology

A look at their predictive power (using data from Nov-Dec’13):

200 400 0.5 1 Un−normalized "bad" magnitude CDF Victim org. Non−victim org. 0.5 1 0.5 1 Normalized "good" magnitude CDF Victim org. Non−victim org. 10 20 30 0.5 1 "Bad" duration CDF Victim org. Non−victim org. 0.5 1 0.5 1 "Bad" frequency CDF Victim org. Non−victim org.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 17 / 28

slide-31
SLIDE 31

Forecast Overview of method

Training subjects

A subset victim organizations, Group(1) or incident group.

◮ Training-testing ratio, e.g., 70-30 or 50-50 split . ◮ Split strictly according to time: use past to predict future. Hackmageddon VCDB WHID Training Oct 13 – Dec 13 Aug 13 – Dec 13 Jan 14 – Mar 14 Testing Jan 14 – Feb 14 Jan 14 – Dec 14 Apr 14 – Nov 14

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 18 / 28

slide-32
SLIDE 32

Forecast Overview of method

Training subjects

A subset victim organizations, Group(1) or incident group.

◮ Training-testing ratio, e.g., 70-30 or 50-50 split . ◮ Split strictly according to time: use past to predict future. Hackmageddon VCDB WHID Training Oct 13 – Dec 13 Aug 13 – Dec 13 Jan 14 – Mar 14 Testing Jan 14 – Feb 14 Jan 14 – Dec 14 Apr 14 – Nov 14

A random subset of non-victims, Group (0) or non-incident group.

◮ Random sub-sampling necessary to avoid imbalance; procedure is

repeated over different random subsets.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 18 / 28

slide-33
SLIDE 33

Results

Outline of the talk

◮ Data and Preliminaries

  • Description of the data
  • Data pre-processing

◮ Forecasting methods

  • Construction of the predictor

◮ Forecasting results

  • Main prediction results & analysis

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 19 / 28

slide-34
SLIDE 34

Results Main results

Prediction procedure

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28

slide-35
SLIDE 35

Results Main results

Prediction procedure

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28

slide-36
SLIDE 36

Results Main results

Prediction procedure

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 20 / 28

slide-37
SLIDE 37

Results Main results

Prediction performance

0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive True positive VCDB Hackmageddon WHID ALL

Example of desirable operating points of the classifier:

Accuracy Hackmageddon VCDB WHID All True Positive (TP) 96% 88% 80% 88% False Positive (FP) 10% 10% 5% 4% Overall Accuracy 90% 90% 95% 96%

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 21 / 28

slide-38
SLIDE 38

Results Other observations

Split ratio

0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive True positive VCDB: 50−50 & Short VCDB: 70−30 & Short

More training data better performance.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 22 / 28

slide-39
SLIDE 39

Results Other observations

Long term prediction

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 23 / 28

slide-40
SLIDE 40

Results Other observations

Short term v.s. long term prediction

0.1 0.2 0.3 0.4 0.5 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive True positive VCDB: 50−50 & Short VCDB: 50−50 & Long

Temporal features become outdated.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 24 / 28

slide-41
SLIDE 41

Results Other observations

Importance of the Features

Top feature descriptor Value Untrusted HTTPS Certificates 0.1531 Frequency 0.1089 Organization size 0.0976 Open recursive resolver 0.0928 ◮ Two mismgmt features rank in top 4.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28

slide-42
SLIDE 42

Results Other observations

Importance of the Features

Top feature descriptor Value Untrusted HTTPS Certificates 0.1531 Frequency 0.1089 Organization size 0.0976 Open recursive resolver 0.0928 ◮ Two mismgmt features rank in top 4. Feature category Normalized importance Mismanagement 0.3229 Time series data 0.2994 Recent-60 secondary features 0.2602 ◮ Secondary features almost as important as time series data.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28

slide-43
SLIDE 43

Results Other observations

Importance of the Features

Top feature descriptor Value Untrusted HTTPS Certificates 0.1531 Frequency 0.1089 Organization size 0.0976 Open recursive resolver 0.0928 ◮ Two mismgmt features rank in top 4. Feature category Normalized importance Mismanagement 0.3229 Time series data 0.2994 Recent-60 secondary features 0.2602 ◮ Secondary features almost as important as time series data. ◮ Dynamic features > static features.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28

slide-44
SLIDE 44

Results Other observations

Importance of the Features

Top feature descriptor Value Untrusted HTTPS Certificates 0.1531 Frequency 0.1089 Organization size 0.0976 Open recursive resolver 0.0928 ◮ Two mismgmt features rank in top 4. Feature category Normalized importance Mismanagement 0.3229 Time series data 0.2994 Recent-60 secondary features 0.2602 ◮ Secondary features almost as important as time series data. ◮ Dynamic features > static features. ◮ Separate data does NOT achieve comparable results.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 25 / 28

slide-45
SLIDE 45

Results Other observations

Case study: Data Breaches of 2014

0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Predictor output CDF Randomly selected non−victim set VCDB victim set

AXTEL 0.87 Homedepot 0.85 BJP Junagadh 0.77

Threshold 0.85 Threshold 0.85

Target 0.84 ACME 0.85 Sony picture 0.90 OnlineTech 0.92 Ebay 0.88

◮ High profile data breaches from 2014: Sony (0.9), Ebay (0.88),

Homedepot (0.85), Target (0.84), OnlineTech/JP Morgan Chase (0.92)

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 26 / 28

slide-46
SLIDE 46

Discussion Discussions

Discussions

Errors in the data.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28

slide-47
SLIDE 47

Discussion Discussions

Discussions

Errors in the data. Robustness against advasarial data.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28

slide-48
SLIDE 48

Discussion Discussions

Discussions

Errors in the data. Robustness against advasarial data. Prediction by incident type.

◮ O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling

Organizations and Individuals Subject to Targeted Attacks. Financial Cryptography and Data Security 2015.

◮ A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:

A Quantitative Analysis of Risk Distributions for Different Business Profiles, WEIS 2015.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28

slide-49
SLIDE 49

Discussion Discussions

Discussions

Errors in the data. Robustness against advasarial data. Prediction by incident type.

◮ O. Thonnard, L. Bilge, A. Kashyap, and M.Lee, Are You At Risk? Profiling

Organizations and Individuals Subject to Targeted Attacks. Financial Cryptography and Data Security 2015.

◮ A. Sarabi, P. Naghizadeh, Y. Liu and M. Liu, Prioritizing Security Spending:

A Quantitative Analysis of Risk Distributions for Different Business Profiles, WEIS 2015.

Quality of reported data.

◮ Part of our data can be downladed here: http://grs.eecs.umich.edu.

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 27 / 28

slide-50
SLIDE 50

Discussion Discussions

Q & A

Acknowledgement

◮ We thank NSF and DHS for fundings.

Project webpage (part of data being available)

◮ http://grs.eecs.umich.edu ◮ http://www.umich.edu/~youngliu

Y.Liu (U. Michigan) Forecasting Cyber Security Incidents 28 / 28