Secure Data Management: Foundations, Systems and Applications My - - PowerPoint PPT Presentation

secure data management
SMART_READER_LITE
LIVE PREVIEW

Secure Data Management: Foundations, Systems and Applications My - - PowerPoint PPT Presentation

Erik Jonsson School of Engineering and Computer Science Distinguished Lecture Series 2017-2018 Secure Data Management: Foundations, Systems and Applications My Journey 1985-Present Dr. Bhavani Thuraisingham The University of Texas at Dallas


slide-1
SLIDE 1

Erik Jonsson School of Engineering and Computer Science Distinguished Lecture Series 2017-2018

Secure Data Management:

Foundations, Systems and Applications

My Journey 1985-Present

  • Dr. Bhavani Thuraisingham

The University of Texas at Dallas

April 27, 2018

slide-2
SLIDE 2

How Did I Get Here?*

  • I am of Sri-Lankan Tamil Origin; Married at 20 in 1975 while finishing my B.Sc. in

Mathematics and Physics at the University of Ceylon. My husband, also of Tamil origin, was finishing his PhD at the University of Cambridge, England.

  • Moved to the University of Bristol, England soon after for my graduate education and then

moved to the US in 1980 for better opportunities.

  • Was offered tenure track Assistant Professor position at New Mexico Tech, but declined as

my son was a baby, so took visiting faculty positions in Socorro, NM and later Minneapolis, MN for 3 years and then joined Control Data Corp. as a Senior Software Developer in Computer Networks and Distributed Systems.

  • My lucky break came in 1985 Fall when I received my US Citizenship, Honeywell won the

USAF contract to design a high assurance secure database, and Honeywell interviewed me and wanted to hire me; all three had to come together.

  • Have been working in Cyber Security and Data Science at Honeywell, MITRE, NSF and UT

Dallas for 32+ years. * https://www.youtube.com/watch?v=xfBie2oVzkA 2

slide-3
SLIDE 3

Summary of My Research in Secure Data: 1985 - Present

3

slide-4
SLIDE 4

Multilevel Secure Data Management: Lock Data Views (LDV)* - (Air Force AFRL)

  • LOCK operating system type enforcement mechanism encapsulates applications such as database

management systems (DBMS) in a protected subsystem as objects of special types are only accessible to subjects executing in the DBMS domain.

  • Restrict the subjects which are allowed to execute in this domain; it is this approach that makes LDV a

unique design.

  • The underlying LOCK security mechanisms are available within the DBMS domain and we extend the

underlying security policy to meet database requirements.

  • The LOCK type enforcement mechanism supports assured pipelines for passing data securely between

the DBMS and user domains.

  • Proved that the pipelines are both unbypassable and tamper-proof.
  • Developed a multilevel data model, relational database theory, security architecture and formal

security model.

  • Technology transferred to every secure commercial database system product from Oracle, Sybase,

Informix, and Ingres in the early 1990s. * Bhavani M. Thuraisingham: Security checking in relational database management systems augmented with inference engines. Computers &

Security 6(6): 479-492 (1987) (Landmark paper that spawned research on the inference problem) Paul D. Stachour, Bhavani M. Thuraisingham: Design of LDV: A Multilevel Secure Relational Database Management System. IEEE T

  • rans. Knowl.

Data Eng. 2(2): 190-209 (1990)

4

slide-5
SLIDE 5

Multilevel Secure Data Management: LDV Assured Pipelines*

5

Response Pipeline Update Pipeline Metadata Pipeline * This research had a significant impact on the National Computer

Security Center’s Purple Book in 1991 – Evaluation Criteria for Secure Database Systems

slide-6
SLIDE 6

Inference Problem: Security Architecture* (Army-CECOM)

6

User Interface Manager

Policy Manager Security Policies

Query Processor: Policies during query and release

  • perations

Update Processor: Policies during update

  • peration

Database Design Tool Policies during database design

  • peration

MLS Database

MLS/DBMS

  • Problem if posing multiple queries and deducing

unauthorized information

  • Query rewriting according to the policies
  • Release database is examined as to what has been

released

  • Query is processed and response assembled
  • Release database is examined to determine whether

the response should be released

  • Portions of the query processor are trusted
  • Certain policies are examined during update operation
  • Example: Content-based policies
  • The security level of the data is computed
  • Data is entered at the appropriate level
  • Certain parts of the Update Processor are trusted
  • Certain policies are examined during the database

design time

  • Example: simple, association and logical policies
  • Schema are assigned security levels
  • Database is partitioned accordingly
  • Technology patented and implemented in Army’s

Maneuver Control System in a distributed environment connecting systems in MA, VA, and NJ.

* Bhavani M. Thuraisingham, William R. Ford, Marie Collins, J. O'Keeffe: Design and Implementation of a Database Inference Controller. Data Knowl. Eng. 11(3): 271-297 (1993) Bhavani M. Thuraisingham, William R. Ford: Security Constraints in a Multilevel Secure Distributed Database Management System. IEEE Trans. Knowl. Data Eng. 7(2): 274-293 (1995)

slide-7
SLIDE 7

Inference/Privacy Problem: Complexity* (Navy SPAWAR)

  • Dr. John Campbell of the NSA stated that the Unsolvability Proof of the Inference Problem

was a significant result in database security in 1990 (Proceedings National Computer Security Conference, 1990)

  • Some of the work has been adapted for data privacy
  • Given a recursively enumerable degree, can you find an instance of the privacy problem that is
  • ne-one equivalent? YES
  • To what extent is the privacy problem unsolvable?
  • Challenges posed by Thuraisingham in the 1990s
  • Can we measure security and privacy?
  • Question answered in 2002 by Prof. Latanya Sweeney with respect to Data Privacy with her

formulation of K-Anonymity (followed by L-Diversity and Differential Privacy)

  • What is the computational complexity of the inference and privacy problems?
  • Some initial directions provided by Thuraisingham, recent work by Harvard Data Privacy Group

* Bhavani Thuraisingham, Recursion Theoretic Properties of the Inference Problem, IEEE Computer Security Foundations Franconia, NH, June 1990 (also available as MTP Technical Report, MTP 291) Bhavani Thuraisingham. On the Complexity of the Privacy Problem in Databases, Data Mining: Theory and Practice, Springer, Editors: TY Lin et al, 2005. First work to integrate computability theory with secure data

7

slide-8
SLIDE 8

Inference/Privacy Problem: Complexity

8

Theorem I (i) For each privacy level L, PP[L] is recursively enumerable (ii) For each privacy level L, PP[L] is either recursive or nonsimple (iii) If all privacy functions which model the rules in deductive databases are deterministic, then for each privacy level L, PP[L] is either recursive or a cylinder (iv) If the privacy level L1 dominates the privacy level L2, then PP[L1]  PP[L2]where  is the subset function Privacy problem: The privacy problem with respect to privacy level L is the set of all quadruples <B, F, C, A> such that there is some x belongs to CnF(B) and the privacy level of x dominates L. Note that we assume that the set of privacy levels form a lattice. Formally stated the privacy problem at level L is the set: PP[L] = {<B, F, C, A> | Level (B) ≤ L and x (x  CnF(B) and Level(x) > L)} where  is the “there exists” symbol. Multilevel Database: A multilevel deductive database is a quadruple <B, F, C, A> where B is a database, F is a privacy function, C is a recursive set of privacy policies and A is an algorithm (i.e. an effective procedure) which assigns privacy levels to the data based on C. (Note that since C is recursive, one can effectively decide whether a privacy constraint belongs to C.) Theorem 2 (i) There is a situation where PP[Public] is not recursive. (ii) Assuming that the privacy functions are deterministic, there is a situation where PP[Public] is not creative. (iii) There is a situation where PP[Public] is neither recursive nor a cylinder.

slide-9
SLIDE 9

Inference/Privacy Problem: Complexity

9

(u,0) (u,1) (u,2) (u,v) (u,v+1) (u,v+2) T(e,u,v) (0,0) (0,1) (0,2) (0,u+v) (0,u+v+1) (0,u+v+2) (u,0) (u,1) (u,2) (u,v) (u,v+1) (u,v+2) T(e,u,v) (0,0) (0,1) (0,2) (0,u+v) (0,u+v+1) (0,u+v+2)

We first show that given a recursively enumerable set W, there is a situation S such that W m PP[Public]. Note that m is the many-one equivalence relationship. The result is then immediate from the following reasoning. * It has been shown that there is set K which is creative. K is the set {x: the xth partial recursive function halts on input x} * The situation S that is constructed from the recursively enumerable set K will guarantee that PP[Public] is creative. This is because if the two sets A and B are many one equivalent and A is creative, then so is B. Therefore, if PP[Public] is creative then it cannot be recursive. Given a recursively enumerable set W, we create a situation S by defining a set of privacy constraints and a privacy function. Let the set of privacy constraints be {(0,0)}. That is, the only element that is assigned the private level is the pair (0,0). We consider pairs of natural numbers. This does not cause any problem due to the existence of the pairing function from NxN onto N where N is the set of all natural numbers. The set of privacy constraints is recursive (note that in this case it is also finite) and does not depend on W. We define a privacy function, which depends on W as follows. We assume that e is the index of W. The privacy function f for a pair (u,v) is defined as follows: {(u, v+1)} if u  0 AND NOT T(e, u, v) f(u,v) = {(0, u+v) if u  0 AND T(e, u, v) {(u, v-1) if u = 0 AND v  0  (the empty set) if u = 0 AND v = 0 Note that T is the Kleene’s T Predicate

Graphical Representation of the Privacy Function f

PP[Public] = {(u,v): there is a path via f from (u,v) to (0,0)}

We have shown that We m PP[Public] where We is the eth recursively enumerable set. Note that m is many-one equivalence.

slide-10
SLIDE 10

Nonmonotonic Typed Multilevel Logic for Multilevel Databases* (Navy SPAWAR)

  • Developed a Nonmonotonic Typed Multilevel Logic called NTML--a breakthrough for

multilevel databases at that time.

  • Extends typed first-order logic to support reasoning in a multilevel environment.
  • Developed both a proof theoretic and model theoretic approach for viewing

multilevel databases.

  • Security constraints (i.e., security policies) are treated as integrity constraints for

multilevel database systems.

  • Developed techniques for efficient security constraint processing.
  • A logic programming system based on NTML was developed for an inference

controller and patented and licensed.

* Bhavani M. Thuraisingham: A Nonmonotonic Typed Multilevel Logic for Multilevel Secure Database/Knowledge-Based Management

  • Systems. IEEE CSFW 1991: 127-138

Bhavani M. Thuraisingham: A Nonmonotonic Typed Multilevel Logic for Multilevel Secure Database/Knowledge-Based Management Systems II IEEE CSFW 1992

10

slide-11
SLIDE 11

NTML Theory for Multilevel Databases

  • As in any logic theory, an NTML theory has a set of logical axioms, a set of

proper axioms, and a set of inference rules.

  • In the proof theoretic approach, the multilevel database is represented as

an NTML theory whose proper axioms are the elementary facts, the schemas and the general laws.

  • Query Evaluation
  • A query posed by a user at security level L is expressed as a wff (well formed

formula) of the NTML theory.

  • There are two types of queries: closed and open. A closed query is a wff which is

closed (i.e., with no free variables), and an open query is one with at least one free variable.

  • Query evaluation amounts to theorem proving.
  • For example, let W be the wff which corresponds to a query Q posed by a user

at level L. Then evaluating the query Q amounts to proving that (W,L) is a theorem of the NTML theory.

11

slide-12
SLIDE 12

Lessons Learned from Research in Multilevel Databases

  • Designing and developing high assurance systems is still

a major challenge.

  • Relationships in database systems make it even more

challenging due to the inference and privacy problems.

  • Today we also have to worry about malware attacks and

the web.

  • There has been progress on the Computational

Complexity of the Privacy Problem and defining measures for privacy.

  • Integrating logic programming systems with secure

databases has to be reexamined

12

slide-13
SLIDE 13

Secure Data Science (NSF, ARO, AFOSR, NASA)

  • Big Data Security and Privacy
  • Massive amounts of data are being collected and analyzed
  • Securing the Data – Access Control Models
  • Foundations for Secure Data Collection, Storage, Management and Analysis
  • Data Mining Security and Privacy
  • Data Science for Cyber Security Applications
  • Intrusion Detection
  • Malware Analysis
  • Insider Threat Detection
  • Website Fingerprinting
  • Securing Data Science Techniques
  • Adversarial Machine Learning
  • Trustworthy Analytics
  • Acknowledgements: NSF, ARO, AFOSR, NASA and my colleagues: Prof. Murat Kantarcioglu, Prof. Latifur Khan,

and our Project Coordinator: Ms. Rhonda Walls

13

slide-14
SLIDE 14

Foundations of Secure Data Collection, Storage, Management and Sharing

Influenced by the 9/11 Commission Report *

14

  • Policy-based Information Sharing
  • Enforce policies for sharing information and determine how much

information has been lost (trustworthy partners)

  • What about the Inference Problem?
  • Formal models for information sharing
  • Apply game theory and probing to extract information from semi-

trustworthy partners

  • Determine incentives and risks for information sharing
  • Conduct active defense and determine the actions of an untrustworthy

partner

  • Defend ourselves from our partners using data analytics techniques
  • Conduct active defense--find our what our partners are doing by

monitoring them so that we can defend ourselves from dynamic situations * Bhavani Thuraisingham, Assured Information Sharing: Technologies, Challenges and Directions. Intelligence and

Security Informatics 2008: 1-15

slide-15
SLIDE 15

Inference Problem with Semantic Web

15

Policies Ontologies Rules In RDF

JENA RDF Engine RDF Documents

Policy/Inference Engine/ Rules Processor e.g., Pellet

Interface to the Semantic Web

Technology by UT Dallas

Barbara Carminati, Elena Ferrari, Raymond Heatherly, Murat Kantarcioglu, Bhavani M. Thuraisingham: A semantic web based framework for social network access control. SACMAT 2009: 177-186 Bhavani M. Thuraisingham, Tyrone Cadenhead, Murat Kantarcioglu, Vaibhav Khadilkar: Secure Data Provenance and Inference Control with Semantic

  • Web. CRC Press 2014, ISBN 978-1-4665-6943-0, pp. 1-448

Research challenges we addressed included

  • ptimizing SPARQL query processing and rule

evaluation

slide-16
SLIDE 16

Cloud-Centric Assured Information Sharing*

User Interface Layer RDF Data Preprocessor Policy Translation and Transformation Layer MapReduce Framework for Query Processing Hadoop HDFS Agency 1 Agency 2 Agency n

RDF Data and Policies SPARQL Query Result

16

Example Policies Agency 1 wishes to share its resources if Agency 2 also shares its resources with it Agency 1 shares a resource with Agency 2 provided Agency 2 does not share with Agency 3

*Tyrone Cadenhead, Murat Kantarcioglu, Vaibhav Khadilkar, Bhavani M. Thuraisingham: Design and Implementation

  • f a Cloud-Based Assured Information Sharing
  • System. MMM-ACNS 2012: 36-50

Mohammad Farhan Husain, James P. McGlothlin, Mohammad M. Masud, Latifur R. Khan, Bhavani M. Thuraisingham: Heuristics- Based Query Processing for Large RDF Graphs Using Cloud Computing. IEEE Trans. Knowl. Data Eng. 23(9): 1312-1327 (2011)

slide-17
SLIDE 17

Towards Developing a Formal Metamodel* (AFOSR, EOARD)

  • Formal models for high assurance secure systems are vital – to specify security properties and

prove that the system meets the specifications.

  • Therefore, we need a formal security model for data collection, storage, management sharing

systems.

  • Examined existing models such as ABAC, RBAC and related access control models
  • Our CBAC – Category-based Access Control Model defines an axiomatic category-based

metamodel which we then enrich with a rewrite-based operational semantics from which prototype implementations can be directly derived.

  • Presented our work at CODASPY/ABAC 2018 and making a submission to NIST

17

*M. Fernandez, M. Kantarcioglu, B. Thuraisingham, A

Framework for Secure Data Collection and Management for Internet of Things Proceedings IEEE ACSAC Conference Workshop on Industrial Control Systems, Los Angeles, CA, December 2016.

  • M. Fernandez and B. Thuraisingham, CBAC Model for ABAC,

Proceedings ACM CODASPY/ABAC Symposium, Tempe, AZ, March 2018.

  • B. Thuraisingham, M. Kantarcioglu, E. Bertino, J. Bakdash, M.

Fernandez, Privacy-Aware Quantified Self Data Management Framework, Proceedings ACM SACMAT, Indianapolis, June 2018. Future Funding possibility through NSA SoS Lablet

slide-18
SLIDE 18

CBAC: Category-based Access Control Model

Our main focus has been on Category-based Metamodel for Secure Data Collection (CBDC). Future work will include secure data storage, management and sharing. Category: class, group or domain to which entities belong

18

Entities:

  • A countable set Dev of IoT devices d1, d2, . . .: to represent data sources and channels;

e.g. individual sensors, aggregators, clocks, etc.

  • A countable set DI of data items di1, di2, . . . : to represent data emanating from sensors

and also contextual information (such as location, time, identifier, etc.)

  • A set A of actions: e.g., send, receive, block, encrypt, decrypt, etc.
  • A countable set C of categories c0, c1, . . .
  • A countable set S of services: to represent actual services or users that own/process data
slide-19
SLIDE 19

Relationships Between Entities

  • Data-Item Category Assignment: for each d ∈ Dev,

DICAd ⊆ DId × C, such that (di , c) ∈ DICAd iff the data item di ∈ DId , generated by the device d ∈ Dev , is assigned to the category c.

  • Action Category Assignment: ACA ⊆ A × C, such that

(a, c) ∈ ACA iff action a ∈ A can be performed on data items assigned to the category c.

  • Banned-Action Category Assignment: BACA ⊆ A × C, such that (a, c) ∈ BACA

iff action a ∈A is banned for data items assigned to the category c ∈ C.

  • Authorizations: for each d ∈ Dev , ADId ⊆ A × DId , such that (a, di ) ∈ ADId

iff action a ∈ A is authorized on the data item di generated by d ∈ Dev.

  • Prohibitions: for each d ∈ Dev , BADId ⊆ A × DId , such that (a, di ) ∈ BADId

iff action a ∈A is banned on data item di generated by d ∈ Dev .

  • An additional relation U N DE T d ⊆ A × DId contains the tuples (a, di ) such

that the action a is neither authorized nor banned on the data item di emanating from d .

19

slide-20
SLIDE 20

Axioms and Policies

A category-based data collection policy is a tuple (E , {DICAd }d∈ Dev , ACA, BACA, {ADId }d∈ Dev , {BADId }d∈ Dev , {UN DETd }d∈ Dev ) where E = (Dev , DI, A, C, S, ⊆ ), such that axioms (dc1)-(dc4) are satisfied. Operational semantics: axioms (dc1)-(dc4) can be realized through a set of functions, defined by rewrite rules. (dc1/) ADId (A, Di ) → if A ∈ ACA∗(below(DICAd (Di ))) then accept else if A ∈ BACA∗(above(DICAd (Di ))) then forbid else undetermined 20

slide-21
SLIDE 21

Example: Truck rental company

Rental prices vary depending on whether the truck is taken out of the country or not. Drivers who are not planning to leave the country can benefit from a discount. Trucks fitted with tracking devices able to transmit GPS locations. Goal: tracking information transmitted only if truck crosses border.

Next Steps: Develop an Implementation and Check for Efficiency; Develop a CBAC Model for Data Sharing

21

slide-22
SLIDE 22

Background for our Research on Data Mining, Security and Privacy

  • Introduced the idea of Data Mining, Security and Privacy at keynote addresses

first at IFIP 11.3 in 1996 and later at PAKDD in 1998 and subsequently published a landmark paper* while at NSF in 2002 that spawned a new area of research.**

  • Privacy-Preserving Data Mining
  • Our research with PhD student Dr. Li Liu focused on Privacy-Preserving Decision

Trees and the Perturbation Method*** between 2005 - 2008.

  • Data Mining, Security and Privacy is exacerbated with Big Data and Data Science;

hosted an NSF Workshop on Big Data Security and Privacy in September 2014 and presented the results to the Interagency Working Group in February 2015.

* Bhavani M. Thuraisingham: Data Mining, National Security, Privacy and Civil Liberties. SIGKDD Explorations 4(2): 1-5 (2002) ** https://www.utdallas.edu/~bxt043000/Press-Releases/Bhavani-MITRE-article-with-Marty-Faga.pdf **Li Liu, Murat Kantarcioglu, Bhavani M. Thuraisingham: The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl. Eng. 65(1): 5-21 (2008)

22

Noise Addition X (original) X’ ( noise added) Modified Data Mining Process Final Modified Result

slide-23
SLIDE 23

Big Data Stream Classification* (with Prof. Latifur Khan)

  • Uses past data to build classification model
  • Predicts the labels of future instances using the model
  • Helps decision making

Network traffic Classification model Attack traffic Firewall Block and quarantine Benign traffic Server Expert analysis and labeling

23

Big Data Streams:

  • are a continuous flow of data
  • are very common in our connected digital world
  • have massive amounts of data
  • Sponsor: NASA

*Mehedy Masud, Latifur Khan and Bhavani Thuraisingham, Data Mining Tools for

Malware Detection, CRC Press, 2011. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, Bhavani M. Thuraisingham: Classification and Novel Class Detection in Concept-Drifting Data Streams under Time

  • Constraints. IEEE Trans. Knowl. Data Eng. 23(6): 859-874 (2011)
slide-24
SLIDE 24

Insider Threat Detection* (with former PhD student Dr. Pallabi Parveen)

  • Insider threat detection requires the identification of rare anomalies in contexts where

evolving behaviors tend to mask such anomalies.

  • We have designed and developed an ensemble-based stream mining algorithm based on

supervised learning that addresses this challenge by maintaining an evolving collection of multiple models to classify dynamic data streams of unbounded length.

  • The result is a classifier that exhibits substantially increased classification accuracy for real

insider threat streams relative to traditional supervised learning (traditional SVM and one-class SVM) and other single-model approaches.

  • We have also designed and developed an unsupervised, ensemble-based learning algorithm

that maintains a compressed dictionary of repetitive sequences found throughout dynamic data streams of unbounded length to identify anomalies.

  • In unsupervised learning, compression-based techniques are used to model common behavior

sequences.

  • This results in a classifier exhibiting a substantial increase in classification accuracy for data

streams containing insider threat anomalies.

  • This ensemble of classifiers allows the unsupervised approach to outperform traditional static

learning approaches and boosts the effectiveness over supervised learning approaches. * Sponsor: AFOSR 24

slide-25
SLIDE 25

Application: Architecture for Evolving Insider Threat Detection*

Feature Extraction & Selection

Anomaly?

j

System Call System Call

chunki+1 chunki

System log

Testing on Data from Chunki+1

Online learning

Gather Data from Chunki Feature Extraction & Selection Learning Algorithm

Supervised - One class SVM, OCSVM Unsupervised - Graph based Anomaly Detection, GBAD

Ensemble-based Stream Analytics Ensemble of Models Update models

* Pallabi Parveen, Nate McDaniel, Jonathan Evans, Bhavani Thuraisingham, Kevin W. Hamlen and Latifur Khan, "Evolving Insider Threat Detection Stream Mining Perspective", "Journal International Journal on Artificial Intelligence Tools," World Scientific Publishing, 2013

25

slide-26
SLIDE 26
  • Normal users have a repetitive sequence of commands,

system calls, etc.

  • A sudden deviation from normal behavior raises an alarm

indicating an insider threat

  • To find an insider threat

We need to collect these repeated sequences of commands in an unsupervised fashion

  • First challenge: variability in sequence length

Overcome: Generating a LZW dictionary with combinations

  • f

possible potential patterns in the gathered data using Lempel-Ziv-Welch algorithm (LZW)

  • Second challenge: Huge size of the dictionary

Overcome: Compress the dictionary

Pallabi Parveen, Nate McDaniel, Varun S. Hariharan, Bhavani M. Thuraisingham: Unsupervised Ensemble Based Learning for Insider Threat

  • Detection. SocialCom/PASSAT2012: 718-727, MIT, Boston, Nov 2012

Unsupervised Sequence Learning

26

slide-27
SLIDE 27
  • LZW Dictionary:

Contains set of patterns 𝑞𝑗𝑘 and their corresponding weights according to following Eq. Here, 𝑔

𝑗𝑘 is the frequency of the pattern i in chunk j.

  • Quantized Dictionary:

Here, P is a set of possible combinations of a particular pattern Max {(𝑥𝑗𝑘 *Length (𝑞𝑗𝑘)}, where 𝑞𝑗𝑘 ⊆ P 𝑥𝑗𝑘 = 𝑔

𝑗𝑘

Construct a Quantized Dictionary

27

slide-28
SLIDE 28

Lessons Learned

  • Our stream-guided sequence learning performed well,

with limited number of false positives as compared to static approaches.

  • Our approach is a combination of compression and

incremental learning; it adopts the advantages from both compression and ensemble-based learning.

  • In particular, compression offered unsupervised learning

in a manageable manner and on the other hand ensemble-based learning offered adaptive learning.

  • The approach was tested on a real command line dataset

and shows effectiveness over static approaches.

28

slide-29
SLIDE 29

Adversarial Machine Learning: The Problem*

(with Prof. Murat Kantarcioglu) Sponsor: ARO

  • Adversary modifies data to defeat learning algorithms

It is not concept drift It is not online learning Adversary adapts to avoid being detected During training time (i.e., data poisoning) During test time (i.e., modifying features when data mining is deployed) There is game between the data miner and the adversary Understanding Adversarial Learning

29

*Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and B. Xi, Adversarial support vector

machine learning, ACM SIGKDD ‘12 One of the early efforts on Adversarial Machine Learning for Cyber Security

slide-30
SLIDE 30

Solution Ideas and Threat Models

  • Constantly adapt your classifier to changing adversary behavior
  • Questions??
  • How to model this game?
  • Does this game ever end?
  • Is there an equilibrium point in the game?
  • Training time attacks:
  • Poison/modify the training data
  • Some attacks are tailored for specific f()
  • Test time/ Deployment time attacks
  • Attacker modifies x to x’
  • E.g., modify packet length by adding dummy bytes
  • Add good word to spam e-mail
  • Add noise to an image
  • Could be specific to f()

30

slide-31
SLIDE 31

 Free-range attack

 Adversary can move malicious data anywhere in the feature space  The only knowledge the adversary needs is the valid range of the feature  An attack is bounded in the following form.

For every j in [1, d] 𝐷

𝑔 is between 0 and 1 and controls the aggressiveness of the attacks; 0

means no attack and 1 means the most aggressive attack 𝑦𝑘 (max/min) is the largest/smallest value of the jth feature of a data point 𝑦𝑗 (which is 𝑦𝑗𝑘) can take.

 Other models: Restrained attack

 Assume that the Adversary would be reluctant to let a data point move fare

away from its original position

 Reason is that great displacement often entails loss of malicious utility

 Developed an Adversarial SVM for each attack

c f (x. j

min - xij) £dij £ C f (x. j max - xij)

An Example Adversarial Attack Model

31

slide-32
SLIDE 32

Adversarial Machine Learning

AD-SVM Example:

black dashed line is the standard SVM classification boundary, and the blue line is the Adversarial SVM (ADV-SVM) classification boundary 6

Threat Model Example

Test time/ Deployment Time Attacks

Attacker modifies x to x’ Modify packet length by adding dummy bytes Add good word to spam e-mail Add noise to an image

32

slide-33
SLIDE 33

Lessons Learned

  • Adversarial attacks can lead to severe misinterpretation of real data distributions feature

space

  • Learning algorithms lacking the flexibility of handling the structural change in the samples

would not cope well with attacks that modify data to change the makeup of the sample space

  • We have examined two attack models and our AD-SVM is more resilient to these attacks than
  • ther SVM learning algorithms
  • Dynamic adaptation, cost of adaptation, utility of the attacker and defender needs to be

considered.

  • Other issues not discussed but important:
  • Provenance of data
  • Adversarial active learning
  • Cyber Security for Data Science (and Data Science for Cyber Security) requires better

understanding of the attacker.

  • Game theory provides natural tools for such modeling
  • Begun a collaboration with Army Research Lab and BBS

33

slide-34
SLIDE 34

Research Challenges in Secure Data: Summary

  • Designing and developing a Secure Data Management system that enforces

flexible security policies and addresses malware attacks (e.g., SQL injection, covert channels).

  • Complexity of the Inference and Privacy Problems
  • Secure Dependable Data Management: Integrating Cyber Security, Fault

Tolerance and Real-time Processing

  • Theory of Assured Information Sharing: Logical Foundations (CBAC Model?)
  • Adversarial Machine Learning and Trustworthy Analytics
  • Analyzing and Securing Social Media: Fake News
  • We cannot forget about Access Control
  • Science of DASPY (Data and Applications Security and Privacy)?

34

slide-35
SLIDE 35

Significant Impact of Our Research on the DoD and Federal Research Programs

  • Set the research program for the NSA and the DoD from 1990-1997 in Secure Deductive

Data Management and the Inference Problem

  • Research transferred to secure database system products (e.g., Oracle, Sybase, Informix,

Ingres, Ontos)

  • Secure Distributed Data Management System transferred to Army’s Maneuver Control

System

  • Secure Real-time Data Management System transferred to Air Force’s AWACS
  • Spawned the research area of Privacy-Enhanced Data Management starting around 1996
  • DoD (AFOSR) MURI call and subsequent MURI research in Assured Information Sharing,

2005-2015

  • Set the National Privacy Research Strategy in Big Data Security and Privacy (for

Interagency working group) resulting in multiple research programs, 2015

  • Working with Army Research Office on starting a program in Adversarial Machine Learning,

2017- present

35

slide-36
SLIDE 36

Acknowledgements

  • Honeywell
  • Ms. Patricia Dwyer*, Dr. Paul Stachour, Dr. Thomas Haig
  • MITRE
  • Dr. Harvey Rubinovitz, Ms. Marie Collins, Mr. William Ford, Mr. John Maurer,
  • Prof. Chris Clifton, Dr. Maria Zemankova*
  • The National Science Foundation
  • Dr. Maria Zemankova*, Dr. Carl Landwehr
  • The University of Texas at Dallas
  • Including Prof. Latifur Khan, Prof. Murat Kantarcioglu, Prof. I-Ling Yen*, Ms. Rhonda

Walls, and all our students as well as members of the Cyber Security Research and Education Institute at UT Dallas

  • Colleagues
  • CODASPY community including Prof. Elisa Bertino*, Prof. Ravi Sandhu,
  • Prof. Gail-Ahn Joon, Prof. Maribel Fernandez

36

* Part of my strong support group of women over the 32+ years

slide-37
SLIDE 37

My Night (or Day)? Job: Cyber Security Research and Education Institute

  • Over $40M in research funding and $10M in education funding mainly from federal agencies since 2005.
  • NSA/DHS Center for Academic Excellence in Cyber Security Education, June 2004 (CAE); NSA/DHS Center for

Academic Excellence in Cyber Security Research, June 2008 (CAE-R); Recertifications in 2014 until 2021.

  • NSA/DHS Center for Academic Excellence in Cyber Operations in June 2015; first university in TX and 14th in the US
  • Prestigious grants and contracts including the following:
  • Multiple NSF CAREER (100% success for NSF CAREER; 5/5)
  • Multiple AFOSR YIP
  • DoD MURI, DURIP and other larger grants (e.g., DARPA, IARPA …)
  • Grants/Contracts from ARO, AFOSR, ONR, NIH, NASA, NGA, . . .
  • NSF Large SatC and multiple Medium SatC
  • Multiple NSF SFS and Capacity Development
  • NSF MRI (Major Research Instrumentation)
  • Highly Competitive NSF/VMware Partnership Research Grant
  • NSA Lablet in Science of Security (Core member of the Vanderbilt team)
  • DHS in Cyber Physical Systems Security
  • UT System National Security Network Grant
  • SBIR Phase I and II
  • Fellowships/Awards: IEEE, AAAS, IACR Fellowships, IBM Faculty Award, IEEE/ACM Awards
  • Student placements in academia, industry, government
  • Papers in all top tier journals and conferences in Cyber Security and Data Science (e.g., IEEE S&P, ACM CCS, NDSS,

Usenix Security, ACM KDD, PVLDB, IEEE ICDE, IEEE ICDM, IEEE Big Data, - - - - )

  • Significant outreach – Women in Cyber Security, (WiCyS), Women in Data Science, (WiDS), …
  • ECS collaborations with EPPS (Holmes and Brandt, et al), JSOM (Bensoussan, et al), BBS (Krawczyk, et al) and NSM

(Gel, Lary et al)- multiple joint funded programs from NSF and DoD as well as papers.

37

slide-38
SLIDE 38

Thanks to Our Sponsors

38

slide-39
SLIDE 39

Lessons I have learned from my career

  • I took a nontraditional career path that has resulted in commercial products and technology

transfer, operational systems, publications, patents, books, keynote addresses, leadership roles, and more importantly an H Index of 52 and Citation Count of 10,366 as of April 26, 2018.

  • How? Because I have considered my work as my hobby (even though work is not by choice).
  • Work on problems that are challenging, meaningful, have significance and make an impact.
  • Never ever give up – be persistent and learn from your mistakes (not dwell on them).
  • Do not let anyone undermine you; focus on what you have to do and do not get distracted and

discouraged; Only your enemies will benefit from your downfall.

  • Know what your strengths and weaknesses are – do not fool yourself.
  • Have role models – especially for women and underrepresented minorities.
  • Need Family Support: I owe most to three people: (i) My husband who has always been there for

me for 43 years, (ii) My mother who always said “Clever Girl” regardless of whether I came first

  • r tenth in class and (iii) My oldest sister who supported me financially for four years after I lost

my father at 16.

39

slide-40
SLIDE 40

Where Do I Go from Here?

  • Continue to work on interesting and challenging problems that will not
  • nly have an impact but also receive many citations
  • Continue to mentor my students and colleagues, as well as members of

WiCyS and WiDS

  • Continue to make UTD’s Cyber Security Research and Education Institute a

premier organization and integrate Cyber Security with Data Science

  • While continuing to write research proposals (e.g., Center Scale Grants),

continue to focus also on education proposals including NRT and ADVANCE

  • In the longer term, another three year stint in Washington DC? – this is

where one can truly build a community – like I did in Data Mining, Security and Privacy at NSF 2001-2004.

  • Eventually Go back to my roots? – Logic, Computational Complexity and

Data Security Integrated with Software Development

40