Policy from Unconstrained Natural Language Text LAS Research - - PowerPoint PPT Presentation

policy from unconstrained natural
SMART_READER_LITE
LIVE PREVIEW

Policy from Unconstrained Natural Language Text LAS Research - - PowerPoint PPT Presentation

Implementing Database Access Control Policy from Unconstrained Natural Language Text LAS Research Presentation John Slankas June 24 th , 2015 Relation Extraction slides are from Dan Jurafskys NLP Course on Coursera 1 Research Path &


slide-1
SLIDE 1

Implementing Database Access Control Policy from Unconstrained Natural Language Text

LAS Research Presentation John Slankas June 24th, 2015

1

Relation Extraction slides are from Dan Jurafsky’s NLP Course on Coursera

slide-2
SLIDE 2

Research Path & Publications

2

Policy 2012 NaturiliSE 2013 ICSE Doctoral Symposium 2013 PASSAT 2013 ACSAC 2014 ESEM 20151 RE 20143 ESEM 20142 ASE Science Journal 2013

Feasibility Classification Access Control Extraction Database Model Extraction

1 to be submitted 2 2nd Author 3 3rd Author

slide-3
SLIDE 3

Agenda

  • Motivation
  • Research Goal
  • Background and Related Work – focus on Relation Extraction
  • Solution - Role Extraction and Database Enforcement
  • Studies
  • Classification
  • Access Control Extraction
  • Database Model Extraction & End to End Implementation
  • Limitations
  • Future Work
  • Research Goal Evaluation & Contributions

3

slide-4
SLIDE 4

Motivation Goal Related Work Solution Studies Limitations Future Work

2015 – The Year of Healthcare Hack

[Peterson 2015]

Two major breaches Anthem – 80 million records Premera – 11 million records Experts fault Anthem for lack of robust access control

[Bennett 2015] [Husain 2015] [Redhead 2015] [Westin 2015] 4

slide-5
SLIDE 5

5

Motivation Goal Related Work Solution Studies Limitations Future Work

A Possibility…

slide-6
SLIDE 6

Motivation Goal Related Work Solution Studies Limitations Future Work

Research Goal

Improve security and compliance by ensuring access control rules (ACRs) explicitly and implicitly defined within unconstrained natural language product artifacts are appropriately enforced within a system’s relational database.

6

slide-7
SLIDE 7

Motivation Goal Related Work Solution Studies Limitations Future Work

Background

Access Control Rules (ACRs)

Regulate who can perform actions on resources (subject, action, object)

Database Model Elements (DMEs) Organization of stored data

Entities: “thing” in the real world Attributes: property the describes an entity Relationships: association between two entities

7

slide-8
SLIDE 8

Extracting relations from text

Company report: “International Business Machines Corporation

(IBM or the company) was incorporated in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)…”

Extracted Complex Relation:

Company-Founding

Company IBM Location New York Date June 16, 1911 Original-Name Computing-Tabulating-Recording Co.

But we will focus on the simpler task of extracting relation triples

Founding-year(IBM,1911) Founding-location(IBM,New York)

slide-9
SLIDE 9

Extracting Relation Triples from Text

The Leland Stanford Junior

University, commonly referred to as Stanford University or Stanford, is an American private research university located in Stanford, California … near Palo Alto, California… Leland Stanford…founded the university in 1891

Stanford EQ Leland Stanford Junior University Stanford LOC-IN California Stanford IS-A research university Stanford LOC-NEAR Palo Alto Stanford FOUNDED-IN 1891 Stanford FOUNDER Leland Stanford

slide-10
SLIDE 10

Why Relation Extraction?

Create new structured knowledge bases, useful for any app Augment current knowledge bases

Adding words to WordNet thesaurus, facts to FreeBase

  • r DBPedia

Support question answering

The granddaughter of which actor starred in the movie “E.T.”?

(acted-in ?x “E.T.”)(is-a ?y actor)(granddaughter-of ?x ?y)

But which relations should we extract?

10

slide-11
SLIDE 11

Automated Content Extraction (ACE)

ARTIFACT GENERAL AFFILIATION ORG AFFILIATION PART- WHOLE PERSON- SOCIAL PHYSICAL Located Near Business Family Lasting Personal Citizen- Resident- Ethnicity- Religion Org-Location- Origin Founder Employment Membership Ownership Student-Alum Investor User-Owner-Inventor- Manufacturer Geographical Subsidiary Sports-Affiliation

17 relations from 2008 “Relation Extraction Task”

slide-12
SLIDE 12

Automated Content Extraction (ACE)

Physical-Located PER-GPE

He was in Tennessee

Part-Whole-Subsidiary ORG-ORG

XYZ, the parent company of ABC

Person-Social-Family PER-PER

John’s wife Yoko

Org-AFF-Founder PER-ORG

Steve Jobs, co-founder of Apple…

12

slide-13
SLIDE 13

Databases of Wikipedia Relations

13

Relations extracted from Infobox Stanford state California Stanford motto “Die Luft der Freiheit weht” …

Wikipedia Infobox

slide-14
SLIDE 14

Relation databases that draw from Wikipedia

Resource Description Framework (RDF) triples

subject predicate object Golden Gate Park location San Francisco dbpedia:Golden_Gate_Park dbpedia-owl:location dbpedia:San_Francisco

DBPedia: 1 billion RDF triples, 385 from English Wikipedia Frequent Freebase relations:

people/person/nationality, location/location/contains people/person/profession, people/person/place-of-birth biology/organism_higher_classification film/film/genre

14

slide-15
SLIDE 15

Ontological relations

IS-A (hypernym): subsumption between classes Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal… Instance-of: relation between individual and class San Francisco instance-of city

Examples from the WordNet Thesaurus

slide-16
SLIDE 16

How to build relation extractors 1. Hand-written patterns

  • 2. Supervised machine learning
  • 3. Semi-supervised and unsupervised

Bootstrapping (using seeds) Distant supervision Unsupervised learning from the web

slide-17
SLIDE 17

Rules for extracting IS-A relation Early intuition from Hearst (1992)

“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

What does Gelidium mean? How do you know?`

slide-18
SLIDE 18

Rules for extracting IS-A relation

Early intuition from Hearst (1992)

“Agar is a substance prepared from a mixture of red algae, such as Gelidium, for laboratory or industrial use”

What does Gelidium mean? How do you know?`

slide-19
SLIDE 19

Hearst’s Patterns for extracting IS-A relations

(Hearst, 1992): Automatic Acquisition of Hyponyms

“Y such as X ((, X)* (, and|or) X)” “such Y as X” “X or other Y” “X and other Y” “Y including X” “Y, especially X”

slide-20
SLIDE 20

Hearst’s Patterns for extracting IS-A relations

Hearst pattern Example occurrences

X and other Y ...temples, treasuries, and other important civic buildings. X or other Y Bruises, wounds, broken bones or other injuries... Y such as X The bow lute, such as the Bambara ndang... Such Y as X ...such authors as Herrick, Goldsmith, and Shakespeare. Y including X ...common-law countries, including Canada and England... Y , especially X European countries, especially France, England, and Spain...

slide-21
SLIDE 21

Hand-built patterns for relations

Plus: Human patterns tend to be high-precision Can be tailored to specific domains Minus Human patterns are often low-recall A lot of work to think of all possible patterns! Don’t want to have to do this for every relation! We’d like better accuracy

slide-22
SLIDE 22

Supervised machine learning for relations

Choose a set of relations we’d like to extract Choose a set of relevant named entities Find and label data

Choose a representative corpus Label the named entities in the corpus Hand-label the relations between these entities Break into training, development, and test

Train a classifier on the training set

22

slide-23
SLIDE 23

How to do classification in supervised relation extraction

  • 1. Find all pairs of named entities (usually in same sentence)
  • 2. Decide if 2 entities are related
  • 3. If yes, classify the relation

Why the extra step?

Faster classification training by eliminating most pairs Can use distinct feature-sets appropriate for each task.

23

slide-24
SLIDE 24

Relation Extraction

Classify the relation between two entities in a sentence

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

SUBSIDIARY FAMILY EMPLOYMENT NIL FOUNDER CITIZEN INVENTOR

slide-25
SLIDE 25

Word Features for Relation Extraction

Headwords of M1 and M2, and combination

Airlines Wagner Airlines-Wagner

Bag of words and bigrams in M1 and M2

{American, Airlines, Tim, Wagner, American Airlines, Tim Wagner}

Words or bigrams in particular positions left and right of M1/M2

M2: -1 spokesman M2: +1 said

Bag of words or bigrams between the two entities

{a, AMR, of, immediately, matched, move, spokesman, the, unit}

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2

slide-26
SLIDE 26

Named Entity Type and Mention Level Features for Relation Extraction

Named-entity types

M1: ORG M2: PERSON

Concatenation of the two named-entity types

ORG-PERSON

Entity Level of M1 and M2 (NAME, NOMINAL, PRONOUN)

M1: NAME [it or he would be PRONOUN] M2: NAME [the company would be NOMINAL]

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2

slide-27
SLIDE 27

Parse Features for Relation Extraction

Base syntactic chunk sequence from one to the other

NP NP PP VP NP NP

Constituent path through the tree from one to the other

NP  NP  S  S  NP

Dependency path Airlines matched Wagner said

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said Mention 1 Mention 2

slide-28
SLIDE 28

Gazeteer and trigger word features for relation extraction

Trigger list for family: kinship terms

parent, wife, husband, grandparent, etc. [from WordNet]

Gazeteer:

Lists of useful geo or geopolitical words

Country name list Other sub-entities

slide-29
SLIDE 29

American Airlines, a unit of AMR, immediately matched the move, spokesman Tim Wagner said.

slide-30
SLIDE 30

Classifiers for supervised methods

Now you can use any classifier you like

MaxEnt Naïve Bayes SVM ...

Train it on the training set, tune on the dev set, test on the test set

slide-31
SLIDE 31

Evaluation of Supervised Relation Extraction

Compute P/R/F1 for each relation

31

P = # of correctly extracted relations Total # of extracted relations R = # of correctly extracted relations Total # of gold relations

F

1 = 2PR

P + R

slide-32
SLIDE 32

Summary: Supervised Relation Extraction

+ Can get high accuracies with enough hand-

labeled training data, if test similar enough to training

  • Labeling a large training set is expensive
  • Supervised models are brittle, don’t generalize

well to different genres

slide-33
SLIDE 33

Motivation Goal Related Work Solution Studies Limitations Future Work

Selected Related Work

Access Control Extraction

  • Requirements-based Access Control Analysis and Policy

Specification [He 2009]

  • Automated Extraction and Validation of Security Policies from

Natural Language Documents [Xiao 2009]

Database Model Extraction

  • English Sentence Structures and Entity-Relationship Diagrams

[Chen 1983]

  • Heuristics-based Entity-Relationship Modeling through NLP

[Omar 2004]

  • Conceptual Modeling of Natural Language Functional

Requirements [Sagar 2014]

33

slide-34
SLIDE 34

Motivation Goal Related Work Solution Studies Limitations Future Work

Role Extraction and Database Enforcement

34

Role Extraction and Database Enforcement

Text Documents Database Design Domain Knowledge Generated SQL Commands for access control Completeness and Conflict Report Traceability Report 1) Parse natural language product artifacts 2) Classify sentence 3) Extract access control elements 4) Extract database model elements 5) Map data model to physical database schema 6) Implement access control

slide-35
SLIDE 35

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 1: Parse Natural Language Product Artifacts

Generate intermediate representation from text

35

“A nurse can order a lab procedure for a patient.”

Named Entities: A action R resource S subject Parts of Speech: NN noun VB verb Relationships: dobj direct object nn noun compound modifier nsubj nominative subject prep_for prepositional modifier – for

slide-36
SLIDE 36

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 2: Classify Sentence

Performs two classifications on each sentence:

1. Does the sentence contain ACRs? 2. Does the sentence contain DMEs? Example 1:

36

ACRs – Yes DMEs – Yes Example 2: “Lab procedures have a date-ordered, lab-type, and current status.” ACRs – No DMEs – Yes

slide-37
SLIDE 37

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 3: Semantic Relation Extraction

37

Specific Action nsubj dobj VB A NN S * NN R *

Generate Seed Patterns Match Subject and Resources Apply Patterns Known Subjects & Resources Access Control Rules Subject & Resource Search Pattern Extraction and Transformation Classify Patterns Pattern Set Inject Patterns

slide-38
SLIDE 38

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 3: Extract Access Control Elements

38

Semantic Relations: (order, nurse, lab procedure) (order_for, nurse, patient) Relational Pattern:

  • rder – nsubj – nurse, – dobj – lab procedure
  • rder – nsubj – nurse, - prep_for - patient
  • rder

lab procedure nsubj prep_for dobj NN VB A R nurse NN S patient NN R can aux MD

Use semantic relations to extract information

Access Control Rules (nurse, order, lab procedure, create) (nurse, order_for, patient, read)

slide-39
SLIDE 39

39

Semantic Relations: (order, nurse, lab procedure) (order_object_for, lab procedure, patient) Semantic Relational Pattern:

  • rder – nsubj – nurse, – dobj – lab procedure
  • rder – dobj – lab procedure, - prep_for - patient
  • rder

lab procedure nsubj prep_for dobj NN VB A R nurse NN S patient NN R can aux MD

Use semantic relations to extract information

Database Elements: Entities: lab procedure, patient Relationship: nurse orders lab procedure lab procedure for patient

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 4: Extract Database Model Elements

slide-40
SLIDE 40

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 5: Map Data Model to Physical Database Schema

  • Merge ACRs and database model elements
  • Map subjects to roles
  • Map objects to tables

40

Access Control Rules (nurse, order, lab procedure, create) (nurse, order_for, patient, read) Database Elements: Entities: lab procedure, patient Relationship: nurse orders lab procedure lab procedure for patient Physical Database Schema: lab_procedure_tbl patient_tbl lab_procedure_patient_tbl role: nurse_rl Merged ACRs (nurse, order, lab procedure, create) (nurse, order_for, patient, read) (nurse, order, lab procedure_patient, create) (nurse, order_for, lab procedure_patient, read) Database Access Rules (nurse, lab procedure, create) (nurse, patient, read) (nurse, lab procedure_patient, create/read)

slide-41
SLIDE 41

Motivation Goal Related Work Solution Studies Limitations Future Work

Step 6: Implement Access Control

  • Perform Sanity Checks
  • Conflict detection
  • Unmapped subjects and resources
  • Generate SQL Commands

create role nurse_rl; grant insert on lab_procedure_tbl to nurse_rl; grant select on patient_tbl to nurse_rl; grant insert, select on lab_procedure_patient_tbl to nurse_rl; 41

slide-42
SLIDE 42

Motivation Goal Related Work Solution Studies Limitations Future Work

Process Challenges

  • Ambiguity
  • Pronouns
  • Missing elements
  • “Generic” words – (e.g., list, item, data)
  • Synonyms
  • Negativity
  • Schema mismatches
  • Names
  • Cardinality

42

slide-43
SLIDE 43

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: Classification Study

[NaturaliSE 2013]

43

Research ability to classify sentences Why?

  • What needs to be processed further
  • Prevent false positives

Focus

  • Processing activities needs
  • Determine appropriate sentence representation(s)
  • Classifier and feature performance

Study Documents: 11 Healthcare related documents

slide-44
SLIDE 44

44

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: What Classification Algorithm to Use?

Classifier P R 𝑮𝟐

Weighted Random .047 .060 .053 50% Random .044 .502 .081 Naïve Bayes .227 .347 .274 SVM .728 .544 .623 NFR Locator k-NN .691 .456 .549

Classifying Non Functional Requirements

slide-45
SLIDE 45

45

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 1: What Classification Algorithm to Use?

Similarity Ratio 𝑮𝟐 % Classified 0.5 .85 56% 0.6 .82 63% 0.7 .78 74% 0.8 .75 86% 0.9 .71 96% 1.0 .70 96% ∞ .63 100%

Security Requirements: k-NN with Similarity Check Conclusion: Use ensemble-based classifier

slide-46
SLIDE 46

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Access Control Extraction

[PASSAT 2013] [ASE Science Journal 2013] [ACSAC 2014]

46

Research ability to identify and extract access control rules Why?

  • Determine access control to implement

Focus

  • Bootstrap knowledge to find ACRs
  • Extend pattern set while preventing false positives

Study Documents

Document Domain Document Type # of Sentences # of ACR Sentences # of ACRs Fleiss’ Kappa iTrust Healthcare Use Case 1160 550 2274 0.58 iTrust for Text2Policy Healthcare Use Case 471 418 1070 0.73 IBM Course Mgmt Education Use Case 401 169 375 0.82 CyberChair

  • Conf. Mgmt Seminar Paper 303

139 386 0.71 Collected ACP Docs Multiple Sentences 142 114 258 n/a

slide-47
SLIDE 47

47

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: What Properties do ACR Sentences Have?

iTrust iTrust_t2p IBM CM CyberChair Collected Text2Policy Pattern – Modal Verb 210 130 46 71 93 Text2Policy Pattern – Passive voice w/ to Infinitive 66 21 10 39 9 Text2Policy Pattern – Access Expression 32 7 5 1 18 Text2Policy Pattern – Ability Expression 45 21 14 11 3 Number of sentences with multiple types

  • f ACRs

383 146 77 105 36 Number of patterns appearing

  • nce or twice

680 173 162 184 97 ACRs with ambiguous subjects (e.g. “system”, “user”, etc.) 193 119 139 1 13 ACRs with blank subjects 557 206 29 187 5 ACRs with pronouns as subjects 109 28 5 11 11 ACRs with ambiguous objects (e.g., entry, list, name,etc.) 422 228 45 47 34 Total Number of ACR Sentences 550 418 169 139 114 Total Number of ACR Rules 2274 1070 375 386 258

slide-48
SLIDE 48

48

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Identifying ACR Sentences

Document Precision Recall F1 iTrust for Text2Policy .96 .99 .98 iTrust .90 .86 .88 IBM Course Management .83 .92 .87 CyberChair .63 .64 .64 Collected ACP .83 .96 .89 All documents, 10-fold .81 .84 .83

slide-49
SLIDE 49

49

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Identifying ACR Sentences without Training Sets Classification Performance (F1) by Completion %

slide-50
SLIDE 50

50

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 2: Extracting ACRs

Precision Recall F1 iTrust for Text2Policy .80 .75 .77 iTrust for ACRE .75 .60 .67 IBM Course Management .81 .62 .70 CyberChair .75 .30 .43 Collected ACP .68 .18 .29

slide-51
SLIDE 51

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Database Model Extraction

[ESEM 2015 (to submit)]

51

Research ability to extract database model and implement process from start to finish

Why?

  • Need to map ACRs to environment

Challenges

  • Patterns
  • Completeness

Case Study: Open Conference System

slide-52
SLIDE 52

52

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Classification Results

Precision Recall F1

OCS (train using CyberChair) ACR .82 .29 .42 CyberChair (train using OCS) ACR .75 .61 .67 OCS, 10-fold self-validation ACR .81 .78 .79 OCS (train using CyberChair) DME .82 .29 .42 CyberChair (train using OCS) DME .75 .61 .67 OCS, 10-fold self-validation DME .83 .78 .79 Does a sentence have ACRs and/or DMEs?

slide-53
SLIDE 53

53

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Extracting DMEs

Precision Recall F1 Perfect Knowledge from ACRs 1.00 .89 .94 Results from ACR Process 1.00 .81 .90

slide-54
SLIDE 54

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Database Design Extraction

54

Number of System Oracle Process

ACRs 52 730 686 resolved subjects 524 686 resolved objects 39 35 merged rules 272 481 discovered roles 7 21 1 discovered entities 52 223 213

slide-55
SLIDE 55

55

Motivation Goal Related Work Solution Studies Limitations Future Work

Limitations

Limitations

  • Text-based process
  • Conditional access
  • Rule-based resolution
  • Only considered access at a table level
  • Mapping discovered roles and entities to the actual database

manually performed

  • Only examined one system within a given problem domain for

end-to-end validation

  • System implementation may not match documentation
  • Different functionality
  • Effective dating/status in place of deletes/updates
slide-56
SLIDE 56

Motivation Goal Related Work Solution Studies Limitations Future Work

Future Work

  • Access control rules
  • Temporal orderings
  • Conditions / constraints
  • Database model elements
  • Field types
  • Values / ranges
  • Human computer interaction

56

slide-57
SLIDE 57

Research Goal Evaluation

Improve security and compliance by

  • identify and extract access control rules (ACRs)
  • identify and extract database model elements (DMEs)
  • implement defined access control rules in a system’s

database

Confirmation

  • Identify ACR Sentences: .83 𝐺

1

  • Extract ACRs: .29 to .77 𝐺

1

  • Identify DME Sentences: .79 𝐺

1

  • Extract DMEs: .90 𝐺

1

  • Generated # of ACRs: 272

57

slide-58
SLIDE 58

Motivation Goal Related Work Solution Studies Limitations Future Work

Contributions

  • Approach and supporting tool *
  • Sentence similarity algorithm
  • Bootstrapping algorithms
  • Labeled corpora*
  • Pattern distributions

* https://github.com/RealsearchGroup/REDE

58

slide-59
SLIDE 59

References

[Bennett 2015] Bennett, Cory. Weak Login Security at Heart of Anthem Breach. http://thehill.com/policy/cybersecurity/232158-weak-login-security-at-heart-of-anthem-breach Accessed: 3/15/2015 [Chen 1983] Chen, Peter. English Sentence Structure and Entity-Relationship Diagrams. Information Series 29: 127-149. 1983. [He 2009] He, Q. and Antón, A.I., Requirements-based Access Control Analysis and Policy Specification (ReCAPS). Information and Software Technology, vol. 51, no. 6, pp 993-1009, 2009. [Husain 2015] Husain, Azam. What the Anthem Breach Teaches US About Access Control. http://www.healthitoutcomes.com/doc/what-the-anthem-breach-teaches-us-about-access-control-

  • 0001. Accessed 3/15/2015

[Omar 2004] Omar, Nazlia. Heuristics-Based Entity-Relationship Modelling through Natural Language Processing, PhD Dissertation, University of Ulster, 2004. [Peterson 2015] Peterson, Andrea. 2015 is already the year of the health-care hack – and it’s only going to get

  • worse. Washington Post. Washington D.C., 3/20/2015.

[Redhead 2015] Redhead, C. Stephen. Anthem Data Breach: How Safe Is Health Information Under HIPAA, http://fas.org/sgp/crs/misc/IN10235.pdf. Congressional Research Service Report. Accessed 3/16/2015 [Sagar 2014] Sagar, Vidhu Bhala R. Vidya and Abrirami, S. Conceptual Modeling of Natural Language Functional Requirements, Journal of Systems and Software, v 88, 25-41, 2014 [Westin 2015] Westin, Ken. How Anthem Could be Breached. http://www.tripwire.com/state-of-security/incident- detection/how-the-anthem-breach-could-have-happened/. Accessed: 3/15/2015 [Xiao 2009] Xiao, X., Paradkar, A., Thummalapenta, S. and Xie, T. Automated Extraction of Security Policies from Natural-Language Software Documents. International Symposium on the Foundations of Software Engineering (FSE), Raleigh, North Carolina, USA, 2012.

59

slide-60
SLIDE 60

References

[Slankas 2015] Slankas, John, and Williams, Laurie, "Relation Extraction for Inferring Database Models from Natural Language Artifacts" , 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM 2015) to be submitted. [Slankas 2014] Slankas, John, Xiao, Xushang, Williams, Laurie, and Xie, Tao, "Relation Extraction for Inferring Access Control Rules from Natural Language Artifacts" , 2014 Annual Computer Security Applications Conference (ACSAC 2014), New Orleans, LA. [Riaz 2014b] Riaz, Maria, Slankas, John, King, Jason, and Williams, Laurie, "Using Templates to Elicit Implied Security Requirements from Functional Requirements − A Controlled Experiment", ACM / IEEE 8th International Symposium on Empirical Software Engineering and Measurement (ESEM 2014), Torino, Italy, September 18-19, 2014 [Riaz 2014a] Riaz, Maria, King, Jason, Slankas, John, and Williams, Laurie, "Hidden in Plain Sight: Automatically Identifying Security Requirements from Natural Language Artifacts", 2014 Requirements Engineering (RE 2014), Karlskrona, Sweeden, August 25-29, 2014 [Slankas 2013d] Slankas, John and Williams, Laurie, 2013. Access Control Policy Identification and Extraction from Project Documentation, Academy of Science and Engineering Science Journal Volume 2, Issue 3. p145-159. [Slankas 2013c] Slankas, John and Williams, Laurie, "Access Control Policy Extraction from Unconstrained Natural Language Text", 2013 ASE/IEEE International Conference on Privacy, Security, Risk, and Trust (PASSAT 2013), Washington D.C., USA, September 8-14, 2013. [Slankas 2013b] Slankas, John and Williams, Laurie, "Automated Extraction of Non-functional Requirements in Available Documentation", 1st International Workshop on Natural Language Analysis in Software Engineering (NaturaLiSE 2013), San Francisco, CA. [Slankas 2013a] Slankas, John, "Implementing Database Access Control Policy from Unconstrained Natural Language Text", 35th International Conference on Software Engineering - Doctoral Symposium (ICSE DS 2013), San Francisco, CA. [Slankas 2012] Slankas, John and Williams, Laurie, "Classifying Natural Language Sentences for Policy", IEEE International Symposium on Policies for Distributed Systems and Networks (POLICY 2012)

60

slide-61
SLIDE 61

Backup slides

61

slide-62
SLIDE 62

Additional Information

Other Solutions to Inappropriate Data Access

62

  • Auditing
  • Intrusion detection
  • Manually establish access control
  • Completeness
  • Correctness
  • Effort
slide-63
SLIDE 63

Additional Information

Machine Learning Background

63

  • Combines computer science and statistics
  • Supervised vs. Unsupervised
  • Sample algorithms
  • k-nearest neighbor (k-NN)
  • Naïve bayes
  • Decision trees
  • Regression
  • k-means clustering
slide-64
SLIDE 64

Additional Information

Semantic Relation Related Work

64

1992 Hearst – Automatic Acquisition of Hyponyms from Large Text Corpora 2004 Snow et al., Learning Syntactic Patterns for Automatic Hypernym Discovery 2005 Zhou et al., Exploring Various Knowledge in Relation Extraction

slide-65
SLIDE 65

Additional Information

Natural Language Parsers

65

Apache OpenNLP: http://opennlp.apache.org/ Berkeley Parser: http://nlp.cs.berkeley.edu/ BLLIP (Charniak-Johnson): http://bllip.cs.brown.edu/ GATE: https://gate.ac.uk MALLET: http://mallet.cs.umass.edu/ Python Natural Language Toolkit: http://www.nltk.org/ Stanford Natural Language Parser: http://nlp.stanford.edu/ Criteria:

  • Performs well
  • Open-source, maintained, well-documented
  • Java
slide-66
SLIDE 66

Additional Information

NLP Outputs

66

POS Tagging: The/DT nurse/NN can/MD order/VB a/DT lab/NN procedure/NN for/IN a/DT patient/NN ./. Parse:

(ROOT (S (NP (DT The) (NN nurse)) (VP (MD can) (VP (VB order) (NP (DT a) (NN lab) (NN procedure)) (PP (IN for) (NP (DT a) (NN patient))))) (. .)))

Typed Dependency:

det(nurse-2, The-1) nsubj(order-4, nurse-2) aux(order-4, can-3) root(ROOT-0, order-4) det(procedure-7, a-5) nn(procedure-7, lab-6) dobj(order-4, procedure-7) prep(order-4, for-8) det(patient-10, a-9) pobj(for-8, patient-10)

slide-67
SLIDE 67

Additional Information

Precision, Recall, F1 Measure

Precision (P) is the proportion of correctly predicted access control statements: 𝑄 = 𝑈𝑄/(𝑈𝑄 + 𝐺𝑄). Recall (R) is the proportion of access control statements found: 𝑆 = 𝑈𝑄/(𝑈𝑄 + 𝐺𝑂) F1 Measure is the harmonic mean between P and R: F1 = 2 × 𝑄 × 𝑆 /(𝑄 + 𝑆)

67

Expected Classification Yes No Predicted Classification Yes True Positive False Negative No False Negative True Negative

slide-68
SLIDE 68

Additional Information

Inter-rater Agreement (Fleiss’ Kappa)

How well do multiple raters agree beyond what’s possible by chance?

𝜆 = 𝑄 − 𝑄

𝑓

1 − 𝑄

𝑓

Degree of agreement attained above chance divided by the degree

  • f agreement possible above chance

68

Fleiss’ Kappa Agreement Interpretation <= 0 Less than chance 0.01 – 0.20 Slight 0.21 – 0.40 Fair 0.41 – 0.60 Moderate 0.61 – 0.80 Substantial 0.81 – 0.99 Almost perfect

slide-69
SLIDE 69

Step 3: Semantic Relations

Use semantic relation extraction to extract access control elements from natural language text. Semantic relation: underlying meaning between two concepts Examples:

69

Hypernymy (is-a) users, such as nurses authenticate … Meronymy (whole-part) a patient’s vital signs Verb Phrases customers rent cars

slide-70
SLIDE 70

𝑩( 𝒕 , 𝒃 , 𝒔 , 𝒐 , 𝒎 , 𝒅 , 𝑰, 𝒒)

𝑡 vertices composing the subject 𝑏 vertices composing the action 𝑠 vertices composing the resource 𝑜 vertex representing negativity 𝑚 vertex representing limitation to a specific role 𝑑 vertices providing context to the access control policy 𝐼 subgraph required to connect all previous vertices 𝑞 set of permission associated with the current policy

𝐵( 𝑜𝑣𝑠𝑡𝑓 , 𝑝𝑠𝑒𝑓𝑠 , 𝑚𝑏𝑐 𝑞𝑠𝑝𝑑𝑓𝑒𝑣𝑠𝑓 , , , 𝑊: 𝑜𝑣𝑠𝑡𝑓, 𝑝𝑠𝑒𝑓𝑠, 𝑚𝑏𝑐 𝑞𝑠𝑝𝑑𝑓𝑒𝑣𝑠𝑓; 𝐹: (𝑝𝑠𝑒𝑓𝑠, 𝑜𝑣𝑠𝑡𝑓, 𝑜𝑡𝑣𝑐𝑘); (𝑝𝑠𝑒𝑓𝑠, 𝑚𝑏𝑐 𝑞𝑠𝑝𝑑𝑓𝑒𝑣𝑠𝑓, 𝑒𝑝𝑐𝑘) ), 𝑑𝑠𝑓𝑏𝑢𝑓) 𝐵( 𝑜𝑣𝑠𝑡𝑓 , 𝑝𝑠𝑒𝑓𝑠 , 𝑞𝑏𝑢𝑗𝑓𝑜𝑢 , , , (𝑊: 𝑜𝑣𝑠𝑡𝑓, 𝑝𝑠𝑒𝑓𝑠, 𝑞𝑏𝑢𝑗𝑓𝑜𝑢; 𝐹: (𝑝𝑠𝑒𝑓𝑠, 𝑜𝑣𝑠𝑡𝑓, 𝑜𝑡𝑣𝑐𝑘); (𝑝𝑠𝑒𝑓𝑠, 𝑞𝑏𝑢𝑗𝑓𝑜𝑢, 𝑞𝑠𝑓𝑞_𝑔𝑝𝑠) ), 𝑠𝑓𝑏𝑒)

70

Additional Information

Access Control Rule Representation

slide-71
SLIDE 71

Additional Information

Database Model Element Representation

71

Entities 𝐸𝑓({𝑓}, 𝐼) Attributes of Entities 𝐸𝑏({𝑓}, {𝑏}, 𝐼) Relationships 𝐸𝑠({𝑠}, {𝑓1}, {𝑓2}, 𝐼)

slide-72
SLIDE 72

Additional Information

Step 4: Database Model Patterns

72

% Entity_2 dobj NN VB R E Entity_1 NN E nsubj

Relationship: Association

Relationship: Aggregation / Composition

have part dobj NN VB R E whole NN E nsubj

Relationship: Inheritance

be General Entity prep of NN VB R Specific EntityNN nsubj E E

Attribute Entity NN NN A E poss

Entity-attributes

Attribute Entity NN NN A E prep_of

nsubj Entity NN E % VB Entity NN E % VB dobj prep_% Entity NN E % VB

Entity

slide-73
SLIDE 73

Additional Information

Step 4: Extract Database Design

73

Classify Patterns Known Entities and Relationships Generate patterns from templates Wildcard Patterns Pattern Set Pattern Search Extract Database Design Elements Manually Identified Patterns Inject Additional Patterns Extracted Access Control Rules Transform Patterns

slide-74
SLIDE 74

74

Additional Information

REDE Application

slide-75
SLIDE 75

75

Additional Information

Negativity

  • Specific adjectives (unable)
  • Adverbs (not, never)
  • Determiners (no, zero, neither)
  • Nouns (none, nothing).
  • Negative verbs (stop, prohibit, forbid)
  • Negative prefixes for verbs
slide-76
SLIDE 76
  • 1. What document types contain NFRs in each of the 14

different categories?

  • 2. What characteristics, such as keywords or entities do

sentences assigned to each NFR category have in common?

  • 3. What machine learning classification algorithm has the

best performance to identify NFRs?

  • 4. What sentence characteristics affect classifier

performance?

76

Additional Information

Study 1: Research Questions

slide-77
SLIDE 77
  • Started from Cleland-Huang, et al.
  • Combined performance and scalability
  • Separated access control and audit from security
  • Added privacy, recoverability, reliability, and other

77

Additional Information

Study 1: Non-functional Requirement Categories

  • J. Cleland-Huang, R. Settimi, X. Zou, and P. Solc, “Automated Classification of Non-functional Requirements,”

Requirements Engineering, vol. 12, no. 2, pp. 103–120, Mar. 2007.

Access Control Privacy Audit Recoverability Availability Performance & Scalability Legal Reliability Look & Feel Security Maintenance Usability Operational Other

slide-78
SLIDE 78

Lawrence Chung’s NFRs

accessibility, accountability, accuracy, adaptability, agility, auditability, availability, buffer space performance, capability, capacity, clarity, code-space performance, cohesiveness, commonality, communication cost, communication time, compatibility, completeness, component integration time, composability, comprehensibility, conceptuality, conciseness, confidentiality, configurability, consistency, coordination cost, coordination time, correctness, cost, coupling, customer evalutation time, customer loyalty, customizability, data-space performance, decomposability, degradation of service, dependability, development cost, development time, distributivity, diversity, domain analysis cost, domain analysis time, efficiency, elasticity, enhanceability, evolvability, execution cost, extensibility, external consistency,fault-tolerance, feasibility, flexibility, formality, generality, guidance, hardware cost, impact analyzability, independence, informative-ness, inspection cost, inspection time, integrity, inter-operable, internal consistency, intuitiveness, learnability, main-memory performance, maintainability, maintenance cost, maintenance time, maturity, mean performance, measurability, mobility, modifiability, modularity, naturalness, nomadicity, observability, off-peak-period performance, operability, operating cost, peakperiod performance, performability, performance, planning cost, planning time, plasticity, portability, precision, predictability, process management time, productivity, project stability, project tracking cost, promptness, prototyping cost, prototyping time, reconfigurability, recoverability, recovery, reengineering cost, reliability, repeat ability, replaceability, replicability, response time, responsiveness, retirement cost, reusability, risk analysis cost, risk analysis time, robustness, safety, scalability, secondary storage performance, security, sensitivity, similarity, simplicity, software cost, software production time, space boundedness, space performance, specificity, stability, standardizability, subjectivity, supportability, surety, survivability, susceptibility, sustainability, testability, testing time, throughput, time performance, timeliness, tolerance, traceability, trainabilìty, transferability, transparency, understandability, uniform performance, uniformity, usability, user-friendliness, validity, variability, verifìabiìity, versatility, visibility, wrappability

78

slide-79
SLIDE 79

79

Additional Information

Study 1: Documents

Document Document Type Size AC AU AV LG LF MT OP PR PS RC RL SC US OT FN NA CCHIT Ambulatory Requirements Requirement 306 12 27 1 2 10 1 5 2 28 4 8 228 6 iTrust Requirement, Use Case 1165 439 44 2 2 18 2 9 9 9 55 2 734 376 PromiseData Requirement 792 164 20 36 10 50 26 89 7 75 4 12 71 101 19 340 Open EMR Install Manual Installation Manual 225 3 5 1 6 1 25 2 184 Open EMR User Manual User Manual 473 169 14 8 4 286 95 NC Public Health DUA DUA 62 1 20 4 1 41 US Medicare/Medicai d DUA DUA 140 1 26 17 5 2 108 California Correctional Health Care RFP 1893 94 120 9 85 133 94 52 13 16 13 193 14 38 987 409 Los Angeles County EHR RFP 1268 58 37 8 3 2 28 19 3 11 8 13 108 21 10 639 380 HIPAA Combined Rule CFR 2642 28 8 3 78 213 9 41 1 317 2018 Meaningful Use Criteria CFR 1435 8 116 1311 Health IT Standards CFR 1475 10 20 119 1 2 2 71 1 2 164 1146 Total 11876 979 276 57 152 68 413 207 300 100 50 43 563 148 82 3568 6076

slide-80
SLIDE 80

80

Study 1/RQ1: What document types contain what categories of NFRs?

  • All evaluated document contained NFRs
  • RFPs had a wide variety of NFRs except look and feel
  • DUAs contained high frequencies of legal and privacy
  • Access control and/or security NFRs appeared in all of

the documents.

  • Low frequency of functional and NFRs with CFRs

exemplifies why tool support is critical to efficiently extract requirements from those documents.

slide-81
SLIDE 81
  • 1. What patterns exist among sentences with access

control rules?

  • 2. How frequently do different forms of ambiguity occur in

sentences with access control rules?

  • 3. How effectively does our process detect sentences with

access control rules?

  • 4. How effectively can the subject, action, and resources

elements of ACRs be extracted?

81

Additional Information

Study 2: Research Questions

slide-82
SLIDE 82

82

Document Domain Number of Sentences Number of ACR Sentences Number of ACRs Fleiss’ Kappa iTrust Healthcare 1160 550 2274 0.58 iTrust for Text2Policy Healthcare 471 418 1070 0.73 IBM Course Management Education 401 169 375 0.82 CyberChair

  • Conf. Mgmt

303 139 386 0.71 Collected ACP Documents Multiple 142 114 258 n/a

Additional Information

Study 2: Investigated Documents

slide-83
SLIDE 83

83

Additional Information

Study 2: ACR Patterns

Top ACR Patterns Pattern

  • Num. of Occurrences

(VB root(NN nsubj)(NN dobj))

465 (14.1%)

(VB root(NN nsubjpass))

122 (3.7%)

(VB root(NN nsubj)(NN prep))

116 (3.5%)

(VB root(NN dobj))

72 (2.2%)

(VB root(NN prep_%))

63 (1.9%)

slide-84
SLIDE 84

84

Additional Information

Study 2: Ambiguity

Ambiguity Occurrence % in ACR Sentences Pronouns 3.2% “System” / “user” 11.0% No explicit subject 17.3% Other ambiguous terms 21.5% Missing objects 0.2%

Ambiguous terms: “list”, “name”, “record”, “data”, …

slide-85
SLIDE 85

85

Additional Information

Study 3: Case Study

System Open Conference System Version 2.3.6, released May 28th, 2014 Language PHP Supported DBMSs MySQL, PostgreSQL Architecture Web-based application Number of PHP files 1557 Number lines in PHP files 22198 Number of application defined roles 7 Number of database tables 52 Number of fields in database tables 369

slide-86
SLIDE 86

86

Additional Information

Study 3: Case Study

Number of sentences 708 Number of ACR sentences 327 Number of ACRs 630 Number of DDE sentences 329 Number of DDEs 1002 Number of Entity DDEs 748 (287 unique) Number of Entity-Attribute DDEs 99 (75 unique) Number of Relationship DDEs 155 (82 unique) Number of DDE sentences with no ACRs 2

slide-87
SLIDE 87

87

Additional Information

Study 3: Research Questions

slide-88
SLIDE 88

88

Motivation Goal Related Work Solution Studies Limitations Future Work

Study 3: Extracting ACRs

Precision Recall F1 OCS 0.53 0.27 0.35

Top 10 ACR Extraction Errors Number Times Missed Error Type Pattern 89 FN ( % VB root ( % NN dobj )) 36 FN ( % VB root ( % PRP nsubj )( % NN dobj )) 20 FN ( % VB root ( % NN prep_% )) 18 FN ( % VB root ( % NN nsubj )( % NN dobj )) 17 FP ( % VB root ( % NN nsubjpass )) 12 FN ( % VB root ( % PRP nsubj )( % NN prep_% )) 8 FP ( % VB root ( % PRP nsubj )( % NN dobj )) 5 FN ( allow VB root ( % PRP dobj )( % VB dep ( % NN dobj ))) 5 FN ( % VB root ( % NN nsubj )( % NN prep_% )) 5 FN ( % VB root ( % NN nsubjpass ))

slide-89
SLIDE 89

Modified version of Levenshtein String Edit Distance Use words (vertices) instead of characters

89

Additional Information

Sentence Similarity Algorithm

computeVertexDistance(Vertex a, Vertex b) 1: if a = NULL or b = NULL return 1 2: if a.partOfSpeech <> b.partOfSpeech return 1 3: if a.parentCount <> b.parentCount return 1 4: for each parent in a.parents 5: if not b.parents.contains(parent) return 1 6: if a.lemma = b.lemma return 0 7: if a and b are numbers, return 0 8: if ner classes match, return 0 9: wnValue = wordNetSynonyms(a.lemma,b.lemma) 10: if wnValue > 0 return wnValue 11: return 1

slide-90
SLIDE 90

Why is This Problem Difficult?

90

  • Ambiguity
  • Multiple ways to express the same

meaning

  • Resolution issues
slide-91
SLIDE 91

Motivation: U.S. Data Breaches

91

Source: Privacy Rights Clearinghouse[2]

slide-92
SLIDE 92

Motivation: Healthcare Documentation

  • HIPAA
  • HITECH ACT
  • Meaningful Use Stage 1 Criteria
  • Meaningful Use Stage 2 Criteria
  • Certified EHR (45 CFR Part 170)
  • ASTM
  • HL7
  • NIST FIPS PUB 140-2
  • HIPAA Omnibus
  • NIST Testing Guidelines
  • DEA Electronic Prescriptions for Controlled Substances (EPCS)
  • Industry Guidelines: CCHIT, EHRA, HL7
  • State-specific requirements
  • North Carolina General Statute § 130A-480 – Emergency Departments
  • Organizational policies and procedures
  • Project requirements, use cases, design, test scripts, …
  • Payment Card Industry: Data Security Standard

93

Scream, Edvard Much, 1895

slide-93
SLIDE 93

Dissertation Thesis

Access control rules explicitly and implicitly defined within unconstrained natural language product artifacts can be effectively identified and extracted; Database design elements can be effectively identified and extracted; Mappings can be identified among the access control rules, database design elements, and the physical database implementation; and Role-based access control can be established within a system’s relational database.

94