Learning Approaches to Post-Hoc LangSec Sheridan Curley and & - - PowerPoint PPT Presentation

learning approaches to post hoc langsec
SMART_READER_LITE
LIVE PREVIEW

Learning Approaches to Post-Hoc LangSec Sheridan Curley and & - - PowerPoint PPT Presentation

UNCLASSIFIED UNCLASSIFIED Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec Sheridan Curley and & Dr. Richard Harang (ARL) The Nations Premier Laboratory for Land Forces The Nations Premier Laboratory for Land


slide-1
SLIDE 1

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

UNCLASSIFIED

Grammatical Inference and Machine Learning Approaches to Post-Hoc LangSec

Sheridan Curley and & Dr. Richard Harang (ARL)

slide-2
SLIDE 2

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Theory approach – Grammatical inference – LangSec Paper’s work – Machine learning to bypass hardness – Our experimental setup – Results Moving Forward Conclusion

Outline

slide-3
SLIDE 3

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Grammatical Inference

Grammars are tuples: – 𝑯 =< 𝑾, 𝚻, 𝑺, 𝑻 > – Set of nonterminal characters, 𝑾 – Set of terminal chars, 𝚻 where 𝚻 ∩ 𝑾 = ∅

  • AKA the alphabet

– Production rules, 𝑺: 𝑾 → 𝑾 ∪ 𝚻 ∗ – Set of starting chars, 𝑻 ⊂ 𝑾 Grammars generate Languages – ℒ 𝑯 = {𝒙 ∈ 𝚻 ∗: 𝑻

𝒙},

denoting reflexive, transitive closure

slide-4
SLIDE 4

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Chomsky Hierarchy – Defines complexity of known languages – 4 “levels” – Lowest level languages:

  • “Regular”
  • “Context-Free” (Deterministic or Nondeterministic)

Chomsky’s Hierarchy

Image: “Chomsky Hierarchy.“ Wikipedia. 30 April 2016. Web. <https://en.wikipedia.org/>.

slide-5
SLIDE 5

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Key Questions

Biggest questions are: – Given a grammar; produced language = <?> – Equivalence of grammars/languages – Learning grammars from language samples

slide-6
SLIDE 6

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Most theory negative: – Above “Regular” cannot be learned generally Even probabilistic identification hard – Valiant’s Probably Approximately Correct Some languages have learnable properties: – Angluin’s “pattern languages” – Clark’s “nonterminally separated”

Inference Results

slide-7
SLIDE 7

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Given: 𝚻 = 𝟏, 𝟐 , 𝒒 = 𝟐𝒚𝟐𝟏𝟐𝒚𝟑𝒚𝟒 Then: 𝒙 = 𝟐𝟐𝟏𝟐𝟐𝟐, 𝟐𝟏𝟏𝟐𝟐𝟐, 𝟐𝟏𝟏𝟐𝟏𝟐 ⊆ ℒ(𝒒)

Pattern Language Example

  • Restricted language
  • Equivalence still NP-hard

Above taken from Angluin’s “Finding Patterns in Sets of Strings”

slide-8
SLIDE 8

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Clark’s Omphalos algorithm:

  • Gives exact results
  • Very slow
  • May not converge reasonably

NTS Languages

Above taken from Clark’s “Learning Deterministic Context Free Grammars: The Omphalos Competition”

slide-9
SLIDE 9

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Language Theoretic Security

Learning grammars is hard: – Cannot determine if parser’s grammar is equivalent to another – Cannot enumerate all “safe” or “bad” strings for parser – Cannot generically learn all parsers with one method To be secure… – Parsers must be restricted to low Chomsky hierarchy – This can be difficult given existing practices

slide-10
SLIDE 10

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Computers are discrete, computational – Must be some type of underlying structure – Should be possible to recognize valid structure Rather than exact learning (hard), try close recognition – Relax assumptions Apply machine learning: – Build and train off feature vectors from language examples Key differences: – Building “sentences” from parts using rules (exact) – Recognizing language with only “letters” known (M.L.)

Learning vs Recognition

slide-11
SLIDE 11

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Multi-layered LSTM* network: – One-hot feature vector input – Embedding layer – 3-layers of LSTM – Softmax output

Our Network

See Hochreiter & Schmidhuber’s “Long Short-Term Memory”

slide-12
SLIDE 12

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Long Short-Term Memory

Subtype of Recurrent Neural Network: – Feed-forward to next levels – Feed into same layer simultaneously – Persistent “memory” that is edit-limited Shown to be able to learn over “long-distances”

Image: Olah, Christopher. "Understanding LSTM Networks." Colah's Blog. 27 Aug. 2015. Web. <http://colah.github.io/>.

slide-13
SLIDE 13

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Labeled URI data from Apache server logs – URI + response code only – Possible to have multiple labels URI initially unknown language – Network given no prior structure information – Knows nothing about RFC or other rules re: URIs – URI theoretically a CFG Goal is validation – Recognizing valid URIs only – Rejecting improper/invalid URIs

Training Data

slide-14
SLIDE 14

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Results of LSTM Application

slide-15
SLIDE 15

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Practical learning possible – Recognition rate for grouped URIs >99% – However, false positive rate high Network can be trained to recognize URIs – No prior knowledge – However, training is time consuming – Practical use requires faster identification

Improving Results

slide-16
SLIDE 16

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Possible: develop entropy-based rules – Construct quicker decision machine Possible: test for vulnerability to malicious training – Robustness of result determines efficacy

Future Work

slide-17
SLIDE 17

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Theory is often hard (very hard) – Complicated languages have complicated structure – No clear exact learning results Experimental results are promising – Despite theory, can “learn” valid URI – Not perfect, but may be good enough Learning differences – “Exact” builds rules, start, end symbols from given samples – M.L. builds recognizer from alphabet and given samples – M.L. can recognize unlearnable languages

Conclusion

slide-18
SLIDE 18

UNCLASSIFIED UNCLASSIFIED

The Nation’s Premier Laboratory for Land Forces

Questions?