The geometry of syntax and semantics for directed file - - PowerPoint PPT Presentation

the geometry of syntax and semantics for directed file
SMART_READER_LITE
LIVE PREVIEW

The geometry of syntax and semantics for directed file - - PowerPoint PPT Presentation

IEEE S&P 2020 LangSec workshop The geometry of syntax and semantics for directed file transformations Steve Huntsman 1 Michael Robinson 2 1 FAST Labs / Cyber Technology 2 American University 21 May 2020 IEEE S&P 2020 LangSec workshop


slide-1
SLIDE 1

IEEE S&P 2020 LangSec workshop

The geometry of syntax and semantics for directed file transformations

Steve Huntsman 1 Michael Robinson 2

1FAST Labs / Cyber Technology 2American University

21 May 2020

slide-2
SLIDE 2

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 2

string.h must be used carefully to prevent buffer overflows

  • X = strings of ASCII NULLs

and printable characters

  • G = cyclic shifts on

individual characters

  • Goal: remove NULLs and

punctuation; make lowercase

  • This example is discussed in

the paper

slide-3
SLIDE 3

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 3

Transform files to achieve language-theoretical security

  • X = space of files in some

fixed format (e.g., PDF)

  • G = various invertible

transformations

  • Goal: eliminate

nondeterministic syntax

  • Input ambiguity =

vulnerability

slide-4
SLIDE 4

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 4

Patch binary code to secure critical legacy systems

  • X = space of disassembled

binary code

  • G = “sugar-neutral” lifts,

translations, etc

  • Goal: parsimoniously patch

a known vulnerability

  • Compiler/build options,

dependencies make this hard

slide-5
SLIDE 5

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 5

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

slide-6
SLIDE 6

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 6

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

  • Think of X like a manifold

and get something akin to a principal bundle P(X, G)

slide-7
SLIDE 7

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 7

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

  • Think of X like a manifold

and get something akin to a principal bundle P(X, G)

  • Locally looks like X × G
  • G acts on P nicely
slide-8
SLIDE 8

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 8

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

  • Think of X like a manifold

and get something akin to a principal bundle P(X, G)

  • Locally looks like X × G
  • G acts on P nicely
  • E.g., X = S1 (time of day);

G = Z (epoch); P = R (as a helix above X)

slide-9
SLIDE 9

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 9

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

  • Think of X like a manifold

and get something akin to a principal bundle P(X, G)

  • Locally looks like X × G
  • G acts on P nicely
  • E.g., X = S1; G = (0, 1) w/

x ⊞ y := f (f −1(x) + f −1(y)) for invertible f : R → (0, 1)

slide-10
SLIDE 10

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 10

Principal bundles model syntax and semantics

  • X = space of documents
  • G = group of invertible

transformations

  • Think of X like a manifold

and get something akin to a principal bundle P(X, G)

  • Locally looks like X × G
  • G acts on P nicely
  • E.g., Hopf fibration

S1 → S3 → S2

slide-11
SLIDE 11

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 11

Connections model geometry directing transformations

  • Principal bundles are a

natural arena for geometry realized through a connection

slide-12
SLIDE 12

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 12

Connections model geometry directing transformations

  • Principal bundles are a

natural arena for geometry realized through a connection

  • I.e., a “vertical” and

“horizontal” direct sum decomposition of tangent spaces . . .

slide-13
SLIDE 13

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 13

Connections model geometry directing transformations

  • Principal bundles are a

natural arena for geometry realized through a connection

  • I.e., a “vertical” and

“horizontal” direct sum decomposition of tangent spaces . . .

  • . . . that is equivariant

under group action

slide-14
SLIDE 14

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 14

Connections model geometry directing transformations

  • Principal bundles are a

natural arena for geometry realized through a connection

  • I.e., a “vertical” and

“horizontal” direct sum decomposition of tangent spaces . . .

  • . . . that is equivariant

under group action

  • Connects local product

geometries via parallel transport

slide-15
SLIDE 15

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 15

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

slide-16
SLIDE 16

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 16

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

slide-17
SLIDE 17

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 17

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

  • bjend
slide-18
SLIDE 18

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 18

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

  • bjend

⇒ objend % objend -> endobj

slide-19
SLIDE 19

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 19

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

  • bjend

⇒ objend % objend -> endobj ⇒ endobj % objend -> endobj

slide-20
SLIDE 20

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 20

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

  • bjend

⇒ objend % objend -> endobj ⇒ endobj % objend -> endobj

  • Sugar-neutral: transformations

should handle sugar, but not introduce or eliminate it

slide-21
SLIDE 21

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 21

Syntactic transformations must be invertible

  • This requirement of the

mathematical model is really a hint about how to perform file transformations

  • Record (or in reverse,

delete) details of atomic transformations in ancillae

  • bjend

⇒ objend % objend -> endobj ⇒ endobj % objend -> endobj

  • Sugar-neutral: transformations

should handle sugar, but not introduce or eliminate it

  • Suggests using normal forms
slide-22
SLIDE 22

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 22

Normal forms simplify and disambiguate

int i; for (i=0; i<10; i++) { z+=i; } int n=0; while (n<10) { x+=n; n++; } (From Lacomis et al.) jmp @5 @4: jmp @9 @8: jne @19 jmp @10 @19: jmp @14 @13:@14: jg @13 @9:@10: jge @20 jmp @8 @5:@20: jge @21 jmp @4 @21: START; S do while b S do while b if b S do while b S enddo endif S enddo S enddo; HALT (From Zhang and D’Hollander)

slide-23
SLIDE 23

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 23

Concrete syntax trees parameterize a principal bundle

  • G corresponds to

semantics-preserving CST transformations

  • Equivalence class of CSTs

corresponding to a given AST has group-theoretical and language security significance and indicates format redundancy

  • E.g., xref table in PDF

(which nobody trusts)

slide-24
SLIDE 24

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 24

Dynamic concretization semantically enriches an AST

[Files] can be considered as an abstraction of their semantics. For example the syntax of [files] records the existence of [objects] and maybe their type but not [the trace of a parser or renderer], as defined by the semantics. 1

  • Annotating (with, e.g., types) and cross-linking an AST gives

a semantically rich derived graph

  • To understand a file, parse it . . .

1[Cousot and Cousot], replacing “program” and

“variable” with “file” and “object,” respectively.

slide-25
SLIDE 25

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 25

Dynamic concretization semantically enriches an AST

[Files] can be considered as an abstraction of their semantics. For example the syntax of [files] records the existence of [objects] and maybe their type but not [the trace of a parser or renderer], as defined by the semantics. 1

  • Annotating (with, e.g., types) and cross-linking an AST gives

a semantically rich derived graph

  • To understand a file, parse it . . .
  • . . . to understand it more, render/compile it

1[Cousot and Cousot], replacing “program” and

“variable” with “file” and “object,” respectively.

slide-26
SLIDE 26

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 26

To transform syntax trees, transform derived graphs

1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

1 START 2 do-while 3 do-while 4 do-while 5 do-while 6 S 8 S 10 if-else 10 if 11 do-while 12 S 14 if-else 14 if 15 S 1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

  • Compilers parse

source code into abstract syntax tree, then cross-link into control flow graph

slide-27
SLIDE 27

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 27

To transform syntax trees, transform derived graphs

1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

1 START 2 do-while 3 do-while 4 do-while 5 do-while 6 S 8 S 10 if-else 10 if 11 do-while 12 S 14 if-else 14 if 15 S 1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

  • Compilers parse

source code into abstract syntax tree, then cross-link into control flow graph

  • PDF analogue:

indirect object cross-references

slide-28
SLIDE 28

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 28

To transform syntax trees, transform derived graphs

1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

1 START 2 do-while 3 do-while 4 do-while 5 do-while 6 S 8 S 10 if-else 10 if 11 do-while 12 S 14 if-else 14 if 15 S 1 START 2 do while b 3 do while b 4 do while b 5 do while b 6 S 7 enddo 8 S 9 enddo 10 if b 11 do while b 12 S 13 enddo 14 if b 15 S 16 endif 17 endif 18 enddo 19 enddo 20 HALT

  • Compilers parse

source code into abstract syntax tree, then cross-link into control flow graph

  • PDF analogue:

indirect object cross-references

  • Transform the

derived graph to transform ASTs

slide-29
SLIDE 29

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 29

To transform syntax trees, transform derived graphs

  • Compositionally transform derived graphs
  • Restructure; decompose/locally perturb
  • Invertibly reduce derived graphs back to syntax trees
  • Local AST dissimilarities suffice for geometry
  • E.g., elimination of nondeterministic syntax elements
  • E.g., local similarity to some reference file

unstructured control flow structured CFG0 (ASM) structured CFG1 (IR1) structured CFG2 (IR2) disassembled binary AST0 (ASM) AST1 (IR1) AST2 (IR2)

slide-30
SLIDE 30

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 30

To transform syntax trees, transform derived graphs

  • Compositionally transform derived graphs
  • Restructure; decompose/locally perturb
  • Invertibly reduce derived graphs back to syntax trees
  • Local AST dissimilarities suffice for geometry
  • E.g., elimination of nondeterministic syntax elements
  • E.g., local similarity to some reference file
  • This approach inherits compositionality of derived graphs and

can be viewed through the lens of a category of lenses

unstructured control flow structured CFG0 (ASM) structured CFG1 (IR1) structured CFG2 (IR2) disassembled binary AST0 (ASM) AST1 (IR1) AST2 (IR2)

slide-31
SLIDE 31

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 31

Dissimilarities on file artifacts yield geometry

  • Attributed tree dissimilarities for CSTs
  • Edit distance exploits compositionality
  • Kernels are fast
slide-32
SLIDE 32

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 32

Dissimilarities on file artifacts yield geometry

  • Attributed tree dissimilarities for CSTs
  • Edit distance exploits compositionality
  • Kernels are fast
  • Sequence dissimilarities for traces
  • Any parser IR (token sequence, CST, AST, etc.) defines a

section associated to a set of execution traces

  • Software errors ⇒ section is typically local, but ideally global
slide-33
SLIDE 33

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 33

Dissimilarities on file artifacts yield geometry

  • Attributed tree dissimilarities for CSTs
  • Edit distance exploits compositionality
  • Kernels are fast
  • Sequence dissimilarities for traces
  • Any parser IR (token sequence, CST, AST, etc.) defines a

section associated to a set of execution traces

  • Software errors ⇒ section is typically local, but ideally global
  • Order metric for ontologies
  • Use w/ topological differential testing for de facto syntax ⇒ X
slide-34
SLIDE 34

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 34

Dissimilarities on file artifacts yield geometry

  • Attributed tree dissimilarities for CSTs
  • Edit distance exploits compositionality
  • Kernels are fast
  • Sequence dissimilarities for traces
  • Any parser IR (token sequence, CST, AST, etc.) defines a

section associated to a set of execution traces

  • Software errors ⇒ section is typically local, but ideally global
  • Order metric for ontologies
  • Use w/ topological differential testing for de facto syntax ⇒ X
  • Wasserstein metric on functor from a small category to Set
  • E.g., small category = two parallel morphisms between two
  • bjects ⇒ functor = quiver
  • Convex relaxation of Hausdorff-style metric ⇒ linear program
  • Attributed/labeled structures not covered by this at present
slide-35
SLIDE 35

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 35

In general, consider fibrations endowed with geometry

  • A fibration is a generalization of a fiber bundle that retains

desirable homotopy properties

  • Homotopy-equivalent fibers
  • Homotopy lifting property: if f , ˜

f0 make the outer square commute, there exists ˜ f making the entire diagram commute

  • Key feature: a path in X can be uniquely lifted to a path in P

Y Y × [0, 1] P X id × {0} ˜ f0 f π ˜ f

slide-36
SLIDE 36

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 36

In general, consider fibrations endowed with geometry

  • A fibration is a generalization of a fiber bundle that retains

desirable homotopy properties

  • Homotopy-equivalent fibers
  • Homotopy lifting property: if f , ˜

f0 make the outer square commute, there exists ˜ f making the entire diagram commute

  • Key feature: a path in X can be uniquely lifted to a path in P
  • Homotopy type theory: dependent types are fibrations

Y Y × [0, 1] P X id × {0} ˜ f0 f π ˜ f

slide-37
SLIDE 37

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 37

In general, consider fibrations endowed with geometry

  • A fibration is a generalization of a fiber bundle that retains

desirable homotopy properties

  • Homotopy-equivalent fibers
  • Homotopy lifting property: if f , ˜

f0 make the outer square commute, there exists ˜ f making the entire diagram commute

  • Key feature: a path in X can be uniquely lifted to a path in P
  • Homotopy type theory: dependent types are fibrations
  • Avoid invertibility requirement via monoidal fibrations?
  • Maybe, but trading simplicity for generality is not a good start

Y Y × [0, 1] P X id × {0} ˜ f0 f π ˜ f

slide-38
SLIDE 38

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 38

Lenses can help with complex file transformations

  • Simple structural dependencies such as cross-references can

be handled using a derived graph

  • Complex structural dependencies such as checksums can
  • bstruct ad hoc transformations to a normal form
slide-39
SLIDE 39

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 39

Lenses can help with complex file transformations

  • Simple structural dependencies such as cross-references can

be handled using a derived graph

  • Complex structural dependencies such as checksums can
  • bstruct ad hoc transformations to a normal form
  • The notion of a lens provides a principled, compositional

solution that permits modifications to a file to be automatically transported to its putative normal form

  • Lenses have been synthesized at small scale from

specifications and translation examples, suggesting an approach for safely transforming files

slide-40
SLIDE 40

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 40

Generalized lenses are Grothedieck fibrations

  • A generalized lens category can be defined in terms of a

category C and a functor F : Cop → Cat

  • This recipe turns out to yield a Grothendieck fibration or

fibered category

  • Generalized “total space” of a bundle
slide-41
SLIDE 41

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 41

Generalized lenses are Grothedieck fibrations

  • A generalized lens category can be defined in terms of a

category C and a functor F : Cop → Cat

  • This recipe turns out to yield a Grothendieck fibration or

fibered category

  • Generalized “total space” of a bundle
  • Many of the cases motivating the definition of this generalized

lens category correspond specifically to bundles

  • Bimorphic lenses can be interpreted as trivial bundles (i.e., the

total space is a Cartesian product)

slide-42
SLIDE 42

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 42

Semantics is a modulus (complete isomorphism invariant)

A mathematically attractive definition of semantics is that it is the invariant after translation. If we view translation as

  • perators between different [representations], the fact that

semantics is preserved after translation means that the generators for different [representations] are all similar to one another [i.e., generators commute with translations]. 2

  • Moduli spaces or stacks describe the algebraic invariants

associated to categories fibered in groupoids

  • For the moduli stack of elliptic curves the appropriate (coarse,

i.e., automorphism-forgetting) modulus is the j-invariant

  • Modular forms are sections of line bundles on this stack
  • The role of “total space” is played by a Grothendieck fibration

2[E and Zhou]

slide-43
SLIDE 43

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 43

Thanks

BAE Systems FAST Labs @ https://bit.ly/2X2UwcP steve.huntsman @ baesystems.com paper @ https://arxiv.org/abs/2001.04952

slide-44
SLIDE 44

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 44

Syntax : semantics :: algebra : [arena for] geometry

The duality between syntax and semantics is really a manifestation of that between algebra and geometry. 3

3[Awodey and Forssell]

slide-45
SLIDE 45

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 45

Syntax : semantics :: algebra : [arena for] geometry

  • LHS = categorical logic
  • Simply typed lambda calculus : Cartesian closed category
  • First-order logic : hyperdoctrine
  • Dependent type theory : locally Cartesian closed category
  • Homotopy type theory : elementary (∞, 1) topos

The duality between syntax and semantics is really a manifestation of that between algebra and geometry. 3

3[Awodey and Forssell]

slide-46
SLIDE 46

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 46

Syntax : semantics :: algebra : [arena for] geometry

  • LHS = categorical logic
  • Simply typed lambda calculus : Cartesian closed category
  • First-order logic : hyperdoctrine
  • Dependent type theory : locally Cartesian closed category
  • Homotopy type theory : elementary (∞, 1) topos
  • RHS = Isbell/sheaf/spectral duality; noncommutative

topology

  • Boolean algebra : Stone space
  • Commutative C*-algebra : compact Hausdorff space
  • Commutative ring : affine scheme
  • Crossed product C*-algebra : principal bundle

The duality between syntax and semantics is really a manifestation of that between algebra and geometry. 3

3[Awodey and Forssell]

slide-47
SLIDE 47

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 47

Syntax : semantics :: algebra : [arena for] geometry

  • LHS = categorical logic
  • Simply typed lambda calculus : Cartesian closed category
  • First-order logic : hyperdoctrine
  • Dependent type theory : locally Cartesian closed category
  • Homotopy type theory : elementary (∞, 1) topos
  • RHS = Isbell/sheaf/spectral duality; noncommutative

topology

  • Boolean algebra : Stone space
  • Commutative C*-algebra : compact Hausdorff space
  • Commutative ring : affine scheme
  • Crossed product C*-algebra : principal bundle
  • No actual geometry yet, just spaces as substrates

The duality between syntax and semantics is really a manifestation of that between algebra and geometry. 3

3[Awodey and Forssell]

slide-48
SLIDE 48

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 48

Bundle geometry connects algebra across a base space

  • Syntax transformations form a group (or groupoid)
  • Semantically distinct (reps of)

files form a “base space”

slide-49
SLIDE 49

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 49

Bundle geometry connects algebra across a base space

  • Syntax transformations form a group (or groupoid)
  • Semantically distinct (reps of)

files form a “base space”

  • Homotopy hypothesis:

spaces = ∞-groupoids

  • Geometric interpretation

for base vs algebraic interpretation for paths

slide-50
SLIDE 50

IEEE S&P 2020 LangSec workshop Geometry of syntax and semantics 50

Bundle geometry connects algebra across a base space

  • Syntax transformations form a group (or groupoid)
  • Semantically distinct (reps of)

files form a “base space”

  • Homotopy hypothesis:

spaces = ∞-groupoids

  • Geometric interpretation

for base vs algebraic interpretation for paths

  • Unifying constructs:

bundles and fibrations

  • Goal-directed file

transformation imbues a notion of geometry connecting syntax and semantics