Secure and Efgicient Parsing via Programming Language Theory
Neel Krishnaswami & Jeremy Yallop
Secure and Efgicient Parsing via Programming Language Theory Neel - - PowerPoint PPT Presentation
Secure and Efgicient Parsing via Programming Language Theory Neel Krishnaswami & Jeremy Yallop parsing & security types & algebras staging & speed speed & correctness e : < < e > > Parsing
Neel Krishnaswami & Jeremy Yallop
Parsing and security
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
parsing interpretation
Parser combinators: appeal
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
simplicity: parsers are functions declarative: parsers resemble BNF
star :: Parser a → Parser [a] star p = ps ⊕ empty where empty = return [] ps = do x ← p xs ← star p return (x : xs) sexp = (lparen >> star sexp >> rparen) ⊕ atom sexp ::= LPAREN sexp* RPAREN | ATOM
Parser combinators: pitfalls
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
complexity: exponential (or worse) declarative? not in practice
p ⊕ q ̸≡ p ⊕ q
(demonstration)
Parser combinators: pitfalls
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
complexity: exponential (or worse) declarative? not in practice
p ⊕ q ̸≡ p ⊕ q
(demonstration)
ASP and its aims
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
asp: a combinator library with an unusual combination of features Conventional Interface Unsurprising Semantics Guaranteed Determinism Competitive Performance
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Real interface: arbitrary tokens
Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Real interface: arbitrary tokens
Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Real interface: arbitrary tokens
Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Real interface: arbitrary tokens
Imperative stream Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
ASP: interface
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Real interface: arbitrary tokens
Imperative stream May fail! Abstract grammar interface (context-free expressions)
type α t val chr: char → char t val eps: unit t val seq: α t → β t → (α * β) t val bot: α t val alt: α t → α t → α t val fix: (α t → α t) → α t val map: (α → β) → α t → β t
User-defined functions
let option r = alt (map (fun _ → None) eps) (map (fun x → Some x) r) (also star, plus, infix, &c.)
Parsers from grammars
val parser: α t → (char Stream.t → α)
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
disjunctive non-determinismX
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
disjunctive non-determinismX
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
disjunctive non-determinismX
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
disjunctive non-determinismX
accepted or rejected?
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
(also reject: lefu recursion, non-lefu-factored)
disjunctive non-determinismX
sequential non-determinismX
accepted or rejected? Plan: use a type system to decide
alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'b')) seq (chr 'a') (option (chr 'b')) alt (map (fun _ → 1) (chr 'a')) (map (fun _ → 2) (chr 'a')) seq (option (chr 'a')) (option (chr 'a'))
(also reject: lefu recursion, non-lefu-factored)
disjunctive non-determinismX
sequential non-determinismX
context-free expressions
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Context-free expressions (CFEs) g ::= ⊥ | g ∨ g′ | ϵ | c | g · g′ | x | µx. g Semantics of CFEs ⊥ γ = ∅ g ∨ g′ γ = g γ ∪ g′ γ ϵ γ = {ε} c γ = {c} g · g′ γ = {w · w′ | w ∈ g γ ∧ w′ ∈ g′ γ} x γ = γ(x) µx. g γ = fix(λX. g (γ, X/x)) fix(f) = ∪
i∈N
Li where L0 = ∅ Ln+1 = f(Ln)
ASP: equations
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness CFEs form an idempotent semiring g1 ∨ (g2 ∨ g3) = (g1 ∨ g2) ∨ g3 g ∨ g′ = g′ ∨ g g ∨ ⊥ = g g ∨ g = g g1 · (g2 · g3) = (g1 · g2) · g3 g · ϵ = g (g1 ∨ g2) · g = (g1 · g) ∨ (g2 · g) g · (g1 ∨ g2) = (g · g1) ∨ (g · g2) g · ⊥ = ⊥ ⊥ · g = ⊥ (along with some equations for µ)
ASP: types
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Types for languages Types τ ∈ { NULL : 2; FIRST : P(Σ); FLAST : P(Σ) } τ1 ∨ τ2 = NULL = τ1.NULL ∨ τ2.NULL FIRST = τ1.FIRST ∪ τ2.FIRST FLAST = τ1.FLAST ∪ τ2.FLAST Type predicates τ1 # τ2 ≜ (τ1.FIRST ∩ τ2.FIRST = ∅) ∧ ¬(τ1.NULL ∧ τ2.NULL) Properties of types If L | = τ and M | = τ ′ and τ # τ ′, then L ∪ M | = τ ∨ τ ′.
ASP: type system
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Syntactic type system for LL(1) grammars Γ; ∆ ⊢ g : τ Γ; ∆ ⊢ g′ : τ ′ τ # τ ′ Γ; ∆ ⊢ g ∨ g′ : τ ∨ τ ′ Semantic soundness If Γ; ∆ ⊢ g : τ and γ | = Γ and δ | = ∆ then g (γ, δ) | = τ Type inference No type annotations needed (even for fixed points)
ASP: parsing with types
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
A simple parsing algorithm P (Γ; ∆ ⊢ g : τ) ∈ Env(Γ) → Env(∆) → Σ∗ ⇀ Σ∗ P (Γ; ∆ ⊢ g ∨ g′ : τ ∨ τ ′) ˆ γ ˆ δ [] = { [] when (τ ∨ τ ′).NULL fail
P (Γ; ∆ ⊢ g ∨ g′ : τ ∨ τ ′) ˆ γ ˆ δ ((c :: _) as s) = P (Γ; ∆ ⊢ g : τ) ˆ γ ˆ δ s when c ∈ τ.FIRST
P (Γ; ∆ ⊢ g′ : τ ′) ˆ γ ˆ δ s when c ∈ τ ′.FIRST
fail
The parsing algorithm is sound and complete i.e. it parses exactly the words of the language.
ASP: linear-time guarantee
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Guarantee: well-typed parsers don’t back-track
input size run time
linear-time
ASP: linear-time guarantee
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Guarantee: well-typed parsers don’t back-track
input size run time
linear-time
Speed: combinators vs yacc
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
input size run time
linear-time also linear-time How can we close the gap? Staging
Speed: combinators vs yacc
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
input size run time
linear-time also linear-time How can we close the gap? Staging
Speed: combinators vs yacc
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
input size run time
linear-time also linear-time How can we close the gap? Staging
Speed: combinators vs yacc
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
input size run time
linear-time also linear-time How can we close the gap? Staging
staging removes overhead
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Parser combinators abstract over the grammar Abstraction carries a performance penalty abstraction = ⇒ overhead Use staging to specialize code once grammar is known Delay (quote) code that accesses the input stream ≪ peek stream == ’a’ ≫ Evaluate code that depends only on the grammar abstraction + staging = ⇒ no overhead
staging removes overhead
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Parser combinators abstract over the grammar Abstraction carries a performance penalty abstraction = ⇒ overhead Use staging to specialize code once grammar is known Delay (quote) code that accesses the input stream ≪ peek stream == ’a’ ≫ Evaluate code that depends only on the grammar abstraction + staging = ⇒ no overhead
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
less-naive staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Binding-time improvements turn dynamic terms static
if peek stream == 'a' then e peek (fun c → if c == 'a' then e) stream CPS-convert f c match c with | 'a' → f 'a' | 'b' → f 'b' | 'c' → f 'c' | . . . match c with | 'a' → f 'a' | _ → f 'c' c is a char prune using types
faster than yacc!
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
arith sexp 20 40 Throughput (MB/s) unstaged yacc staged
Future: heterogeneous staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness Homogeneity limits both guarantees and performance low-level (monomorphic, efgicient) high-level (powerful types & abstractions) language expressiveness host target homogeneous host target h e t e r
e n e
s
Future: verified staging
parsing & security types & ⊢ algebras ≡ staging & speed < <e> > speed & ✓ ✓ correctness
Challenge: verifying a staged program
_;_ : Statement → Statement → Statement … ⇝-assignment : E ⊢ e ⇒ v → S (x := e) k E ⇝ S nop k (x → v , E) … seq : Parser α → Parser α → Parser α … assoc: seq p (seq q r) ∼ = seq (seq p q) r) …
Code Logic C Parsers
A typed interface to C Reduction semantics Parser operations Parser properties