@AndreasZeller
Advances in Grammar Mining and Testing
Andreas Zeller CISPA / Saarland University
https://github.com/vrthra/pygmalion
Advances in Grammar Mining and Testing Andreas Zeller CISPA / - - PowerPoint PPT Presentation
Advances in Grammar Mining and Testing Andreas Zeller CISPA / Saarland University https://github.com/vrthra/pygmalion @AndreasZeller Saarbrcken @AndreasZeller CISPA | Center for IT-Security, Privacy and Accountability Scienti
@AndreasZeller
Advances in Grammar Mining and Testing
Andreas Zeller CISPA / Saarland University
https://github.com/vrthra/pygmalion
@AndreasZeller
Saarbrücken
Scientifjc excellence in fundamental research 50,000,000 €/year • 500+ researchers
Fuzzing
Random Testing at the System Level
[;x1-GPZ+wcckc];,N9J+?#6^6\e?]9lu2_%'4GX"0VUB[E/r ~fApu6b8<{%siq8Zh.6{V,hr?;{Ti.r3PIxMMMv6{xS^+'Hq!AxB"YXRS@! Kd6;wtAMefFWM(`|J_<1~o}z3K(CCzRH JIIvHz>_*.\>JrlU32~eGP? lR=bF3+;y$3lodQ<B89!5"W2fK*vE7v{')KC-i,c{<[~m!]o;{.'}Gj\(X} EtYetrpbY@aGZ1{P!AZU7x#4(Rtn!q4nCwqol^y6}0| Ko=*JK~;zMKV=9Nai:wxu{J&UV#HaU)*BiC<),`+t*gka<W=Z. %T5WGHZpI30D<Pq>&]BS6R&j?#tP7iaV}-}`\?[_[Z^LBMPG- FKj'\xwuZ1=Q`^`5,$N$Q@[!CuRzJ2D|vBy!^zkhdf3C5PAkR?V hn| 3='i2Qx]D$qs4O`1@fevnG'2\11Vf3piU37@55ap\zIyl"'f, $ee,J4Gw:cgNKLie3nx9(`efSlg6#[K"@WjhZ}r[Scun&sBCS,T[/ vY'pduwgzDlVNy7'rnzxNwI)(ynBa>%|b`;`9fG]P_0hdG~$@6 3]KAeEnQ7lU)3Pn,0)G/6N-wyzj/MTd#A;r
Fuzzer
Fuzzing
Random Testing at the System Level
UNIX utilities
“ab’d&gfdfggg” 25%–33% grep • sh • sed …
@AndreasZeller
Grammar Fuzzing
to compile and execute a program
syntactically correct inputs
Parser
LangFuzz (2012)
generate inputs
to parse existing inputs
JavaScript Grammar
If Statement
IfStatementfull ⇒ if ParenthesizedExpression Statementfull | if ParenthesizedExpression StatementnoShortIf else Statementfull IfStatementnoShortIf ⇒ if ParenthesizedExpression StatementnoShortIf else StatementnoShortIf
Switch Statement
SwitchStatement ⇒ switch ParenthesizedExpression { } | switch ParenthesizedExpression { CaseGroups LastCaseGroup } CaseGroups ⇒ «empty» | CaseGroups CaseGroup CaseGroup ⇒ CaseGuards BlockStatementsPrefix LastCaseGroup CaseGuards BlockStatements
Parser
A Generated Input
1var haystack = "foo"; 2var re text = "^foo"; 3haystack += "x"; 4 re text += "(x)"; 5var re = new RegExp(re text);Figure 2: Test case generated by LangFuzz,
# days # defects
Mozilla TI Google V8 (Chrome 10 Beta) Mozilla TM (Firefox 4 Beta)US$ 50,000+ in fjrst four weeks 18 Chromium Security Rewards 12 Mozilla Security Bug Bounty Awards in 9 months
Fuzzing JavaScript
Learning Grammars
If Statement
IfStatementfull ⇒ if ParenthesizedExpression Statementfull | if ParenthesizedExpression StatementnoShortIf else Statementfull IfStatementnoShortIf ⇒ if ParenthesizedExpression StatementnoShortIf else StatementnoShortIf
Switch Statement
SwitchStatement ⇒ switch ParenthesizedExpression { } | switch ParenthesizedExpression { CaseGroups LastCaseGroup } CaseGroups ⇒ «empty» | CaseGroups CaseGroup CaseGroup ⇒ CaseGuards BlockStatementsPrefix LastCaseGroup CaseGuards BlockStatements
@AndreasZeller
Learning Grammars
via its input/output language
via a formal language – regular expressions, grammars
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http:// @ user:pass www.google.com:80 path /
Program
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http :// @ user:pass www.google.com:80 path /
– protocol
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http :// @ user:pass www.google.com :80 path /
– protocol – host name
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http :// @ user:pass www.google.com : 80 path /
– protocol – host name – port
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http :// @ user : pass www.google.com : 80 path /
– protocol – host name – port – login
@AndreasZeller http:// @ user:pass www.google.com:80 path /
Learning Grammars
http :// @ user : pass www.google.com : 80 path /
– protocol – host name – port – login – page request
@AndreasZeller
Learning Grammars
http :// @ user : pass www.google.com : 80 path /
– protocol – host name – port – login – page request – terminals
http:// @ user:pass www.google.com:80 path /
@AndreasZeller
Learning Grammars
http :// @ user : pass www.google.com : 80 path /
– protocol – host name – port – login – page request – terminals
http:// @ user:pass www.google.com:80 path /
processed in difgerent functions stored in difgerent variables
@AndreasZeller
Tracking Input
We track input characters throughout program execution:
(and derived values) with their origin
whether they hold input fragments (simpler)
@AndreasZeller
Grammar Inference
@AndreasZeller
$START ::= http://user:pass@www.google.com:80/path#refwhere value is a substring of input:
Grammar Inference
fragment = 'ref' url = '/path' path = '/path' scheme = 'http' netloc = 'user:pass@www.google.com:80'@AndreasZeller
Grammar Inference
fragment = 'ref' url = '/path' path = '/path' scheme = 'http'where value is a substring of input:
@AndreasZeller
$START ::= $SCHEME://$NETLOC/path#ref $NETLOC ::= user:pass@www.google.com:80 $SCHEME ::= httpwhere value is a substring of input:
Grammar Inference
fragment = 'ref' url = '/path' path = '/path'@AndreasZeller
fragment = 'ref' url = '/path' $START ::= $SCHEME://$NETLOC$PATH#ref $NETLOC ::= user:pass@www.google.com:80 $SCHEME ::= http $PATH ::= /pathwhere value is a substring of input:
Grammar Inference
@AndreasZeller
where value is a substring of input:
Grammar Inference
$START ::= $SCHEME://$NETLOC$PATH#$FRAGMENT $NETLOC ::= user:pass@www.google.com:80 $SCHEME ::= http $PATH ::= /path $FRAGMENT ::= ref url = '/path'@AndreasZeller
where value is a substring of input:
Grammar Inference
$START ::= $SCHEME://$NETLOC$PATH#$FRAGMENT $NETLOC ::= user:pass@www.google.com:80 $SCHEME ::= http $PATH ::= $URL $FRAGMENT ::= ref $URL ::= /path@AndreasZeller
@AndreasZeller
AUTOGRAM
AUTOGRAM: a grammar miner for Java programs Uses active learning to infer
Höschele, Zeller: "Mining Input Grammars from Dynamic Taints", ASE 2016
@AndreasZeller
URLs
URL ::= PROTOCOL '://' AUTHORITY PATH ['?' QUERY] ['#' REF] AUTHORITY ::= [USERINFO '@'] HOST [':' PORT] PROTOCOL ::= 'http' | 'ftp' USERINFO ::= /[a-z]+:[a-z]+/ HOST ::= /[a-z.]+/ PORT ::= '80' PATH ::= /\/[a-z0-9.\/]*/ QUERY ::= 'foo=bar&lorem=ipsum' REF ::= /[a-z]+/
http://user:password@www.google.com:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:12345@ftp.example.com/oss/debian7.iso@AndreasZeller
INI Files
INI ::= LINE+ LINE ::= SECTION_LINE '\r' | OPTION_LINE ['\r'] SECTION_LINE ::= '[' KEY ']' OPTION_LINE ::= KEY ' = ' VALUE KEY ::= /[a-zA-Z]*/ VALUE ::= /[a-zA-Z0-9\/]/
[Application] Version = 0.5 WorkingDir = /tmp/mydir/ [User] User = Bob Password = 12345
@AndreasZeller
JSON Input
JSON ::= VALUE VALUE ::= JSONOBJECT | ARRAY | STRINGVALUE | TRUE | FALSE | NULL | NUMBER TRUE ::= ’true’ FALSE ::= ’false’ NULL ::= ’null’ NUMBER ::= [’-’] /[0-9]+/ STRINGVALUE ::= ’"’ INTERNALSTRING ’"’ INTERNALSTRING ::= /[a-zA-Z0-9 ]+/ ARRAY ::= ’[’ [VALUE [’,’ VALUE]+] ’]’ JSONOBJECT ::= ’{’ [STRINGVALUE ’:’ VALUE [’,’ STRINGVALUE ’:’ VALUE] +] '}'
{ "v": true, "x": 25, "y": -36, … }
@AndreasZeller
Testing with Mined Grammars
Tests Grammar Inputs Program
@AndreasZeller
Testing with Mined Grammars
Tests Grammar Inputs Program
introduce bias may not be available
@AndreasZeller
Sample-Free Grammar Learning
Tests Grammar Test Generator Program
@AndreasZeller
Sample-Free Grammar Learning
Test Generator Program
But this is what we want to build in the first place!
@AndreasZeller
Sample-Free Grammar Learning
Tests Grammar Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
xyzzy
✘
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{'
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{'
✔
Parser-Directed Test Generator Program
@AndreasZeller
Learning Behavior
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{'
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{'
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
true
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
false
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' false
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
– checks for digit – checks for "true"/"false" – checks for '"' – checks for '[' – checks for '{' false
✔
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
"
✘
– checks for '"' – checks for '\' – checks for character
Parser-Directed Test Generator Program
@AndreasZeller
Dynamic Checks
""
✔
Parser-Directed Test Generator Program
@AndreasZeller
Learning JSON
JSON ::= VALUE VALUE ::= JSONOBJECT | ARRAY | STRINGVALUE | TRUE | FALSE | NULL | NUMBER TRUE ::= ’true’ FALSE ::= ’false’ NULL ::= ’null’ NUMBER ::= [’-’] /[0-9]+/ STRINGVALUE ::= ’"’ INTERNALSTRING ’"’ INTERNALSTRING ::= /[a-zA-Z0-9 ]+/ ARRAY ::= ’[’ [VALUE [’,’ VALUE]+] ’]’ JSONOBJECT ::= ’{’ [STRINGVALUE ’:’ VALUE [’,’ STRINGVALUE ’:’ VALUE] +] '}'
Parser-Directed Test Generator
@AndreasZeller
PYGMALION prototype for Python programs
All in One
Program Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test InputsGopinath, Mathis, Höschele, Kampmann, Zeller: "Sample-Free Learning of Input Grammars"
@AndreasZeller
Initial Evaluation
AFL KLEE
mutation-based test generator constraint-based test generator
@AndreasZeller
Initial Evaluation
AFL KLEE Pygmalion
mutation-based test generator constraint-based test generator
@AndreasZeller
Initial Evaluation
AFL KLEE Pygmalion
No seed inputs for any tool runs on Python programs run on equivalent C programs
@AndreasZeller
Testing with Inferred Grammars
PYGMALION prototype for Python programs
Program Under Test Parser-Directed Test Generator Comparisons Tests Dynamic Taints Grammar Learner Test InputsGopinath, Mathis, Höschele, Kampmann, Zeller: "Sample-Free Learning of Input Grammars"
Perfect coverage, much faster than AFL, much better structure than KLEE
@AndreasZeller
Test Generation Compared
[false ,[{ "o":{ , "$dYPrlj@?BR": 397 [+ ]"S|+|4GzCW(C":-94}} ], [false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
AFL KLEE Pygmalion Low coverage, few valid inputs Top coverage, but flat inputs Top coverage, deep structure, control
@AndreasZeller
Findings
@AndreasZeller
Generated scanners and parsers
Binary formats, identifjers
Port Pygmalion to C (2019)
Book "Generating Software Tests"
Challenges
@AndreasZeller
@AndreasZeller
@AndreasZeller
Andreas Zeller Saarland University / CISPA
@AndreasZellerChallenges
Parser-Directed Test Generation
@AndreasZellerLearning Grammars
URL ::= PROTOCOL '://' AUTHORITY PATH ['?' QUERY] ['#' REF] AUTHORITY ::= [USERINFO '@'] HOST [':' PORT] PROTOCOL ::= 'http' | 'ftp' USERINFO ::= /[a-z]+:[a-z]+/ HOST ::= /[a-z.]+/ PORT ::= '80' PATH ::= /\/[a-z0-9.\/]*/ QUERY ::= 'foo=bar&lorem=ipsum' REF ::= /[a-z]+/ http://user:password@www.google.com:80/command?foo=bar&lorem=ipsum#fragment http://www.guardian.co.uk/sports/worldcup#results ftp://bob:12345@ftp.example.com/oss/debian7.iso Höschele, Zeller: "Mining Input Grammars from Dynamic Taints", ASE 2016 @AndreasZellerParser-Directed Testing
Tests Grammar Parser-Directed Test Generator Program @AndreasZellerTest Generation Compared
[false ,[{ "o":{ , "$dYPrlj@?BR": 397 [+ ]"S|+|4GzCW(C":-94}} ], [false,null]] ............................ nullÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ AFL KLEE Pygmalion Low coverage, few valid inputs Top coverage, but flat inputs Top coverage, deep structure, controlhttps://github.com/vrthra/pygmalion
@AndreasZeller