Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview - - PowerPoint PPT Presentation
Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview - - PowerPoint PPT Presentation
Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview About me! New Relics PHP Agent escapee. Now on New Projects, doing unspeakably un-PHP things. Wannabe compiler nerd. Terminology & brief intro to
Overview
- About me!
- New Relic’s PHP Agent escapee.
- Now on New Projects, doing unspeakably un-PHP things.
- Wannabe compiler nerd.
- Terminology & brief intro to compilers:
- Grammars, Scanners & Parsers
- General architecture of a bytecode compiler
- Hands on: Modifying the PHP language
- PHP/Zend compiler architecture & summary
- Case study in adding a new keyword
“Zend” vs. “Zend Engine” vs. “PHP”
- I will use all of these interchangeably throughout this talk.
- Referring to the bytecode compiler in the “Zend Engine 2” in most cases.
- The distinction doesn’t really matter here.
Compilers 101: Scanners
- Or lexical analyzers, or tokenizers
- Input: raw source code
- Output: a stream of tokens
T_WHILE T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") while ($x == $y) '(' ')'
Compilers 101: Parsers
- Input: a stream of tokens from the scanner
- Output is implementation dependent
- Often an intermediate, in-memory representation of the
program in tree form.
- e.g. Parse Tree or Abstract Syntax Tree
- Or directly generate bytecode.
- Goal of a parser is to structure
the token stream.
- Parsers are frequently generated from a DSL
- See parser generators like Yacc/Bison, ANTLR, etc.
- r e.g. parser combinators in Haskell, Scala, ML.
T_WHILE T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") 0: ZEND_IS_EQUAL ~0 !0 !1 1: ZEND_JMPZ ~0 ->3 2: … 3: … '(' ')'
Compilers 101: Context-free grammars
- Or simply “grammar”
- A grammar describes the complete syntax of a (programming) language.
- Usually expressed in Extended Backus-Naur Form (EBNF)
- Or some variant thereof.
- Variants of EBNF used for a lot of DSL-based parser generators
- e.g. Yacc/Bison, ANTLR, etc.
Generalized Compiler Architecture*
Scanner Parser Code Generator
Token stream Abstract Syntax Tree
Source files
Source code
Bytecode Interpreter
Bytecode
* Actually a generalized *bytecode* compiler architecture
Generalized *PHP* Compiler Architecture
Scanner Parser Code Generator
Token stream Abstract Syntax Tree
Source files
Source code
Bytecode Interpreter
Bytecode
Z e n d / z e n d _ l a n g u a g e _ p a r s e r . y Z e n d / z e n d _ l a n g u a g e _ s c a n n e r . l Z e n d / z e n d _ c
- m
p i l e . c Z e n d / z e n d _ e x e c u t e . c P H P c
- m
p i l e s d i r e c t l y t
- b
y t e c
- d
e !
Case Study: The “until” statement
<?php $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]\n”; }
- - output --
Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0] It’s basically while (!...) ...
How to add “until” to the PHP language
1.Tell the scanner how to tokenize new keyword(s) 2.Describe the syntax of the new construct 3.Emit bytecode
Before you start...
- You’ll need the usual gcc toolchain, GNU Bison, etc.
- Debian/Ubuntu apt-get install build-essential
- OSX Xcode command line tools should give you most of what you need.
- Also ensure that you have re2c
- Debian/Ubuntu apt-get install re2c
- OSX (Homebrew) brew install re2c
- Used to generate the scanner
- Silently ignored if not found by the configure script!
- And, of course, source code for some recent version of PHP 5.
- I’m working with PHP 5.4.4
- 1. Tell the scanner how to tokenize “until”
- Zend/zend_language_scanner.l
- Input for re2c, which will generate the Zend language scanner.
- Describes how raw source code should be converted into tokens.
- Note that no structure is implied here: that’s the parser’s job.
- Tell the scanner that the word “until” is special.
- The parser also needs to know about new tokens!
- How is this done for the while keyword?
T_UNTIL T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") until ($x == $y) '(' ')'
- 2. Describe the syntax of “until”
- Zend/zend_language_parser.y
- Essentially serves as the grammar for the Zend language.
- Also describes actions to perform during parsing.
- Input for the the parser generator (Bison) used to generate the PHP parser.
- Tell PHP how until statements are structured syntactically.
- How was it done for a while statement?
T_UNTIL expr '(' ')' statement
- 3. Emit bytecode
- Add actions to Zend/zend_language_parser.y
- What should they do?
- Recall that PHP generates bytecode during the parsing process.
- Generate bytecode describing the semantics of
until in terms of the PHP VM.
- Er, wait -- what bytecode do we need to generate?
Compiler Bytecode
Intermission: PHP bytecode intro
- opline
- Data structure representing a single line of PHP VM “assembly”
- Includes opcode + operands
- opline # associated with each opline
- Different variable types, differentiated by prefix:
- Variables ($)
- Compiled variables (!)
- Temporary variables (~)
- ZEND_JMP
- “goto”
- Conditional variants: ZEND_JMPZ, ZEND_JMPNZ
- opline #s used as address operand for JMP instructions (->)
<opcode> <result?> <op1?> <op2?> ZEND_JMP <op1> Unconditional jump to the opline # in op1 e.g. jump to opline #10 ZEND_JMP ->10 ZEND_JMPZ <op1> <op2> Conditional jump to the opline # in op2 ifg op1 is zero e.g. jump to opline #3 if ~0 is zero ZEND_JMPZ ~0 ->3 ZEND_IS_EQUAL <result> <op1> <op2> result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10
Unconditional jump: ZEND_JMP
0: ... 1: ... 2: ZEND_JMP ->0
Conditional jump: ZEND_JMPZ / ZEND_JMPNZ
0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...
- 4. Emit bytecode (cont.)
- Zend/zend_compile.c
- The Zend language’s code generation logic lives here.
- No DSLs here: plain old C source code.
- First, let’s try to understand the bytecode for while
- How do we need to modify it for until?
Demo!
- Time to build!
- The usual ./configure && make dance on Linux & OSX.
- To be thorough, regenerate data used by the tokenizer extension.
(cd ext/tokenizer && ./tokenizer_data_gen.sh)
- http://php.net/manual/en/book.tokenizer.php
- You’ll need to run make again once you’ve done this.
- With a little luck, magic happens and you get a binary in sapi/cli/php
- Take until out for a spin!
- Lots to take in, right?
- In my experience, this stuff is best learned bit-by-bit through practice.
- Ask questions!
- php-internals
- Or hey, ask me...
And exhale.
Thanks!
- scon@tomlee.co @tglee