inside php tom lee tglee oscon 2012 19th july 2012
play

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview - PowerPoint PPT Presentation

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview About me! New Relics PHP Agent escapee. Now on New Projects, doing unspeakably un-PHP things. Wannabe compiler nerd. Terminology & brief intro to


  1. Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012

  2. Overview • About me! • New Relic’s PHP Agent escapee. • Now on New Projects, doing unspeakably un-PHP things. • Wannabe compiler nerd. • Terminology & brief intro to compilers: • Grammars, Scanners & Parsers • General architecture of a bytecode compiler • Hands on: Modifying the PHP language • PHP/Zend compiler architecture & summary • Case study in adding a new keyword

  3. “Zend” vs. “Zend Engine” vs. “PHP” • I will use all of these interchangeably throughout this talk. • Referring to the bytecode compiler in the “Zend Engine 2” in most cases. • The distinction doesn’t really matter here.

  4. Compilers 101: Scanners • Or lexical analyzers , or tokenizers T_WHILE • Input : raw source code '(' • Output : a stream of tokens T_VARIABLE("x") while ($x == $y) T_IS_EQUAL T_VARIABLE("y") ')'

  5. Compilers 101: Parsers • Input: a stream of tokens from the scanner T_WHILE • Output is implementation dependent '(' • Often an intermediate, in-memory representation of the program in tree form. 0: ZEND_IS_EQUAL ~0 !0 !1 T_VARIABLE("x") • e.g. Parse Tree or Abstract Syntax Tree 1: ZEND_JMPZ ~0 ->3 2: … • Or directly generate bytecode. 3: … T_IS_EQUAL • Goal of a parser is to structure the token stream. T_VARIABLE("y") • Parsers are frequently generated from a DSL ')' • See parser generators like Yacc/Bison, ANTLR, etc. or e.g. parser combinators in Haskell, Scala, ML.

  6. Compilers 101: Context-free grammars • Or simply “grammar” • A grammar describes the complete syntax of a (programming) language. • Usually expressed in Extended Backus-Naur Form (EBNF) • Or some variant thereof. • Variants of EBNF used for a lot of DSL-based parser generators • e.g. Yacc/Bison, ANTLR, etc.

  7. Generalized Compiler Architecture* Source files Source code Scanner Token stream Parser Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree * Actually a generalized *bytecode* compiler architecture

  8. Generalized *PHP* Compiler Architecture Source files Scanner Source code Token stream e r . l n n a s c e _ a g g u a n _ l n d z e d / e n Z Parser y e r . r s p a e _ g u a n g l a d _ e n / z n d Z e Bytecode Abstract Bytecode Code Generator Interpreter Syntax Tree c e . p i l P m H c c o P t e . d _ c u e n x e / z _ e n d d Z e e n c d / z o m n Z e p i l e s d i r e c t l y t o b y t e c o d e !

  9. Case Study: The “until” statement <?php It’s basically while (!...) ... $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]\n”; } -- output -- Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0]

  10. How to add “until” to the PHP language 1.Tell the scanner how to tokenize new keyword(s) 2.Describe the syntax of the new construct 3.Emit bytecode

  11. Before you start... • You’ll need the usual gcc toolchain, GNU Bison, etc. • Debian/Ubuntu apt-get install build-essential • OSX Xcode command line tools should give you most of what you need. • Also ensure that you have re2c • Debian/Ubuntu apt-get install re2c • OSX (Homebrew) brew install re2c • Used to generate the scanner • Silently ignored if not found by the configure script! • And, of course, source code for some recent version of PHP 5. • I’m working with PHP 5.4.4

  12. 1. Tell the scanner how to tokenize “until” T_UNTIL • Zend/zend_language_scanner.l • Input for re2c , which will generate the Zend language scanner. '(' • Describes how raw source code should be converted into tokens. • Note that no structure is implied here: that’s the parser’s job. T_VARIABLE("x") • Tell the scanner that the word “until” is special. until ($x == $y) T_IS_EQUAL • The parser also needs to know about new tokens! • How is this done for the while keyword? T_VARIABLE("y") ')'

  13. 2. Describe the syntax of “until” • Zend/zend_language_parser.y • Essentially serves as the grammar for the Zend language. • Also describes actions to perform during parsing. • Input for the the parser generator (Bison) used to generate the PHP parser. • Tell PHP how until statements are structured syntactically. • How was it done for a while statement? T_UNTIL '(' expr ')' statement

  14. 3. Emit bytecode • Add actions to Zend/zend_language_parser.y • What should they do? • Recall that PHP generates bytecode during the parsing process. • Generate bytecode describing the semantics of until in terms of the PHP VM. Compiler • Er, wait -- what bytecode do we need to generate? Bytecode

  15. Intermission: PHP bytecode intro • opline <opcode> <result?> <op1?> <op2?> • Data structure representing a single line of PHP VM “assembly” • Includes opcode + operands ZEND_JMP <op1> • opline # associated with each opline Unconditional jump to the opline # in op1 e.g. jump to opline #10 • Different variable types, differentiated by prefix: ZEND_JMP ->10 • Variables ( $ ) • Compiled variables ( ! ) ZEND_JMPZ <op1> <op2> Conditional jump to the opline # in op2 • Temporary variables ( ~ ) i fg op1 is zero e.g. jump to opline #3 if ~0 is zero • ZEND_JMP ZEND_JMPZ ~0 ->3 • “goto” • Conditional variants: ZEND_JMPZ , ZEND_JMPNZ ZEND_IS_EQUAL <result> <op1> <op2> • opline #s used as address operand for JMP instructions (->) result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10

  16. Unconditional jump: ZEND_JMP 0: ... 1: ... 2: ZEND_JMP ->0

  17. Conditional jump: ZEND_JMPZ / ZEND_JMPNZ 0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...

  18. 4. Emit bytecode (cont.) • Zend/zend_compile.c • The Zend language’s code generation logic lives here. • No DSLs here: plain old C source code. • First, let’s try to understand the bytecode for while • How do we need to modify it for until ?

  19. Demo! • Time to build! • The usual ./configure && make dance on Linux & OSX. • To be thorough, regenerate data used by the tokenizer extension. (cd ext/tokenizer && ./tokenizer_data_gen.sh) • http://php.net/manual/en/book.tokenizer.php • You’ll need to run make again once you’ve done this. • With a little luck, magic happens and you get a binary in sapi/cli/php • Take until out for a spin!

  20. And exhale. • Lots to take in, right? • In my experience, this stuff is best learned bit-by-bit through practice. • Ask questions! • Google • php-internals • Or hey, ask me...

  21. Thanks! oscon@tomlee.co @tglee http://newrelic.com ... and come see Inside Python @ 5pm in D135 :)

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend