Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview - - PowerPoint PPT Presentation

inside php tom lee tglee oscon 2012 19th july 2012
SMART_READER_LITE
LIVE PREVIEW

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview - - PowerPoint PPT Presentation

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012 Overview About me! New Relics PHP Agent escapee. Now on New Projects, doing unspeakably un-PHP things. Wannabe compiler nerd. Terminology & brief intro to


slide-1
SLIDE 1

Inside PHP Tom Lee @tglee OSCON 2012 19th July, 2012

slide-2
SLIDE 2

Overview

  • About me!
  • New Relic’s PHP Agent escapee.
  • Now on New Projects, doing unspeakably un-PHP things.
  • Wannabe compiler nerd.
  • Terminology & brief intro to compilers:
  • Grammars, Scanners & Parsers
  • General architecture of a bytecode compiler
  • Hands on: Modifying the PHP language
  • PHP/Zend compiler architecture & summary
  • Case study in adding a new keyword
slide-3
SLIDE 3

“Zend” vs. “Zend Engine” vs. “PHP”

  • I will use all of these interchangeably throughout this talk.
  • Referring to the bytecode compiler in the “Zend Engine 2” in most cases.
  • The distinction doesn’t really matter here.
slide-4
SLIDE 4

Compilers 101: Scanners

  • Or lexical analyzers, or tokenizers
  • Input: raw source code
  • Output: a stream of tokens

T_WHILE T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") while ($x == $y) '(' ')'

slide-5
SLIDE 5

Compilers 101: Parsers

  • Input: a stream of tokens from the scanner
  • Output is implementation dependent
  • Often an intermediate, in-memory representation of the

program in tree form.

  • e.g. Parse Tree or Abstract Syntax Tree
  • Or directly generate bytecode.
  • Goal of a parser is to structure

the token stream.

  • Parsers are frequently generated from a DSL
  • See parser generators like Yacc/Bison, ANTLR, etc.
  • r e.g. parser combinators in Haskell, Scala, ML.

T_WHILE T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") 0: ZEND_IS_EQUAL ~0 !0 !1 1: ZEND_JMPZ ~0 ->3 2: … 3: … '(' ')'

slide-6
SLIDE 6

Compilers 101: Context-free grammars

  • Or simply “grammar”
  • A grammar describes the complete syntax of a (programming) language.
  • Usually expressed in Extended Backus-Naur Form (EBNF)
  • Or some variant thereof.
  • Variants of EBNF used for a lot of DSL-based parser generators
  • e.g. Yacc/Bison, ANTLR, etc.
slide-7
SLIDE 7

Generalized Compiler Architecture*

Scanner Parser Code Generator

Token stream Abstract Syntax Tree

Source files

Source code

Bytecode Interpreter

Bytecode

* Actually a generalized *bytecode* compiler architecture

slide-8
SLIDE 8

Generalized *PHP* Compiler Architecture

Scanner Parser Code Generator

Token stream Abstract Syntax Tree

Source files

Source code

Bytecode Interpreter

Bytecode

Z e n d / z e n d _ l a n g u a g e _ p a r s e r . y Z e n d / z e n d _ l a n g u a g e _ s c a n n e r . l Z e n d / z e n d _ c

  • m

p i l e . c Z e n d / z e n d _ e x e c u t e . c P H P c

  • m

p i l e s d i r e c t l y t

  • b

y t e c

  • d

e !

slide-9
SLIDE 9

Case Study: The “until” statement

<?php $x = 5; until ($x == 0) { $x--; echo “Oh hi, Mark [$x]\n”; }

  • - output --

Oh hi, Mark [4] Oh hi, Mark [3] Oh hi, Mark [2] Oh hi, Mark [1] Oh hi, Mark [0] It’s basically while (!...) ...

slide-10
SLIDE 10

How to add “until” to the PHP language

1.Tell the scanner how to tokenize new keyword(s) 2.Describe the syntax of the new construct 3.Emit bytecode

slide-11
SLIDE 11

Before you start...

  • You’ll need the usual gcc toolchain, GNU Bison, etc.
  • Debian/Ubuntu apt-get install build-essential
  • OSX Xcode command line tools should give you most of what you need.
  • Also ensure that you have re2c
  • Debian/Ubuntu apt-get install re2c
  • OSX (Homebrew) brew install re2c
  • Used to generate the scanner
  • Silently ignored if not found by the configure script!
  • And, of course, source code for some recent version of PHP 5.
  • I’m working with PHP 5.4.4
slide-12
SLIDE 12
  • 1. Tell the scanner how to tokenize “until”
  • Zend/zend_language_scanner.l
  • Input for re2c, which will generate the Zend language scanner.
  • Describes how raw source code should be converted into tokens.
  • Note that no structure is implied here: that’s the parser’s job.
  • Tell the scanner that the word “until” is special.
  • The parser also needs to know about new tokens!
  • How is this done for the while keyword?

T_UNTIL T_VARIABLE("x") T_IS_EQUAL T_VARIABLE("y") until ($x == $y) '(' ')'

slide-13
SLIDE 13
  • 2. Describe the syntax of “until”
  • Zend/zend_language_parser.y
  • Essentially serves as the grammar for the Zend language.
  • Also describes actions to perform during parsing.
  • Input for the the parser generator (Bison) used to generate the PHP parser.
  • Tell PHP how until statements are structured syntactically.
  • How was it done for a while statement?

T_UNTIL expr '(' ')' statement

slide-14
SLIDE 14
  • 3. Emit bytecode
  • Add actions to Zend/zend_language_parser.y
  • What should they do?
  • Recall that PHP generates bytecode during the parsing process.
  • Generate bytecode describing the semantics of

until in terms of the PHP VM.

  • Er, wait -- what bytecode do we need to generate?

Compiler Bytecode

slide-15
SLIDE 15

Intermission: PHP bytecode intro

  • opline
  • Data structure representing a single line of PHP VM “assembly”
  • Includes opcode + operands
  • opline # associated with each opline
  • Different variable types, differentiated by prefix:
  • Variables ($)
  • Compiled variables (!)
  • Temporary variables (~)
  • ZEND_JMP
  • “goto”
  • Conditional variants: ZEND_JMPZ, ZEND_JMPNZ
  • opline #s used as address operand for JMP instructions (->)

<opcode> <result?> <op1?> <op2?> ZEND_JMP <op1> Unconditional jump to the opline # in op1 e.g. jump to opline #10 ZEND_JMP ->10 ZEND_JMPZ <op1> <op2> Conditional jump to the opline # in op2 ifg op1 is zero e.g. jump to opline #3 if ~0 is zero ZEND_JMPZ ~0 ->3 ZEND_IS_EQUAL <result> <op1> <op2> result=1 if op1 == op2, otherwise result=0 e.g. set ~0=1 if !0 == 10 ZEND_IF_EQUAL ~0 !0 10

slide-16
SLIDE 16

Unconditional jump: ZEND_JMP

0: ... 1: ... 2: ZEND_JMP ->0

slide-17
SLIDE 17

Conditional jump: ZEND_JMPZ / ZEND_JMPNZ

0: ... 1: ... 2: ZEND_JMPZ ~0 ->0 3: ...

slide-18
SLIDE 18
  • 4. Emit bytecode (cont.)
  • Zend/zend_compile.c
  • The Zend language’s code generation logic lives here.
  • No DSLs here: plain old C source code.
  • First, let’s try to understand the bytecode for while
  • How do we need to modify it for until?
slide-19
SLIDE 19

Demo!

  • Time to build!
  • The usual ./configure && make dance on Linux & OSX.
  • To be thorough, regenerate data used by the tokenizer extension.

(cd ext/tokenizer && ./tokenizer_data_gen.sh)

  • http://php.net/manual/en/book.tokenizer.php
  • You’ll need to run make again once you’ve done this.
  • With a little luck, magic happens and you get a binary in sapi/cli/php
  • Take until out for a spin!
slide-20
SLIDE 20
  • Lots to take in, right?
  • In my experience, this stuff is best learned bit-by-bit through practice.
  • Ask questions!
  • Google
  • php-internals
  • Or hey, ask me...

And exhale.

slide-21
SLIDE 21

Thanks!

  • scon@tomlee.co @tglee

http://newrelic.com ... and come see Inside Python @ 5pm in D135 :)