Pycket A Tracing JIT For a Functional Language APLS December 16, - - PowerPoint PPT Presentation

pycket a tracing jit for a functional language
SMART_READER_LITE
LIVE PREVIEW

Pycket A Tracing JIT For a Functional Language APLS December 16, - - PowerPoint PPT Presentation

Pycket A Tracing JIT For a Functional Language APLS December 16, 2015 Spenser Bauman 1 Carl Friedrich Bolz 2 Robert Hirschfeld 3 Vasily Kirilichev 3 Tobias Pape 3 Jeremy G. Siek 1 Sam Tobin-Hochstadt 1 1 Indiana University Bloomington, USA 2


slide-1
SLIDE 1

Pycket A Tracing JIT For a Functional Language

Spenser Bauman1 Carl Friedrich Bolz2 Robert Hirschfeld3 Vasily Kirilichev3 Tobias Pape3 Jeremy G. Siek1 Sam Tobin-Hochstadt1

1Indiana University Bloomington, USA 2King’s College London, UK 3Hasso-Plattner-Institut, University of Potsdam, Germany

APLS December 16, 2015

slide-2
SLIDE 2

Problem: Racket is slow on generic code

Generic code: (define (dot v1 v2) (for/sum ([e1 v1] [e2 v2]) (* e1 e2))) (time (dot v1 v2)) ;; 3864 ms Hand optimized:

(define (dot-fast v1 v2) (define len (flvector-length v1)) (unless (= len (flvector-length v2)) (error 'fail)) (let loop ([n 0] [sum 0.0]) (if (unsafe-fx= len n) sum (loop (unsafe-fx+ n 1) (unsafe-fl+ sum (unsafe-fl* (unsafe-flvector-ref v1 n) (unsafe-flvector-ref v2 n)))))))

(time (dot-fast v1 v2)) ;; 268 ms

slide-3
SLIDE 3

Problem: Racket is slow on contracts

(define/contract (dot-safe v1 v2) ((vectorof flonum?) (vectorof flonum?) . -> . flonum?) (for/sum ([e1 v1] [e2 v2]) (* e1 e2))) (time (dot-safe v1 v2)) ;; 8888 ms

slide-4
SLIDE 4

Problem: Racket is slow wrt. gradual typing

Is Sound Gradual Typing Dead? Takikawa et al. POPL 2016

kcfa

typed/untyped ratio

  • max. overhead

mean overhead 3-deliverable 3/10-usable (7 modules) 1.00x 22.67x 9.23x 32 (25%) 48 (38%)

snake

typed/untyped ratio

  • max. overhead

mean overhead 3-deliverable 3/10-usable (8 modules) 0.92x 121.51x 32.30x 4 (2%) 28 (11%)

1x 1x 1x 1x 1x 1x 1x 1x 1x 6x 6x 6x 6x 6x 6x 6x 6x 6x 10x 10x 10x 10x 10x 10x 10x 10x 10x 15x 15x 15x 15x 15x 15x 15x 15x 15x 20x 20x 20x 20x 20x 20x 20x 20x 20x 26 26 26 26 26 26 26 26 26 51 51 51 51 51 51 51 51 51 77 77 77 77 77 77 77 77 77 102 102 102 102 102 102 102 102 102 128 128 128 128 128 128 128 128 128 1x 1x 1x 1x 1x 1x 1x 1x 1x 6x 6x 6x 6x 6x 6x 6x 6x 6x 10x 10x 10x 10x 10x 10x 10x 10x 10x 15x 15x 15x 15x 15x 15x 15x 15x 15x 20x 20x 20x 20x 20x 20x 20x 20x 20x 51 51 51 51 51 51 51 51 51 102 102 102 102 102 102 102 102 102 154 154 154 154 154 154 154 154 154 205 205 205 205 205 205 205 205 205 256 256 256 256 256 256 256 256 256

tetris

typed/untyped ratio

  • max. overhead

mean overhead 3-deliverable 3/10-usable (9 modules) 0.97x 117.28x 33.34x 128 (25%) 0 (0%)

synth

typed/untyped ratio

  • max. overhead

mean overhead 3-deliverable 3/10-usable (10 modules) 1.03x 85.90x 39.69x 15 (1%) 73 (7%)

1x 1x 1x 1x 1x 1x 1x 1x 1x 6x 6x 6x 6x 6x 6x 6x 6x 6x 10x 10x 10x 10x 10x 10x 10x 10x 10x 15x 15x 15x 15x 15x 15x 15x 15x 15x 20x 20x 20x 20x 20x 20x 20x 20x 20x 102 102 102 102 102 102 102 102 102 205 205 205 205 205 205 205 205 205 307 307 307 307 307 307 307 307 307 410 410 410 410 410 410 410 410 410 512 512 512 512 512 512 512 512 512 1x 1x 1x 1x 1x 1x 1x 1x 1x 6x 6x 6x 6x 6x 6x 6x 6x 6x 10x 10x 10x 10x 10x 10x 10x 10x 10x 15x 15x 15x 15x 15x 15x 15x 15x 15x 20x 20x 20x 20x 20x 20x 20x 20x 20x 200 200 200 200 200 200 200 200 200 400 400 400 400 400 400 400 400 400 614 614 614 614 614 614 614 614 614 800 800 800 800 800 800 800 800 800 1024 1024 1024 1024 1024 1024 1024 1024 1024

slide-5
SLIDE 5

Pycket is a tracing JIT compiler which reduces the need for manual specialization and reduces contract overhead.

(time (dot v1 v2)) ;; 74 ms (time (dot-fast v1 v2)) ;; 74 ms (268 ms on Racket) (time (dot-safe v1 v2)) ;; 95 ms

slide-6
SLIDE 6

Pycket tames overhead from gradual typing

kcfa

5 10 15 20 slowdown factor 20 40 60 80 100 120 number below

racket pycket hidden

tetris

5 10 15 20 slowdown factor 100 200 300 400 500 number below

racket pycket hidden

snake

5 10 15 20 slowdown factor 50 100 150 200 250 number below

racket pycket hidden

synth

5 10 15 20 slowdown factor 200 400 600 800 1000 number below

racket pycket hidden

slide-7
SLIDE 7

Idea: Apply dynamic language JIT compiler to Racket Take: Racket + Apply: RPython Project

= Pycket

slide-8
SLIDE 8

Background: Tracing JIT Compilation

Interpret & Profile Tracing Native Execution Optimize Code Gen.

… … … … … … … … …

hot loop Virtual Machine Program Input

slide-9
SLIDE 9

Background: Tracing JIT Compilation

A B C D

Program Execution Trace

A B D guard

side exit Interpreter

slide-10
SLIDE 10

Background: The PyPy Meta-Tracing JIT

RPython Interpret & Profile Tracing Native Execution Optimize Code Gen.

… … … … … … … … …

hot loop Virtual Machine Python Interpreter Python Program Input

slide-11
SLIDE 11

The Pycket Meta-Tracing JIT

RPython Interpret & Profile Tracing Native Execution Optimize Code Gen.

… … … … … … … … …

hot loop Virtual Machine Racket Interpreter Racket Program Input

slide-12
SLIDE 12

Our Racket Interpreter: The CEK Machine

e ::= x | λx. e | (e e) | letcc x. e | e@e κ ::= [] | arg(e, ρ)::κ | fun(v, ρ)::κ | ccarg(e, ρ)::κ | cc(κ)::κ v ::= λx. e | κ ⟨x, ρ, κ⟩ − → ⟨ρ(x), ρ, κ⟩ ⟨(e1 e2), ρ, κ⟩ − → ⟨e1, ρ, arg(e2, ρ)::κ⟩ ⟨v1, ρ, arg(e2, ρ′)::κ⟩ − → ⟨e2, ρ′, fun(v1, ρ)::κ⟩ ⟨v2, ρ, fun(λx. e, ρ′)::κ⟩ − → ⟨e, ρ′[x → v2], κ⟩ ⟨letcc x. e, ρ, κ⟩ − → ⟨e, ρ[x → κ], κ⟩ ⟨(e1@e2), ρ, κ⟩ − → ⟨e1, ρ, ccarg(e2, ρ)::κ⟩ ⟨κ1, ρ, ccarg(e2, ρ′)::κ⟩ − → ⟨e2, ρ′, cc(κ1) :: κ⟩ ⟨v2, ρ, cc(κ1)::κ)⟩ − → ⟨v2, ρ, κ1⟩ Programming Languages and Lambda Calculi. Flatt and Felleisen. 2007

slide-13
SLIDE 13

Challenges particular to Racket

▶ Detect loops for trace compilation in a higher-order language

without explicit loop constructs

▶ Reduce the need for manual specialization ▶ Reduce the overhead imposed by contracts

slide-14
SLIDE 14

Loop finding: cyclic paths

Record cycles in control flow

. . pc1 . pc2 . pc3 . pc4 . pc5 . p . c .

1

. < . p . c .

5

Default RPython strategy

slide-15
SLIDE 15

Tracing cycles in the control flow is insufficient

The CEK machine has no notion of a program counter, can try to use AST nodes instead.

1.

(define (my-add a b) (+ a b))

2.

(define (loop a b)

3.

(my-add a b)

4.

(my-add a b)

5.

(loop a b)) .. Begin tracing at a hot node and continue until that node is reached again . .

(+ a b)

slide-16
SLIDE 16

Tracing cycles in the control flow is insufficient

1.

(define (my-add a b) (+ a b))

2.

(define (loop a b)

3.

(my-add a b)

4.

(my-add a b)

5.

(loop a b)) . Begin tracing at a hot node and continue until that node is reached again . .

(+ a b)

.

(loop a b)

slide-17
SLIDE 17

Tracing cycles in the control flow is insufficient

1.

(define (my-add a b) (+ a b))

2.

(define (loop a b)

3.

(my-add a b)

4.

(my-add a b)

5.

(loop a b)) . Begin tracing at a hot node and continue until that node is reached again . .

(+ a b)

.

(loop a b)

.

(my-add a b)1

slide-18
SLIDE 18

Tracing cycles in the control flow is insufficient

1.

(define (my-add a b) (+ a b))

2.

(define (loop a b)

3.

(my-add a b)

4.

(my-add a b)

5.

(loop a b)) .. Begin tracing at a hot node and continue until that node is reached again . .

(+ a b)

.

(loop a b)

.

(my-add a b)1

.

(+ a b)

.

(my-add a b)2

slide-19
SLIDE 19

The Callgraph

. . loop . my-add Newer definition: A loop is a cycle in the program’s call graph.

  • 1. Build the callgraph during execution
  • 2. Mark functions in a cycle as a loop

. .

(my-add a b)1

.

(+ a b)

.

(my-add a b)2

.

(+ a b)

.

(loop a b)

slide-20
SLIDE 20

Data Structure Specialization

Unbox small, fixed-size arrays of Racket values

Env

Vals * List3 1 2 Fixnum: 1 Flonum: 3.14 Symbol: 'a EnvSize3 Val0 Val1 Val2 1 3.14 * Symbol: 'a

slide-21
SLIDE 21

Specialized Mutable Objects

Optimistically specialize the representation of homogeneous containers

Vector storage FloatVectorStrategy strategy FixnumCons 2 array 2 1.4 5.5

When a mutating operation invalidates the current strategy, the storage is rewritten — this is fortunately infrequent

[Bolz et al., OOPSLA 2013]

slide-22
SLIDE 22

Pycket: What Works?

▶ File IO

(open-input-file "list.txt") (open-output-file "brain.dat")

▶ Numeric tower

number? complex? real? rational? integer? ...

▶ Contracts

(define-contract ...)

▶ Typed Racket

#lang typed/racket

▶ Primitive Functions (∼ 900/1400)

slide-23
SLIDE 23

Pycket: What Doesn’t Work?

▶ FFI ▶ Scribble

#lang scribble/base

▶ DrRacket ▶ Web

#lang web-server/insta

▶ Threads

(thread (฀ () ...))

▶ Lesser used primitives

slide-24
SLIDE 24

Performance Caveats

Fast Slow

Tight loops Branchy/irregular control flow Numeric Computations Code not easily expressed as loops Interpreters Short-running programs

slide-25
SLIDE 25

Benchmarks

slide-26
SLIDE 26

Overall Performance

racket larceny gambit bigloo pycket

system

0.0 0.2 0.4 0.6 0.8 1.0

geomean runtime

Larceny Benchmarks

racket pycket

system

Shootout Benchmarks

slide-27
SLIDE 27

Specialization

racket pycket system 5 10 15 20 25 30 % slowdown

Despecialization Slowdown

slide-28
SLIDE 28

Contracts and Chaperones

(define (dot v1 v2) (for/sum ([e1 v1] [e2 v2]) (* e1 e2))) (define/contract (dotc v1 v2) ((vectorof flonum?) (vectorof flonum?) . -> . flonum?) (for/sum ([e1 v1] [e2 v2]) (* e1 e2)))

▶ Pycket supports Racket’s implementation of higher-order software

contracts via impersonators and chaperones

▶ Used to support Type Racket’s implementation of gradual typing ▶ Overhead = Enforcement Cost + Extra Indirection

[Strickland, Tobin-Hochstadt, Findler, Flatt 2012]

slide-29
SLIDE 29

Benchmarks: Contracts

bubble church struct

  • de

binomial

benchmark

5 10 15 20 25 30 35

slowdown

Chaperone Slowdown

system pycket racket

slide-30
SLIDE 30

Future Improvements

▶ Improve chaperone/impersonator performance and space

usage

▶ Explore interaction between ahead-of-time and just-in-time

  • ptimizations

▶ Green threads and inter-thread optimizations ▶ Improve performance on complicated control flow ▶ Support more of Racket

slide-31
SLIDE 31

Thank You

▶ Dynamic language JIT compilation is a viable implementation

strategy for functional languages

▶ Novel loop detection method for trace compilation of a

higher-order language

▶ Significant reduction in contract overhead ▶ Significant reduction in the need for manual specialization

https://github.com/samth/pycket