Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What - - PowerPoint PPT Presentation

orc
SMART_READER_LITE
LIVE PREVIEW

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What - - PowerPoint PPT Presentation

Orc David Schleef Entropy Wave Inc (c) 2009 Entropy Wave Inc What is Orc A system for describing low-level computation on modern CPUs (c) 2009 Entropy Wave Inc Motivation (c) 2009 Entropy Wave Inc Motivation Want maintainable assembly


slide-1
SLIDE 1

(c) 2009 Entropy Wave Inc

Orc

David Schleef Entropy Wave Inc

slide-2
SLIDE 2

(c) 2009 Entropy Wave Inc

What is Orc

A system for describing low-level computation on modern CPUs

slide-3
SLIDE 3

(c) 2009 Entropy Wave Inc

Motivation

slide-4
SLIDE 4

(c) 2009 Entropy Wave Inc

Motivation

  • Want maintainable assembly code
slide-5
SLIDE 5

(c) 2009 Entropy Wave Inc

Motivation

  • Want maintainable assembly code
  • Want to quickly write assembly code
slide-6
SLIDE 6

(c) 2009 Entropy Wave Inc

Motivation

  • Want maintainable assembly code
  • Want to quickly write assembly code
  • Want to verify correct behavior
slide-7
SLIDE 7

(c) 2009 Entropy Wave Inc

Possible Solutions

  • Hand-written assembly
  • perfect C compiler
  • C with intrinsics
  • C with #pragmas (TI C6x, OpenMP)
  • Enhanced C (CUDA, GLSL, OpenCL)
  • LLVM
  • other...
slide-8
SLIDE 8

(c) 2009 Entropy Wave Inc

Combinatoric Problem

Video Format Conversion:

23 input formats 23 output formats 9 algorithms = 4761 functions Schroedinger motion compensation: 32768 functions Pixman rendering: >= 1e9 functions

Conclusion: runtime code generation

slide-9
SLIDE 9

(c) 2009 Entropy Wave Inc

Orc Parts

  • Language for describing computation
  • Compiler for language (orcc)

to intermediate form

  • r to SSE/MMX/C/Neon/etc.
  • Orc library (liborc-0.4.so)

Generate and compile functions at runtime

slide-10
SLIDE 10

(c) 2009 Entropy Wave Inc

Orc Features

  • Active Backends: SSE, MMX, Neon, Altivec, C
  • Experimental: C64x, Arm
  • Can generate for different CPU microarchitectures
  • 194 opcodes
  • 8/16/32/64-bit signed/unsigned int
  • 32/64-bit float
  • 1D, 2D arrays, constant or variable size
slide-11
SLIDE 11

(c) 2009 Entropy Wave Inc

Orc Features

  • Easy to make Orc optional
  • Embedded friendly
slide-12
SLIDE 12

(c) 2009 Entropy Wave Inc

Opcodes

  • standard and saturated arithmetic
  • shifting, size and float conversion
  • specialized loading: loadoff[bwl], ldreslin[bl]
  • accumulation
  • div255w: divide by 255 (for compositing)
  • divluw: divide 16-bit by 8-bit
slide-13
SLIDE 13

(c) 2009 Entropy Wave Inc

Automatic Test Features

  • Test and compare

against backup C code or emulation

  • Compile and compare

generated source vs. generated binary code

slide-14
SLIDE 14

(c) 2009 Entropy Wave Inc

Orc Workflow

Write .orc source Compile with orcc SSE/MMX Neon/etc. C source liborc-based C source Runtime code generation Execute SSE/MMX Neon etc. Write liborc-based C source Execute compiled C code

slide-15
SLIDE 15

(c) 2009 Entropy Wave Inc

Orc code

Vertical downscale by factor of 2 (3 taps)

.function cogorc_downsample_vert_cosite_3tap .dest 1 d1 .source 1 s1 .source 1 s2 .source 1 s3 .temp 2 t1 .temp 2 t2 .temp 2 t3 convubw t1, s1 convubw t2, s2 convubw t3, s3 mullw t2, t2, 2 addw t1, t1, t3 addw t1, t1, t2 addw t1, t1, 2 shrsw t1, t1, 2 convsuswb d1, t1

slide-16
SLIDE 16

(c) 2009 Entropy Wave Inc

Generated code

Header:

void cogorc_downsample_vert_cosite_3tap (uint8_t * d1, uint8_t * s1, uint8_t * s2, uint8_t * s3, int n);

C source (generator function):

void cogorc_downsample_vert_cosite_3tap (uint8_t * d1, uint8_t * s1, uint8_t * s2, uint8_t * s3, int n) { OrcExecutor _ex, *ex = &_ex; static int p_inited = 0; static OrcProgram *p = 0; if (!p_inited) {

  • rc_once_mutex_lock ();

... }

slide-17
SLIDE 17

(c) 2009 Entropy Wave Inc

Generated code

C source (backup function):

void static void _backup_cogorc_downsample_vert_cosite_3tap (OrcExecutor *ex) { int i; int8_t * var0; const int8_t * var4; const int8_t * var5; const int8_t * var6; ... }

Test Code: 110 lines of C code Assembly Code (optional): 395 for SSE, 216 for Neon

slide-18
SLIDE 18

(c) 2009 Entropy Wave Inc

GStreamer Plugins using Orc

adder audioconvert videoscale videotestsrc volume deinterlace videobox videomixer cog colorspace invtelecine

slide-19
SLIDE 19

(c) 2009 Entropy Wave Inc

Schrödinger Orc status

  • Used everywhere in schro
  • Limited by Orc features
slide-20
SLIDE 20

(c) 2009 Entropy Wave Inc

Cairo Orc status

  • Orc backend is slightly faster than SSE
  • Orc backend handles more operators than SSE

backend

  • Everything in place to write a Grand Unified

Compositor function (>1e9 combinations)

slide-21
SLIDE 21

(c) 2009 Entropy Wave Inc

videoscale speed comparison

slide-22
SLIDE 22

(c) 2009 Entropy Wave Inc

colorspace speed comparison

slide-23
SLIDE 23

(c) 2009 Entropy Wave Inc

Emergent Features

What opportunities arise when writing SIMD code is quick and easy?

slide-24
SLIDE 24

(c) 2009 Entropy Wave Inc

Emergent Features

10/16-bit video processing floating point video processing quality vs. time tradeoffs

slide-25
SLIDE 25

(c) 2009 Entropy Wave Inc

Emergent Features

quality factor time per frame (ms)

slide-26
SLIDE 26

(c) 2009 Entropy Wave Inc

Limitations

  • 0.4 ABI is horrific
  • Fixed-size arrays everywhere
  • Limited number of constants/parameters
slide-27
SLIDE 27

(c) 2009 Entropy Wave Inc

Opportunities

  • Instruction Scheduler

Reorder instruction stream to improve processor parallelization

  • Multi-register allocation

Do more operations on full registers

  • Better handling of register spills/constant loading
slide-28
SLIDE 28

(c) 2009 Entropy Wave Inc

Future Directions

  • Alignment characteristics for arrays
  • Swizzling, shuffling opcodes
  • Table lookup opcodes
  • Convolution load opcodes
  • Non-loop-based functions (for 8x8 DCT)
  • Exposure of backend code generators in API
  • Macros/high-level opcodes