Generating Fast Operators for Binarizable Networks Meghan Cowan - - PowerPoint PPT Presentation

generating fast operators for binarizable networks
SMART_READER_LITE
LIVE PREVIEW

Generating Fast Operators for Binarizable Networks Meghan Cowan - - PowerPoint PPT Presentation

Generating Fast Operators for Binarizable Networks Meghan Cowan Running Binarizable Networks? Running Binarizable Networks? Training in frameworks with no binarizable operators. Running Binarizable Networks? ? Speedup Cant evaluate


slide-1
SLIDE 1

Generating Fast Operators for Binarizable Networks

Meghan Cowan

slide-2
SLIDE 2

Running Binarizable Networks?

slide-3
SLIDE 3

Running Binarizable Networks?

Training in frameworks with no binarizable operators.

slide-4
SLIDE 4

Running Binarizable Networks?

Training in frameworks with no binarizable operators. Can’t evaluate performance gains

Speedup

?

slide-5
SLIDE 5

Running Binarizable Networks?

Training in frameworks with no binarizable operators. Easy to introduce bugs Can’t evaluate performance gains

Speedup

?

slide-6
SLIDE 6

Running Binarizable Networks?

Training in frameworks with no binarizable operators. Easy to introduce bugs Can’t evaluate performance gains

Speedup

?

Need to generate binarizable operators ourselves!

slide-7
SLIDE 7

Speedup Baseline Unoptimized Goal

Baselines are incredibly well optimized Without optimizations low precision can’t compete

slide-8
SLIDE 8

Speedup Baseline Unoptimized Goal

Baselines are incredibly well optimized Without optimizations low precision can’t compete Want operators that are fast

slide-9
SLIDE 9

Speedup Baseline Unoptimized Goal

Baselines are incredibly well optimized Without optimizations low precision can’t compete Need optimized operators for all workloads Performance portability across different CPUs Want operators that are fast

slide-10
SLIDE 10

AutoTVM LLVM, CUDA, Metal Tensor Expression IR

Generating Fast Operators for Binarizable Networks

Optimization

AutoVTA High-Level Differentiable IR VTA

Edge FPGA Cloud FPGA ASIC Hardware Fleet

slide-11
SLIDE 11

AutoTVM LLVM, CUDA, Metal Tensor Expression IR

Generating Fast Operators for Binarizable Networks

Optimization

AutoVTA High-Level Differentiable IR Tensor Expression IR VTA

Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space

slide-12
SLIDE 12

AutoTVM LLVM, CUDA, Metal Tensor Expression IR

Generating Fast Operators for Binarizable Networks

Optimization

AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR VTA

Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends

slide-13
SLIDE 13

AutoTVM LLVM, CUDA, Metal Tensor Expression IR

Generating Fast Operators for Binarizable Networks

Optimization

AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA

Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends Overrule LLVM code generation with custom microkernel Use tensorize primitive to replace inner-most loop of computation

vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5

tensorize()

slide-14
SLIDE 14

Convolutions on Raspberry Pi

Can generate low precision convolutions 5.5x to 15.2x faster than optimized 16-bit integer

Relative Speedup 6 12 18 24 30 ResNet 18 Layer 2 3 4 5 6 7 8 9 10 11 12 Total

16-bit TVM W1A1 W1A2 W2A2