Generating Fast Operators for Binarizable Networks Meghan Cowan - - PowerPoint PPT Presentation
Generating Fast Operators for Binarizable Networks Meghan Cowan - - PowerPoint PPT Presentation
Generating Fast Operators for Binarizable Networks Meghan Cowan Running Binarizable Networks? Running Binarizable Networks? Training in frameworks with no binarizable operators. Running Binarizable Networks? ? Speedup Cant evaluate
Running Binarizable Networks?
Running Binarizable Networks?
Training in frameworks with no binarizable operators.
Running Binarizable Networks?
Training in frameworks with no binarizable operators. Can’t evaluate performance gains
Speedup
?
Running Binarizable Networks?
Training in frameworks with no binarizable operators. Easy to introduce bugs Can’t evaluate performance gains
Speedup
?
Running Binarizable Networks?
Training in frameworks with no binarizable operators. Easy to introduce bugs Can’t evaluate performance gains
Speedup
?
Need to generate binarizable operators ourselves!
Speedup Baseline Unoptimized Goal
Baselines are incredibly well optimized Without optimizations low precision can’t compete
Speedup Baseline Unoptimized Goal
Baselines are incredibly well optimized Without optimizations low precision can’t compete Want operators that are fast
Speedup Baseline Unoptimized Goal
Baselines are incredibly well optimized Without optimizations low precision can’t compete Need optimized operators for all workloads Performance portability across different CPUs Want operators that are fast
AutoTVM LLVM, CUDA, Metal Tensor Expression IR
Generating Fast Operators for Binarizable Networks
Optimization
AutoVTA High-Level Differentiable IR VTA
Edge FPGA Cloud FPGA ASIC Hardware Fleet
AutoTVM LLVM, CUDA, Metal Tensor Expression IR
Generating Fast Operators for Binarizable Networks
Optimization
AutoVTA High-Level Differentiable IR Tensor Expression IR VTA
Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space
AutoTVM LLVM, CUDA, Metal Tensor Expression IR
Generating Fast Operators for Binarizable Networks
Optimization
AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR VTA
Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends
AutoTVM LLVM, CUDA, Metal Tensor Expression IR
Generating Fast Operators for Binarizable Networks
Optimization
AutoTVM AutoVTA High-Level Differentiable IR Tensor Expression IR LLVM, CUDA, Metal VTA
Edge FPGA Cloud FPGA ASIC Hardware Fleet Declare bitserial computation and CPU schedule describing an optimization space Use AutoTVM use to find schedule parameters for different operators and backends Overrule LLVM code generation with custom microkernel Use tensorize primitive to replace inner-most loop of computation
vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5 vcnt.8 q8, q8 vrev16.8 q5, q8 vadd.i8 q8, q8, q5 vorr q5, q8, q8 vuzp.8 q8, q5 vmovl.u8 q5, d16 vrev32.16 q5, q5 vaddw.u8 q8, q5, d16 vorr q5, q8, q8 vuzp.16 q8, q5
tensorize()
Convolutions on Raspberry Pi
Can generate low precision convolutions 5.5x to 15.2x faster than optimized 16-bit integer
Relative Speedup 6 12 18 24 30 ResNet 18 Layer 2 3 4 5 6 7 8 9 10 11 12 Total
16-bit TVM W1A1 W1A2 W2A2