Tuning the Performance of Convolutional Neural Network for Image - - PowerPoint PPT Presentation

tuning the performance of convolutional neural network
SMART_READER_LITE
LIVE PREVIEW

Tuning the Performance of Convolutional Neural Network for Image - - PowerPoint PPT Presentation

Tuning the Performance of Convolutional Neural Network for Image Classification on GPU Agenda Adoptions of Image classification or image recognition at Alibaba Easy ways to improve performance of Caffe Further performance


slide-1
SLIDE 1

Tuning the Performance of Convolutional Neural Network for Image Classification on GPU

slide-2
SLIDE 2

Confidential & Proprietary

Agenda

  • Adoptions of Image classification or image recognition

at Alibaba

  • Easy ways to improve performance of Caffe
  • Further performance optimization of convolution layer
  • Ongoing works

2

slide-3
SLIDE 3

Confidential & Proprietary

  • Product Display

Classification

  • Fashion Style

Classification

  • Buy-by-photo mobile

app, search for visually similar products by images

  • Leverage Caffe

framework

Image classification at Alibaba

Model-Upper/Item-Bottom/Multi-Object

Sweet / Street / Office

slide-4
SLIDE 4

Confidential & Proprietary

Profiling Caffe

4

Caffe spends more than 70% of time on Convolution layers !

  • Most expensive part
slide-5
SLIDE 5

Confidential & Proprietary

Convolution layer

  • How does the convolution layer work in Caffe

Image to Column SGemm

slide-6
SLIDE 6

Confidential & Proprietary

The gap

  • Is it really fast?

Blue: Caffe(imagenet model) Red: Sgemm routine of Cublas Green: Peak of K20

ImageNet model, refer to the ILSVRC12 challenge

slide-7
SLIDE 7

Confidential & Proprietary

How does Cublas Sgemm perform

7

slide-8
SLIDE 8

Confidential & Proprietary

Easiest way to narrow the gap

  • To Overcome the low efficient of SGEMM at small scale

8

Image to Column Gemm Processing one batch

Single image every loop

Image to Column Gemm Processing one batch

Batch-coalesced images every loop

slide-9
SLIDE 9

Confidential & Proprietary

Performance of Fast mode

9

  • Titan black, mini-batch size is 256
slide-10
SLIDE 10

Confidential & Proprietary

Moving forward

  • How is cublas sgemm implemented
slide-11
SLIDE 11

Confidential & Proprietary

Use high performance sgemm routines

  • Example: ImageNet

convolution layer “conv5”: M = 96, N=3025, K=363

  • cuBLAS use:

sgemm_64x16x64x16x16, slow!

  • We use:

sgemm_128x8x128x16x16 to get the same result, 1.54x faster on K20!

slide-12
SLIDE 12

Confidential & Proprietary

Implement our own conv layer

  • Auto-gen gpu

kernels for convolution layers

  • Kernels are

implemented in PTX assembly

Conv2 from Alex’s Net, Height = 16; Width = 16; Channel = 5; Stride = 1; Ksize = 5; Pad = 2; Neuron = 32

slide-13
SLIDE 13

Confidential & Proprietary

Is PTXAS good enough?

  • Problem

– Register usage – Manipulate “control code” on Kepler

  • Our own assembler for Kepler

– Probe native ins – Probe control ins – Ongoing

  • Some users need a native assembler, please!

/* 0x0900101c1c101c1c */ FFMA R23, R83, R84, R23; FFMA R33, R88, R84, R33; FFMA R36, R88, R85, R36; NOP; FFMA R45, R89, R84, R45; FFMA R32, R89, R85, R32; NOP; /* 0x0880101410141014 */ FFMA R5, R80, R86, R5; FFMA R2, R81, R86, R2; FFMA R14, R81, R87, R14; FFMA R7, R80, R92, R7; FFMA R3, R80, R87, R3; FFMA R8, R81, R92, R8; NOP; snippet of instru ctions from sge mm kernel, sm_ 35

slide-14
SLIDE 14

Confidential & Proprietary

Other ongoing works

  • Convert model from Single-precision floating points to

– half-precision (maxwell) – flexible fixed-points (FPGA)

slide-15
SLIDE 15

Confidential & Proprietary

Thank You

15

  • Download the mobile app at taobao.com and try out Buy-by-

Photo