Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with - - PowerPoint PPT Presentation

jianchao yang toutiao ai lab in silicon valley
SMART_READER_LITE
LIVE PREVIEW

Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with - - PowerPoint PPT Presentation

Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with Xiaojie Jin (NUS), Ning Xu (Snap), Yingzhen Yang Quest for compact and efficient deep models Memory usage Computation cost App size Quest for compact and


slide-1
SLIDE 1

Jianchao Yang Toutiao AI Lab in Silicon Valley

Joint work with Xiaojie Jin (NUS), Ning Xu (Snap), Yingzhen Yang

slide-2
SLIDE 2
  • Quest for compact and efficient deep models
  • Memory usage
  • Computation cost
  • App size
slide-3
SLIDE 3
  • Quest for compact and efficient deep models
  • Memory usage
  • Computation cost
  • App size
  • WSNet: Compact and efficient network design
  • Smaller model (e.g., up to 180x smaller on ESC50)
  • Faster computation (e.g., up to 18x faster on ESC50)
  • Accuracy comparable to state of the arts
slide-4
SLIDE 4
  • Conventional convolution filters are initialized and trained separately.

!" !#

slide-5
SLIDE 5
  • Conventional convolution filters are initialized and trained separately.
  • Convolution filters are highly redundant.

!" !#

slide-6
SLIDE 6
  • Conventional convolution filters are initialized and trained separately.
  • Convolution filters are highly redundant.

Model quantization Model pruning Low rank Signal sparsity …

!" !#

slide-7
SLIDE 7
  • Main Idea: The convolution filters generated from a compact learnable

parameter set (low-dimensional manifold), instead of learned separately.

!" !#

slide-8
SLIDE 8
  • Main Idea: The convolution filters generated from a compact learnable

parameter set (low-dimensional manifold), instead of learned separately.

!1 !# Φ %& %'

%( = !

((Φ)

Φ: learnable compact parameter set !

, : mapping function to generate the

(-. convolution filter

slide-9
SLIDE 9
  • Main Idea: The convolution filters generated from a compact learnable

parameter set (low-dimensional manifold), instead of learned separately.

!1 !# Φ %& %'

%( = !

((Φ)

Φ: learnable compact parameter set !

, : mapping function to generate the

(-. convolution filter We focus on weight sampling for function !

, in this work: weight tying!

slide-10
SLIDE 10
  • Model quantization (e.g., Han et al. 2015)
  • Weight tying as a result of weight quantization on a learnt model
  • HashedNet (Chen et al. 2015)
  • Random weight tying with hashing before model training
  • Epitome (Jojic et al. 2003)
  • A statistical model that tie pixel values in overlapping patches

Chen et al. Compressing convolutional neural networks with the hashing trick. ICML 2015. Jojic et al. Epitomic analysis of appearance and shape. ICCV 2003.

slide-11
SLIDE 11
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

Φ #$ = !

$(Φ)

slide-12
SLIDE 12
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

Φ #$ = !

$(Φ)

slide-13
SLIDE 13
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

Φ #$ = !

$(Φ)

slide-14
SLIDE 14
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

Φ #$ = !

$(Φ)

slide-15
SLIDE 15
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

Φ #$ = !

$(Φ)

slide-16
SLIDE 16
  • Simplest case: 1D Convolution with single channel
  • Shift sampling
  • !: projection matrix
  • Φ: condensed parameter set

7 weights to generate 5 1x3 filters (15 weights) Φ #$ = !

$(Φ)

slide-17
SLIDE 17
  • 1D convolution
  • Input feature map ! ∈ ℝ$×&, where ((, *) denotes length of input and

number of channels

  • Output feature map * ∈ ℝ$×,, where - denotes number of filters
  • Convolution kernel . ∈ ℝ/×&×,, where 0 denotes filter length
  • Number of Multi-Adds: (*0-
slide-18
SLIDE 18
  • Weight sampling overview

!∗: Length of Φ $∗: Channel number of Φ !: Length of % $: Channel number of % &: Number of filters ': Sampling stride (: Repeating factor !$& !∗$∗ Compactness =

slide-19
SLIDE 19
  • Weight shift sampling in spatial dimension
  • Conventional CNN: ! independent filters with size ", #params = !"
  • WSNet: Condensed filter with size of "∗, #params = " + (! − 1))

Compactness = *+

+∗ ≈ +

slide-20
SLIDE 20
  • Repeating weight sampling in channel dimension
  • Conventional CNN: Each filter with channel !
  • WSNet: Condensed filter with channel !∗

Compactness = C =

% %∗

slide-21
SLIDE 21
  • Example

Stride ! = 1 Channel repeating $ = 4 times Filter length & = 16 Compactness = ()*

(∗)∗ ≈ &$ = 64

Same idea can be generalized to fully connected layers!

slide-22
SLIDE 22
  • Sample more filters with larger condensed filter (bigger !∗) and

a small stride to increase capacity.

  • Increased computation?

Sampling Stride: Ŝ Sampling Stride: S

slide-23
SLIDE 23
  • Recap of conventional 1D convolution
  • Input feature map ! ∈ ℝ$×&
  • Output feature map ' ∈ ℝ$×(
  • Convolution kernel ) ∈ ℝ*×&×(
  • Number of Multi-Adds: +',-
slide-24
SLIDE 24
  • Re-use the convolution results between overlapped input and

filters

slide-25
SLIDE 25
  • An efficient variant of the integral image method
slide-26
SLIDE 26
  • Acceleration in terms of Multi-Adds:
  • Example convolution layer
  • Conv kernel size !, #, $ = (8, 64, 128)
  • Condensed kernel !∗, #∗ = 135, 16
  • Input feature map (0, #)
  • Computation acceleration of ~27 for this layer

#!$ #∗ 3 + !∗ − 1 + !∗ + $

slide-27
SLIDE 27
  • Direct extension
  • Spatial sampling: shifting patch sampling from a 2D condensed filter
  • Channel sampling: repeat sampling in the channel dimension
slide-28
SLIDE 28
  • Direct extension
  • Spatial sampling: shifting patch sampling from a 2D condensed filter
  • Channel sampling: repeat sampling in the channel dimension
slide-29
SLIDE 29
  • Direct extension
  • Spatial sampling: shifting patch sampling from a 2D condensed filter
  • Channel sampling: repeat sampling in the channel dimension
slide-30
SLIDE 30
  • Direct extension
  • Spatial sampling: shifting patch sampling from a 2D condensed filter
  • Channel sampling: repeat sampling in the channel dimension
  • Compactness
  • Conventional filter: (!, ℎ, $, %)
  • Condensed filter: (&, ', $∗)
  • Sampling strides: )!, )ℎ
slide-31
SLIDE 31
  • Tensor decomposition extension
  • Decompose 3D weight tensors into three 1D vectors (Jin et al. 2015)
  • Apply WSNet on each 1D vector as in 1D CNN.

3D convolution 1D convolution over three directions Jin et al. Flattened neural networks for feedforward acceleration. ICLR 2015.

slide-32
SLIDE 32
  • Channel dimension dominates

model size and computation.

  • Channel reordering to reduce

computation cost.

slide-33
SLIDE 33
  • Tasks and datasets
  • WSNet-1D: Audio classification
  • ESC-50
  • UrbanSound8K
  • DCASE
  • WSNet-2D: Image classification
  • CIFAR 10
  • MNIST
  • ImageNet
slide-34
SLIDE 34
  • Notation settings of WSNet
  • Name of WSNet model in the form of !"#$%&'(
  • !" denotes compactness ) in spatial dimension
  • #$ denotes channel repeating * times
  • %& denotes ratio of filters as + between WSNet vs baseline through

dense sampling

  • '( denotes compression ratio of , by weight quantization when used.
slide-35
SLIDE 35
  • Baseline network for ESC-50, UrbanaSound8K, DCASE
  • Network adopted from SoundNet (Aytar et al. 2016) for fair

comparison

Aytar et al. SoundNet: Learning sound representation from unlabeled videos. NIPS 2016.

slide-36
SLIDE 36
  • ESC-50: A collection of 2000 short environmental recordings comprising

50 equally balanced classes of sound events (e.g., animals, water sounds, urban noises, human non-speech sounds, etc.)

slide-37
SLIDE 37
slide-38
SLIDE 38
  • UrbanSound8K: A collection of 8732 short recordings of various urban

sound sources (air conditioner, car horn, playing children, etc.)

slide-39
SLIDE 39
  • DCASE: Detection and Classification of Acoustic Scenes and Events
  • Challenge. It contains 10 acoustic scene categories, with ten samples of

30s recording for training per category.

slide-40
SLIDE 40
  • Direct 2D extension on CIFAR 10 and MNIST
  • Same baseline network as HashedNet (Chen et al. 2015)

Chen et al. Compressing convolutional neural networks with the hashing trick. ICML 2015.

slide-41
SLIDE 41
  • Tensor decomposition extension on ImageNet
  • Single view test
  • Baseline network is Res34

Model #Params #Multi-Adds Top-1 Res18 11.2M 1800M 70.6 Res34 21.3M 3600M 73.1 MobileNet 4.2M 575M 70.6 WSNet 2.7M 540M 70.4

slide-42
SLIDE 42
  • WSNet provides a novel design scheme for convolutional neural

networks to learn compact and efficient models.

  • Achieve comparable accuracy with STOA, but with much fewer

parameters and computation cost for CNNs.

  • For future work, explore more filter generation methods, e.g.,

learning a generative statistical model or low-dimensional basis.

slide-43
SLIDE 43

We are hiring research scientist, software engineer, and intern in

  • Areas
  • Computer vision
  • Computer graphics
  • Machine learning
  • Natural language processing
  • Knowledge discover and data mining
  • Speech and audio processing
  • Recommender system
  • Sites
  • Beijing
  • Silicon Valley (USA)
  • Seattle (USA)

Send resume to lab-hr@bytedance.com for Beijing Positions and

rdus.staffing@bytedance.com for Silicon Valley Positions

slide-44
SLIDE 44

Thank You!

Reference: https://arxiv.org/abs/1711.10067?context=cs