Random Forests vs. Deep Learning Christian Wolf Universit de Lyon, - - PowerPoint PPT Presentation

random forests vs deep learning
SMART_READER_LITE
LIVE PREVIEW

Random Forests vs. Deep Learning Christian Wolf Universit de Lyon, - - PowerPoint PPT Presentation

Random Forests vs. Deep Learning Christian Wolf Universit de Lyon, INSA-Lyon LIRIS UMR CNRS 5205 November 26 th , 2015 RF vs. DL Goal: prediction (classification, regression) max pooling ConvD2 HLV1 HLV2 (, x) vD1 shared hid tree 1


slide-1
SLIDE 1

Random Forests vs. Deep Learning

Christian Wolf Université de Lyon, INSA-Lyon LIRIS UMR CNRS 5205 November 26th, 2015

slide-2
SLIDE 2

RF vs. DL

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

tree 𝑈

(𝐽, x) 𝑄(𝑑)

HLV2 ConvD2 vD1 vC1 vC1 ConvC2 ConvC2 max pooling ConvD2 vD1 HLV1 shared hid HLS HLV1 HLV2 HLA2 HLM3 HLM2 HLM1 ConvA1 HLA1 equency

  • grams

eature extractor

Goal: prediction (classification, regression)

slide-3
SLIDE 3

Deep networks

  • Many layers, many parameters …. and all of them are used for testing,

for each single sample!

  • Feature learning integred into classification
  • End-to-end training, using gradient of the loss function

HLV2 ConvD2 vD1 vC1 vC1 ConvC2 ConvC2 max pooling ConvD2 vD1 HLV1 shared hid HLS HLV1 HLV2 HLA2 HLM3 HLM2 HLM1 ConvA1 HLA1 equency

  • grams

eature extractor

Layer Filter size / n.o. units N.o. parameters Pooling Paths V1, V2 Input D1,D2 72×72×5

  • 2×2×1

ConvD1 25×5×5×3 1900 2×2×3 ConvD2 25×5×5 650 1×1 Input C1,C2 72×72×5

  • 2×2×1

ConvC1 25×5×5×3 1900 2×2×3 ConvC2 25×5×5 650 1×1 HLV1 900 3 240 900

  • HLV2

450 405 450

  • Path M

Input M 183

  • HLM1

700 128 800

  • HLM2

700 490 700

  • HLM3

350 245 350

  • Path A

Input A 40×9

  • 1×1

ConvA1 25×5×5 650 1×1 HLA1 700 3 150 000

  • HLA2

350 245 350

  • Shared layers

HLS1 1600 3 681 600

  • HLS2

84 134 484

  • Output layer

21 1785

  • 12.4M per scale = 37.2M parameters total!
slide-4
SLIDE 4

Random Forests

  • Many levels, many parameters …. but only log2(N) of them

are used for testing!

  • Training is done layer-wise, no end-to-end. No gradient on

the objective function

  • No/limited feature learning

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

slide-5
SLIDE 5

RF vs DL : applications (1)

[Shotton et al., CVPR 2011] (Microsoft Research) [Neverova, Wolf, Nebout, Taylor, under review, arXiv 2015]

Full body pose, Random Forest 3 Trees, depth 20 >10M parameters Hand pose with a deep network Semi/weakly supervised training 8 layers, ~5M parameters

slide-6
SLIDE 6

RF vs DL : applications (2)

[Kontschieder et al., CVPR 2014] (Microsoft Research) [Fourure, Emonet, Fromont, Muselet, Tremeau, Wolf, under review]

Scene parsing with structured random forests Scene parsing with deep networks (5 layers, ~2M parameters)

slide-7
SLIDE 7

Types of random forests

Classical random forests Structured random forests Neural random forests Deep convolutional random forests

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

d8 d9 d11 π9 π10 d12 π11 π12 d10 d13 π13 π14 d14 π15 π16 f14 f10 f13 f8 f12 f9 f11

2 X R R R R

f(0) : X → R3

slide-8
SLIDE 8

Example for classical RF

Real-Time Human Pose Recognition in Parts from Single Depth Images

Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchio Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation [Shotton et al., CVPR 2011] (Best Paper!)

slide-9
SLIDE 9

Depth images -> 3D joint locations

depth image body parts 3D joint proposals

[Shotton et al., CVPR 2011]

slide-10
SLIDE 10

synthetic (train & test)

real (test)

31 ¡body ¡parts (Labels)

[Shotton et al., CVPR 2011]

slide-11
SLIDE 11

Classification with random forests

tree 1 tree 𝑈

(𝐽, x) (𝐽, x) 𝑄(𝑑) 𝑄

(𝑑)

P(c|I, x) = 1 T

T

X

t=1

Pt(c|I, x) .

Each split node thresholds one of the features Each leaf node contains a class distribution Class distributions are averaged over the trees:

ution Pt(c|I, x) utions are averaged

[Shotton et al., CVPR 2011]

slide-12
SLIDE 12

Learning & Entropy

A good split function minimizes entropy in the label distributions.

444 0.2 0.4 0.4 0.33 0.66 0.01 0.33 0.01 0.66 Ql(θ) Q Qr(θ)

slide-13
SLIDE 13

Random forests: learning algorithm

  • 1. Randomly propose a set of splitting candidates φ =

(θ, τ) (feature parameters θ and thresholds τ).

  • 2. Partition the set of examples Q = {(I, x)} into left

and right subsets by each φ: Ql(φ) = { (I, x) | f✓(I, x) < τ } (3) Qr(φ) = Q \ Ql(φ) (4)

  • 3. Compute the φ giving the largest gain in information:

φ? = argmax

  • G(φ)

(5) G(φ) = H(Q) − X

s2{l,r}

|Qs(φ)| |Q| H(Qs(φ)) (6) where Shannon entropy H(Q) is computed on the nor- malized histogram of body part labels lI(x) for all (I, x) ∈ Q.

  • 4. If the largest gain G(φ?) is sufficient, and the depth in

the tree is below a maximum, then recurse for left and right subsets Ql(φ?) and Qr(φ?).

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

Training:

  • 3 ¡trees ¡
  • depth ¡20 ¡
  • 1.000.000 ¡images ¡
  • 2000 ¡candidate ¡features
  • 50 ¡thresholds ¡ per ¡feature

1 ¡day ¡: 1000 ¡cores ¡cluster

[Shotton et al., CVPR 2011]

slide-14
SLIDE 14

Examples

Figure 5. Example inferences. Synthetic (top row); real (middle); failure

[Shotton et al., CVPR 2011]

slide-15
SLIDE 15

Dependencies of results

  • n hyper-parameters

[Shotton et al., CVPR 2011]

slide-16
SLIDE 16

Types of random forests

Classical random forests Structured random forests Neural random forests Deep convolutional random forests

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

d8 d9 d11 π9 π10 d12 π11 π12 d10 d13 π13 π14 d14 π15 π16 f14 f10 f13 f8 f12 f9 f11

2 X R R R R

f(0) : X → R3

slide-17
SLIDE 17

Structured random forests

In the classical version, decision (=leaf) nodes contain predictions for a single pixel (a label or a posterior distribution) In the structured version, a decision node is assigned a rectangular patch of predictions.

p x (u, v)

  • 1. Training data example, as used in our proposed

[Kontschieder et al., ICCV 2011]

slide-18
SLIDE 18

Structured version : integration

Integration over multiple pixels by vote:

[Kontschieder et al., ICCV 2011]

slide-19
SLIDE 19

Types of random forests

Classical random forests Structured random forests Neural random forests Deep convolutional random forests

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

d8 d9 d11 π9 π10 d12 π11 π12 d10 d13 π13 π14 d14 π15 π16 f14 f10 f13 f8 f12 f9 f11

2 X R R R R

f(0) : X → R3

slide-20
SLIDE 20

[Rota Bulo and Kontschieder, CVPR 2014]

Neural Decision Forests for Semantic Image Labelling

Samuel Rota Bul`

  • Fondazione Bruno Kessler

Trento, Italy

rotabulo@fbk.eu

Peter Kontschieder Microsoft Research Cambridge, UK

pekontsc@microsoft.com

slide-21
SLIDE 21

Neural split functions

Classical random forest with neural split functions

x 2 X

+1

R R R R

+1

R f(x)

f(1) : R3 → R4 f(0) : X → R3 f(2) : R4 → R W(1) ∈ R4×4 W(2) ∈ R5×1

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

[Rota Bulo and Kontschieder, CVPR 2014]

slide-22
SLIDE 22

Learning the neural split function (1)

Probabilistic loss function:

Q(Θ) = max

π

P[y|X, π, Θ] , P[y|X, π, Θ] =

n

Y

s=1

P[ys|xs, π, Θ] ,

  • f π = (π(L), π(R)),

Samples are independent: Latent distributions of labels routed to left and right child nodes Labels Node input samples Network parameters

P[ys|xs, π, Θ] = X

d∈{L,R}

P[ys|ψs = d, π]P[ψs = d|xs, Θ] = X

d∈{L,R}

π(d)

ys fd(xs|Θ) .

𝑄 𝑄

(𝑑)

= (π(L)

), π(R))

[Rota Bulo and Kontschieder, CVPR 2014]

slide-23
SLIDE 23

Learning the neural split function (2)

Learning procedure alternates between two steps :

Q(Θ) = max

π

P[y|X, π, Θ] ,

1. Update child distributions 2. Update network parameters (backprop)

[Rota Bulo and Kontschieder, CVPR 2014]

slide-24
SLIDE 24

Results on semantic labelling

Method ETRIMS8 CAMVID Global Class-Avg Jaccard Global Class-Avg Jaccard RF Baseline 64.5 ±1.6 59.6 ±1.7 40.3 ±1.1 64.0 41.6 27.2 NDFP 69.8 ±1.8 64.3 ±2.2 45.0 ±1.9 67.4 46.5 30.8 NDFMLP 68.9 ±2.0 62.4 ±2.3 44.2 ±2.1 67.1 44.4 30.1 NDFMLPC 69.7 ±1.7 62.5 ±2.1 44.7 ±1.9 67.4 44.2 30.2 NDFMLPC−`1 71.7 ±2.0 (+7.2) 65.3 ±2.3 (+5.7) 46.9 ±2.0 (+6.6) 69.0 (+5.0) 46.8 (+5.2) 31.7 (+4.5) RF Baseline 72.2 ±1.9 68.0 ±0.8 47.5 ±1.0 68.5 50.3 32.4 NDFMLPC−`1 80.8 ±0.7 (+8.6) 74.6 ±0.7 (+6.6) 56.9 ±1.2 (+9.4) 82.1 (+13.6) 56.1 (+5.8) 43.3 (+10.9) Best RF in [13] 76.1 72.3

  • Best in [14]

75.1 72.4

  • Best RF in [19]
  • 38.3

Best RF in [20]

  • 72.5

51.4 36.4 Best in [8]

  • 69.1

53.0

  • Best in [35]
  • 73.7

36.3 29.6

h φ(x) Input layer f(0) x ∈ X Normalization

Figure 1. Example input RGB image and learned representations of our rMLP taken from a hidden layer, visualized using heat-maps.

[Rota Bulo and Kontschieder, CVPR 2014]

slide-25
SLIDE 25

Types of random forests

Classical random forests Structured random forests Neural random forests Deep convolutional random forests

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

d8 d9 d11 π9 π10 d12 π11 π12 d10 d13 π13 π14 d14 π15 π16 f14 f10 f13 f8 f12 f9 f11

2 X R R R R

f(0) : X → R3

slide-26
SLIDE 26

Deep Neural Decision Forests

Peter Kontschieder1 Madalina Fiterau∗,2 Antonio Criminisi1 Samuel Rota Bul`

  • 1,3

Microsoft Research1 Carnegie Mellon University2 Fondazione Bruno Kessler3 Cambridge, UK Pittsburgh, PA Trento, Italy

[Kontschieder et al., ICCV 2015]

One model to rule them all …

slide-27
SLIDE 27

Goals

Combine neural networks and random forests Advantage of NN : representation learning Advantage of RR : divide and conquer Différentiable loss function, allowing gradient backprop « Backpropagation trees »

[Kontschieder et al., ICCV 2015]

slide-28
SLIDE 28

Notation

π` over Y. Each decision node n 2 N is instead a decision function dn(·; Θ) : X ! [0, 1] parametrized Θ, which is responsible for routing samples

L Each prediction node ` 2 L holds a probability distribution π` over Y. which is responsible for a sample x 2 X

L Each prediction node ` 2 L holds

  • ver

. Each decision node tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

node 2 L holds a probability Each decision node n 2 N is function

[Kontschieder et al., ICCV 2015]

slide-29
SLIDE 29

Stochastic decision functions

Decisions in split nodes are Bernouilli random variables!

π` over Y. Each decision node n 2 N is instead a decision function dn(·; Θ) : X ! [0, 1] parametrized Θ, which is responsible for routing samples

a Bernoulli random variable with mean dn(x; Θ). sample ends in a leaf node , the related tree predic- PT [y|x, Θ, π] = X

`∈L

⇡`yµ`(x|Θ)

Final prediction for sample x:

where π = (π`)`∈L and ⇡`y denotes the probability of a sample reaching leaf ` to take on class y, while µ`(x|Θ) is regarded as the routing function providing the probability that sample x will reach leaf `. Clearly, P

` µ`(x|Θ) = 1

for all x 2 X.

[Kontschieder et al., ICCV 2015]

slide-30
SLIDE 30

Stochastic decision functions

The routing function can be calculated with a single pass through the tree.

d1 d2 d4 d5 d3 d6 d7

`4

Figure 1. Each node n ∈ N of the tree performs routing decisions via function dn(·) (we omit the parametrization Θ). The black path shows an exemplary routing of a sample x along a tree to reach leaf `4, which has probability µ`4 = d1(x) ¯ d2(x) ¯ d5(x).

[Kontschieder et al., ICCV 2015]

slide-31
SLIDE 31

Illustration

d1 d2 d4 π1 π2 d5 π3 π4 d3 d6 π5 π6 d7 π7 π8 f7 f3 f6 f1 f5 f2 f4 d8 d9 d11 π9 π10 d12 π11 π12 d10 d13 π13 π14 d14 π15 π16 f14 f10 f13 f8 f12 f9 f11 FC Deep CNN with parameters Θ

dn(x; Θ) = (fn(x; Θ)) , (3) where (x) = (1 + ex)1 is the sigmoid function, and fn(·; Θ) : X → R is a real-valued function depending

  • n the sample and the parametrization

. Further details

[Kontschieder et al., ICCV 2015]

slide-32
SLIDE 32

The deep network used: GoogleNet

[Kontschieder et al., ICCV 2015]

slide-33
SLIDE 33

Résultats sur ImageNet

GoogLeNet [36] GoogLeNet? dNDF.NET # Models 1 7 1 1 7 # Crops 1 10 144 1 10 144 1 1 10 1 Top5-Errors 10.07% 9.15% 7.89% 8.09% 7.62% 6.67% 10.02% 7.84% 7.08% 6.38%

Table 2. Top5-Errors obtained on ImageNet validation data, comparing our dNDF.NET to GoogLeNet(?).

[Kontschieder et al., ICCV 2015]

slide-34
SLIDE 34

Leaf entropy during Training

# # # 100 200 300 400 500 600 700 800 900 1000 3 4 5 6 7 8 9

# Training Epochs Average Leaf Entropy [bits] Average Leaf Entropy during Training

[Kontschieder et al., ICCV 2015]

slide-35
SLIDE 35

Results on ImageNet

# # #

#Training Epochs

200 400 600 800 1000

Top5-Error [%]

20 40 60 80 100

#Training Epochs

500 550 600 650 700 750 800 850 900 950 1000

Top5-Error [%]

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 11 11.5 12

ImageNet Top5-Errors

dNDF0 on Validation dNDF1 on Validation dNDF2 on Validation dNDF.NET on Validation dNDF.NET on Training

Figure 5. Top5-Error plots for individual dNDFx used in dNDF.NET as well as their joint ensemble errors. Left: Plot over all 1000 training epochs. Right: Zoomed version of left plot, showing Top5-Errors from 0–12% between training epochs 500- 1000.

[Kontschieder et al., ICCV 2015]

slide-36
SLIDE 36

Conclusion

  • Deep networks are still the number one model in terms
  • f prediction performance
  • Random forests still have excellent computational

complexity during testing

  • Combining both families is not an easy compromise
  • Possibility of having classical « crisp » trees under a

convolutional space displacement layer : speed!!

tree 1

(𝐽, x) 𝑄 𝑄

(𝑑)

HLV2 ConvD2 vD1 vC1 vC1 ConvC2 ConvC2 max pooling ConvD2 vD1 HLV1 shared hid HLS HLV1 HLV2 HLA2 HLM3 HLM2 HLM1 ConvA1 HLA1 equency
  • grams
eature extractor