[PPT] - Large-scale Product Categorization with Deep Models in Rakuten PowerPoint Presentation

SLIDE 1

Large-scale Product Categorization with Deep Models in Rakuten

May/8/2017 Ali Cevahir / Denis Miller Rakuten Institute of Technology / Rakuten, Inc. https://rit.rakuten.co.jp / https://global.rakuten.com

SLIDE 2

About Rakuten

2 https://global.rakuten.com/corp/about/strength/data.html

SLIDE 3

Rakuten Group Services

3

E-Commerce FinTech Digital Content Travel & Reservation Pro Sports Others

https://global.rakuten.com/corp/about/business/internet.html

SLIDE 4 4

Online Market Place

ver 230,000,000 items in 30,000+ categories

Merchants Shoppers

Branding Marketing

Rakuten Ichiba

EC Consulting

SLIDE 5

Problem and Solution

5

SLIDE 6

Introduction

Problem: Given product information, automatically

classify it to its correct category

6

MACPHEE(マカフィー) 切り替えVネックニット Ladies Fashion  Tops  Knit Sweaters  Long Sleeves  V Neck

SLIDE 7

Proposed Solutions

2 different models

– Deep Belief Nets – Deep Auotoencoders + kNN

2 different data sources

– Titles – Descriptions

Overall results aggregated
GPU Implementation

7

SLIDE 8

Proposed Solutions

2-step classification

– First classify to Level-1 categories – Then, to leaf levels

81% match with merchants

(‘others’ excluded)

– Merchants are not always correct

8

MACPHEE(マカフィー) 切り替えVネックニット

Ladies Fashion  Tops  Knit Sweaters  Long Sleeves  V Neck

SLIDE 9

CUDeep: A CUDA-based Deep Learning Framework

In-house command-line tool

for training DBN and DAE

Written with CUDA,

using cuBlas and cuSparse

9

SLIDE 10

CUDeep: A CUDA-based Deep Learning Framework

Deep Belief Nets

vs

Deep Autoencoders

10

X Y’ X X’

Input features

Semantic hash Class probabilities

Y

Supervised

(~1 million dim.)

Billions of connections!!!

SLIDE 11

CUDeep: A CUDA-based Deep Learning Framework

Selective Reconstruction

(Dauphin et. al, 2011)

Applied for both

– Layer-wise training – Backpropagation

11

SLIDE 12

CUDeep: Some Design Decisions

12

W[vis,hid1] = 4 GB 1M 1000

Keep neural net weights on GPU

– Faster: No need to communicate weights btw CPU and GPU – Alternative: store weights on main memory, copy weights to be updated to GPU for each minibatch

Sparse input feature vectors are

stored on main memory

– Limited device memory – Disk streaming possible, but slower

SLIDE 13

CUDeep: Some Design Decisions

13

During layer-wise pre-training:

Do not store intermediate
utputs of hidden layers
Do feedforward computations

instead

Intermediate outputs are dense

– Not practical to store

200 Million sparse inputs 8 GB (10 nonzero / feature) (2000-d) (1000-d) 1.6 TB 800 GB (64-d) 51.2GB

SLIDE 14

CUDA-kNN

Vector search engine

14

SLIDE 15

CUDA-kNN

15

Preprocessing: Multi-level

k-means clustering

2-step search

1. Closest-cluster search 2. kNN in the closest cluster

1 2

SLIDE 16

2-Step Classification

16

Step-1: 2 DBN & kNN
Step 2: 2x35 DBN & kNN
2 DAE models

– Same encoding for step 1 and step 2

Level 1: 35 Categories Level 5: ~30,000 Categories

SLIDE 17

Feature Extraction

Features: 0-1 word vectors
Mostly Japanese text
Normalize letters: ｱｲﾌｫﾝ４S  アイフォン 4s
Cleaning all html tags: <a href> link </a>  link
Regular expressions for:

– Product codes: iPhone-4S → iphone4s – Japanese counters: 4枚 (do not tokenize) – Sizes and dimensions: 12Cm x 3 Cm → 12cmx3cm

17

SLIDE 18

Feature Extraction

Titles: 26M tokens
Descr: 47M tokens
Use only 1M most-

frequent tokens

– Good enough for L1 classification – Less tokens exist in subcategories for L2 classification

18

Total dictionary size: 26M Total dictionary size: 800K

SLIDE 19

Dataset Properties and Hardware Setup

280 million (active and inactive) products

– Rakuten Data Release (https://rit.rakuten.co.jp/opendata.html)

Deduped by titles: 280 million → 172 million

– Merchants may sell the same items

28,338 active categories

– ~40% of products are assigned to leaf categories named “others”

90% of randomly selected products used for training
A Linux server with 4 TitanX GPUs
2 x12-core Intel CPUs
96 GB main memory

19

SLIDE 20

Level-1 Genre Prediction Results (Step 1)

20

Excludes “others” categories Includes “others” categories

70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 1 2 3 4 5 6 7 8 9 10 L1 Prediction - with others(Percent Recall @ N) Title-DBN Description-DBN Title-KNN Description-KNN Combined % 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 1 2 3 4 5 6 7 8 9 10 L1 Prediction - without others(Percent Recall @ N) Title-DBN Description-DBN Title-KNN Description-KNN Combined % Top N predictions Top N predictions

SLIDE 21 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 L5 Prediction - without others (Percent Recall @ N) Title-DBN Description-DBN Title-KNN Description-KNN Combined 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 5 6 7 8 9 10 L5 Prediction - with others (Percent Recall @ N) Title-DBN Description-DBN Title-KNN Description-KNN Combined DBNs combined

Overall Taxonomy Matching (Step 2)

21

Excludes “others” categories Includes “others” categories

% % Top N predictions Top N predictions

SLIDE 22

Sample Results Merchant Correct / Algorithm Incorrect

22

Sweet Mother - Isaac Andrews

Merchant Category: Books, Magazines & Comics > Western Books > Books For Kids Predicted Category: Books, Magazines & Comics > Western Books > Fiction & Literature

SLIDE 23

Sample Results Merchant Incorrect / Algorithm Correct

23

トヨトミ［KS-67H］電子火式流型石油ストブKS67H

Merchant Category: Flowers, Garden & DIY > DIY & Tools > Others Predicted Category: Consumer electronics > Seasonal home appliances > Heating > Oilstove > 14+ tatami (wooden) , 19+ tatami (rebar)

SLIDE 24

Sample Results Merchant and Algorithm are Both Correct

24

レンタル【RG87】袴フルセット/大学生/小学生/高校生/中学生

Merchant Category: Women’s Fashion > Japanese > Kimono > Hakama Predicted Category: Women’s Fashion > Japanese > Rental

SLIDE 25

Summary

Large-scale product categorization
A multi-modal deep learning approach
CUDA-based tools: CUDeep, CUDA-kNN
Noisy data, high matching with manual labeling
Engineering challenges

– Large data – Dynamic data: products and categories keep changing – Not easy to replicate research output with these settings

25

SLIDE 26

Engineering Work

26

 Architecture  Tuning for different GPU cards  Dealing with large data set  Improving prediction accuracy  Future work

SLIDE 27

System architecture

Designed to have high

scalability and availability

Support requests of both

single and multiple input data

Based on Docker. Used

nvidia-docker for GPU-based components

27

https://github.com/NVIDIA/nvidia-docker

SLIDE 28

Classification data flow diagram

28

SLIDE 29

PROBLEMS & SOLUTIONS

29

SLIDE 30

GPU memory size difference

Research environment Titan X Production environment Tesla K80

30

12,287 MiB 11,519 MiB 768 MiB loss

>

SLIDE 31

GPU memory size difference

Different memory size requires a series of experiments to find new model configuration

Reduce input layer size e.g. from

1M to 900K, with sacrificing information

Will use latest GPU with more GPU memory to recover this information loss in future work

31

900K 1K 2K N

SLIDE 32

Extra large data amount

230

million items

200 GB of raw data
260 GB of tokenized items
200+ GB of 70+ model files
4 days preparing training data
More than one week to train the

models using single server with 2 Tesla K80 cards

Extremely large memory usage

during training and classification

32

SLIDE 33

Extra large data amount

Issue

– File operations and data processing has high time consumption

Solution

– Multiprocessing everywhere – High-speed storage

33

SLIDE 34

Accuracy worse than experiment

74%

51%

Research shows the result of

74% accuracy rate and up to 88% in some categories

After first building the models

from latest data, accuracy is

nly 51%
Further investigations shown

some few significant defects.

34

SLIDE 35

Shuffling input data

Issue

– Due to the high correlation of sample data, this can result in biased gradient and lead to poor convergence

Solution

– Add shuffling process into the data preparation

35

Input data preprocessing Additional process to shuffle data

SLIDE 36

Tuning training parameter

Issue

– Trained models with latest data resulted in low accuracy result

Lower input layer size
Unbalanced item distribution in categories
Solution

– Increase number of backpropagation epochs in 2.5 times and decrease bias multiplier in 10 times

36

SLIDE 37

Grouping of categories

37

Issue

– Low prediction accuracy for similar categories when separating models

Solution

– Group similar categories

SLIDE 38

Accuracy improvement result

80%~ 98%

51%

Recover expected result

– 80% of overall accuracy – 98% in popular categories

Cost several months of work

38

SLIDE 39

Most successful categories

39

SLIDE 40

FUTURE WORK

40

SLIDE 41

Next steps

80%

is not enough

Need to improve the accuracy as much as possible

Data analysis
New experiments

41

SLIDE 42

Bias item distribution in leaf categories

42

Category ID # of items

Resulting in low prediction accuracy in categories with few items

SLIDE 43

Experiment with fine-tuning

Adding extra training iterations

– for categories with few items by using the same input data to increase acknowledgment

Experiment shows a positive

dynamic of using fine-tuning

2 4 6 8 10 12 14 16 18 Category 1 Category 2 Category 3 Category 4

Extra training iterations in model

43 # of item (M)

SLIDE 44

Experiment with splitting models

Separate categories into 3 groups and build model sets for them independently

Extra small
Normal
Extra large genres

Require more resources, but expected to have significant accuracy improvement

44

SLIDE 45

To meet business requirement

Very frequent data update

– Need to reduce time to train new models

Will need high spec GPU server

and automation enhancement

45

SLIDE 46

THANK YOU!

Q&A

46

More about Rakuten https://global.rakuten.com/corp/about/ https://global.rakuten.com/corp/careers/