Machine Learning Research at eBay: An Industry Lab Perspective - - PowerPoint PPT Presentation

machine learning research at ebay an industry lab
SMART_READER_LITE
LIVE PREVIEW

Machine Learning Research at eBay: An Industry Lab Perspective - - PowerPoint PPT Presentation

Machine Learning Research at eBay: An Industry Lab Perspective Dennis DeCoste eBay Research Labs ML Summer School @ UC Santa Cruz, July 13, 2012 Dennis DeCoste (eBay Research Labs) 1 / 66 Dennis DeCoste @ research labs NASA / Caltech JPL,


slide-1
SLIDE 1

Machine Learning Research at eBay: An Industry Lab Perspective

Dennis DeCoste eBay Research Labs

ML Summer School @ UC Santa Cruz, July 13, 2012

Dennis DeCoste (eBay Research Labs) 1 / 66

slide-2
SLIDE 2

Dennis DeCoste @ research labs

NASA / Caltech JPL, principal computer scientist Yahoo! Research, founding Director, Machine Learning Microsoft Live Labs, principal research scientist Facebook, research scientist eBay Research Labs, Director of Machine Learning

Dennis DeCoste (eBay Research Labs) 2 / 66

slide-3
SLIDE 3

Outline

1

ML at eBay: Data and Apps

2

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test

3

Conclusion / Open Issues

Dennis DeCoste (eBay Research Labs) 3 / 66

slide-4
SLIDE 4

ML at eBay: Data and Apps

“ML Apps” vs “Applied Research”

eBay product groups

“applied scientists”; driven by Quarterly roadmaps search, recommendation, catalog, fraud detection, ...

eRL focus: “what if” & foundational ML research

data: scaling to massive data (e.g. behaviorial logs) algos: stream, sample, randomize, mini-batch, ... systems issues: exploit multi-cores, GPU, Hadoop, ... methodology: decouple accurate train from cheap run

Dennis DeCoste (eBay Research Labs) 4 / 66

slide-5
SLIDE 5

ML at eBay: Data and Apps

Data at eBay

users: behavior logs (query,click,view,wish,buy,...)

≈ 100 million users (buyers and sellers)

items: text (title,description,...), prices, images, ...

≈ 10 million items listed/day billions of historic item listings long tail: many have no product SKU id (e.g. antiques) variety: auctions / Buy It Now, new/used, ...

Dennis DeCoste (eBay Research Labs) 5 / 66

slide-6
SLIDE 6

ML at eBay: Data and Apps

Text Data: Project Orgami

LDA, sentiment analysis (e.g. product/seller reviews), NLP, ... information extraction (e.g. product properties) item classification into large catalog taxonomy

Dennis DeCoste (eBay Research Labs) 6 / 66

slide-7
SLIDE 7

ML at eBay: Data and Apps

Example Product Descriptions

study: billions of descriptions (≈ 1 year)

Dennis DeCoste (eBay Research Labs) 7 / 66

slide-8
SLIDE 8

ML at eBay: Data and Apps

Unsupervised Property Extraction

(Rohanimanesh, Mauge, Ruvini (eRL), ACL 2012)

big data, some structured sellers → simple heuristics suffice

popularity of name/value ≡ # sellers using it KN = known name (pattern 1 finds new names)

Dennis DeCoste (eBay Research Labs) 8 / 66

slide-9
SLIDE 9

ML at eBay: Data and Apps

Example Discovered Properties (5 Cats)

Dennis DeCoste (eBay Research Labs) 9 / 66

slide-10
SLIDE 10

ML at eBay: Data and Apps

Property Synonym Discovery

logistic regression with features: name string edit distance, common values, co-occurrence (anti!), ... train set: hand cluster discovered property names

Dennis DeCoste (eBay Research Labs) 10 / 66

slide-11
SLIDE 11

ML at eBay: Data and Apps

Example Discovered Property Synonyms

precision = 91.8%, recall = 51%

(lack of value overlap: need clean/normalize)

Dennis DeCoste (eBay Research Labs) 11 / 66

slide-12
SLIDE 12

ML at eBay: Data and Apps

Large eBay Taxonomy of Categories

Dennis DeCoste (eBay Research Labs) 12 / 66

slide-13
SLIDE 13

ML at eBay: Data and Apps

Cats: Levels / Sizes, Skew

Dennis DeCoste (eBay Research Labs) 13 / 66

slide-14
SLIDE 14

ML at eBay: Data and Apps

Cats: Two-Stage ML Approach

item titles: millions of unigrams

Dennis DeCoste (eBay Research Labs) 14 / 66

slide-15
SLIDE 15

ML at eBay: Data and Apps

Cats: Grouping Algorithm

Dennis DeCoste (eBay Research Labs) 15 / 66

slide-16
SLIDE 16

ML at eBay: Data and Apps

Cats: Example Group Result

Dennis DeCoste (eBay Research Labs) 16 / 66

slide-17
SLIDE 17

ML at eBay: Data and Apps

Cats: Some Results

hier-ebay-struc: size-balanced grouping of existing 6-level tree > 20,000 classes and 83 million training examples

Dennis DeCoste (eBay Research Labs) 17 / 66

slide-18
SLIDE 18

ML at eBay: Data and Apps

ML Apps at eBay: Search

Dennis DeCoste (eBay Research Labs) 18 / 66

slide-19
SLIDE 19

ML at eBay: Data and Apps

eBay Search: Some ML Challenges

query text → rank product item listings

e.g. pairwise RankSVM, but whack-a-mole lower unseen items

blending with user deterministic orderings (“Time Ending Soonest”, cheapest first, ...) correlating offline metrics to A/B tests temporal data mining in Hadoop (Mobius)

Dennis DeCoste (eBay Research Labs) 19 / 66

slide-20
SLIDE 20

ML at eBay: Data and Apps

“Null Search”: Zero-Recall Queries

(Singh, Parikh, Sundaresan (eRL), WWW 2012) unique eBay challenges:

100 million queries / day, significant nulls inventory dynamic e.g. “warren buffet lunch” seller / buyer vocab mismatch e.g. “universal” vs “size 5” clock key

Dennis DeCoste (eBay Research Labs) 20 / 66

slide-21
SLIDE 21

ML at eBay: Data and Apps

Null Search: Common Query Attributes

Dennis DeCoste (eBay Research Labs) 21 / 66

slide-22
SLIDE 22

ML at eBay: Data and Apps

Null Search: Stats (Top-K, Historic Overlap)

billions of products in history, 10 million listed / day past month items overlap 30% today’s null queries coverage 3x @ 10% (non vs null)

Dennis DeCoste (eBay Research Labs) 22 / 66

slide-23
SLIDE 23

ML at eBay: Data and Apps

Null Search: Algo/Example

Dennis DeCoste (eBay Research Labs) 23 / 66

slide-24
SLIDE 24

ML at eBay: Data and Apps

Null Search: Taxa Inferred per Search

Dennis DeCoste (eBay Research Labs) 24 / 66

slide-25
SLIDE 25

ML at eBay: Data and Apps

Image Search

Dennis DeCoste (eBay Research Labs) 25 / 66

slide-26
SLIDE 26

ML at eBay: Data and Apps

ML Apps at eBay: Merch

product merchandise recommendation

e.g. large-scale sparse user/item SVD (100 million by billions)

extra challenges: listings dynamic, items = products

Dennis DeCoste (eBay Research Labs) 26 / 66

slide-27
SLIDE 27

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Outline

1

ML at eBay: Data and Apps

2

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test

3

Conclusion / Open Issues

Dennis DeCoste (eBay Research Labs) 27 / 66

slide-28
SLIDE 28

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Randomized Large-Scale SVD

function [U, S, V] = rsvd(A, k) [m,n]=size(A); nz=nnz(A);

%% A ≈ (Q∗Q′)∗A

P = randn(n, k + 5); Y = full( A * P );

%% O(nz k)

[Q, R] = qr(Y, 0);

%% O(m kˆ2)

B = full( Q’ * A );

%% O(k nz)

[Uhat, S, V] = svd(B, ’econ’);

%% O(n kˆ2)

U = Q * Uhat;

%% O(m kˆ2)

U = U(:,1:k); S = S(1:k,1:k); V = V(:,1:k);

O(nz k + (m + n)k2) vs O(m n k) MATLAB [U,S,V]=svds(A,k) [multicore,GPU,2pass] See: “Finding structure with randomness”, Halko, Martinsson, Tropp. SIAM Review, 2011.

Dennis DeCoste (eBay Research Labs) 28 / 66

slide-29
SLIDE 29

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

RSVD vs SVD

U = randn(20*1000,100); V = randn(20*1000,100); A = U*V’; [v, i, j] = find(A); [~,ids] = sort(v); ids = [ ids(1:length(v)/200); ids(end-length(v)/200:end) ]; A = sparse( v(ids), i(ids), j(ids) ); tic; [Ur, Sr, Vr] = rsvd(A, k); toc, tic; [U0, S0, V0] = svds(A, k); toc

also useful for seeding, even if accuracy not sufficient

Dennis DeCoste (eBay Research Labs) 29 / 66

slide-30
SLIDE 30

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Reservoir Sampling

UNIX: head -1000 < rows.txt > subset.txt want:samp -1000 < rows.txt > subset.txt

void reservoir_sample(int k) { int n = 0; vector<string> R(k); string s; while (true) { cin >> s; if (cin.eof()) break; if (n < k) { R[n] = s; } else { int i = rand()%n; if (i < k) R[i] = s; } n++; } if (n<k) k=n; for (int i=0; i<k; i++) { cout << R[i] << endl; } }

apps: streams, map-reduce, simulators

Dennis DeCoste (eBay Research Labs) 30 / 66

slide-31
SLIDE 31

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Bootstrap Sampling

traditional bootstrap

sample with replacement (k times do: n → n) per k samples: n times, pick from n examples e.g. k trainings → ensemble (e.g. random forest) e.g. internet log aggregation: sums (clicks, revenue), ... issue: sample by entry, or by user, or ... confidence intervals (not assume normal distribution)

  • nline bootstrap

popularized in ML by (Oza & Russell, AISTATS 2001) per example: determine k counts for k samples; each sample: ci = rpois(λ = 1); n

i=1 ci ≈ n

good for: streaming, n unknown (e.g. user sampling)

Dennis DeCoste (eBay Research Labs) 31 / 66

slide-32
SLIDE 32

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Bootstrap Example

BS = function(x, stat, k) { bs = replicate(k, stat( sample(x,replace=T) )) quantile(bs, c(0.025, 0.5, 0.975)) } BS.pois = function(x, stat, k) { bs = replicate(k, stat( rep(x,rpois(length(x),lambda=1)) )); quantile(bs, c(0.025, 0.5, 0.975)) } LMH.normal <- function(x) { mu = mean(x); stderr = sqrt(var(x)/length(x)) c(mu - 1.96*stderr, mu, mu + 1.96*stderr) } n = 100*1000; k = 10*1000; x = runif(n,0,1); s = mean; print(quantile(replicate(k,s(runif(n,0,1))),c(0.025,0.5,0.975))) 0.4982299 0.5000069 0.5017628 0.4978307 0.4996247 0.5014186 print(LMH.normal(x)) 0.4978188 0.4996341 0.5014037 print(BS(x,s,k)) 0.4978936 0.4996468 0.5014059 print(BS.pois(x,s,k))

Dennis DeCoste (eBay Research Labs) 32 / 66

slide-33
SLIDE 33

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Bootstrap Example 2

BS = function(x, stat, k) { bs = replicate(k, stat( sample(x,replace=T) )) quantile(bs, c(0.025, 0.5, 0.975)) } BS.pois = function(x, stat, k) { bs = replicate(k, stat( rep(x,rpois(length(x),lambda=1)) )); quantile(bs, c(0.025, 0.5, 0.975)) } LMH.normal <- function(x) { mu = mean(x); stderr = sqrt(var(x)/length(x)) c(mu - 1.96*stderr, mu, mu + 1.96*stderr) } n = 100*1000; k = 10*1000; x = rexp(n,1); s = median; print(quantile(replicate(k,s(rexp(n,1)),c(0.025,0.5,0.975))) 0.6869930 0.6931360 0.6994282 0.9964869 1.0026975 1.0089082 print(LMH.normal(x)) 0.6899971 0.6967001 0.7029585 print(BS(x,s,k)) 0.6899953 0.6967119 0.7029711 print(BS.pois(x,s,k)) 0.6966489 print(s(x))

Dennis DeCoste (eBay Research Labs) 33 / 66

slide-34
SLIDE 34

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Bag of Little Bootstraps (BLB)

(Kleiner,Talwalkar,Sarkar,Jordan, ICML 2012)

150GB data (n=6M,d=3000); 10x8 cores (60GB) vs 20x4 cores (240GB)

Poisson BOOT vs BLB; s = 5 times:

b = n0.7 subsamples → r = 50 weighted(sum=n) resamplings

Dennis DeCoste (eBay Research Labs) 34 / 66

slide-35
SLIDE 35

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale

Many Optimization Tricks

growing bag of ML tricks

kernel trick hashing trick random projection trick ...

simultaneous perturbation stochastic approximation

http://www.jhuapl.edu/spsa/

[finite diffs: 2 loss evals / iter]

see Mark Schmidt’s MATLAB minfunc

http://www.di.ens.fr/∼mschmidt/Software/minFunc.html hessian free, finite diffs using complexs, ...

...

Dennis DeCoste (eBay Research Labs) 35 / 66

slide-36
SLIDE 36

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Outline

1

ML at eBay: Data and Apps

2

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test

3

Conclusion / Open Issues

Dennis DeCoste (eBay Research Labs) 36 / 66

slide-37
SLIDE 37

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

ML Bottlenecks

most cases: dot products / distance computations ||x − w||2 ≡ xTx − 2xTw + w Tw numbers every ML practioneer should know:

streaming / sequential disk speed = X MB/s ? each CPU core = Y GFlops/s ? sequential scan Z GB RAM per second ? hint: large-scale usually (should) not be I/O bound?

Dennis DeCoste (eBay Research Labs) 37 / 66

slide-38
SLIDE 38

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

ML Systems Issues: Why Care?

answers:“is code for algo already near-optimal?” research needs speed more than production does?!

“code not even optimized yet” ≡ not study when large real-time vs training (e.g. playback year of data) $$$: like 10x more machine ($1 M buck → $10 M bang) relevance: research today, using tomorrow’s machines prudence: distributed computing = silver bullet ... more (vs faster): “train(X)” = 1 op in larger program

resource allocation: model is 1 of many @ company e.g. faster train allows others (shared finite cluster)

Dennis DeCoste (eBay Research Labs) 38 / 66

slide-39
SLIDE 39

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Motivation: Better ML Metrics

common paper claim: “our new X is competitive with SVMs”, etc. need shift: accuracy → accuracy / training secs

  • pportunity: different methods at different stages
  • verheads and constants matter

especially for fast linear online methods ...

Dennis DeCoste (eBay Research Labs) 39 / 66

slide-40
SLIDE 40

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Quiz Time: Nearest Neighbors

X = randn(1000, 1000*1000, ’single’); Q = randn(1000, 1000, ’single’); q = Q(:, 1); recall ||x − q||2 ≡ x’ * x - 2 * x’ * q + q’ * q

tic; X’*Q toc, tic; X’*q toc

[BLAS3 SGEMM or not]

Q: speeds on 12-core 3.5Gz CPU?

Dennis DeCoste (eBay Research Labs) 40 / 66

slide-41
SLIDE 41

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Quiz Time: Nearest Neighbors

X = randn(1000, 1000*1000, ’single’); Q = randn(1000, 1000, ’single’); q = Q(:, 1); recall ||x − q||2 ≡ x’ * x - 2 * x’ * q + q’ * q

tic; X’*Q toc, tic; X’*q toc

[BLAS3 SGEMM or not]

Q: speeds on 12-core 3.5Gz CPU?

8.13s vs 0.138s → 59x longer ... 1000/59 = 17x faster/q (cores+cache)

(3.5 Gz)(12 cores)(4 SSE flops) ≈ 168 GF/s; X’*Q = 1 TF

[6 secs @ peak]

newests: Sandy Bridge AVX vs SSE: 2x more (8 vs 4 float / sec)

Q: speeds on single core?

Dennis DeCoste (eBay Research Labs) 40 / 66

slide-42
SLIDE 42

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Quiz Time: Nearest Neighbors

X = randn(1000, 1000*1000, ’single’); Q = randn(1000, 1000, ’single’); q = Q(:, 1); recall ||x − q||2 ≡ x’ * x - 2 * x’ * q + q’ * q

tic; X’*Q toc, tic; X’*q toc

[BLAS3 SGEMM or not]

Q: speeds on 12-core 3.5Gz CPU?

8.13s vs 0.138s → 59x longer ... 1000/59 = 17x faster/q (cores+cache)

(3.5 Gz)(12 cores)(4 SSE flops) ≈ 168 GF/s; X’*Q = 1 TF

[6 secs @ peak]

newests: Sandy Bridge AVX vs SSE: 2x more (8 vs 4 float / sec)

Q: speeds on single core?

72.3s vs 0.387s → 187x longer ... 1000/187 = 5.3x faster/q (cache)

Q: speeds on GPU (e.g. Nvidia 690 GTX)?

Dennis DeCoste (eBay Research Labs) 40 / 66

slide-43
SLIDE 43

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Quiz Time: Nearest Neighbors

X = randn(1000, 1000*1000, ’single’); Q = randn(1000, 1000, ’single’); q = Q(:, 1); recall ||x − q||2 ≡ x’ * x - 2 * x’ * q + q’ * q

tic; X’*Q toc, tic; X’*q toc

[BLAS3 SGEMM or not]

Q: speeds on 12-core 3.5Gz CPU?

8.13s vs 0.138s → 59x longer ... 1000/59 = 17x faster/q (cores+cache)

(3.5 Gz)(12 cores)(4 SSE flops) ≈ 168 GF/s; X’*Q = 1 TF

[6 secs @ peak]

newests: Sandy Bridge AVX vs SSE: 2x more (8 vs 4 float / sec)

Q: speeds on single core?

72.3s vs 0.387s → 187x longer ... 1000/187 = 5.3x faster/q (cache)

Q: speeds on GPU (e.g. Nvidia 690 GTX)?

5.5 TFlops peak (vs 168 GF 12-core@3.5Ghz) → > 20x faster

Dennis DeCoste (eBay Research Labs) 40 / 66

slide-44
SLIDE 44

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Quiz Time

X=ones(1024*1024*1024,1); tic;sum(X);toc

→ 0.33 secs ≈ 24 GB/s

time cat file1GB.txt > /dev/null

→ > 100 MB/s reads

dual SATA3 SSD’s: ≈ 1 GB/s reads

trick: train many models on data stream e.g. model selection, ensembles, ... moral: amortize cache and I/O costs

Dennis DeCoste (eBay Research Labs) 41 / 66

slide-45
SLIDE 45

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Example: MMMF Ensembles

give back speed → more accuracy (vs less time)

fast Maximum Margin Matrix Factorization (Rennie & Srebro, ICML 2005)

MMMF Ensembles (DeCoste, ICML 2006), each ≈ 20x faster

later: ensembles dominated Netflix Prize

Dennis DeCoste (eBay Research Labs) 42 / 66

slide-46
SLIDE 46

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

MMMF Ensembles for MovieLens

Dennis DeCoste (eBay Research Labs) 43 / 66

slide-47
SLIDE 47

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

GPU

e.g. MBP Retina: GT 650M ≈ 500GF (vs 10GF/CPU)

‘Debunking the 100x GPU vs GPU Myth”, (Lee et al (Intel), ISCA 2010)

OpenCL and Nvidia CUDA ... ... but for ML, first learn/use CUBLAS API! MATLAB Jacket (and GPUmat) insufficient

MAGMA BLAS/LAPACK: http://icl.cs.utk.edu/magma/ (also CULA LAPACK: www.culatools.com)

Dennis DeCoste (eBay Research Labs) 44 / 66

slide-48
SLIDE 48

Needs and Opportunities for ML Research Systems Issues: The Need for Speed

Hadoop / Map-Reduce

“Map-reduce for ML on multicore”, (Chu et al, NIPS 2006)

Apache Malhout? “compute @ data”!? best: 15MB/s; avg: 5MB/s dirty secret: more about scalable storage vs compute

“Nobody ever got fired for using Hadoop on a cluster”, (Rowstron et al (MSR), HotCDP 2012)

suboptimal for: iterative ML good use: “select-join” → train set (e.g. Hive SQL) ... reducer → train @ 40-core, 2-GPU, 1TB RAM

(compressed, column-stored training data often fits RAM!)

Dennis DeCoste (eBay Research Labs) 45 / 66

slide-49
SLIDE 49

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Outline

1

ML at eBay: Data and Apps

2

Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test

3

Conclusion / Open Issues

Dennis DeCoste (eBay Research Labs) 46 / 66

slide-50
SLIDE 50

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Model Compilation: Why?

common scenario:

ML product team: “We just train Naive Bayes models – that’s all we can afford to run/update in production” “Ensembles, or 1000-hidden-unit neural nets? No way!” need: existence proofs (e.g. Netflix Prize)

decoupling training model from execution model:

train-time: find existence proof M1 (e.g. noise, outliers) compile-time: find fast run-time M2, guided by M1

avoids premature optimization bias on run-time; frees ML researchers to focus on what’s possible

Dennis DeCoste (eBay Research Labs) 47 / 66

slide-51
SLIDE 51

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Early Approach: TREPAN

(Craven and Savlik, NIPS 1995) goal: accurate neural net → explainable rules instance of Oracle Trick:

train M1 (neural net) on labeled data X1, L1 run on X2 = X1 ∪ unlabeled ∪ phantom data treat new outputs for X2 as labels L2 (denoises L1) train new target model M2 on expanded X2, L2

intuition: enough data → M2 mimics M1 well

Dennis DeCoste (eBay Research Labs) 48 / 66

slide-52
SLIDE 52

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Model Compilation

step 1: learn most accurate model possible e.g. huge ensemble, not worry about run time costs step 2: “compile” into simpler model goal: smaller/faster, retain accuracy/robustness “cross-compilation” e.g. train random forest → ship neural net e.g. use expensive features only to denoise training

Dennis DeCoste (eBay Research Labs) 49 / 66

slide-53
SLIDE 53

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Compression of Large Ensembles

(Caruana et al, ICML 2004), forward selection

Dennis DeCoste (eBay Research Labs) 50 / 66

slide-54
SLIDE 54

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Early Approach: SVM → Reduced Sets

e.g. (Burges and Scholkopf, NIPS 1997) f (x) = N

i=1 αiK(x, si), 0 ≤ ai ≤ C, s ∈ X

(SVs)

g(x) = M

i=1 βiK(x, zi) for M ≪ N

(hard, slow) global optimization: minβi,zi ρ = ||W − V || W = N

i=1 αiφ(si)

V = M

i=1 βiφ(zi)

same cost for each query – even easier ones

Dennis DeCoste (eBay Research Labs) 51 / 66

slide-55
SLIDE 55

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Example: MNIST (d=768, 60k train, 10k test)

Virtual SVs (DeCoste and Scholkopf, MLJ 2002)

Dennis DeCoste (eBay Research Labs) 52 / 66

slide-56
SLIDE 56

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Nearest Support Vectors (NSV)

(DeCoste and Mazzoni, ICML 2003) SVM: f (x) = N

i=1 αiK(Si, x)

for given query x:

NNscorei(x) ≡ |αi| ˆ K(Si, x) sort αi, Si pairs (largest NNscorei(x) first)

(e.g. approx K’s (PCA20), so cost ≪ O(dN), kd-tree, etc.)

gk(x) = k

j=1 αjK(Sj, x), until gk(x) outside Lk, Hk

example:

Dennis DeCoste (eBay Research Labs) 53 / 66

slide-57
SLIDE 57

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Example: 3 vs 8 Queries

Dennis DeCoste (eBay Research Labs) 54 / 66

slide-58
SLIDE 58

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

NSV: Learning Thresholds

run NSV algo over representative pre-query sample

massive training / unlabelled set generate phantoms (e.g. munge, convex hull, etc.)

Hk = max gk(x) | gk(x) > 0 yet f (x) < 0 Lk = min gk(x) | gk(x) < 0 yet f (x) > 0 “tug of war” shuffling for imbalanced data windowed outward smoothing e.g. Lk = min over window k-w to k+w

Dennis DeCoste (eBay Research Labs) 55 / 66

slide-59
SLIDE 59

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Linear Filtering

linear models often suffice (e.g. 75-90% of queries) before run NSV, filter query by linear SVM, with:

Hk = max gk(x) | fLINEAR(x) > 0 yet f (x) < 0 Lk = min gk(x) | fLINEAR(x) < 0 yet f (x) > 0

again, smooth outwards to account for noise/extrapolation gave additional 3-4x speedups

Dennis DeCoste (eBay Research Labs) 56 / 66

slide-60
SLIDE 60

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Some Pairwise Results (MNIST)

Dennis DeCoste (eBay Research Labs) 57 / 66

slide-61
SLIDE 61

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

NSV Pairwise Speedups (MNIST, mean=111)

Dennis DeCoste (eBay Research Labs) 58 / 66

slide-62
SLIDE 62

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

NSV Disagreements (MNIST)

Dennis DeCoste (eBay Research Labs) 59 / 66

slide-63
SLIDE 63

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Kernel Machine Exact Output Bounds

(DeCoste, ICML 2002); slower (e.g. MNIST: 5 vs 100) (computational geometry / incomplete Cholesky)

Dennis DeCoste (eBay Research Labs) 60 / 66

slide-64
SLIDE 64

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

Proportionality to Query Difficulty

SVM: discriminative ≪ generative model NSV: sign(f(x)) ≪ exact f (x) = M

i=1 αiK(S, x)

Dennis DeCoste (eBay Research Labs) 61 / 66

slide-65
SLIDE 65

Needs and Opportunities for ML Research Model Compilation: Decoupling Train vs Test

New Open Issue in ML

kernel machines & ensembles revolutionized ML SVM, boosting, random forest of DTs, ... via model compile, can get accuracy and speed ... but, it is a “procedural hack” challenge: get without “intermediate bulge”?

Dennis DeCoste (eBay Research Labs) 62 / 66

slide-66
SLIDE 66

Conclusion / Open Issues

Some Open Issues / Needs

compile M1 → M2 vs direct learn M2 (deep learn?) refocus on better metrics

test accuracy per train sec accuracy mean vs robustness, ...

system issues critical

enable study @ scale (including brute-force baselines (e.g. kNN))

build on recent success with online/SGD/linear

adaptive vs fixed learn rate schedule adaptive mini-batch sizes randomized algos? ...

Dennis DeCoste (eBay Research Labs) 63 / 66

slide-67
SLIDE 67

Conclusion / Open Issues

Acknowledgments

eRL team:

Neel Sundaresan Nish Parikh Gyanit Singh Badrul Sarwar Kamal Jain Eric Brill (VP of Research at eBay) C.J. Lin (on sabbatical at eRL) ...

Search team:

David Goldberg Mike Mathieson Dan Fain Hugh Williams

labs.ebay.com

Dennis DeCoste (eBay Research Labs) 64 / 66

slide-68
SLIDE 68

Conclusion / Open Issues

References I

  • C. Burges and B. Sch¨
  • lkopf. Improving the accuracy and speed of support vector machines. In

NIPS, 1997.

  • R. Caruana, A. Niculescu, G. Crew, and A. Ksikes. Ensemble selection from libraries of
  • models. In ICML, 2004.

C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.

  • M. Craven and J. Shavlik. Extracting tree-structured representations of trained networks. In

NIPS, 1995.

  • D. DeCoste. Anytime interval-valued outputs for kernel machines: Fast support vector

machine classification via distance geometry. In ICML, 2002.

  • D. DeCoste. Collaborative prediction using ensembles of maximum margin matrix
  • factorizations. In ICML, 2006.
  • D. DeCoste and D. Mazzoni. Fast query-optimized kernel machine classification via

incremental approximate nearest support vectors. In ICML, 2003.

  • D. DeCoste and B. Sch¨
  • lkopf. Training invariant support vector machines. Machine Learning,

46:161–190, 2002.

Dennis DeCoste (eBay Research Labs) 65 / 66

slide-69
SLIDE 69

Conclusion / Open Issues

References II

  • N. Halko, P. G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic

algorithms for constructing approximate matrix decompositions. SIAM Rev., Survey and Review section, 53(2):217–288, June 2011.

  • A. Kleiner, A. Talwalkar, P. Sarkar, and M. Jordan. The big data bootstrap. In ICML, 2012.
  • N. C. Oza and S. J. Russell. Online bagging and boosting. In AISTATS, 2001.
  • J. D. M. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative
  • prediction. In ICML, 2005.
  • K. Rohanimanesh, K. Mauge, and J.-D. Ruvini. Structuring e-commerce inventory. In ACL,

2012.

  • A. Rowstron, D. Narayanan, A. Donnelly, G. O’Shea, and A. Douglas. Nobody ever got fired

for using hadoop on a cluster. In 1st International Workshop on Hot Topics in Cloud Data Processing (HotCDP 2012), ACM, 2012.

  • G. Singh, N. Parikh, and N. Sundaresan. Rewriting null e-commerce queries to recommend
  • products. In WWW, 2012.

Dennis DeCoste (eBay Research Labs) 66 / 66