machine learning research at ebay an industry lab
play

Machine Learning Research at eBay: An Industry Lab Perspective - PowerPoint PPT Presentation

Machine Learning Research at eBay: An Industry Lab Perspective Dennis DeCoste eBay Research Labs ML Summer School @ UC Santa Cruz, July 13, 2012 Dennis DeCoste (eBay Research Labs) 1 / 66 Dennis DeCoste @ research labs NASA / Caltech JPL,


  1. Machine Learning Research at eBay: An Industry Lab Perspective Dennis DeCoste eBay Research Labs ML Summer School @ UC Santa Cruz, July 13, 2012 Dennis DeCoste (eBay Research Labs) 1 / 66

  2. Dennis DeCoste @ research labs NASA / Caltech JPL, principal computer scientist Yahoo! Research, founding Director, Machine Learning Microsoft Live Labs, principal research scientist Facebook, research scientist eBay Research Labs, Director of Machine Learning Dennis DeCoste (eBay Research Labs) 2 / 66

  3. Outline ML at eBay: Data and Apps 1 Needs and Opportunities for ML Research 2 Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test Conclusion / Open Issues 3 Dennis DeCoste (eBay Research Labs) 3 / 66

  4. ML at eBay: Data and Apps “ML Apps” vs “Applied Research” eBay product groups “applied scientists”; driven by Quarterly roadmaps search, recommendation, catalog, fraud detection, ... eRL focus: “what if” & foundational ML research data: scaling to massive data (e.g. behaviorial logs) algos: stream, sample, randomize, mini-batch, ... systems issues: exploit multi-cores, GPU, Hadoop, ... methodology: decouple accurate train from cheap run Dennis DeCoste (eBay Research Labs) 4 / 66

  5. ML at eBay: Data and Apps Data at eBay users : behavior logs (query,click,view,wish,buy,...) ≈ 100 million users (buyers and sellers) items : text (title,description,...), prices, images, ... ≈ 10 million items listed/day billions of historic item listings long tail: many have no product SKU id (e.g. antiques) variety: auctions / Buy It Now, new/used, ... Dennis DeCoste (eBay Research Labs) 5 / 66

  6. ML at eBay: Data and Apps Text Data: Project Orgami LDA, sentiment analysis (e.g. product/seller reviews), NLP, ... information extraction (e.g. product properties) item classification into large catalog taxonomy Dennis DeCoste (eBay Research Labs) 6 / 66

  7. ML at eBay: Data and Apps Example Product Descriptions study: billions of descriptions ( ≈ 1 year) Dennis DeCoste (eBay Research Labs) 7 / 66

  8. ML at eBay: Data and Apps Unsupervised Property Extraction (Rohanimanesh, Mauge, Ruvini (eRL), ACL 2012) big data, some structured sellers → simple heuristics suffice popularity of name/value ≡ # sellers using it KN = known name (pattern 1 finds new names) Dennis DeCoste (eBay Research Labs) 8 / 66

  9. ML at eBay: Data and Apps Example Discovered Properties (5 Cats) Dennis DeCoste (eBay Research Labs) 9 / 66

  10. ML at eBay: Data and Apps Property Synonym Discovery logistic regression with features: name string edit distance, common values, co-occurrence (anti!), ... train set: hand cluster discovered property names Dennis DeCoste (eBay Research Labs) 10 / 66

  11. ML at eBay: Data and Apps Example Discovered Property Synonyms precision = 91.8%, recall = 51% (lack of value overlap: need clean/normalize) Dennis DeCoste (eBay Research Labs) 11 / 66

  12. ML at eBay: Data and Apps Large eBay Taxonomy of Categories Dennis DeCoste (eBay Research Labs) 12 / 66

  13. ML at eBay: Data and Apps Cats: Levels / Sizes, Skew Dennis DeCoste (eBay Research Labs) 13 / 66

  14. ML at eBay: Data and Apps Cats: Two-Stage ML Approach item titles: millions of unigrams Dennis DeCoste (eBay Research Labs) 14 / 66

  15. ML at eBay: Data and Apps Cats: Grouping Algorithm Dennis DeCoste (eBay Research Labs) 15 / 66

  16. ML at eBay: Data and Apps Cats: Example Group Result Dennis DeCoste (eBay Research Labs) 16 / 66

  17. ML at eBay: Data and Apps Cats: Some Results hier-ebay-struc : size-balanced grouping of existing 6-level tree > 20,000 classes and 83 million training examples Dennis DeCoste (eBay Research Labs) 17 / 66

  18. ML at eBay: Data and Apps ML Apps at eBay: Search Dennis DeCoste (eBay Research Labs) 18 / 66

  19. ML at eBay: Data and Apps eBay Search: Some ML Challenges query text → rank product item listings e.g. pairwise RankSVM, but whack-a-mole lower unseen items blending with user deterministic orderings (“Time Ending Soonest”, cheapest first, ...) correlating offline metrics to A/B tests temporal data mining in Hadoop (Mobius) Dennis DeCoste (eBay Research Labs) 19 / 66

  20. ML at eBay: Data and Apps “Null Search”: Zero-Recall Queries (Singh, Parikh, Sundaresan (eRL), WWW 2012) unique eBay challenges: 100 million queries / day, significant nulls inventory dynamic e.g. “warren buffet lunch” seller / buyer vocab mismatch e.g. “universal” vs “size 5” clock key Dennis DeCoste (eBay Research Labs) 20 / 66

  21. ML at eBay: Data and Apps Null Search: Common Query Attributes Dennis DeCoste (eBay Research Labs) 21 / 66

  22. ML at eBay: Data and Apps Null Search: Stats (Top-K, Historic Overlap) billions of products in history, 10 million listed / day past month items overlap 30% today’s null queries coverage 3x @ 10% (non vs null) Dennis DeCoste (eBay Research Labs) 22 / 66

  23. ML at eBay: Data and Apps Null Search: Algo/Example Dennis DeCoste (eBay Research Labs) 23 / 66

  24. ML at eBay: Data and Apps Null Search: Taxa Inferred per Search Dennis DeCoste (eBay Research Labs) 24 / 66

  25. ML at eBay: Data and Apps Image Search Dennis DeCoste (eBay Research Labs) 25 / 66

  26. ML at eBay: Data and Apps ML Apps at eBay: Merch product merchandise recommendation e.g. large-scale sparse user/item SVD (100 million by billions) extra challenges: listings dynamic, items � = products Dennis DeCoste (eBay Research Labs) 26 / 66

  27. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Outline ML at eBay: Data and Apps 1 Needs and Opportunities for ML Research 2 Randomized Algorithms: Scalable Large-Scale Systems Issues: The Need for Speed Model Compilation: Decoupling Train vs Test Conclusion / Open Issues 3 Dennis DeCoste (eBay Research Labs) 27 / 66

  28. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Randomized Large-Scale SVD function [U, S, V] = rsvd (A, k) [m,n]=size(A); nz=nnz(A); ≈ ( Q ∗ Q ′ ) ∗ A %% A P = randn (n, k + 5); Y = full( A * P ); O ( nz k ) %% [Q, R] = qr (Y, 0); O ( m k ˆ2) %% B = full( Q’ * A ); O ( k nz ) %% [Uhat, S, V] = svd (B, ’econ’); %% O ( n k ˆ2) U = Q * Uhat; %% O ( m k ˆ2) U = U(:,1:k); S = S(1:k,1:k); V = V(:,1:k); O ( nz k + ( m + n ) k 2 ) vs O ( m n k ) MATLAB [U,S,V]=svds(A,k) [multicore,GPU,2pass] See: “Finding structure with randomness”, Halko, Martinsson, Tropp. SIAM Review, 2011. Dennis DeCoste (eBay Research Labs) 28 / 66

  29. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale RSVD vs SVD U = randn(20*1000,100); V = randn(20*1000,100); A = U*V’; [v, i, j] = find(A); [~,ids] = sort(v); ids = [ ids(1:length(v)/200); ids(end-length(v)/200:end) ]; A = sparse( v(ids), i(ids), j(ids) ); tic; [Ur, Sr, Vr] = rsvd (A, k); toc, tic; [U0, S0, V0] = svds (A, k); toc also useful for seeding, even if accuracy not sufficient Dennis DeCoste (eBay Research Labs) 29 / 66

  30. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Reservoir Sampling UNIX: head -1000 < rows.txt > subset.txt want: samp -1000 < rows.txt > subset.txt void reservoir_sample(int k) { int n = 0; vector<string> R(k); string s; while (true) { cin >> s; if (cin.eof()) break; if (n < k) { R[n] = s; } else { int i = rand()%n; if (i < k) R[i] = s; } n++; } if (n<k) k=n; for (int i=0; i<k; i++) { cout << R[i] << endl; } } apps: streams, map-reduce, simulators Dennis DeCoste (eBay Research Labs) 30 / 66

  31. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Bootstrap Sampling traditional bootstrap sample with replacement ( k times do: n → n ) per k samples: n times, pick from n examples e.g. k trainings → ensemble (e.g. random forest) e.g. internet log aggregation: sums (clicks, revenue), ... issue: sample by entry, or by user, or ... confidence intervals (not assume normal distribution) online bootstrap popularized in ML by (Oza & Russell, AISTATS 2001) per example: determine k counts for k samples; each sample: c i = rpois( λ = 1); � n i =1 c i ≈ n good for: streaming, n unknown (e.g. user sampling) Dennis DeCoste (eBay Research Labs) 31 / 66

  32. Needs and Opportunities for ML Research Randomized Algorithms: Scalable Large-Scale Bootstrap Example BS = function(x, stat, k) { bs = replicate(k, stat( sample(x,replace=T) )) quantile(bs, c(0.025, 0.5, 0.975)) } BS.pois = function(x, stat, k) { bs = replicate(k, stat( rep(x,rpois(length(x),lambda=1)) )); quantile(bs, c(0.025, 0.5, 0.975)) } LMH.normal <- function(x) { mu = mean(x); stderr = sqrt(var(x)/length(x)) c(mu - 1.96*stderr, mu, mu + 1.96*stderr) } n = 100*1000; k = 10*1000; x = runif(n,0,1); s = mean; print(quantile(replicate(k,s(runif(n,0,1))),c(0.025,0.5,0.975))) 0.4982299 0.5000069 0.5017628 0.4978307 0.4996247 0.5014186 print(LMH.normal(x)) 0.4978188 0.4996341 0.5014037 print(BS(x,s,k)) 0.4978936 0.4996468 0.5014059 print(BS.pois(x,s,k)) Dennis DeCoste (eBay Research Labs) 32 / 66

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend