 
              Catalog Classification at the Long Tail using IR and ML Neel Sundaresan (nsundaresan@ebay.com) Team: Badrul Sarwar, Khash Rohanimanesh, JD R Ruvini, Karin Mauge, Dan Shen ini Karin Ma ge Dan Shen
Then There was One… When asked if he understood that the laser pointer was broken, the buyer said “Of course, I’m a collector of broken laser pointers”
Divine Reward! PetroliumJeliffe Neel Sundaresan – SIGIR 2010
What we sell on a daily basis? Neel Sundaresan – SIGIR 2010
The Importance of Structured Information o atio Search Experience Recommender Systems Fraud and Counterfeit Detection Fraud and Counterfeit Detection Neel Sundaresan – SIGIR 2010
Discovering Catalogs: Challenges Our goal is to build catalogs using An unsupervised metadata extraction system Challenges Challenges Huge volume of raw text Highly unstructured High level of noise Lack of consistency/standardization of attribute name and value usage g 6 Neel Sundaresan – SIGIR 2010
Take Advantage of the Community Savvy sellers provide plenty of useful information We need to combine techniques that can q Extract attribute names and values from this large collection Remove noise and normalize attribute names and value usage Neel Sundaresan – SIGIR 2010
We have the data 8 Neel Sundaresan – SIGIR 2010
We have the data 9 Neel Sundaresan – SIGIR 2010
The BIG Picture Year: 1999 Model: premiere night Model: premiere night Edition: home shopping special Hair: blond Gown: purple and silver Condition: new / never removed from box / mint Title.. Attributes… Seller Info BARBIE 1999 "PREMIERE NIGHT" Home Shopping Special Edition Catalog Gorgeous Doll With Beautiful Blond Hair / In A Item Description API Gown Of Purple And Silver New / Never Removed From Box / Doll Is In Mint New / Never Removed From Box / Doll Is In Mint Stream Condition / Remember This Beauty Is 11 Years Old 5-10M items (20Gb) daily Free Shipping To US Only / Will Ship International Catalog / Please E-mail For Cost Indexer Feel Free To Ask Me Any Questions Or Concerns Smoke - Free Environment Free Shipping Free Shipping Large Scale Hadoop Cluster Neel Sundaresan – SIGIR 2010
Our Approach To build an automatic product catalog we follow the following steps Grouping items into categories p g g Category classification Weeding out noise through accessory classification Weeding out noise through accessory classification Extraction of attribute names and values Simple two pass approach Si pl t p ppr h Cleaning and normalization Capturing human expertise through machine learning 1 Neel Sundaresan – SIGIR 2010 1
Catalog Discovery Improve value coverage for important names Use machine learning to expand value coverage Product building Organize items in a hierarchical collection Matching inventory to products g v y p d Adoption At each step we perform machine At each step we perform machine learning/text mining techniques Neel Sundaresan – SIGIR 2010
Item Categorization Near similar titles “Apple IPOD Nano 4GB Black NEW! Great Deal!” “Apple IPOD Nano 4GB Black NEW! Skin Great Deal!” Category Classification g y Feature Selection Smoothing Smoothing Accessory classification (NBC) 1 Neel Sundaresan – SIGIR 2010 3
Class Pruning Class Pruning – unique to eBay we compute the posterior probability in NBC i.e, P(C|Title words) then for some title words ( | ) the number of class they appear is huge (for instance, the word “harry potter” appears in y p pp thousands of categories and that puts a strain on the online posterior probability computation. To p p y p fix this, we use class pruning—for a given feature we only keep a few top classes in the y p p computation. Neel Sundaresan – SIGIR 2010
Item Categorization on eBay Seller describes his item with a few keywords eBay recommends 15 categories for his/her consideration Neel Sundaresan – SIGIR 2010
Item Categorization on eBay Larry Bird Boston Celtics Signed Adidas Classic Jersey Price: US $399.99 Buy It Now Categorize into ? Sports Mem, Cards & Fan Shop > Manufacturer Authenticated > Basketball-NBA p , p Sports Mem, Cards & Fan Shop > Fan Apparel & Souvenirs > Basketball-NBA Sports Mem, Cards & Fan Shop > Autographs-Original > Basketball-NBA > Jerseys Clothing, Shoes & Accessories > Men's Clothing > Athletic Apparel Clothing, Shoes & Accessories > Men's Clothing > Shirts > T-Shirts, Tank Tops Collectibles > Advertising > Clothing, Shoes & Accessories > Clothing Neel Sundaresan – SIGIR 2010
Challenge I Large collection of categories 30K 30K categories i Hard to distinguish 1. Clothing, Shoes & Accessories � Costumes & Reenactment Attire � Costumes � Women 2. Everything Else � Adult Only � Clothing, Shoes & Accessories � Costumes & Fantasy Wear � Women Wear � Women Insufficient information of items Limited length of item title: 10 words Inaccurate or fraud title description Inaccurate or fraud title description Neel Sundaresan – SIGIR 2010
Challenge II Highly skewed item distribution 8 6.9% categories contain 17.3% items 1% categories contain 51.7% items Scalability and efficiency 4 million items daily 4 million items daily Real-time response G Good scalability and high efficiency d l bilit d hi h ffi i
Applications Recommending category candidates for seller’s listing Monitoring misclassification rate on current site Detecting outlier items Detecting outlier items Neel Sundaresan – SIGIR 2010
Method Multinomial Bayesian algorithm Smoothing Scaling-up to cope with highly skewed item distribution Data sparseness problem Common or non informative word problem Common or non-informative word problem Neel Sundaresan – SIGIR 2010
Bayesian Learning Framework We employ the Naive Bayes with Multinomial likelihood function which is to find the most likely class c with the y maximum posterior probability of generating item t item t Neel Sundaresan – SIGIR 2010
Approach Exploit Data to the Maximum Apply simple algorithms at the same time Neel Sundaresan – SIGIR 2010
Smoothing Algorithms Laplace Smoothing Jelinek-Mercer Smoothing Dirichlet Prior Dirichlet Prior Absolute Discounting Shrinkage Smoothing Neel Sundaresan – SIGIR 2010
Experiments Train: Sold items on eBay site in about one month 18 million items 18 million items 18K categories Test: Sold items in the day following the training period 278K items Neel Sundaresan – SIGIR 2010
Research Questions How various smoothing methods perform on our task? How does smoothing interact with siz size of training data set f tr inin d t s t size of category vocabulary focusedness of category Neel Sundaresan – SIGIR 2010
Overall Precision of Smoothing Methods 100 P@1 @ P@5 @ P@10 @ P@15 @ 90 80 80 70 60 3.1% 5.1% 5% 50 50 NoSmoothing Laplace Jelinek Dirichlet Absolute Shrinkage Smoothing Mercer Priors Discounting Smoothing Neel Sundaresan – SIGIR 2010
Influence of size of training set on Smoothing Larger data set leads to better performance The rate of the increase is not fixed when multiplying data set blindly Quality of training data Increasing prior sample size μ improves performance Neel Sundaresan – SIGIR 2010
Influence of Category Size on Smoothing Smoothing for insufficient sample problem Eliminate zero probability of unobserved words Eliminate zero probability of unobserved words Two data sets: LargeCat: the cat containing >10K training instances LargeCat: the cat. containing >10K training instances SmallCat: the cat. containing <1K training instances L LargeCat C t S SmallCat llC t LargeCat significantly outperforms SmallCat by 27.7% No Smoothing 69.4 35.9 Smoothing saves the system μ = 100 69.8 39.1 +3.4% on LargeCat Dirichlet μ = 500 69.4 45.1 +9.2% on SmallCat Prior μ = 1000 70.1 43.7 μ = 2000 μ 71.0 41.0 μ = 5000 72.8 35.1 Neel Sundaresan – SIGIR 2010
Influence of word specificity on Smoothing Smoothing for common or non-informative words Decrease the discrimination power of such words Decrease the discrimination power of such words Two data sets: SpecCat: the cat containing words with high IDF values SpecCat: the cat. containing words with high IDF values NotSpecCat: the cat. containing words with low IDF values SpecCat SpecCat NotSpecCat NotSpecCat SpecCat significantly outperforms No Smoothing 71.4 42.8 NotSpecCat by 31.4% μ = 100 73.0 43.7 Smoothing saves the system Dirichlet μ = 500 75.7 44.3 +5.0% on SpecCat Prior μ = 1000 75.1 44.7 +2.2% on NotSpecCat μ = 2000 μ 75.6 44.5 μ = 5000 76.4 45.0 Neel Sundaresan – SIGIR 2010
Cataloging: Extraction of Attribute Names and Values Na es a d Va ues We extract attribute names and values from millions of descriptions Harder than named entity recognition Harder than named entity recognition Attribute names are not known beforehand We employ a two-pass process l 3 Neel Sundaresan – SIGIR 2010 0
Pass 1—Name identification Use a high-precision low-recall extraction based on pattern search p Use seller count, items count and other statistics to find names to find names Neel Sundaresan – SIGIR 2010
Recommend
More recommend