using classifier cascades for scalable e mail
play

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - PowerPoint PPT Presentation

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daum III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012 Building a scalable e-mail system 1 Goal: Maintain system throughput across


  1. USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daumé III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012

  2. Building a scalable e-mail system 1 ¨ Goal: Maintain system throughput across conditions ¨ Varying conditions ¤ Load varies ¤ Resource availability varies ¤ Task varies ¨ Challenge: Build a system that can adapt its operation to the conditions at hand

  3. Problem structure informs scalable solution 2 Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

  4. Important facets of problem 3 ¨ Structure in input ¤ Features may have an order or systemic dependency ¤ Acquisition costs vary: cheap or expensive features ¨ Structure in output ¤ Labels naturally have a hierarchy from coarse-to-fine ¤ Different levels of hierarchy have different sensitivities to cost ¨ Exploit structure during classification ¨ Minimize costs, minimize error

  5. Two overarching questions 4 ¨ When should we acquire features to classify a message? ¨ How does this acquisition policy change across different classification tasks? ¨ Classifier Cascades can answer both questions!

  6. Introducing Classifier Cascades 5 • Series of classifiers: f 1 , f 2 , f 3 ... f n f 1 f 2 f 3 ...

  7. Introducing Classifier Cascades 6 • Series of classifiers: Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates on different, increasingly expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 ...

  8. Introducing Classifier Cascades 7 • Series of classifiers: f 1 ( ϕ 1 ) Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates on different, increasingly f 2 ( ϕ 1, ϕ 2 ) expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n • Classifier outputs a value in [-1,1], the margin or f 3 ( ϕ 1, ϕ 2, ϕ 3 ) f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 confidence of decision ...

  9. Introducing Classifier Cascades 8 • Series of classifiers: f 1 ( ϕ 1 ) Cost: c 1 f 1 , f 2 , f 3 ... f n f 1 ( ϕ 1 ) • Each classifier operates |f 1 |< γ 1 on different, increasingly f 2 ( ϕ 1, ϕ 2 ) expensive sets of features Cost: c 2 f 2 ( ϕ 1, ϕ 2 ) ( ϕ ) with costs c 1 , c 2 , c 3 ... c n • Classifier outputs a value |f 2 |< γ 2 in [-1,1], the margin or f 3 ( ϕ 1, ϕ 2, ϕ 3 ) f 3 ( ϕ 1, ϕ 2, ϕ 3 ) Cost: c 3 confidence of decision • γ parameters control the |f 3 |< γ 3 ... relationship of classifiers

  10. Optimizing Classifier Cascades 9 ¨ Loss function: – errors in classification L ( y, F ( x )) ¨ Minimize loss function, incorporating cost ¤ Cost-constraint with budget (load-sensitive): min Σ ( x ,y ) ∈ D L ( y, F ( x )) s.t. C ( x ) < B ¤ Cost Sensitive loss function (granular): ¨ Use grid-search to find optimal γ parameters

  11. Load-Sensitive Classification 12

  12. Features have costs & dependencies 13 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features IP is known at socket connect time, is 4 bytes in size

  13. Features have costs & dependencies 14 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The Mail From is one of the first commands of an SMTP conversation From addresses have a known format, but higher diversity

  14. Features have costs & dependencies 15 Network Cache $ packets Size Derived IP features Cost Derived Mail From features Derived Subject features Derived Body $$$ features The subject, one of the mail headers, occurs after a number of network exchanges. Since the subject is user-generated, it is very diverse and often lacks a defined format

  15. Load-Sensitive Problem Setting 16 γ 2 < | f 2 | Subject γ 1 < | f 1 | MailFrom Classifier Classifier IP Classifier • Train IP , MailFrom, and Subject classifiers • For a given budget, B , choose γ 1 , γ 2 that minimize error within B • Constraint: C(x) < B

  16. Load-Sensitive Challenges 17 ¨ Overfitting model when choosing γ 1 , γ 2 ¨ Train-time costs underestimated versus test-time performance ¨ Use a regularization constant Δ ¤ Sensitive to cost variance ( σ ) ¤ Accounts for variability ¨ Revised constraint: C(x) + ∆ σ < B

  17. Granular Classification 18

  18. E-mail Challenges: Spam Detection 19 • Most mail is spam Spam Ham • Billions of classifications • Must be incredibly fast

  19. E-mail Challenges: Categorizing Mail 20 Spam Ham • E-mail does more, tasks such as: • Extract receipts, tracking info Social Business Network • Thread conversations • Filter into mailing lists Personal • Inline social network response Newsgroup • Computationally intensive processing • Each task applies to one class

  20. Coarse task is constrained by feature cost 21 Feature Structure Class Structure $ Derived IP λ c Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject Personal features Newsgroup Derived Body $$$ features

  21. Fine task is constrained by misclassification cost 22 Feature Structure Class Structure $ Derived IP Ham Spam features Granularity Cost Derived Mail From features Social Business Network Derived Subject λ f Personal features Newsgroup Derived Body $$$ features

  22. Granular Classification Problem Setting 23 Subject MailFrom IP Spam Ham L(y, h(x)) + λ c C(x) Social Business Network Subject MailFrom IP L(y, h(x)) + λ f C(x) Personal Newsgroup • Two separate models for different tasks, with different classifiers and cascade parameters • Choose γ 1 , γ 2 for each cascade to balance accuracy and cost with different tradeoffs λ

  23. Experimental Results 27

  24. Experimental Setup: Overview 28 ¨ Two tasks: load-sensitive & granular classification ¨ Two datasets: Yahoo! Mail corpus and TREC-2007 ¤ Load-sensitive uses both datasets, granular uses only Yahoo! ¨ Results are L1O, 10-fold CV with bold values significant (p<.05) ¨ Cascade stages use MEGAM MaxEnt classifier

  25. Experimental Setup: Yahoo! Data 29 Class Messages Feature Cost Spam 531 IP .168 Business 187 MailFrom .322 Social Network 223 Subject .510 Newsletter 174 Personal/Other 102 • Data from 1227 Yahoo! Mail messages from 8/2010 • Feature costs calculated from network + storage cost

  26. Experimental Setup: TREC data 30 Class Messages Spam 39055 Ham 8139 • Data from TREC-2007 Public Spam Corpus, 47194 messages • Use same feature cost estimates

  27. Results: Load-Sensitive Classification Regularization prevents cost excesses 32 Y!Mail Dataset Δ Y!Mail TREC 0 .115 .059 .25 .020 0.00 Average excess cost

  28. Results: Load-Sensitive Classification Significant error reduction 33 Classification Error across methods in different datasets 0.14 Classification Error (L(x)) 0.12 0.1 0.08 Naive ACC, Δ =0 0.06 ACC, Δ =.25 0.04 ACC, Δ =.5 0.02 0 Yahoo! Mail TREC-2007 Dataset

  29. Results: Granular Classification 35 Feature Set Feature Cost Misclass Cost Coarse Fine Overall Fixed: IP .168 .139 .181 .229 ACC: λ c =1.5, λ f =1 .187 .140 .156 .217 Fixed: IP+MailFrom .490 .128 .142 .200 ACC: λ c =.1, λ f =.075 .431 .111 .100 .163 Fixed: IP+MailFrom+Subject 1.00 .106 .108 .162 ACC: λ c =.02, λ f =.02 .691 .108 .105 .162 • Compare fixed feature acquisition policies to adaptive classifiers • Significant gains in performance or cost (or both) depending on tradeoff

  30. Dynamics of choosing λ c and λ f 36

  31. Different approaches, same tradeoff 37

  32. Conclusion 38 ¨ Problem of scalable e-mail classification ¨ Introduce two settings ¤ Load-sensitive Classification: known budget ¤ Granular Classification: task sensitivity ¨ Use classifier cascades to achieve tradeoff between cost and accuracy ¨ Demonstrate results superior to baseline Questions? Research funded by Yahoo! Faculty Research Engagement Program

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend