USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - - PowerPoint PPT Presentation

using classifier cascades for scalable e mail
SMART_READER_LITE
LIVE PREVIEW

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay - - PowerPoint PPT Presentation

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION Jay Pujara jay@cs.umd.edu Hal Daum III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012 Building a scalable e-mail system 1 Goal: Maintain system throughput across


slide-1
SLIDE 1

USING CLASSIFIER CASCADES FOR SCALABLE E-MAIL CLASSIFICATION

Jay Pujara jay@cs.umd.edu Hal Daumé III me@hal3.name Lise Getoor getoor@cs.umd.edu 2/23/2012

slide-2
SLIDE 2

Building a scalable e-mail system

¨ Goal: Maintain system throughput across conditions ¨ Varying conditions

¤ Load varies ¤ Resource availability varies ¤ Task varies

¨ Challenge: Build a system that can adapt its

  • peration to the conditions at hand

1

slide-3
SLIDE 3

Feature Structure Class Structure

Cost Granularity

Problem structure informs scalable solution

Spam Ham

Business Personal

Newsgroup

Social Network

IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

$ $$$

2

slide-4
SLIDE 4

Important facets of problem

¨ Structure in input

¤ Features may have an order or systemic dependency ¤ Acquisition costs vary: cheap or expensive features

¨ Structure in output

¤ Labels naturally have a hierarchy from coarse-to-fine ¤ Different levels of hierarchy have different sensitivities

to cost

¨ Exploit structure during classification ¨ Minimize costs, minimize error

3

slide-5
SLIDE 5

Two overarching questions

4

¨ When should we acquire features to classify a

message?

¨ How does this acquisition policy change across

different classification tasks?

¨ Classifier Cascades can answer both questions!

slide-6
SLIDE 6

Introducing Classifier Cascades

f1 f2 f3

  • Series of classifiers:

f1, f2, f3 ... fn

...

5

slide-7
SLIDE 7

Introducing Classifier Cascades

  • Series of classifiers:

f1, f2, f3 ... fn

  • Each classifier operates
  • n different, increasingly

expensive sets of features (ϕ) with costs c1, c2, c3 ... cn

...

f1(ϕ1) f2(ϕ1,ϕ2) f3(ϕ1,ϕ2,ϕ3)

Cost: c1 Cost: c2 Cost: c3

6

slide-8
SLIDE 8

Introducing Classifier Cascades

  • Series of classifiers:

f1, f2, f3 ... fn

  • Each classifier operates
  • n different, increasingly

expensive sets of features (ϕ) with costs c1, c2, c3 ... cn

  • Classifier outputs a value

in [-1,1], the margin or confidence of decision

...

f1(ϕ1) f2(ϕ1,ϕ2) f3(ϕ1,ϕ2,ϕ3)

Cost: c1 Cost: c2 Cost: c3 f1(ϕ1) f2(ϕ1,ϕ2) f3(ϕ1,ϕ2,ϕ3)

7

slide-9
SLIDE 9

Introducing Classifier Cascades

  • Series of classifiers:

f1, f2, f3 ... fn

  • Each classifier operates
  • n different, increasingly

expensive sets of features (ϕ) with costs c1, c2, c3 ... cn

  • Classifier outputs a value

in [-1,1], the margin or confidence of decision

  • γ parameters control the

relationship of classifiers

...

f1(ϕ1) f2(ϕ1,ϕ2) f3(ϕ1,ϕ2,ϕ3)

Cost: c1 Cost: c2 Cost: c3

|f1|<γ1 |f2|<γ2 |f3|<γ3

f1(ϕ1) f2(ϕ1,ϕ2) f3(ϕ1,ϕ2,ϕ3)

8

slide-10
SLIDE 10

Optimizing Classifier Cascades

9

¨ Loss function:

– errors in classification

¨ Minimize loss function, incorporating cost

¤ Cost-constraint with budget (load-sensitive): ¤ Cost Sensitive loss function (granular):

¨ Use grid-search to find optimal γ parameters

L(y, F(x)) min Σ(x,y)∈DL(y, F(x)) s.t. C(x) < B

slide-11
SLIDE 11

Load-Sensitive Classification

12

slide-12
SLIDE 12

IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

Cost

$ $$$

Features have costs & dependencies

IP is known at socket connect time, is 4 bytes in size Network packets Cache Size

13

slide-13
SLIDE 13

Features have costs & dependencies

The Mail From is one of the first commands of an SMTP conversation From addresses have a known format, but higher diversity IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

Cost

$ $$$

Network packets Cache Size

14

slide-14
SLIDE 14

Features have costs & dependencies

The subject, one of the mail headers, occurs after a number of network exchanges. Since the subject is user-generated, it is very diverse and often lacks a defined format IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

Cost

$ $$$

Network packets Cache Size

15

slide-15
SLIDE 15

Load-Sensitive Problem Setting

IP Classifier MailFrom Classifier Subject Classifier

| f1 | < γ1 | f2 | < γ2

  • Train IP

, MailFrom, and Subject classifiers

  • For a given budget, B, choose γ1, γ2 that

minimize error within B

  • Constraint: C(x) < B

16

slide-16
SLIDE 16

Load-Sensitive Challenges

¨ Overfitting model when choosing γ1, γ2 ¨ Train-time costs underestimated versus

test-time performance

¨ Use a regularization constant Δ

¤ Sensitive to cost variance (σ) ¤ Accounts for variability

¨ Revised constraint: C(x) + ∆σ < B

17

slide-17
SLIDE 17

Granular Classification

18

slide-18
SLIDE 18

Spam Ham

E-mail Challenges: Spam Detection

  • Most mail is spam
  • Billions of classifications
  • Must be incredibly fast

19

slide-19
SLIDE 19

Spam Ham

E-mail Challenges: Categorizing Mail

  • E-mail does more, tasks such as:
  • Extract receipts, tracking info
  • Thread conversations
  • Filter into mailing lists
  • Inline social network response

Business Personal

Newsgroup

Social Network

  • Computationally intensive processing
  • Each task applies to one class

20

slide-20
SLIDE 20

Spam Ham

Business Personal

Newsgroup

Social Network

IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

Feature Structure Class Structure

Cost Granularity

$ $$$

Coarse task is constrained by feature cost

λc

21

slide-21
SLIDE 21

Feature Structure Class Structure

Cost Granularity

Fine task is constrained by misclassification cost

Spam Ham

Business Personal

Newsgroup

Social Network

IP

Mail From

Subject Body

Derived features Derived features

Derived features Derived features

$ $$$

λf

22

slide-22
SLIDE 22

Spam Ham

Granular Classification Problem Setting

Business Personal

Newsgroup

Social Network

MailFrom Subject IP MailFrom Subject IP

L(y, h(x)) + λcC(x) L(y, h(x)) + λfC(x)

  • Two separate models for different tasks, with different classifiers

and cascade parameters

  • Choose γ1, γ2 for each cascade to balance accuracy and cost with

different tradeoffs λ

23

slide-23
SLIDE 23

Experimental Results

27

slide-24
SLIDE 24

Experimental Setup: Overview

28

¨ Two tasks: load-sensitive & granular classification ¨ Two datasets: Yahoo! Mail corpus and TREC-2007

¤ Load-sensitive uses both datasets, granular uses only

Yahoo!

¨ Results are L1O, 10-fold CV with bold values

significant (p<.05)

¨ Cascade stages use MEGAM MaxEnt classifier

slide-25
SLIDE 25

Experimental Setup: Yahoo! Data

  • Data from 1227 Yahoo! Mail messages from 8/2010
  • Feature costs calculated from network + storage cost

Feature Cost IP .168 MailFrom .322 Subject .510 Class Messages Spam 531 Business 187 Social Network 223 Newsletter 174 Personal/Other 102

29

slide-26
SLIDE 26

Experimental Setup: TREC data

  • Data from TREC-2007 Public Spam Corpus, 47194 messages
  • Use same feature cost estimates

Class Messages Spam 39055 Ham 8139

30

slide-27
SLIDE 27

Results: Load-Sensitive Classification Regularization prevents cost excesses

32

Δ Y!Mail TREC .115 .059 .25 .020 0.00 Average excess cost Y!Mail Dataset

slide-28
SLIDE 28

Results: Load-Sensitive Classification Significant error reduction

33

0.02 0.04 0.06 0.08 0.1 0.12 0.14 Yahoo! Mail TREC-2007 Classification Error (L(x)) Dataset

Classification Error across methods in different datasets

Naive ACC, Δ=0 ACC, Δ=.25 ACC, Δ=.5

slide-29
SLIDE 29

Results: Granular Classification

Feature Set Feature Cost Misclass Cost Coarse Fine Overall Fixed: IP .168 .139 .181 .229 ACC: λc=1.5, λf=1 .187 .140 .156 .217 Fixed: IP+MailFrom .490 .128 .142 .200 ACC: λc=.1, λf=.075 .431 .111 .100 .163 Fixed: IP+MailFrom+Subject 1.00 .106 .108 .162 ACC: λc=.02, λf=.02 .691 .108 .105 .162

  • Compare fixed feature acquisition policies to adaptive classifiers
  • Significant gains in performance or cost (or both) depending on tradeoff

35

slide-30
SLIDE 30

Dynamics of choosing λc and λf

36

slide-31
SLIDE 31

Different approaches, same tradeoff

37

slide-32
SLIDE 32

Conclusion

¨ Problem of scalable e-mail classification ¨ Introduce two settings

¤ Load-sensitive Classification: known budget ¤ Granular Classification: task sensitivity

¨ Use classifier cascades to achieve tradeoff between

cost and accuracy

¨ Demonstrate results superior to baseline

Questions?

Research funded by Yahoo! Faculty Research Engagement Program 38