Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu - - PowerPoint PPT Presentation

prediction cubes
SMART_READER_LITE
LIVE PREVIEW

Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu - - PowerPoint PPT Presentation

Prediction Cubes Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison Big Picture We are not trying to build a single accuracy model We want to find interesting subsets of the dataset


slide-1
SLIDE 1

Prediction Cubes

Bee-Chung Chen, Lei Chen, Yi Lin and Raghu Ramakrishnan University of Wisconsin - Madison

slide-2
SLIDE 2

2

Big Picture

  • We are not trying to build a single accuracy “model”
  • We want to find interesting subsets of the dataset

– Interestingness: Defined by the “model” built on a subset – Cube space: A combination of dimension attribute values defines a candidate subset (just like regular OLAP)

  • We are not using regular aggregate functions as the

measures to summarize subsets

  • We want the measures to represent

decision/prediction behavior

– Summarize a subset using the “model” built on it – Big difference from regular OLAP!!

slide-3
SLIDE 3

3

One Sentence Summary

  • Take OLAP data cubes, and keep everything

the same except that we change the meaning

  • f the cell values to represent the

decision/prediction behavior

– The idea is simple, but it leads to interesting and promising data mining tools

slide-4
SLIDE 4

4

Example (1/5): Regular OLAP

Location Time # of App.

… … ... AL, USA Dec, 04 2 … … … WY, USA Dec, 04 3

Z: Dimensions Y: Measure

Goal: Look for patterns of unusually high numbers of applications

… … … … … … …

… … … 10 8 2 70

USA

… … 30 25 50 20 30

CA … Dec … Jan Dec … Jan … 2003 2004

Cell value: Number of loan applications

… … …

… 90 80

USA

… 90 100

CA … 03 04

Roll up

Coarser regions

… … … …

… …

… … … 10

WY

… … 5

… … … 55

AL USA

… 15 3 5

YT

… 20 2 5

… 15 15 20

AB CA … Dec … Jan … 2004

Drill down

Finer regions

slide-5
SLIDE 5

5

Example (2/5): Decision Analysis

Goal: Analyze a bank’s loan decision process w.r.t. two dimensions: Location and Time

Model h(X, σZ(D)) E.g., decision tree

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Z: Dimensions X: Predictors Y: Class

Fact table D cube subset

Location Time

All Japan USA Norway AL W Y

All Country State

slide-6
SLIDE 6

6

Example (3/5): Questions of Interest

  • Goal: Analyze a bank’s loan decision process with

respect to two dimensions: Location and Time

  • Target: Find discriminatory loan decision
  • Questions:

– Are there locations and times when the decision making was similar to a set of discriminatory decision examples (or similar to a given discriminatory decision model)? – Are there locations and times during which Race or Sex is an important factor of the decision process?

slide-7
SLIDE 7

7

Example (4/5): Prediction Cube

Model h(X, σ[USA, Dec 04](D)) E.g., decision tree

2004 2003 … Jan … Dec Jan … Dec … CA

0.4 0.8 0.9 0.6 0.8 … …

USA

0.2 0.3 0.5 … … …

… … … … … … …

  • 1. Build a model using data

from USA in Dec., 1985

  • 2. Evaluate that model

Measure in a cell:

  • Accuracy of the model
  • Predictiveness of Race

measured based on that model

  • Similarity between that

model and a given model

N … F Black Dec, 04 WY, USA … … … … … … Y … M White Dec, 04 AL ,USA Approval … Sex Race Time Location

Data σ[USA, Dec 04](D)

slide-8
SLIDE 8

8

Example (5/5): Prediction Cube

2004 2003 … Jan … Dec Jan … Dec … CA

0.4 0.1 0.3 0.6 0.8 … …

USA

0.7 0.4 0.3 0.3 … … …

… … … … … … … … … … … … … …

… …

… … … … 0.8 0.7 0.9

WY

… … … 0.1 0.1 0.3

… … … … 0.2 0.1 0.2

AL USA

… … … 0.2 0.1 0.2 0.3

YT

… … … 0.3 0.3 0.1 0.1

… … 0.2 0.1 0.1 0.2 0.4

AB CA … Dec … Jan Dec … Jan … 2003 2004

Drill down

… … …

… 0.3 0.2

USA

… 0.2 0.3

CA … 03 04

Roll up

Cell value: Predictiveness of Race

slide-9
SLIDE 9

9

Outline

  • Motivating example
  • Definition of prediction cubes
  • Efficient prediction cube materialization
  • Experimental results
  • Conclusion
slide-10
SLIDE 10

10

Prediction Cubes

  • User interface: OLAP data cubes

– Dimensions, hierarchies, roll up and drill down

  • Values in the cells:

– Accuracy – Similarity – Predictiveness → Test-set accuracy cube → Model-similarity cube → Predictiveness cube

slide-11
SLIDE 11

11

Test-Set Accuracy Cube

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Data table D

Given:

  • Data table D
  • Test set ∆

No … M Black … … … … Yes … F White

Approval … Sex Race

Test set ∆

… … … … … … …

… … … 0.9 0.3 0.2

USA

… … 0.5 0.6 0.3 0.2 0.4

CA … Dec … Jan Dec … Jan … 2003 2004

Level: [Country, Month] The decision model of USA during Dec 04 had high accuracy when applied to ∆ Build a model

Accuracy

Yes … Yes

Prediction

slide-12
SLIDE 12

12

Model-Similarity Cube

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Data table D

Given:

  • Data table D
  • Target model h0(X)
  • Test set ∆ w/o labels

… M Black … … … … F White

… Sex Race

Test set ∆

… … … … … … …

… … … 0.9 0.3 0.2

USA

… … 0.5 0.6 0.3 0.2 0.4

CA … Dec … Jan Dec … Jan … 2003 2004

Level: [Country, Month] The loan decision process in USA during Dec 04 was similar to a discriminatory decision model h0(X) Build a model

Similarity

No … Yes Yes … Yes

slide-13
SLIDE 13

13

Predictiveness Cube

2004 2003 … Jan … Dec Jan … Dec … CA

0.4 0.2 0.3 0.6 0.5 … …

USA

0.2 0.3 0.9 … … …

… … … … … … …

Given:

  • Data table D
  • Attributes V
  • Test set ∆ w/o labels

Data table D Build models

… M Black … … … … F White

… Sex Race

Test set ∆ Level: [Country, Month]

Predictiveness of V

Race was an important factor of loan approval decision in USA during Dec 04 h(X) h(X−V)

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Yes No . . No Yes No . . Yes

slide-14
SLIDE 14

14

Outline

  • Motivating example
  • Definition of prediction cubes
  • Efficient prediction cube materialization
  • Experimental results
  • Conclusion
slide-15
SLIDE 15

15

One Sentence Summary

  • Reduce prediction cube computation to data

cube computation

– Somehow represent a data-mining model as a distributive or algebraic (bottom-up computable) aggregate function, so that data-cube techniques can be directly applied

slide-16
SLIDE 16

16

Full Materialization

Full Materialization Table

[All,All] [Country,Year] [All,Year] [Country,All]

Level Location Time Cell Value [All,All] ALL ALL 0.7 CA ALL 0.4 … ALL … USA ALL 0.9 ALL 1985 0.8 ALL … … ALL 2004 0.3 CA 1985 0.9 CA 1986 0.2 … … … USA 2004 0.8 [Country,Year] [All,Year] [Country,All]

USA … CA 2004 … 1986 1985 All 2004 … 1986 1985

[All, Year]

USA … CA All All All

[All, All] [Country, Year] [Country, All]

slide-17
SLIDE 17

17

Bottom-Up Data Cube Computation

1985 1986 1987 1988 Norway 10 30 20 24 … 23 45 14 32 USA 14 32 42 11 1985 1986 1987 1988 All 47 107 76 67 All All 297 All Norway 84 … 114 USA 99

Cell Values: Numbers of loan applications

slide-18
SLIDE 18

18

Functions on Sets

  • Bottom-up computable functions: Functions that can be

computed using only summary information

  • Distributive function: α(X) = F({α(X1), …, α(Xn)})

– X = X1 ∪ … ∪ Xn and Xi ∩ Xj = ∅ – E.g., Count(X) = Sum({Count(X1), …, Count(Xn)})

  • Algebraic function: α(X) = F({G(X1), …, G(Xn)})

– G(Xi) returns a length-fixed vector of values – E.g., Avg(X) = F({G(X1), …, G(Xn)})

  • G(Xi) = [Sum(Xi), Count(Xi)]
  • F({[s1, c1], …, [sn, cn]}) = Sum({si}) / Sum({ci})
slide-19
SLIDE 19

19

Scoring Function

  • Represent a model as a function of sets.
  • Conceptually, a machine-learning model h(X; σZ(D))

is a scoring function Score(y, x; σZ(D)) that gives each class y a score on test example x

– h(x; σZ(D)) = argmax y Score(y, x; σZ(D)) – Score(y, x; σZ(D)) ≈ p(y | x, σZ(D)) – σZ(D): The set of training examples (a cube subset of D)

slide-20
SLIDE 20

20

Bottom-up Score Computation

  • Key observations:

– Observation 1: Score(y, x; σZ(D)) is a function of cube subset σZ(D); if it is distributive or algebraic, the data cube bottom-up technique can be directly applied – Observation 2: Having the scores for all the test examples and all the cells is sufficient to compute a prediction cube

  • Scores ⇒ predictions ⇒ cell values
  • Details depend on what each cell means (i.e., type of prediction

cubes); but straightforward

slide-21
SLIDE 21

21

1985 1986 1987 1988 Norway … USA 1985 1986 1987 1988 All All Norway … USA All All

scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores scores value value value value value value value value value value value value value value value value value value value value

  • 1. Build a model for each lowest-level cell
  • 2. Compute the scores using data cube bottom-up technique
  • Ob. 1: Distributive scoring function ⇒ bottom up
  • 3. Use the scores to compute the cell values
  • Ob. 2: Having scores ⇒ having cell values
slide-22
SLIDE 22

22

Machine-Learning Models

  • Naïve Bayes:

– Scoring function: algebraic

  • Kernel-density-based classifier:

– Scoring function: distributive

  • Decision tree, random forest:

– Neither distributive, nor algebraic

  • PBE: Probability-based ensemble (new)

– To make any machine-learning model distributive – Approximation

slide-23
SLIDE 23

23

Probability-Based Ensemble

PBE version of decision tree on [WA, 85] Decision tree on [WA, 85] 1985 W A

Jan … Dec

… … … W A 1985 … … …

Dec … Jan

Decision trees built on the lowest-level cells

slide-24
SLIDE 24

24

Probability-Based Ensemble

  • Scoring function:

– h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D)) – g(bi | x): A model that predicts the probability that x belongs to base subset bi(D) )) ( ; , ( max arg )) ( ; ( D D

S S

x x σ σ y Score h

PBE y PBE

=

( )

∑ ∈

=

S S

x x

i i PBE PBE

b y Score y Score )) ( ; , ( )) ( ; , ( D D σ

) | ( )) ( ; | ( )) ( ; , ( x x x

i i i PBE

b g b y h b y Score ⋅ = D D

slide-25
SLIDE 25

25

Outline

  • Motivating example
  • Definition of prediction cubes
  • Efficient prediction cube materialization
  • Experimental results
  • Conclusion
slide-26
SLIDE 26

26

Experiments

  • Quality of PBE on 8 UCI datasets

– The quality of the PBE version of a model is slightly worse (0 ~ 6%) than the quality of the model trained directly on the whole training data.

  • Efficiency of the bottom-up score computation

technique

  • Case study on demographic data

W A 1985 … … … … W A 1985 … … … …

PBE vs.

slide-27
SLIDE 27

27

Efficiency of the Bottom-up Score Computation

  • Machine-learning models:

– J48: J48 decision tree – RF: Random forest – NB: Naïve Bayes – KDC: Kernel-density-based classifier

  • Bottom-up method vs. Exhaustive method

− PBE-J48 − PBE-RF − NB − KDC − J48ex − RFex − NBex − KDCex

slide-28
SLIDE 28

28

Synthetic Dataset

  • Dimensions: Z1, Z2 and Z3.
  • Decision rule:

All 1 n All A B C D E 0 1 2 3 4 5 6 7 8 9

Z1 and Z2 Z3

Condition Rule When Z1>1 Y = I(4X1+3X2+2X3+X4+0.4X6 > 7) else when Z3 mod 2 = 0 Y = I(2X1+2X2+3X3+3X4+0.4X6 > 7) else Y = I(0.1X5+X1>1)

slide-29
SLIDE 29

29

Efficiency Comparison

500 1000 1500 2000 2500 40K 80K 120K 160K 200K RFex KDCex NBex J48ex NB KDC RF- PBE J48- PBE

Using exhaustive method

# of Records Execution Time (sec)

Using bottom-up score computation

slide-30
SLIDE 30

30

Take-Home Messages

  • Promising exploratory data analysis paradigm:

– Use models to identify interesting subsets – Concentrate only on subsets in the cube space

  • Those are meaningful subsets

– Precompute the results – Provide the users with an interactive tool

  • A simple way to plug “something” into cube-style

analysis:

– Try to describe/approximate “something” by a distributive or algebraic function

slide-31
SLIDE 31

31

Related Work: Building models in OLAP

  • Multi-dimensional regression [Chen, VLDB 02]

– Goal: Detect changes of trends – Build linear regression models for cube cells

  • Step-by-step regression in stream cube [Liu, PAKDD 03]
  • Loglinear-based quasi cubes [Barbara, J. IIS 01]

– Use loglinear model to approximately compress dense regions of a data cube

  • NetCube [Margaritis, VLDB 01]

– Build Bayes Net on the entire dataset of approximately answer count queries

slide-32
SLIDE 32

32

Related Work: Advanced Cube- Style Analysis

  • Cubegrades [Imielinski, J. DMKD 02]

– Extend data cubes using ideas from association rules – How the measure changes when we rollup or drill down

  • Constrained gradients in data cube [Dong, VLDB 01]

– Find pairs of similar cell characteristics associated with big changes in measure

  • User-cognizant multidimensional analysis

[Sarawagi, VLDBJ 01] – Help users to explore the most informative unvisited regions in a data cube using max entropy principle

slide-33
SLIDE 33

Questions

slide-34
SLIDE 34

34

What are Our Assumptions?

  • Machine-learning models are good approximation
  • f the true decision/prediction model

– Evaluate accuracy

  • The size of each base subset is large enough to

build a good model

– Future work: Find the proper levels of subsets to start from

  • Model properties are evaluated by test sets

– We did not consider looking at the models themselves

slide-35
SLIDE 35

35

Why Test Set?

  • To obtain quantitative model properties, we need

test set

  • Questions: Why to let users to provide test sets?
  • Flexibility vs. ease of use

– Flexibility: The user can specify p(X) that he/she is interested in (e.g., focus on rich people)

  • E.g., compare p1(Y | X, σ(D)) with p2(Y | X, σ(D))

– Simple fix:

  • Sample test set from the dataset.
  • Cross-validation cube
slide-36
SLIDE 36

36

Why PBE is not that good?

  • If the probability estimation of the base models is

correct, then PBE is optimal

  • Why it is not optimal in reality?

– The probability estimation method is not good – The training datasets for base models are too small

  • Fix:

– Work on the probability estimation method – Build models for some non-base-level cells

slide-37
SLIDE 37

37

Feature Selection vs. Prediction Cubes

  • Feature selection:

– Goal: Find the best k predictive attributes – Search space: 2n (n: number of attributes)

  • Prediction cubes:

– Goal: Find interesting cube cells – Search space: 2d (d: number of dimension attributes) – You may use accuracy cube to find predictive dimension attributes, but not is not our goal – For the predictiveness cube, the attributes whose predictiveness is of interest is given

slide-38
SLIDE 38

38

Why We Need Efficient Precomputation?

  • Several hours vs. several days vs. several months
  • For upper level cells, if the machine learning

algorithm is not scalable and we do not have a bottom-up method, we may never get the result

slide-39
SLIDE 39

Backup Slides

slide-40
SLIDE 40

40

Theoretical Comparison

  • Training complexity:

– Exhaustive: – Bottom-up:

( )

× × ×

Levels l l l l train l d l

d d d

n f Z Z

] ,..., [ ] ,..., [ ) ( ) ( 1

1 1 1

) ( | | ... | | ) ( | | ... | |

] 1 ,..., 1 [ ) 1 ( ) 1 ( 1

n f Z Z

train d

× × ×

All MA WI MN Madison, WI Green Bay, WI

Z1

(3) = All All 85 86 04 Jan., 86 Dec., 86

Z1

(2) = State

Z1

(1) = City

Z2

(3) = All

Z2

(2) = Year

Z2

(1) = Month

Z1 = Location Z2 = Time

slide-41
SLIDE 41

41

Theoretical Comparison

  • Testing complexity:

– Exhaustive: – Bottom-up:

( )

× × ×

Levels l l l l test l d l

d d d

n f Z Z

] ,..., [ ] ,..., [ ) ( ) ( 1

1 1 1

) ( | | ... | |

( )

− ∈

× × × + × × ×

]}) 1 ,..., 1 {[ ( ] ,..., [ ) ( ) ( 1 ] 1 ,..., 1 [ ) 1 ( ) 1 ( 1

1 1

| | ... | | ) ( | | ... | |

Levels l l l d l train d

d d

c Z Z n f Z Z

[3,3] [2,2] [1,1] [2,1] [1,2] [3,1] [1,3] [3,2] [2,3]

Levels

slide-42
SLIDE 42

42

Test-Set-Based Model Evaluation

  • Given a set-aside test set ∆ of schema [X, Y]:

– Accuracy of h(X):

  • The percentage of ∆ that are correctly classified

– Similarity between h1(X) and h2(X):

  • The percentage of ∆ that are given the same class

labels by h1(X) and h2(X)

– Predictiveness of V ⊆ X: (based on h(X))

  • The difference between h(X) and h(X−V) measured

by ∆; i.e., the percentage of ∆ that are predicted differently by h(X) and h(X−V)

slide-43
SLIDE 43

43

Model Accuracy

  • Test-set accuracy (TS-accuracy):

– Given a set-aside test set ∆ with schema [X, Y],

  • |∆|: The number of examples in ∆
  • I(Ψ) = 1 if Ψ is true; otherwise, I(Ψ) = 0
  • Alternative: Cross-validation accuracy

– This will not be discussed further!!

=

∆ ) , (

) ) ; ( ( | | 1

y

y h I

x

D x ∆ accuracy(h(X; D) | ∆) =

slide-44
SLIDE 44

44

Model Similarity

  • Prediction similarity (or distance):

– Given a set-aside test set ∆ with schema X:

  • Similarity between ph1(Y | X) and ph2(Y | X):

– phi(Y | X): Class-probability estimated by hi(X)

∑ ∈

=

x

x x )) ( ) ( ( | | 1

2 1

h h I similarity(h1(X), h2(X)) =

∑ ∑

∈∆

x y h h h

x y p x y p x y p ) | ( ) | ( log ) | ( | | 1

2 1 1

KL-distance = distance(h1(X), h2(X)) = 1 – similarity(h1(X), h2(X))

slide-45
SLIDE 45

45

Attribute Predictiveness

  • Predictiveness of V ⊆ X: (based on h(X))

– PD-predictiveness: – KL-predictiveness:

  • Alternative:

accuracy(h(X)) – accuracy(h(X – V)) – This will not be discussed further!! distance(h(X), h(X – V)) KL-distance(h(X), h(X – V))

slide-46
SLIDE 46

46

Target Patterns

  • Find subset σ(D) such that h(X; σ(D)) has high prediction

accuracy on a test set ∆ – E.g., The loan decision process in 2003’s WI is similar to a set ∆ of discriminatory decision examples

  • Find subset σ(D) such that h(X; σ(D)) is similar to a given

model h0(X) – E.g., The loan decision process in 2003’s WI is similar to a discriminatory decision model h0(X)

  • Find subset σ(D) such that V is predictive on σ(D)

– E.g., Race is an important factor of loan approval decision in 2003’s WI

slide-47
SLIDE 47

47

Test-Set Accuracy

  • We would like to discover:

– The loan decision process in 2003’s WI is similar to a set of problematic decision examples

  • Given:

– Data table D: The loan decision dataset – Test set ∆: The set of problematic decision examples

  • Goal:

– Find subset σLoc,Time(D) such that h(X; σLoc,Time(D)) has high prediction accuracy on ∆

slide-48
SLIDE 48

48

Model Similarity

  • We would like to discover:

– The loan decision process in 2003’s WI is similar to a problematic decision model

  • Given:

– Data table D: The loan decision dataset – Model h0(X): The problematic decision model

  • Goal:

– Find subset σLoc,Time(D) such that h(X; σLoc,Time(D)) is similar to h0(X)

slide-49
SLIDE 49

49

Attribute Predictiveness

  • We would like to discover:

– Race is an important factor of loan approval decision in 2003’s WI

  • Given:

– Data table D: The loan decision dataset – Attribute V of interest: Race

  • Goal:

– Find subset σLoc,Time(D) such that h(X; σLoc,Time(D)) is very different to h(X – V; σLoc,Time(D))

slide-50
SLIDE 50

50

Model-Based Subset Analysis

  • Given: A data table D with schema [Z, X, Y]

– Z: Dimension attributes, e.g., {Location, Time} – X: Predictor attributes, e.g., {Race, Sex, …} – Y: Class-label attribute, e.g., Approval

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Data table D

slide-51
SLIDE 51

51

Model-Based Subset Analysis

No … F Black Dec, 04 WY, USA … … … … … … Yes … M White Dec, 04 AL, USA

Approval … Sex Race Time Location

Z: Dimension X: Predictor Y: Class σ[USA, Dec 04](D)

  • Goal: To understand the relationship between X and Y on

different subsets σZ(D) of data D – Relationship: p(Y | X, σZ(D))

  • Approach:

– Build model h(X; σZ(D)) ≈ p(Y | X, σZ(D)) – Evaluate h(X; σZ(D))

  • Accuracy, model similarity, predictiveness
slide-52
SLIDE 52

52

Dimension and Level

All MA WI MN Madison, WI Green Bay, WI

Z1

(3) = All All 85 86 04 Jan., 86 Dec., 86

Z1

(2) = State

Z1

(1) = City

Z2

(3) = All

Z2

(2) = Year

Z2

(1) = Month

Z1 = Location Z2 = Time

[3,3] [2,2] [1,1] [2,1] [1,2] [3,1] [1,3] [3,2] [2,3]

[All,All] [State,Year] [City,Month] [City,Year] [All,Month] [City,All] [All,Year] [State,All] [State,Month]

slide-53
SLIDE 53

53

Example: Full Materialization

[All,All] [State,Year] [City,Month] [City,Year] [All,Month] [City,All] [All,Year] [State,All] [State,Month]

All All

[All, All]

All AL … WY 85 … 05 All

[City, Month]

slide-54
SLIDE 54

54

Scoring Function

  • Conceptually, a machine-learning model h(X; S) is

a scoring function Score(y, x; S) that gives each class y a score on test example x

– h(x; S) = argmax y Score(y, x; S) – Score(y, x; S) ≈ p(y | x, S) – S: A set of training examples x h(X; S)

No … F Black Dec, 85 WY, USA … … … … … … Yes … M White Dec, 85 AL, USA

Approval … Sex Race Time Location

S [Yes: 80%, No: 20%]

slide-55
SLIDE 55

55

Bottom-Up Score Computation

  • Base cells: The finest-grained (lowest-level) cells

in a cube

  • Base subsets bi(D): The lowest-level data subsets

– The subset of data records in a base cell is a base subset

  • Properties:

– D = ∪i bi(D) and bi(D) ∩ bj(D) = ∅ – Any subset σS(D) of D that corresponds to a cube cell is the union of some base subsets – Notation:

  • σS(D) = bi(D) ∪ bj(D) ∪ bk(D), where S = {i, j, k}
slide-56
SLIDE 56

56

Bottom-Up Score Computation

Domain Lattice Scores:

Score(y, x; σS(D)) = F({Score(y, x; bi(D)) : i ∈ S})

Data subset:

σS(D) = ∪i∈S bi(D)

1985 … WA b1(D) … WI b2(D) WY b3(D) … 1985 … All σS(D) … 1985 … All Score(y, x; σS(D)) …

[All,All] [State,Year] [All,Year] [State,All]

1985 … WA Score(y, x; b1(D)) … WI Score(y, x; b2(D)) WY Score(y, x; b3(D)) …

slide-57
SLIDE 57

57

Decomposable Scoring Function

  • Let σS(D) = ∪i∈S bi(D).

– bi(D) is a base (lowest-level) subset

  • Distributively decomposable scoring function:

– Score(y, x; σS(D)) = F({Score(y, x; bi(D)) : i ∈ S}) – F is an distributive aggregate function

  • Algebraically decomposable scoring function:

– Score(y, x; σS(D)) = F({G(y, x; bi(D)) : i ∈ S}) – F is an algebraic aggregate function – G(y, x; bi(D)) returns a length-fixed vector of values

slide-58
SLIDE 58

58

Algorithm

  • Input: The dataset D and test set ∆
  • For each lowest-level cell, which contains data bi(D):

– Build a model on bi(D) – For each x ∈ ∆ and y, compute:

  • Score(y, x; bi(D)), if distributive
  • G(y, x; bi(D)), if algebraic
  • Use standard data cube computation technique to compute

the scores in a bottom-up manner (by Observation 2)

  • Compute the cell values using the scores (by Observation 1)
slide-59
SLIDE 59

59

Probability-Based Ensemble

  • Scoring function:

– h(y | x; bi(D)): Model h’s estimation of p(y | x, bi(D)) – g(bi | x): A model that predicts the probability that x belongs to base subset bi(D) )) ( ; , ( max arg )) ( ; ( D D

S S

x x σ σ y Score h

PBE y PBE

=

( )

∑ ∈

=

S S

x x

i i PBE PBE

b y Score y Score )) ( ; , ( )) ( ; , ( D D σ

) | ( )) ( ; | ( )) ( ; , ( x x x

i i i PBE

b g b y h b y Score ⋅ = D D

slide-60
SLIDE 60

60

Optimality of PBE

  • ScorePBE(y, x; σS(D)) = c ⋅ p(y | x, x∈σS(D))

( ) ( )

∑ ∑ ∑

∈ ∈ ∈

⋅ ⋅ = ∈ ⋅ ∈ ⋅ = ∈ ⋅ = ∈ ⋅ = ∈ ∈ = ∈

S S S S S S S

x x x x x x x x x x x x x x x x

i i i i i i i i

b g b y h z b p b y p z b y p z y p z p y p y p ) | ( )) ( ; | ( ) | ) ( ( ) ), ( | ( ) | ) ( , ( ) | ) ( , ( ) | ) ( ( ) | ) ( , ( )) ( , | ( D D D D D D D D σ σ σ σ

[ bi(D)’s partitions σS(D)]

slide-61
SLIDE 61

61

Efficiency Comparison

100 200 300 400 500 600 700 200K 400K 600K 800K 1M J48- PBE KDC NB RF- PBE

slide-62
SLIDE 62

62

Where is the Time Spend on

0% 20% 40% 60% 80% 1 00% 200K 1 M 200K 1 M 200K 1 M 200K 1 M J48-P B E RF-P B E KDC NB

Other Testing Training

slide-63
SLIDE 63

63

Accuracy of PBE

  • Goal:

– To compare PBE with the gold standard

  • PBE: A set of n J48s/RFs each of which is trained
  • n a small partition of the whole dataset
  • Gold standard: A J48/RF trained on the whole data

– To understand how the number of base classifiers in a PBE affects the accuracy of the PBE

  • Datasets:

– Eight UCI datasets

slide-64
SLIDE 64

64

Accuracy of PBE

Adult Dataset

80 82 84 86 88 90 92 94 96 98 100 2 5 10 15 20 # of base classifiers in a PBE Accuracy RF J48 RF-PBE J48-PBE

slide-65
SLIDE 65

65

Accuracy of PBE

Nursery Dataset

80 82 84 86 88 90 92 94 96 98 100 2 5 10 15 20 # of base classifiers in a PBE Accuracy RF J48 RF-PBE J48-PBE

slide-66
SLIDE 66

66

Accuracy of PBE

Error = The average of the absolute difference between a ground-truth cell value and a cell value computed by PBE

Flat Dataset 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 1 10 100 1000 10000 # of base models in a PBE Error RF-PBE J48-PBE Deep Dataset 0.02 0.04 0.06 0.08 0.1 0.12 0.14 1 10 100 1000 # of base models in a PBE Error RF-PBE J48-PBE