Outline Outline RainForest A Framework for Fast A Framework for - - PowerPoint PPT Presentation

outline outline rainforest a framework for fast a
SMART_READER_LITE
LIVE PREVIEW

Outline Outline RainForest A Framework for Fast A Framework for - - PowerPoint PPT Presentation

Outline Outline RainForest A Framework for Fast A Framework for Fast RainForest 1. Introduction Decision Tree Construction of Large Decision Tree Construction of Large 2. Problem Definition Datasets Datasets 3. Related Work 4.


slide-1
SLIDE 1

RainForest RainForest – – A Framework for Fast A Framework for Fast Decision Tree Construction of Large Decision Tree Construction of Large Datasets Datasets

Pre Presented by ted by:

Leila Homaeian

CMPUT 695

  • Nov. 25, 2004

Leila Homaeian CMPUT 695 2

Outline Outline

  • 1. Introduction
  • 2. Problem Definition
  • 3. Related Work
  • 4. RainForest Framework
  • Family of Algorithms
  • 5. Experiments
  • 6. Conclusion

Leila Homaeian CMPUT 695 3

Introduction Introduction

Classification is an important data mining problem. Input: database of training records Each record has a class label and predictor attributes. The resulting model assigns class labels to testing records. Decision tree construction algorithms: Easily assimilated by humans Can be constructed fast Highly accurate

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Leila Homaeian CMPUT 695 4

Introduction (cont’d) Introduction (cont’d)

Proposed approaches to deal with large datasets:

  • Discretize ordered attributes
  • Sampling at each node of the

classification tree Assume dataset fits in main memory

  • Partitioning methods such that

each partition fits in main memory Quality?

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

slide-2
SLIDE 2

Leila Homaeian CMPUT 695 5

Introduction (cont’d) Introduction (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

RainForest framework scales up the existing decision tree construction algorithms. Data access algorithms scale with the size of database, adapt to available main memory, and are not restricted to a specific classification algorithm. RainForest applied to existing algorithms, results in a scalable version of the algorithm without modifying the result of the algorithm.

Leila Homaeian CMPUT 695 6

Problem Definition Problem Definition

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Outlook? Humidity? Windy? sunny

  • vercast

rain P high N P normal true N P false

  • Combined information of splitting attribute and splitting predicates:

splitting criteria on n, denoted as crit(n). For each internal node n

  • Splitting attribute
  • Set of predicates

Leila Homaeian CMPUT 695 7

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Outlook? Humidity? Windy? sunny

  • vercast

rain P high N P normal true N P

Problem Definition (cont’d) Problem Definition (cont’d)

Family of a node: F(n)

How to control the size of the classification tree? Bottom-up or Top-down Orthogonal issue

false

Leila Homaeian CMPUT 695 8

Related Work Related Work

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

The literature survey shows that almost all the previous approaches do not scale to large datasets.

Sprint [SAM96] works for large databases.

  • Builds classification trees with binary split
  • Uses gini index to decide on splitting criteria
  • Uses Minimal Description Length pruning (no test sample is needed)

[SAM96] J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc of VLDB, 96

slide-3
SLIDE 3

Leila Homaeian CMPUT 695 9

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Related Work (cont’d) Related Work (cont’d)

1 2 3 4 8 12 14 5 6 7 9 10 11 13 High High High High High High High Normal Normal Normal Normal Normal Normal Normal N N P P N P N P N P P P P P rid Humidity Class 1 2 3 13 4 8 10 11 12 14 5 6 7 9 Hot Hot Hot Hot Mild Mild Mild Mild Mild Mild Cool Cool Cool Cool N N P P P N P P P N P N P P rid Temperature Class

  • To decide on splitting attribute at a tree node n, Sprint needs

to access F(n) for each ordered attribute in sorted order.

  • Creates an attribute list for each attribute.

A costly hash join to distribute family of a node among its children

Leila Homaeian CMPUT 695 10

RainForest Framework RainForest Framework

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Most decision tree algorithms (C4.5, CART, CHAID, FACT, ID3, SLIQ, Sprint, and Quest)

  • consider every attribute individually
  • need the distribution of class labels for each distinct value of an attribute to

decide on the splitting criteria.

Leila Homaeian CMPUT 695 11

RainForest Framework (cont’d) RainForest Framework (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

RainForest refines this generic schema AVC-set (Attribute Value Classlabel) of a predictor attribute a at a node n is the projection of F(n) onto a and the class label whereby counts of individual class labels are aggregated. AVC-group of a node n is the set of all AVC-sets at n

Leila Homaeian CMPUT 695 12

RainForest Framework (cont’d) RainForest Framework (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Outlook? Humidity? Windy? sunny

  • vercast

rain P high N P normal true N P 2 4 3 P N 3 2 Sunny Overcast Rainy

AVC-set of attribute Outlook

2 P N 3 High Normal

AVC-set of attribute Humidity Size of AVC-set of an attribute a at node n depends on the number of distinct values of a & class labels in F(n)

slide-4
SLIDE 4

Leila Homaeian CMPUT 695 13

RainForest Framework (cont’d) RainForest Framework (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Leila Homaeian CMPUT 695 14

RainForest Framework (cont’d) RainForest Framework (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

Depending on the amount of main memory available, three cases may happen:

  • 1. The AVC-Group of the root node fits in main memory.
  • 2. Each individual AVC-set of the root node fits in main

memory, but not the AVC-Group of the root.

  • 3. None of individual AVC-sets of the root node fits in main

memory. Proposed algorithms RF-Write, RF-Read, and RF-Hybrid deal with case 1, and RF-Vertical deals with case 2

Leila Homaeian CMPUT 695 15

RainForest Framework (cont’d) RainForest Framework (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

In RainForest family of algorithms, the following steps are carried for each node n:

  • 1. AVC-group construction
  • 2. Choose splitting attribute and predicate
  • 3. Partition F(n) across the children nodes

Leila Homaeian CMPUT 695 16

RF RF-

  • Write

Write

RainForest Framework RF-Write RF-Read RF-Hybrid RF-Vertical AVC-Group Size Estimation

  • First the database is scanned to build the AVC-group of the root

node r.

  • Then the AVC-group is passed to the CL (classification algorithm

being scaled by RainForest) to compute crit(r).

  • The children are allocated & another scan is made over the

database to partition the database across children of the root node r. At each level of the tree, families belonging to that level, are read twice and written once.

  • RF-Write is applied to each partition recursively
slide-5
SLIDE 5

Leila Homaeian CMPUT 695 17

RF RF-

  • Read

Read

  • Database is scanned to build the AVC-group of the root node r.
  • Then the AVC-group is passed to CL to compute crit(r).
  • Children are allocated
  • Another scan is made over the database to build the AVC-groups of

children simultaneously. Suppose there is enough memory to hold the AVC-groups of all children

  • Continue the same process until a level L where not all AVC-groups of

new nodes N fit in memory

  • Divide N into gL groups such that the AVC-group of each fits in memory
  • Process each group individually: gL scans over the database is needed

Increasing number of database scans as the decision tree gets deeper

RainForest Framework RF-Write RF-Read RF-Hybrid RF-Vertical AVC-Group Size Estimation

Leila Homaeian CMPUT 695 18

RF RF-

  • Hybrid

Hybrid

  • Starts with RF-Read.
  • Now RF-Hybrid switches to RF-Write but tires to use the available memory

efficiently

  • Continue the same process until a level L where not all AVC-groups of

new nodes N fit in memory

  • Choose the AVC-groups of which are constructed while writing the

partitions in N.

N M⊂

For each one scan over n’s partition saved

M n∈

cost

M n∈

The size of its AVC-group The size of F(n) Each has a cost and benefit Benefit A modified greedy approximation

  • f Knapsack problem is applied to

choose M

RainForest Framework RF-Write RF-Read RF-Hybrid RF-Vertical AVC-Group Size Estimation

Leila Homaeian CMPUT 695 19

RF RF-

  • Vertical

Vertical

The AVC-group of the root node r dose not fit in memory but each individual AVC-set of r does.

  • Plarge =

such that each individual AVC-set fits in memory but not two AVC-sets

} ,..., { 1

v

a a

  • Psmall = & c denote the rest of the attributes, & class label respectively

} ,..., {

1 m v

a a +

At a node n the AVC-sets of Psmall are built in memory Meanwhile, the projection of F(n) onto Plarge and class label c are written to a temporary file Zn. Zn has the schema

> < c a a

v,

,...,

1

Then Zn is scanned v times to build the AVC-sets of Plarge

RainForest Framework RF-Write RF-Read RF-Hybrid RF-Vertical AVC-Group Size Estimation

Leila Homaeian CMPUT 695 20

AVC AVC-

  • Group Size Estimation

Group Size Estimation

The AVC-group of node n is estimated to the same as that of its parent p except for the size of the splitting attribute a

RainForest Framework RF-Write RF-Read RF-Hybrid RF-Vertical AVC-Group Size Estimation

slide-6
SLIDE 6

Leila Homaeian CMPUT 695 21

Experiments Experiments

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

The RainForest generic schema allows the instantiation of all existing classification tree algorithms without modifying the resulting tree. The quality is an orthogonal issue and the experiments are focused on decision tree construction time.

Leila Homaeian CMPUT 695 22

Experiments (cont’d) Experiments (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

A synthetic data generator (referred to as Generator) introduced by [AIS93]

[AIS93] R. Agrawal, T.Imielinski, and A. Swami. Database Mining: A performance perspective. IEEE TKDE, Dec. 93.

Functions 1 and 7 from Generator are used. Function 1 generates small decision trees whereas Function 7 generates large ones. Experiments were performed on a Pentium Pro with 200 MHz processor running Solaris X86 version 2.5.1 with 128 MB of main memory. Algorithms were written in C++ and compiled using gcc version 2.7.2.1 with –O3 compilation option.

Leila Homaeian CMPUT 695 23

Experiments (cont’d) Experiments (cont’d)

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

17 MB main memory is needed to hold the AVC-group of the root node The number of AVC-set entries fitting in memory is called buffer size RF-Write needs a buffer size of at least 2.1 million entries; RF-Vertical needs 1.3 million ones.

Leila Homaeian CMPUT 695 24

Experiments (cont’d) Experiments (cont’d)

Buffer size for RF-Write and RF-Hybrid: 2.5 million entries Buffer size for RF-Vertical: 1.8 million entries

slide-7
SLIDE 7

Leila Homaeian CMPUT 695 25

Experiments (cont’d) Experiments (cont’d)

Buffer size for RF-Write & RF-Hybrid: 2.5 million entries Buffer size for RF-Vertical: 1.8 million entries

Size of input dataset: 2,000,000 records

Leila Homaeian CMPUT 695 26

Experiments (cont’d) Experiments (cont’d)

Leila Homaeian CMPUT 695 27

Experiments (cont’d) Experiments (cont’d)

Leila Homaeian CMPUT 695 28

Conclusion Conclusion

Introduction Problem Definition Related Work RainForest Framework Experiments Conclusion

RainForest is a comprehensive approach to scaling decision tree algorithms. The key observation is that splitting criteria can be computed using AVC-Group. Given enough memory, RainForest algorithms outperform Sprint, the fastest scalable state-of-the-art classification algorithm.

slide-8
SLIDE 8

Leila Homaeian CMPUT 695 29

Thanks! ☺ Questions?