Classification Classification and Prediction Classification: - - PowerPoint PPT Presentation

classification classification and prediction
SMART_READER_LITE
LIVE PREVIEW

Classification Classification and Prediction Classification: - - PowerPoint PPT Presentation

Classification Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify loan applications (approve/decline) Prediction: model continuous-valued functions


slide-1
SLIDE 1

Classification

slide-2
SLIDE 2

Jian Pei: CMPT 741/459 Classification (1) 2

Classification and Prediction

  • Classification: predict categorical class

labels

– Build a model for a set of classes/concepts – Classify loan applications (approve/decline)

  • Prediction: model continuous-valued

functions

– Predict the economic growth in 2015

slide-3
SLIDE 3

Jian Pei: CMPT 741/459 Classification (1) 3

Classification: A 2-step Process

  • Model construction: describe a set of

predetermined classes

– Training dataset: tuples for model construction

  • Each tuple/sample belongs to a predefined class

– Classification rules, decision trees, or math formulae

  • Model application: classify unseen objects

– Estimate accuracy of the model using an independent test set – Acceptable accuracy à apply the model to classify tuples with unknown class labels

slide-4
SLIDE 4

Jian Pei: CMPT 741/459 Classification (1) 4

Model Construction

Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Name Rank Years Tenured Mike

  • Ass. Prof

3 No Mary

  • Ass. Prof

7 Yes Bill Prof 2 Yes Jim

  • Asso. Prof

7 Yes Dave

  • Ass. Prof

6 No Anne

  • Asso. Prof

3 No

slide-5
SLIDE 5

Jian Pei: CMPT 741/459 Classification (1) 5

Model Application

Classifier Testing Data Unseen Data (Jeff, Professor, 4)

Tenured?

Name Rank Years Tenured Tom

  • Ass. Prof

2 No Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes

slide-6
SLIDE 6

Jian Pei: CMPT 741/459 Classification (1) 6

Supervised/Unsupervised Learning

  • Supervised learning (classification)

– Supervision: objects in the training data set have labels – New data is classified based on the training set

  • Unsupervised learning (clustering)

– The class labels of training data are unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

slide-7
SLIDE 7

Jian Pei: CMPT 741/459 Classification (1) 7

Data Preparation

  • Data cleaning

– Preprocess data in order to reduce noise and handle missing values

  • Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes

  • Data transformation

– Generalize and/or normalize data

slide-8
SLIDE 8

Jian Pei: CMPT 741/459 Classification (1) 8

Measurements of Quality

  • Prediction accuracy
  • Speed and scalability

– Construction speed and application speed

  • Robustness: handle noise and missing

values

  • Scalability: build model for large training data

sets

  • Interpretability: understandability of models
slide-9
SLIDE 9

Jian Pei: CMPT 741/459 Classification (1) 9

Decision Tree Induction

  • Decision tree representation
  • Construction of a decision tree
  • Inductive bias and overfitting
  • Scalable enhancements for large databases
slide-10
SLIDE 10

Jian Pei: CMPT 741/459 Classification (1) 10

Decision Tree

  • A node in the tree – a test of some attribute
  • A branch: a possible value of the attribute
  • Classification

– Start at the root – Test the attribute – Move down the tree branch

Outlook Sunny Overcast Rain Humidity High Normal No Yes Yes Wind Strong Weak No Yes

slide-11
SLIDE 11

Jian Pei: CMPT 741/459 Classification (1) 11

Training Dataset

Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

slide-12
SLIDE 12

Jian Pei: CMPT 741/459 Classification (1) 12

Appropriate Problems

  • Instances are represented by attribute-value

pairs

– Extensions of decision trees can handle real- valued attributes

  • Disjunctive descriptions may be required
  • The training data may contain errors or

missing values

slide-13
SLIDE 13

Jian Pei: CMPT 741/459 Classification (1) 13

Basic Algorithm ID3

  • Construct a tree in a top-down recursive divide-

and-conquer manner

– Which attribute is the best at the current node? – Create a node for each possible attribute value – Partition training data into descendant nodes

  • Conditions for stopping recursion

– All samples at a given node belong to the same class – No attribute remained for further partitioning

  • Majority voting is employed for classifying the leaf

– There is no sample at the node

slide-14
SLIDE 14

Jian Pei: CMPT 741/459 Classification (1) 14

Which Attribute Is the Best?

  • The attribute most useful for classifying

examples

  • Information gain and gini index

– Statistical properties – Measure how well an attribute separates the training examples

slide-15
SLIDE 15

Jian Pei: CMPT 741/459 Classification (1) 15

Entropy

  • Measure homogeneity of examples

– S is the training data set, and pi is the proportion

  • f S belong to class i
  • The smaller the entropy, the purer the data

set

=

− ≡

c i i i

p p S Entropy

1 2

log ) (

slide-16
SLIDE 16

Jian Pei: CMPT 741/459 Classification (1) 16

Information Gain

  • The expected reduction in entropy caused

by partitioning the examples according to an attribute

− ≡

) (

) ( | | | | ) ( ) , (

A Values v v v

S Entropy S S S Entropy A S Gain

Value(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v

slide-17
SLIDE 17

Jian Pei: CMPT 741/459 Classification (1) 17

Example

Outlook Temp Humid Wind PlayTenni s Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No

94 . 14 5 log 14 5 14 9 log 14 9 ) (

2 2

= − − = S Entropy

048 . 00 . 1 14 6 811 . 14 8 94 . ) ( 14 6 ) ( 14 8 ) ( ) ( | | | | ) ( ) , (

} , {

= × − × − = − − = − =

∈ Strong Weak Strong Weak v v v

S Engropy S Engropy S Entropy S Entropy S S S Entropy Wind S Gain

slide-18
SLIDE 18

Jian Pei: CMPT 741/459 Classification (1) 18

Hypothesis Space Search in Decision Tree Building

  • Hypothesis space: the set of possible

decision trees

  • ID3: simple-to-complex, hill-climbing search

– Evaluation function: information gain

slide-19
SLIDE 19

Jian Pei: CMPT 741/459 Classification (1) 19

Capabilities and Limitations

  • The hypothesis space is complete
  • Maintains only a single current hypothesis
  • No backtracking

– May converge to a locally optimal solution

  • Use all training examples at each step

– Make statistics-based decisions – Not sensitive to errors in individual example

slide-20
SLIDE 20

Jian Pei: CMPT 741/459 Classification (1) 20

Natural Bias

  • The information gain measure favors

attributes with many values

  • An extreme example

– Attribute “date” may have the highest information gain – A very broad decision tree of depth one – Inapplicable to any future data

slide-21
SLIDE 21

Jian Pei: CMPT 741/459 Classification (1) 21

Alternative Measures

  • Gain ratio: penalize attributes like date by

incorporating split information

  • Split information is sensitive to how broadly and

uniformly the attribute splits the data

  • Gain ratio can be undefined or very large

– Only test attributes with over average gain

| | | | log | | | | ) , (

1 2 S

S S S A S mation SplitInfor

i c i i

=

− ≡

) , ( ) , ( ) , ( A S mation SplitInfor A S Gain A S GainRatio ≡

slide-22
SLIDE 22

Jian Pei: CMPT 741/459 Classification (1) 22

Measuring Inequality

Lorenz Curve X-axis: quintiles Y-axis: accumulative share of income earned by the plotted quintile Gap between the actual lines and the mythical line: the degree

  • f inequality

Gini index Gini = 0, even distribution Gini = 1, perfectly unequal The greater the distance, the more unequal the distribution

slide-23
SLIDE 23

Jian Pei: CMPT 741/459 Classification (1) 23

Gini Index (Adjusted)

  • A data set S contains examples from n

classes

– pj is the relative frequency of class j in S

  • A data set S is split into two subsets S1 and

S2 with sizes N1 and N2 respectively

  • The attribute provides the smallest

ginisplit(T) is chosen to split the node

∑ = − = n j p j T gini 1 2 1 ) (

) ( ) ( ) (

2 2 1 1

T gini N N T gini N N T ginisplit + =

slide-24
SLIDE 24

Jian Pei: CMPT 741/459 Classification (1) 24

Extracting Classification Rules

  • Classification rules can be extracted from a

decision tree

  • Each path from the root to a leaf à an IF-

THEN rule

– All attribute-value pair along a path form a conjunctive condition – The leaf node holds the class prediction – IF age = “<=30” AND student = “no” THEN buys_computer = “no”

  • Rules are easy to understand
slide-25
SLIDE 25

Jian Pei: CMPT 741/459 Classification (1) 25

Inductive Bias

  • The set of assumptions that, together with

the training data, deductively justifies the classification to future instances

– Preferences of the classifier construction

  • Shorter trees are preferred over longer trees
  • Trees that place high information gain

attributes close to the root are preferred

slide-26
SLIDE 26

Jian Pei: CMPT 741/459 Classification (1) 26

Why Prefer Short Trees?

  • Occam’s razor: prefer the simplest

hypothesis that fits the data

  • Fewer short trees than long trees
  • A short tree is less likely to be a statistical

coincidence

“One should not increase, beyond what is necessary, the number of entities required to explain anything” – Also known as the principle of parsimony

slide-27
SLIDE 27

Jian Pei: CMPT 741/459 Classification (1) 27

Overfitting

  • A decision tree T may overfit the training

data

– if there exists an alternative tree T’ such that T has a higher accuracy than T’ over the training examples, but T’ has a higher accuracy than T

  • ver the entire distribution of data
  • Why overfitting?

– Noise data – Bias in training data

All data Training data T T’

slide-28
SLIDE 28

Jian Pei: CMPT 741/459 Classification (1) 28

Avoid Overfitting

  • Prepruning: stop growing the tree earlier

– Difficult to choose an appropriate threshold

  • Postpruning: remove branches from a “fully

grown” tree

– Use an independent set of data to prune

  • Key: how to determine the correct final tree

size

slide-29
SLIDE 29

Jian Pei: CMPT 741/459 Classification (1) 29

Determine the Final Tree Size

  • Separate training (2/3) and testing (1/3) sets
  • Use cross validation, e.g., 10-fold cross validation
  • Use all the data for training

– Apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution

  • Use minimum description length (MDL) principle

– halting growth of the tree when the encoding is minimized

slide-30
SLIDE 30

Jian Pei: CMPT 741/459 Classification (1) 30

Enhancements

  • Allow for attributes of continuous values

– Dynamically discretize continuous attributes

  • Handle missing attribute values
  • Attribute construction

– Create new attributes based on existing ones that are sparsely represented – Reduce fragmentation, repetition, and replication

slide-31
SLIDE 31

To-Do List

  • Read Chapters 8.1-8.2
  • Figure out how to use decision tree for

classification in Weka

Jian Pei: CMPT 741/459 Classification (1) 31