Decision Trees and Naïve Bayes
3/29/17
Decision Trees and Nave Bayes 3/29/17 Hypothesis Spaces Decision - - PowerPoint PPT Presentation
Decision Trees and Nave Bayes 3/29/17 Hypothesis Spaces Decision Trees and K-Nearest Neighbors Continuous inputs Discrete outputs Nave Bayes Discrete inputs Discrete outputs Building a Decision Tree Greedy algorithm:
3/29/17
Greedy algorithm:
sub-regions.
trees for the sub-regions.
elevation $ / sq. ft.
elev > a elev > b elev > c $ > e
$ > f $ > g
SF NY SF elev > d NY SF SF SF NY
a e b c f g d
Key idea: minimize entropy
Entropy(S) = -Pos * log2(Pos) - Neg * log2(Neg)
class, for example: Pos = 1 and Neg = 0
and negative examples: Pos = ½ and Neg = ½
Try all features. Try all possible splits of that feature. If feature F is ______, there are ________ possible splits to consider.
Number of values F could have
Try all features. Try all possible splits of that feature. If feature F is ______, there are ________ possible splits to consider.
How can we avoid trying all possible splits?
Binary Search Local Search
Bad idea:
Better idea:
Key idea: use training data to estimate a probability for each label given an input. Classify a point as its highest-probability label.
Suppose we flip a coin 10 times and observe: 7 3 What do we believe to be the true P( H ) ? Now suppose we flip it 1000 times and observe: 700 300
We need to combine our initial beliefs with data. Empirical frequency: Prior for a coin toss: Add m “observations” of the prior to the data:
probability of each label?
generalization to new data.
get estimate a probability at an unobserved test point.
Assume that all features are independent. This lets us estimate probabilities for each feature separately, then multiply them together: This assumption is almost never literally true, but makes the estimation feasible and often gives a good enough classifier.
Given a data set consisting of
For each possible value of xi and l, we can compute an empirical frequency:
To compute from our data, we need to estimate two more quantities from data:
across our data set for each possible value of each input dimension and the label.
for each dimension
for each dimension conditional on each label
All of these are estimated empirically, with some prior (usually uniform).
Given a new input: Compute for each possible label: Using the naïve assumption this is estimated as: Return the highest-probability label.
P(l | x) = P(l)P(x1 | l)P(x2 | l)P(x3 | l) P(x)