Statistical classification Lecture notes Naive Bayes Bayes' - - PowerPoint PPT Presentation

statistical classification
SMART_READER_LITE
LIVE PREVIEW

Statistical classification Lecture notes Naive Bayes Bayes' - - PowerPoint PPT Presentation

Statistical classification Lecture notes Naive Bayes Bayes' theorem P ( c|a ) P ( a ) = P ( a|c ) P (c) P ( c|f ) % of class c given feature(s) a Posterior This will be our target P ( f|c ) % of feature f given class c


slide-1
SLIDE 1

Statistical classification

Lecture notes

slide-2
SLIDE 2

Naive Bayes

slide-3
SLIDE 3

Bayes' theorem

P (c|a) P (a) = P (a|c) P (c)

P (c|f ) – % of class c given feature(s) a – Posterior This will be our target P (f|c ) – % of feature f given class c – Likelihood Based on data P (c) – % of class c in data – Class prior P (f) – % of feature f in data – Predictor prior Normaliser, can usually be ignored (when comparing)

slide-4
SLIDE 4

Naive Bayes

Based on Bayes theorem Simple, fast, easy to train Outperforms many more sophisticated algorithms BUT: It assumes every feature is independent (still surprisingly good)

This is where the naivety of the method comes in!

Example: Person with flu; running nose and fever are not related

Big discussions on how to fix this

Applications: Face recognition, Spam detection, text classification, ….

slide-5
SLIDE 5

What if more than one feature? Assume all features are independent, so that: P (f | c ) = P (f1 | c ) * P (f2 | c ) * P (f3 | c ) * …. * P (fn | c ) In the previous example we could add taste, colour…

Multiple features

slide-6
SLIDE 6

Example #1 – Fruits

Class Long Sweet Yellow Banana 400 350 450 Orange 150 300 Other 100 150 50

P (c | long,sweet,yellow) P(long | c) * P(sweet | c) * P(yellow | c) * P (c)

P(banana | long,sweet,yellow) 0.8 * 0.7 * 0.9 * 0.5 = 0.252 P(orange | long,sweet,yellow) 0.0 * 0.5 * 1.0 * 0.3 = 0.0 P(other | long,sweet,yellow) 0.5 * 0.75 * 0.25 * 0.2 = 0.01875

slide-7
SLIDE 7

Example #2 – Numerical

Probabilistic classifier based on fruit length

Length Class 6.8 cm Banana 5.4 cm Banana 6.3 cm Banana 6.1 cm Banana 5.8 cm Banana 6.0 cm Banana 5.5 cm Banana 4.1 cm Other 4.3 cm Other 4.6 cm Other 5.1 cm Other 4.6 cm Other 4.7 cm Other 4.8 cm Other

slide-8
SLIDE 8

Consider numerical for length

Length

Banana Other

Example #2 – Numerical

slide-9
SLIDE 9

Consider data with nose length – Two groups

Example #2 – Numerical

Length

Banana Other

slide-10
SLIDE 10

New data point

Example #2 – Numerical

Length

Banana Other

slide-11
SLIDE 11

Gaussian distribution

Calculate means and standard deviations New values from Gaussian Probability Density Function banana other Total = 6.0 cm 4.6 cm 5.3 cm

= 0.45 0.30 0.79

P (banana | L=5.4) P( pdf[5.4] | banana) * P( banana ) = 0.18 P (other | L=5.4) P( pdf[5.4] | other) * P( other ) = 0.019 Note Remove outliers > 3-4 standard devs. from mean Other functions can also be used

slide-12
SLIDE 12

Genetic algorithm

slide-13
SLIDE 13

Theory of evolution

Evolutionary computing inspired by the evolution theory Similarly solutions are evolved Exceptional at navigating huge search spaces Fitness is measure to select new solutions (offspring) Fit offspring have better chances to “reproduce”

slide-14
SLIDE 14

Representing data

Genetic information is encoded in binary format Originally solutions were shown as binaries So floats, strings and more had to be converted Characters can be represented by a 4-bit string Floats can be normalised, cut to X digits, and changed into bits

slide-15
SLIDE 15

Iterative process

Initialise first population Calculate fitness of each solution Selection – Best solutions kept Crossover – Create new solution from best solutions Mutation – Add random variations in solutions (with a very low probability) Repeated until termination condition

slide-16
SLIDE 16

Initial population

Initialise population Good initial population = better solutions Most commonly they are random Initial guesses may also be used The keyword is diversity Many metrics exist to evaluate this:

  • Grefenstette bias
  • Gene-level entropy
  • Chromosome-Level Neighborhood Metric
  • Population-Level Center of Mass

etc.

slide-17
SLIDE 17

Fitness

Fitness calculation Individual fitness compared to avg. Fitness can be based on:

Fit to data/target Complexity Computation time basically anything (fitting your problem)

slide-18
SLIDE 18

“Breeding" / Crossover

Parents can be selected : Randomly Roulette wheel Swap genes between parents:

1-point or 2-point: Probabilistic based on fitness Uniform/half-uniform: Selected on gene level

(also: three-parents crossover) Mutation swap genes values, but with a very low probability Termination criteria: Certain fitness of best “parents”

slide-19
SLIDE 19

Example – Monkeys

Let’s consider the infinite monkey theorem But simplified, let’s make it write “data” Initial population of 3: “lync" “deyi" and “kama" Fitness 0 1 2 Crossover could give us: “lyyi" “dama" “kamc" etc… Fitness 0 3 1 and so on until we have our data, or a particular fitness

slide-20
SLIDE 20

Example – Real numbers

Evaluate x to find lowest point of f(x) = 3x – x^2/10 Fitness: compare model to observations Crossover: select random BETA = [0,1], parents n,m x'1 = (1-BETA)x_m + (BETA)x_m x'2 = (1-BETA)x_n + (BETA)x_n for multi-dimensional: select one feature (x,y,z…. at random), and change only that, keep the others static Mutation: Replace parameter at random from [0,31] (low probability)

slide-21
SLIDE 21

Decision trees

slide-22
SLIDE 22

Decision trees

Fast to train, easy to evaluate Splits data into increasingly smaller subsets in tree structure Boolean logic tracing through tree Consider it an extensive version of the game 20 Questions Also: Classification Trees & Regression trees Some similarities, but also differences, such as splitting method In regression, standard deviation is minimised to choose split

slide-23
SLIDE 23

Decision trees

Advantages: Very easy to visualise results Simple to understand and use Handles both numerical and categorical data … and both small and large data Disadvantages: Small changes can severely effect results Tend to not be as accurate as other methods Many-leveled categorical variables favoured higher

slide-24
SLIDE 24

Example – Gotham

Compile a list of some “random” people in Gotham for Santa

Sex Mask Cape Tie Smokes Class Batman M Yes Yes No No Good Robin M Yes Yes No No Good Alfred M No No Yes No Good Penguin M No No Yes Yes Bad The Joker M No No Yes No Bad Harley Quinn F No No No No Bad

slide-25
SLIDE 25

Example – Gotham

We can create an example tree like this, skipping some features

slide-26
SLIDE 26

Example – Gotham

We can create an example tree like this, skipping some features How can we make it better? Pretty sure he is bad!

slide-27
SLIDE 27

Top-most node corresponds to best predictor Too many features – too complex tree structure (Overfit) Too few features – might not even fit data (like in example) Occam’s razor: The more assumptions you make, the more unlikely the explanation => As simple as possible, but not simpler

Building up a Decision Tree

slide-28
SLIDE 28

Setup: Identify attribute (or value) leading to best split Create child nodes from split Recursively iterate through all child nodes until terminate

Building up a Decision Tree

slide-29
SLIDE 29

Building up a Decision Tree

“Divide-and-conquer” algorithms Greedy strategies – Split based on attribute test selecting optimum, preferring homogeneous distributions With: different splitting criterion, method to reduce overfit, capable of handling incomplete, pruning, and data regression/classification Notable examples: ✦ Hunt’s algorithm (one of the earliest)

✦ ID3 – Entropy, missing values, pruning, outliers ✦ C4.5 – Entropy, missing values, error-based prune, outliers ✦ CART – Gini impurity, classification & regression, missing values, outliers ✦ Others: CHAID (chi2), MARS, SLIQ, SPRINT, …

slide-30
SLIDE 30

Building up a Decision Tree

Feature selected based on “purity” – fewest diff. classes

For pi = Di / D where Di is # points for class i

Gini impurity (CART, SLIQ, SPRINT, …) Misclassification error

Measures misclassification error Error = 1 - max(pi)

slide-31
SLIDE 31

Building up a Decision Tree

Feature selected based on “purity” – fewest diff. classes

For pi = |Di|/|D| where Di is # points in class i

Entropy (ID3, C4.5, …)

Compares impurity between parent and child nodes Information gain measures reduction in entropy from split Entropy(parent) - Entropy(children) [normalised ā by #/total#]

slide-32
SLIDE 32

Building up a Decision Tree

Binary (Yes/No, Case#1/#2, …) Nominal/Ordinal class with many values (small, medium, large) Can be binned to become binary, else no optimum split needed Continuous Numerical values such as height, temperature… Can be binary using split point (e.g. T > 100 degrees) Instead of brute force, sort and select best split point

slide-33
SLIDE 33

But when to stop?

  • all nodes have same class,
  • all nodes have identical attribute values
  • Certain “depth”
  • if instances are independent of available features (e.g. chi2)
  • if further split does not improve purity
  • not enough data

Building up a Decision Tree

slide-34
SLIDE 34

Decision trees – Issues

Tree replication problem: The same subtree can appear at different branches Irrelevant data and noise makes them unstable => Several iterations Post-processing: Prune tree to avoid overfitting, or simplify

slide-35
SLIDE 35

Ensembles of decision trees

Some techniques produce an ensemble of decision trees – Random Forest Idea: Many weak predictors makes for one strong predictor Example: You want to know if you’ll like a movie. You could ask one friend, who needs in-depth questions to know your preferences … or ask many friends, with a smaller list of questions, then “merge” results They select features (to split on) and data at random Continue as normal, no pruning Final result is based on the majority, or average, of the combined classifiers

slide-36
SLIDE 36

Ensembles of decision trees

This grouping of trees is called bagging – With random data subsets The idea is for issues of individual trees, many trees can "wash it away" Applicable to many ML problems Can work in parallel (spread the problem over multiple computers) But: results can be difficult to interpret for regression, overfit to the data (not beyond train data)

  • verfits noisy data

not for big data, extremely demanding