7: Catchup Session & very short intro to other classifiers - - PowerPoint PPT Presentation

7 catchup session very short intro to other classifiers
SMART_READER_LITE
LIVE PREVIEW

7: Catchup Session & very short intro to other classifiers - - PowerPoint PPT Presentation

7: Catchup Session & very short intro to other classifiers Machine Learning and Real-world Data (MLRD) Paula Buttery Lent 2018 What happens in a catchup session? Lecture and practical session as normal. New material is non-examinable.


slide-1
SLIDE 1

7: Catchup Session & very short intro to other classifiers

Machine Learning and Real-world Data (MLRD) Paula Buttery Lent 2018

slide-2
SLIDE 2

What happens in a catchup session?

Lecture and practical session as normal. New material is non-examinable. Time for you to catch-up or attempt some starred ticks. Demonstrators help as per usual.

slide-3
SLIDE 3

Naive Bayes is a probabilistic classifier

Given a set of input features a probabilistic classifier provide a distribution over classess. That is, for a set of observed features O and classes c1...cn ∈ C gives P(ci|O) for all ci ∈ C For us O was the set all the words in a review {w1, w2, ..., wn} where wi is the ith word in a review, C = {POS, NEG} We decided on a single class by choosing the one with the highest probability given the features: ˆ c = argmax

c∈C

P(c|O)

slide-4
SLIDE 4

An SVM is a popular non-probabilistic classifier

A Support Vector Machine (SVM) is a non-probabilistic binary linear classifier SVMs assign new examples to one category or the other SVMs can reduce the amount of labeled data required to gain good accuracy A linear-SVM can be considered to be a base-line for non-probabilistic approaches SVMs can be efficiently adapted to perform a non-linear classification

slide-5
SLIDE 5

SVMs find hyper-planes that separate classes

Our classes exist in a multidimensional feature space A linear classifier will separate the points with a hyper-plane

slide-6
SLIDE 6

SVMs find a maximum-margin hyperplane in noisy data

There are many possible hyperplanes SVMs find the best hyperplane such that the distance from it to the nearest data point from each class is maximised i.e. the hyperplane that passes through the widest possible gap (hopefully helps to avoid over-fitting)

slide-7
SLIDE 7

SVMs can be very efficient and effective

Efficient when learning from a large number of features (good for text) Effective even with relatively small amounts of labelled data (we only need points close to the plane to calculate it) We can choose how many points to involve (size of margin) when calculating the plane (tuning vs. over-fitting) Can separate non-linear boundaries by increasing the feature space (using a kernal function)

slide-8
SLIDE 8

Choice of classifier will depend on the task

Comparison of a SVM and Naive Bayes on the same task: 2000 imdb movie reviews, 400 kept for testing preprocess with improved tokeniser (lowercased, removed uninformative words, dealt with punctuation, lemmatised words) SVM Naive Bayes Accuracy on train 0.98 0.96 Accuracy on test 0.84 0.80 But from Naive Bayes I know that character, good, story, great, ... are informative features SVMs are more difficult to interpret

slide-9
SLIDE 9

Decision tree can be used to visually represent classifications

bad <= 0.0154 entropy = 1.0 samples = 1600 value = [799, 801] class = pos waste <= 0.0218 entropy = 0.9457 samples = 1001 value = [364, 637] class = pos True bad <= 0.0458 entropy = 0.8469 samples = 599 value = [435, 164] class = neg False boring <= 0.0343 entropy = 0.915 samples = 927 value = [306, 621] class = pos job <= 0.0186 entropy = 0.7532 samples = 74 value = [58, 16] class = neg performance <= 0.0098 entropy = 0.8974 samples = 899 value = [282, 617] class = pos really <= 0.0216 entropy = 0.5917 samples = 28 value = [24, 4] class = neg excellent <= 0.0153 entropy = 0.9653 samples = 522 value = [204, 318] class = pos decent <= 0.0102 entropy = 0.7355 samples = 377 value = [78, 299] class = pos potential <= 0.0297 entropy = 0.9785 samples = 488 value = [202, 286] class = pos short <= 0.0188 entropy = 0.3228 samples = 34 value = [2, 32] class = pos ludicrous <= 0.0146 entropy = 0.9657 samples = 465 value = [182, 283] class = pos present <= 0.0134 entropy = 0.5586 samples = 23 value = [20, 3] class = neg entropy = 0.9556 samples = 454 value = [171, 283] class = pos entropy = 0.0 samples = 11 value = [11, 0] class = neg entropy = 0.0 samples = 18 value = [18, 0] class = neg entropy = 0.971 samples = 5 value = [2, 3] class = pos entropy = 0.0 samples = 29 value = [0, 29] class = pos entropy = 0.971 samples = 5 value = [2, 3] class = pos poorly <= 0.0106 entropy = 0.6697 samples = 348 value = [61, 287] class = pos director <= 0.0049 entropy = 0.9784 samples = 29 value = [17, 12] class = neg adam <= 0.0287 entropy = 0.6228 samples = 335 value = [52, 283] class = pos movie <= 0.0224 entropy = 0.8905 samples = 13 value = [9, 4] class = neg entropy = 0.5682 samples = 321 value = [43, 278] class = pos entropy = 0.9403 samples = 14 value = [9, 5] class = neg entropy = 0.0 samples = 7 value = [7, 0] class = neg entropy = 0.9183 samples = 6 value = [2, 4] class = pos decent <= 0.0334 entropy = 0.469 samples = 10 value = [1, 9] class = pos together <= 0.0167 entropy = 0.6292 samples = 19 value = [16, 3] class = neg entropy = 0.7219 samples = 5 value = [1, 4] class = pos entropy = 0.0 samples = 5 value = [0, 5] class = pos entropy = 0.0 samples = 14 value = [14, 0] class = neg entropy = 0.971 samples = 5 value = [2, 3] class = pos entropy = 0.0 samples = 21 value = [21, 0] class = neg entropy = 0.9852 samples = 7 value = [3, 4] class = pos strong <= 0.0251 entropy = 0.5197 samples = 60 value = [53, 7] class = neg know <= 0.0145 entropy = 0.9403 samples = 14 value = [5, 9] class = pos say <= 0.019 entropy = 0.3138 samples = 53 value = [50, 3] class = neg entropy = 0.9852 samples = 7 value = [3, 4] class = pos entropy = 0.0 samples = 41 value = [41, 0] class = neg show <= 0.0172 entropy = 0.8113 samples = 12 value = [9, 3] class = neg entropy = 0.0 samples = 6 value = [6, 0] class = neg entropy = 1.0 samples = 6 value = [3, 3] class = neg entropy = 0.0 samples = 7 value = [0, 7] class = pos entropy = 0.8631 samples = 7 value = [5, 2] class = neg waste <= 0.008 entropy = 0.9274 samples = 426 value = [280, 146] class = neg bill <= 0.0433 entropy = 0.4817 samples = 173 value = [155, 18] class = neg perfect <= 0.0347 entropy = 0.9682 samples = 354 value = [214, 140] class = neg work <= 0.0157 entropy = 0.4138 samples = 72 value = [66, 6] class = neg great <= 0.024 entropy = 0.9525 samples = 341 value = [214, 127] class = neg entropy = 0.0 samples = 13 value = [0, 13] class = pos stupid <= 0.0328 entropy = 0.905 samples = 287 value = [195, 92] class = neg word <= 0.0201 entropy = 0.9357 samples = 54 value = [19, 35] class = pos

  • verall <= 0.0336

entropy = 0.9327 samples = 264 value = [172, 92] class = neg entropy = 0.0 samples = 23 value = [23, 0] class = neg entropy = 0.9 samples = 250 value = [171, 79] class = neg entropy = 0.3712 samples = 14 value = [1, 13] class = pos take <= 0.0191 entropy = 0.8427 samples = 48 value = [13, 35] class = pos entropy = 0.0 samples = 6 value = [6, 0] class = neg entropy = 0.9871 samples = 30 value = [13, 17] class = pos entropy = 0.0 samples = 18 value = [0, 18] class = pos entropy = 0.0 samples = 47 value = [47, 0] class = neg even <= 0.0133 entropy = 0.795 samples = 25 value = [19, 6] class = neg entropy = 0.9544 samples = 8 value = [3, 5] class = pos late <= 0.0234 entropy = 0.3228 samples = 17 value = [16, 1] class = neg entropy = 0.0 samples = 12 value = [12, 0] class = neg entropy = 0.7219 samples = 5 value = [4, 1] class = neg put <= 0.0435 entropy = 0.4138 samples = 168 value = [154, 14] class = neg entropy = 0.7219 samples = 5 value = [1, 4] class = pos flick <= 0.0299 entropy = 0.3328 samples = 163 value = [153, 10] class = neg entropy = 0.7219 samples = 5 value = [1, 4] class = pos despite <= 0.0081 entropy = 0.1841 samples = 143 value = [139, 4] class = neg even <= 0.018 entropy = 0.8813 samples = 20 value = [14, 6] class = neg entropy = 0.0 samples = 123 value = [123, 0] class = neg every <= 0.0241 entropy = 0.7219 samples = 20 value = [16, 4] class = neg entropy = 0.0 samples = 13 value = [13, 0] class = neg entropy = 0.9852 samples = 7 value = [3, 4] class = pos entropy = 0.65 samples = 6 value = [1, 5] class = pos character <= 0.017 entropy = 0.3712 samples = 14 value = [13, 1] class = neg entropy = 0.7219 samples = 5 value = [4, 1] class = neg entropy = 0.0 samples = 9 value = [9, 0] class = neg

Simple to interpret Can mix numerical and categorical data You specify the parameters of the tree (maximum depth, number of items at leaf nodes—both change accuracy) But finding the optimal decision tree can be np-complete

slide-10
SLIDE 10

Information gain can be used to decide how to split

Information gain is defined in terms of entropy H Entropy of tree node: H(n) = −

  • p

pi log2 pi where p’s are the fraction of each class at node n Information gain I is used to decide which feature to split

  • n at each step in building the tree

Information gain: I(n, D) = H(n) − H(n|D) where H(n|D) is the weighted entropy of the daughter nodes.

slide-11
SLIDE 11

Information gain can be used to decide how to split

bad <= 0.0157 entropy = 0.9999 samples = 1600 value = [809, 791] class = neg waste <= 0.022 entropy = 0.952 samples = 1014 value = [377, 637] class = pos True bad <= 0.0475 entropy = 0.8309 samples = 586 value = [432, 154] class = neg False many <= 0.0094 entropy = 0.9238 samples = 944 value = [320, 624] class = pos strong <= 0.0257 entropy = 0.6924 samples = 70 value = [57, 13] class = neg great <= 0.004 entropy = 0.9783 samples = 561 value = [232, 329] class = pos memorable <= 0.0091 entropy = 0.7776 samples = 383 value = [88, 295] class = pos (...) (...) (...) (...) show <= 0.0287 entropy = 0.4237 samples = 58 value = [53, 5] class = neg move <= 0.0127 entropy = 0.9183 samples = 12 value = [4, 8] class = pos (...) (...) (...) (...) suppose <= 0.0144 entropy = 0.907 samples = 428 value = [290, 138] class = neg perfect <= 0.0343 entropy = 0.9432 samples = 366 value = [234, 132] class = neg life <= 0.0194 entropy = 0.4587 samples = 62 value = [56, 6] class = neg (...) (...) (...) (...)

slide-12
SLIDE 12

Results on the 2000 movie reviews: SVM Naive Bayes DTree (max depth 7) Accuracy on train 0.98 0.96 0.80 Accuracy on test 0.84 0.8 0.69

slide-13
SLIDE 13

Classifier comparison on sample data

Modified from SciKit Learn Classifier Comparison

slide-14
SLIDE 14

Today

Come to see lecturers if you are behind New topic starts on Monday—try to have ticks 1–6 by end

  • f today