comp9313 big data management
play

COMP9313: Big Data Management Classification and PySpark MLlib - PowerPoint PPT Presentation

COMP9313: Big Data Management Classification and PySpark MLlib PySpark MLlib MLlib is Sparks scalable machine learning library consisting of common learning algorithms and utilities Basic Statistics Classification Regression


  1. COMP9313: Big Data Management Classification and PySpark MLlib

  2. PySpark MLlib • MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities • Basic Statistics • Classification • Regression • Clustering • Recommendation System • Dimensionality Reduction • Feature Extraction • Optimization • It is more or less a spark version of sk-learn 2

  3. Classification • Classification • predicts categorical class labels • constructs a model based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction (aka. Regression) • models continuous-valued functions, i.e., predicts unknown or missing values • Applications • medical diagnosis • credit approval • natural language processing 3

  4. Classification and Regression • Given a new object 𝑝 , map it to a feature vector 𝐲 = 𝑦 ! , 𝑦 " , … , 𝑦 # $ • Predict the output (class label) 𝑧 ∈ 𝒵 • Binary classification • 𝒵 = {0, 1} (sometimes {−1, 1} ) • Multi-class classification • 𝒵 = {1,2, … , 𝐷} • Learn a classification function • 𝑔 𝐲 : ℝ , ↦ 𝒵 • Regression: 𝑔 𝐲 : ℝ # ↦ ℝ 4

  5. Example of Classification – Text Categorization • Given: document or sentence • E.g., A statement released by Scott Morrison said he has received advice … advising the upcoming sitting be cancelled. • Predict: Topic • Pre-defined labels: Politics or not? • How to learn the classification function? • 𝑔 𝐲 : ℝ , ↦ 𝒵 • How to convert document to 𝐲 ∈ ℝ , (e.g., feature vector)? • How to convert pre-defined labels to 𝒵 = {0, 1} ? 5

  6. Example of Classification – Text Categorization • Input object: a sequence of words • Input features 𝐲 • Bag of Words representation • Freq(Morrison) = 2, freq(Trump) = 0, … • 𝐲 = 2, 1, 0, … - • Class labels: 𝒵 • Politics: 1 • Not politics: -1 6

  7. Convert a Problem into Classification Problem • Input • How to generate input feature vectors • Output • Class labels • Another example: image classification • Input: A matrix of RGB values • Input features: color histogram • E.g., pixel_count(red) = ?, pixel_count(blue) = ? • Output: class labels • Building: 1 • Not building: -1 7

  8. Supervised Learning • How to get 𝑔 𝐲 ? • In supervised learning, we are given a set of training examples: • 𝒠 = 𝐲 . , 𝑧 . , 𝑗 = 1, … , 𝑜 • Identical independent distribution (i.i.d) assumption • A critical assumption for machine learning theory 8

  9. Machine Learning Terminologies • Supervised learning has input labelled data • #instances x #attributes matrix/table • #attributes = #features + 1 • 1 (usu. the last attribute) is for the class label • Labelled data split into 2 or 3 disjoint subsets • Training data (used to build a classifier) • Development data (used to select a classifier) • Testing data (used to evaluate the classifier) • Output of the classifier • Binary classification: #labels = 2 • Multi-label classification: #labels > 2 9

  10. Machine Learning Terminologies • Evaluate the classifier • False positive: • not politics but classified as politics • False negative • Politics but classified as not politics • True positive • Politics and classified as politics &' • Precision = &'()' &' • Recall = &'()* • F1 score = 2 ⋅ '-./0102*⋅-./344 '-./0102*(-./344 10

  11. Classification—A Two-Step Process • Classifier construction • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for classifier construction is training set • The classifier is represented as classification rules, decision trees, or mathematical formulae • Classifier usage: classifying future or unknown objects • Estimate accuracy of the classifier • The known label of test sample is compared with the classified result from the classifier • Accuracy rate is the percentage of test set samples that are correctly classified by the classifier • Test set is independent of training set, otherwise over-fitting will occur • If the accuracy is acceptable, use the classifier to classify data tuples whose class labels are not known 11

  12. Classification Process 1: Preprocessing and Feature Engineering Raw Data Training Data 12

  13. Classification Process 2: Train a classifier Classification Training Algorithms Data Prediction 1 0 1 1 0 Classifier 𝑔 𝐲 Precision = 0.66 Recall = 0.66 F1 = 0.66 13

  14. Classification Process 3: Evaluate the Classifier Classifier Prediction 1 1 1 Test 1 Data 0 Precision = 75% Recall = 100% 14 F1 = 0.86

  15. How to Judge a Model? • Based on training error or testing error? • Testing error • Otherwise, this is a kind of data scooping => overfitting • What if there are multiple models to choose from? • Further split a “development set” from the training set • Can we trust the error values on the development set? • Need “large” dev set => less data for training • k-fold cross-validation 15

  16. k-fold cross-validation 16

  17. Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • … • We will do text classification in Project 2 17

  18. Text Classification: Problem Definition • Input • Document or sentence 𝑒 • Output • Class label 𝑑 ∈ {c / , c 0 , … } • Classification methods: • Naïve bayes • Logistic regression • Support-vector machines • … 18

  19. Naïve Bayes: Intuition • Simple (“naïve”) classification method based on Bayes rule • Relies on very simple representation of document • Bag of words it 6 I 5 I love this movie! It's sweet, the 4 but with satirical humor. The t, to 3 fairy it always love he dialogue is great and the to and 3 it whimsical it I adventure scenes are fun... and areanyone seen 2 seen ... friend yet 1 It manages to be whimsical dialogue cal happy recommend would 1 and romantic while laughing ng adventure satirical whimsical 1 sweet of at the conventions of the who it movie times 1 I to it but romantic I fairy tale genre. I would yet sweet 1 t several humor r recommend it to just about again the it satirical 1 ral the would seen anyone. I've seen it several py adventure 1 to scenes I the manages es I genre 1 the times, and I'm always happy fun times I and and fairy 1 about to see it again whenever I while whenever humor 1 have have a friend who hasn't conventions have 1 with seen it yet! great 1 … … 19

  20. Naïve Bayes Classifier • Bayes’ Rule: • For a document d and a class c 𝑄(𝑑|𝑒) = 𝑄(𝑒|𝑑)𝑄(𝑑) 𝑄 𝑒 • We want to which class is most likely 𝑑 123 = argmax 𝑄(𝑑|𝑒) 4∈6 20

  21. Naïve Bayes Classifier MAP is “maximum a 𝑑 123 = argmax 𝑄(𝑑|𝑒) posteriori” = most likely 4∈6 class 𝑄 𝑒 𝑑 𝑄(𝑑) = argmax Bayes Rule 𝑄(𝑒) 4∈6 = argmax 𝑄 𝑒 𝑑 𝑄(𝑑) Dropping the denominator 4∈6 Document d = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) represented as 4∈6 features x1..xn O(| X | n •| C |) parameters. Could only be estimated if a very, very large number of training examples was available. 21

  22. Multinomial Naïve Bayes Independence Assumptions 𝑄 𝑦 ! , 𝑦 " , … , 𝑦 * 𝑑 𝑄(𝑑) • Bag of Words assumption : Assume position doesn’t matter • Conditional Independence : Assume the feature probabilities P ( x i | c j ) are independent given the class c. 𝑄 𝑦 ! , … , 𝑦 * 𝑑 = 𝑄 𝑦 ! 𝑑 ⋅ 𝑄 𝑦 " 𝑑 ⋅ … ⋅ 𝑄(𝑦 * |𝑑) 22

  23. Multinomial Naïve Bayes Classifier 𝑑 123 = argmax 𝑄 𝑦 / , 𝑦 0 , … , 𝑦 7 𝑑 𝑄(𝑑) 4∈6 𝑑 89 = argmax 𝑄 𝑑 : A 𝑄(𝑦|𝑑) 4∈6 ;∈< positions ¬ all word positions in test document 𝑑 89 = argmax 𝑄 𝑑 A 𝑄(𝑦 . |𝑑 : ) : 4∈6 .∈=>?.@.>7? 23

  24. Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates • simply use the frequencies in the data P ( c j ) = doccount ( C = c j ) ˆ N doc fraction of times word 𝑑𝑝𝑣𝑜𝑢(𝑥 0 , 𝑑 7 ) w i appears ! 𝑄 𝑥 0 𝑑 7 = among all words in ∑ 8∈: 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑑 7 ) documents of topic c j • Create mega-document for topic j by concatenating all docs in this topic • Use frequency of w in mega-document 24

  25. Problem with Maximum Likelihood • What if we have seen no training documents with the word fantastic and classified in the topic positive? 𝑄 ”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑” 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓 = 𝑑𝑝𝑣𝑜𝑢(”𝑔𝑏𝑜𝑢𝑏𝑡𝑢𝑗𝑑”, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) A = 0 ∑ .∈0 𝑑𝑝𝑣𝑜𝑢(𝑥, 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓) • Zero probabilities cannot be conditioned away, no matter the other evidence! B B 𝑑 123 = argmax 𝑄(𝑑) A 𝑄(𝑦 . |𝑑) 4 . 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend