small data machine learning
play

Small Data Machine Learning Andrei Zmievski The goal is not a - PowerPoint PPT Presentation

Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later WORK We are all superheroes, because we help our customers


  1. Top 12 Languages id Indonesian 3548 en English 1804 tl Tagalog 733 es Spanish 329 so Somalian 305 ja Japanese 300 pt Portuguese 262 ar Arabic 256 nl Dutch 150 it Italian 137 sw Swahili 118 fr French 92 I guarantee you people aren’t tweeting at me in Swahili.

  2. Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

  3. Language Detection Text_LanguageDetect pear / textcat pecl / Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.

  4. EnglishNotEnglish ✓ Clean-up text (remove mentions, links, etc) ✓ Run language detection ✓ If unknown/low weight, pretend it’s English, else: ✓ If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Di ff erence with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.

  5. BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones

  6. feature vectors I N P U T O U T P U T labels (good/bad) Had my input and output

  7. BIAS CORRECTION One more thing to address

  8. BIAS CORRECTION BAD GOOD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias

  9. BIAS CORRECTION GOOD BAD

  10. O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

  11. O V E R SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

  12. O V E R SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

  13. SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training

  14. Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.

  15. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

  16. chance feature 1 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

  17. chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

  18. chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

  19. chance feature 1 90% “good” language 70% no hashtags 2 25% 1 hashtag 5% 2 hashtags 0 2% @a at the end 77 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)

  20. Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?

  21. COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

  22. REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

  23. COST FUNCTION m J ( θ ) = 1 X Cost ( h θ ( x ) , y ) m i =1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.

  24. LOGISTIC COST ( − log ( h θ ( x )) if y = 1 Cost ( h θ ( x ) , y ) = − log (1 − h θ ( x )) if y = 0

  25. LOGISTIC COST y=1 y=0 0 1 0 1 Correct guess Cost = 0 Incorrect guess Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.

  26. minimize cost OVER θ Finding the best values of Theta that minimize the cost

  27. GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend