classifying unstructured text
play

Classifying unstructured text Deterministic and machine learning - PowerPoint PPT Presentation

Using a simple tool to solve a complex problem does not result in a simple solution. Larry Wall Classifying unstructured text Deterministic and machine learning approaches Stephanie Fischer Hamburg Mnchen Berlin Kln Leipzig


  1. “Using a simple tool to solve a complex problem does not result in a simple solution.“ Larry Wall Classifying unstructured text Deterministic and machine learning approaches Stephanie Fischer Hamburg München Berlin Köln Leipzig Dr. Christian Winkler Apache Big Data Sevilla , 15th November 2016

  2. Stephanie Fischer Product Owner Text Analytics Big Data, Agile & Change 01 About us mgm consulting partners 02 Text statistics 03 Categories Dr. Christian Winkler 04 Text classification Enterprise Architect Big Data, Data Science 05 Conclusion and outlook mgm technology partners Agenda Speaker

  3. 01 About us

  4. Stephanie and Christian according to their browser history

  5. 02 Text statistics

  6. Comparing word frequency of news from Reuters, Telegraph, Aljazeera Telegraph # 958,996 headlines 9.5 years Reuters World News Aljazeera # 163,919 headlines # 94,309 headlines 9 years 8.5 years • Aljazeera • XXX Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/

  7. 03 Categories

  8. Finding meaningful categories. Each text is different. Challenge accepted! EXTRACTING… CAT. 1 CAT. 2 CAT. 3

  9. 10000 12000 14000 16000 18000 20000 Comparing pre-defined categories of Al Jazeera , Reuters… 2000 4000 6000 8000 0 news-middleeast news-americas news-europe news-asia-pacific news-africa news-asia Al Jazeera news news indepth-opinion opinion indepth-features indepth-inpictures pictures focus blogs-americas indepth-spotlight blogs blogs-asia indepth-interactive 10000 20000 30000 40000 50000 60000 70000 80000 0 World US Politics Politics World Top news Top News Business News Markets Reuters Technology Deals Deals Personal Finance Business Economy Green Business Bonds Sports Small Business

  10. … and the Telegraph categories 100000 120000 20000 40000 60000 80000 world 0 news worldnews sport football news uknews finance newsbysector football news politics sport rugby finance personalfinance sport sport rugbyunion sport cricket news health culture tvandradio cricket sport sport othersports finance markets news newstopics culture music news earth blogs.telegraph.co.uk blogs culture books sport olympics culture film news celebritynews finance economics news obituaries Telegraph sport tennis finance comment comment finance news picturegalleries finance property news science travel destinations culture theatre sport golf sport horseracing comment telegraph-view comment letters education educationnews finance financialcrisis comment columnists foodanddrink recipes www.telegraph.co.uk sport motorsport travel travelnews culture art technology news comment personal-view women womens-life news religion motoring news education universityeducation finance jobs news wikileaks-files lifestyle wellbeing

  11. It‘s not easy. so

  12. Our selection: Functionally relevant, mutually exclusive categories derived from Telegraph categories 250000 200000 150000 100000 50000 0

  13. Finding meaningful categories for the Telegraph News was fun! Lets go on and do a whole text classification experiment. Our aim is to classify 1 million Telegraph News documents with an ML algorithm. While doing this we want to find out … … if a ML algorithm will be able to classify the Telegraph news documents … what are the steps we need to work out in order to make the ML algorithm work? Handy for us: We will be able to train the ML algorithm with the pre-classified data set of the Telegraph News!

  14. 04 Text classification

  15. Typical text classification projects and our experiment set-up Typical set-up: Our Telegraph experiment with Advantages for us: no classification scheme, no classified data pre-classified documents Choose data to be classified Choose data to be classified  No manual classification & QA Manually classify chosen Get already existing necessary data set classifications for chosen data  Existing classification scheme Train ML algorithm with Train ML algorithm with classified data set classified data set  Playground easily set up Apply trained ML algorithm to Apply trained ML algorithm to  Free to choose both complete data set complete data set manual data set & Manual QA Automatic QA categories data set samples complete data set

  16. Our experiment for the next 30 minutes Typical set-up: Our Telegraph experiment with no classification scheme, no classified data pre-classified documents Our aims in the next 30 Choose data to be classified Choose data to be classified minutes: Manually classify chosen Get already existing data set classifications for chosen data  Train & apply the ML algorithm to 1 of Train ML algorithm with Train ML algorithm with Telegraph News classified data set classified data set  See how well ML Apply trained ML algorithm to Apply trained ML algorithm to performs complete data set complete data set Manual QA Automatic QA data set samples complete data set

  17. This process sounds easy and very structured. The people in the audience who have already done text classification projects probably now that in reality, data can become The devil is in pretty challenging . the data The next slides show you the process of how we classified 1 Million Telegraph news. What is the reality we deal with? And what are good practices/our learnings?

  18. Getting started: Preparing data for and executing ML Naïve approach Choose data to be classified Random selection Easy to classify Manually classify data set But expensive Apply ML algorithm Get classifications Manual QA Measure results

  19. The result is BAD! WHY? Lets take a step back and find out: How does ML WORK? How can I MEASURE its results?

  20. ML algorithm explained – Support Vector Machine (SVM) Machine learning is linear algebra  Need to discretize first Categories are already discrete More complicated for text  Bag of words = detect words  TF/IDF matrix = use document and total frequency Many different possible learning models  Support Vector Machines (most popular)  Neural Network  Random forest  Decision tree Support Vector Machine

  21. Preparation of manually classified set Choosing set for manual classification 1. Good situation: The manually classified data set contains all the words of  Select documents with highest word variability the complete data set. – Metric: Word heterogeneity Word heterogeneity Word heterogenity in manual set complete data set = Number of words in all documents w01 w02 w03 w01 w02 w03 (  stopwords) w04 w05 w06 w04 w05 w06 – Even distribution w07 w08 w09 w07 w08 w09 – Long tail distribution w10 w11 w12 w10 w11 w12 (  many, many words use infrequently ) Complete w13 w14 w15 w13 w14 w15 set  Complicated: knapsack-like problem w16 w17 w18 w16 w17 w18  Use an approximate approach (like genetic algorithm) w19 w20 w21  Crucial for all following tasks w22 w23 w24 common distribution dictionary distribution w25 w26 w27 2. Not so good situation: w28 w29 w30 The manually classified data contains only a fraction of all w31 w32 w33 the words in the complete … … w99 data set

  22. Intelligently choose data set to be classified manually Final data set available Final data set not available  Optimize for high variabitlity and high usage  Choose training data set in a way to create maximal word overlap with complete data set  W M = { words in training set } W C = { words in complete set } find maximum for | W C W M | = | W M | U  Improved approach: choose training set to minimize headlines with unknown words in complete data set  Find minimum for |C W M | U  More complicated, but worth it Select this Don‘t select that

  23. Measure classification quality: precision and recall Precision  „positive predictive value “  Precision is the probability that a R (randomly selected) retrieved document is classified correctly P Recall  Sensitivity or „ true positive rate“  Recall is the probability that a (randomly selected) classified document is found Example  Africa has very high precision for category „ Africa “, but bad sensitivity (recall)

  24. Now we know why the naive approach of preparing data for and executing ML is not enough. Lets try the following instead… Necessary steps to successfully apply ML Calculate text metrics Choose data to be classified Define goals Measure quality of ML  Use optimization Training set and test set  Precision and recall  Apply ML to whole data set  Easy to create Manual QA  Manually classify data set But expensive Optimal set Get classifications Training and test data set Apply ML algorithm Calculate Precision + Recall Crossfolding, use different algorithms Attention: ML remembers words Manual QA Better results  It can only classify text with known words

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend