 
              “Using a simple tool to solve a complex problem does not result in a simple solution.“ Larry Wall Classifying unstructured text Deterministic and machine learning approaches Stephanie Fischer Hamburg München Berlin Köln Leipzig Dr. Christian Winkler Apache Big Data Sevilla , 15th November 2016
Stephanie Fischer Product Owner Text Analytics Big Data, Agile & Change 01 About us mgm consulting partners 02 Text statistics 03 Categories Dr. Christian Winkler 04 Text classification Enterprise Architect Big Data, Data Science 05 Conclusion and outlook mgm technology partners Agenda Speaker
01 About us
Stephanie and Christian according to their browser history
02 Text statistics
Comparing word frequency of news from Reuters, Telegraph, Aljazeera Telegraph # 958,996 headlines 9.5 years Reuters World News Aljazeera # 163,919 headlines # 94,309 headlines 9 years 8.5 years • Aljazeera • XXX Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/
03 Categories
Finding meaningful categories. Each text is different. Challenge accepted! EXTRACTING… CAT. 1 CAT. 2 CAT. 3
10000 12000 14000 16000 18000 20000 Comparing pre-defined categories of Al Jazeera , Reuters… 2000 4000 6000 8000 0 news-middleeast news-americas news-europe news-asia-pacific news-africa news-asia Al Jazeera news news indepth-opinion opinion indepth-features indepth-inpictures pictures focus blogs-americas indepth-spotlight blogs blogs-asia indepth-interactive 10000 20000 30000 40000 50000 60000 70000 80000 0 World US Politics Politics World Top news Top News Business News Markets Reuters Technology Deals Deals Personal Finance Business Economy Green Business Bonds Sports Small Business
… and the Telegraph categories 100000 120000 20000 40000 60000 80000 world 0 news worldnews sport football news uknews finance newsbysector football news politics sport rugby finance personalfinance sport sport rugbyunion sport cricket news health culture tvandradio cricket sport sport othersports finance markets news newstopics culture music news earth blogs.telegraph.co.uk blogs culture books sport olympics culture film news celebritynews finance economics news obituaries Telegraph sport tennis finance comment comment finance news picturegalleries finance property news science travel destinations culture theatre sport golf sport horseracing comment telegraph-view comment letters education educationnews finance financialcrisis comment columnists foodanddrink recipes www.telegraph.co.uk sport motorsport travel travelnews culture art technology news comment personal-view women womens-life news religion motoring news education universityeducation finance jobs news wikileaks-files lifestyle wellbeing
It‘s not easy. so
Our selection: Functionally relevant, mutually exclusive categories derived from Telegraph categories 250000 200000 150000 100000 50000 0
Finding meaningful categories for the Telegraph News was fun! Lets go on and do a whole text classification experiment. Our aim is to classify 1 million Telegraph News documents with an ML algorithm. While doing this we want to find out … … if a ML algorithm will be able to classify the Telegraph news documents … what are the steps we need to work out in order to make the ML algorithm work? Handy for us: We will be able to train the ML algorithm with the pre-classified data set of the Telegraph News!
04 Text classification
Typical text classification projects and our experiment set-up Typical set-up: Our Telegraph experiment with Advantages for us: no classification scheme, no classified data pre-classified documents Choose data to be classified Choose data to be classified  No manual classification & QA Manually classify chosen Get already existing necessary data set classifications for chosen data  Existing classification scheme Train ML algorithm with Train ML algorithm with classified data set classified data set  Playground easily set up Apply trained ML algorithm to Apply trained ML algorithm to  Free to choose both complete data set complete data set manual data set & Manual QA Automatic QA categories data set samples complete data set
Our experiment for the next 30 minutes Typical set-up: Our Telegraph experiment with no classification scheme, no classified data pre-classified documents Our aims in the next 30 Choose data to be classified Choose data to be classified minutes: Manually classify chosen Get already existing data set classifications for chosen data  Train & apply the ML algorithm to 1 of Train ML algorithm with Train ML algorithm with Telegraph News classified data set classified data set  See how well ML Apply trained ML algorithm to Apply trained ML algorithm to performs complete data set complete data set Manual QA Automatic QA data set samples complete data set
This process sounds easy and very structured. The people in the audience who have already done text classification projects probably now that in reality, data can become The devil is in pretty challenging . the data The next slides show you the process of how we classified 1 Million Telegraph news. What is the reality we deal with? And what are good practices/our learnings?
Getting started: Preparing data for and executing ML Naïve approach Choose data to be classified Random selection Easy to classify Manually classify data set But expensive Apply ML algorithm Get classifications Manual QA Measure results
The result is BAD! WHY? Lets take a step back and find out: How does ML WORK? How can I MEASURE its results?
ML algorithm explained – Support Vector Machine (SVM) Machine learning is linear algebra  Need to discretize first Categories are already discrete More complicated for text  Bag of words = detect words  TF/IDF matrix = use document and total frequency Many different possible learning models  Support Vector Machines (most popular)  Neural Network  Random forest  Decision tree Support Vector Machine
Preparation of manually classified set Choosing set for manual classification 1. Good situation: The manually classified data set contains all the words of  Select documents with highest word variability the complete data set. – Metric: Word heterogeneity Word heterogeneity Word heterogenity in manual set complete data set = Number of words in all documents w01 w02 w03 w01 w02 w03 (  stopwords) w04 w05 w06 w04 w05 w06 – Even distribution w07 w08 w09 w07 w08 w09 – Long tail distribution w10 w11 w12 w10 w11 w12 (  many, many words use infrequently ) Complete w13 w14 w15 w13 w14 w15 set  Complicated: knapsack-like problem w16 w17 w18 w16 w17 w18  Use an approximate approach (like genetic algorithm) w19 w20 w21  Crucial for all following tasks w22 w23 w24 common distribution dictionary distribution w25 w26 w27 2. Not so good situation: w28 w29 w30 The manually classified data contains only a fraction of all w31 w32 w33 the words in the complete … … w99 data set
Intelligently choose data set to be classified manually Final data set available Final data set not available  Optimize for high variabitlity and high usage  Choose training data set in a way to create maximal word overlap with complete data set  W M = { words in training set } W C = { words in complete set } find maximum for | W C W M | = | W M | U  Improved approach: choose training set to minimize headlines with unknown words in complete data set  Find minimum for |C W M | U  More complicated, but worth it Select this Don‘t select that
Measure classification quality: precision and recall Precision  „positive predictive value “  Precision is the probability that a R (randomly selected) retrieved document is classified correctly P Recall  Sensitivity or „ true positive rate“  Recall is the probability that a (randomly selected) classified document is found Example  Africa has very high precision for category „ Africa “, but bad sensitivity (recall)
Now we know why the naive approach of preparing data for and executing ML is not enough. Lets try the following instead… Necessary steps to successfully apply ML Calculate text metrics Choose data to be classified Define goals Measure quality of ML  Use optimization Training set and test set  Precision and recall  Apply ML to whole data set  Easy to create Manual QA  Manually classify data set But expensive Optimal set Get classifications Training and test data set Apply ML algorithm Calculate Precision + Recall Crossfolding, use different algorithms Attention: ML remembers words Manual QA Better results  It can only classify text with known words
Recommend
More recommend