Classifying unstructured text Deterministic and machine learning - PowerPoint PPT Presentation

“Using a simple tool to solve a complex problem does not result in a simple solution.“ Larry Wall Classifying unstructured text Deterministic and machine learning approaches Stephanie Fischer Hamburg München Berlin Köln Leipzig Dr. Christian Winkler Apache Big Data Sevilla , 15th November 2016

Stephanie Fischer Product Owner Text Analytics Big Data, Agile & Change 01 About us mgm consulting partners 02 Text statistics 03 Categories Dr. Christian Winkler 04 Text classification Enterprise Architect Big Data, Data Science 05 Conclusion and outlook mgm technology partners Agenda Speaker

01 About us

Stephanie and Christian according to their browser history

02 Text statistics

Comparing word frequency of news from Reuters, Telegraph, Aljazeera Telegraph # 958,996 headlines 9.5 years Reuters World News Aljazeera # 163,919 headlines # 94,309 headlines 9 years 8.5 years • Aljazeera • XXX Visualizations created with Apache Solr and D3.js, see our talk from Apache Big Data Vancouver 2016 here: https://bigdata.mgm-tp.com/apache/

03 Categories

Finding meaningful categories. Each text is different. Challenge accepted! EXTRACTING… CAT. 1 CAT. 2 CAT. 3

10000 12000 14000 16000 18000 20000 Comparing pre-defined categories of Al Jazeera , Reuters… 2000 4000 6000 8000 0 news-middleeast news-americas news-europe news-asia-pacific news-africa news-asia Al Jazeera news news indepth-opinion opinion indepth-features indepth-inpictures pictures focus blogs-americas indepth-spotlight blogs blogs-asia indepth-interactive 10000 20000 30000 40000 50000 60000 70000 80000 0 World US Politics Politics World Top news Top News Business News Markets Reuters Technology Deals Deals Personal Finance Business Economy Green Business Bonds Sports Small Business

… and the Telegraph categories 100000 120000 20000 40000 60000 80000 world 0 news worldnews sport football news uknews finance newsbysector football news politics sport rugby finance personalfinance sport sport rugbyunion sport cricket news health culture tvandradio cricket sport sport othersports finance markets news newstopics culture music news earth blogs.telegraph.co.uk blogs culture books sport olympics culture film news celebritynews finance economics news obituaries Telegraph sport tennis finance comment comment finance news picturegalleries finance property news science travel destinations culture theatre sport golf sport horseracing comment telegraph-view comment letters education educationnews finance financialcrisis comment columnists foodanddrink recipes www.telegraph.co.uk sport motorsport travel travelnews culture art technology news comment personal-view women womens-life news religion motoring news education universityeducation finance jobs news wikileaks-files lifestyle wellbeing

It‘s not easy. so

Our selection: Functionally relevant, mutually exclusive categories derived from Telegraph categories 250000 200000 150000 100000 50000 0

Finding meaningful categories for the Telegraph News was fun! Lets go on and do a whole text classification experiment. Our aim is to classify 1 million Telegraph News documents with an ML algorithm. While doing this we want to find out … … if a ML algorithm will be able to classify the Telegraph news documents … what are the steps we need to work out in order to make the ML algorithm work? Handy for us: We will be able to train the ML algorithm with the pre-classified data set of the Telegraph News!

04 Text classification

Typical text classification projects and our experiment set-up Typical set-up: Our Telegraph experiment with Advantages for us: no classification scheme, no classified data pre-classified documents Choose data to be classified Choose data to be classified  No manual classification & QA Manually classify chosen Get already existing necessary data set classifications for chosen data  Existing classification scheme Train ML algorithm with Train ML algorithm with classified data set classified data set  Playground easily set up Apply trained ML algorithm to Apply trained ML algorithm to  Free to choose both complete data set complete data set manual data set & Manual QA Automatic QA categories data set samples complete data set

Our experiment for the next 30 minutes Typical set-up: Our Telegraph experiment with no classification scheme, no classified data pre-classified documents Our aims in the next 30 Choose data to be classified Choose data to be classified minutes: Manually classify chosen Get already existing data set classifications for chosen data  Train & apply the ML algorithm to 1 of Train ML algorithm with Train ML algorithm with Telegraph News classified data set classified data set  See how well ML Apply trained ML algorithm to Apply trained ML algorithm to performs complete data set complete data set Manual QA Automatic QA data set samples complete data set

This process sounds easy and very structured. The people in the audience who have already done text classification projects probably now that in reality, data can become The devil is in pretty challenging . the data The next slides show you the process of how we classified 1 Million Telegraph news. What is the reality we deal with? And what are good practices/our learnings?

Getting started: Preparing data for and executing ML Naïve approach Choose data to be classified Random selection Easy to classify Manually classify data set But expensive Apply ML algorithm Get classifications Manual QA Measure results

The result is BAD! WHY? Lets take a step back and find out: How does ML WORK? How can I MEASURE its results?

ML algorithm explained – Support Vector Machine (SVM) Machine learning is linear algebra  Need to discretize first Categories are already discrete More complicated for text  Bag of words = detect words  TF/IDF matrix = use document and total frequency Many different possible learning models  Support Vector Machines (most popular)  Neural Network  Random forest  Decision tree Support Vector Machine

Preparation of manually classified set Choosing set for manual classification 1. Good situation: The manually classified data set contains all the words of  Select documents with highest word variability the complete data set. – Metric: Word heterogeneity Word heterogeneity Word heterogenity in manual set complete data set = Number of words in all documents w01 w02 w03 w01 w02 w03 (  stopwords) w04 w05 w06 w04 w05 w06 – Even distribution w07 w08 w09 w07 w08 w09 – Long tail distribution w10 w11 w12 w10 w11 w12 (  many, many words use infrequently ) Complete w13 w14 w15 w13 w14 w15 set  Complicated: knapsack-like problem w16 w17 w18 w16 w17 w18  Use an approximate approach (like genetic algorithm) w19 w20 w21  Crucial for all following tasks w22 w23 w24 common distribution dictionary distribution w25 w26 w27 2. Not so good situation: w28 w29 w30 The manually classified data contains only a fraction of all w31 w32 w33 the words in the complete … … w99 data set

Intelligently choose data set to be classified manually Final data set available Final data set not available  Optimize for high variabitlity and high usage  Choose training data set in a way to create maximal word overlap with complete data set  W M = { words in training set } W C = { words in complete set } find maximum for | W C W M | = | W M | U  Improved approach: choose training set to minimize headlines with unknown words in complete data set  Find minimum for |C W M | U  More complicated, but worth it Select this Don‘t select that

Measure classification quality: precision and recall Precision  „positive predictive value “  Precision is the probability that a R (randomly selected) retrieved document is classified correctly P Recall  Sensitivity or „ true positive rate“  Recall is the probability that a (randomly selected) classified document is found Example  Africa has very high precision for category „ Africa “, but bad sensitivity (recall)

Now we know why the naive approach of preparing data for and executing ML is not enough. Lets try the following instead… Necessary steps to successfully apply ML Calculate text metrics Choose data to be classified Define goals Measure quality of ML  Use optimization Training set and test set  Precision and recall  Apply ML to whole data set  Easy to create Manual QA  Manually classify data set But expensive Optimal set Get classifications Training and test data set Apply ML algorithm Calculate Precision + Recall Crossfolding, use different algorithms Attention: ML remembers words Manual QA Better results  It can only classify text with known words

Classifying unstructured text Deterministic and machine learning - PowerPoint PPT Presentation

Using a simple tool to solve a complex problem does not result in a simple solution. Larry Wall Classifying unstructured text Deterministic and machine learning approaches Stephanie Fischer Hamburg Mnchen Berlin Kln Leipzig

V0D 2016 Classifying Studies V0D V0D 2016 Classifying Studies 1 2016 Classifying Studies

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

CFD General Notation System (CGNS) Usage for unstructured grids Edwin van der Weide Stanford

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Semantic annotation of unstructured and ungrammatical text Matthew Michelson and Craig A.

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

Classifying Homogeneous Structures Cherlin Introduction The finite case Gregory Cherlin

Skill discovery from unstructured demonstrations Skill discovery from unstructured demonstrations

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval

Day 3 Long Tail SEO Google Analytics How Google Analytics can help with our Long Tail

History and Baptism Brief History of RCC RCC can trace its history back to 1972 where a group of

Verilan Network Provider Update March, 2018 IEEE 802 Plenary Hyatt Regency OHare, Rosemont,

Reference Capabilities for Concurrency and Scalability An Experience Report Elias Castegren ,

manager (IM) Introduction General platform to deploy on-demand customized virtual computing

To change More textrunner, more pattern learning Reorder: Open Information Extraction

The Untouchable Web Rick Hanlon Point Hover Click Type Resize Drag Load Point Hover

http://cs224w.stanford.edu (1) New problem: Outbreak detection (2) Develop an approximation