Lightly Supervised Content Modeling for Corporate Text Analytics - PowerPoint PPT Presentation

Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL — INTERNAL USE ONLY. 1

Talk Outline Text Analytics for Customer Services • Customer interaction data – Structured Data Vs. Textual data • Motivation / Objectives • Rule based approach • Data driven modeling • Machine learning for preprocessing • Modeling topics • Injecting SME knowledge to a topic model • Cool Results EMC CONFIDENTIAL — INTERNAL USE ONLY. 2

Costumer Data – in our classic CRM DB We care… “Something that makes me particularly proud is Customer interactions: the use of Big Data • Service Request analytics to create a • Support Chat detailed picture of service delivery characteristics for Structured Data: continuous improvement .” • Date • Employee id • Product Type • Problem Code Kevin Roche Senior Vice president, • Resolution Code EMC Global Services EMC CONFIDENTIAL — INTERNAL USE ONLY. 3

Customer Data – In the Data Lake Unstructured Data: • Problem Summary ( e.g. “Exchange backup failing”) • Resolution Summary ( e.g. “ hf 23.45 applied, issue solved”) • Chat data: – Customer : “hello” – Helpdesk : “hi tom, how can I help” – Customer: “my daily exchange backup is failing…” – Helpdesk : “did you try to restart the service?” EMC CONFIDENTIAL — INTERNAL USE ONLY. 4

Objectives / Motivation Service organization wish list • Early detection of emerging problems – Example, “we are getting a lot of service requests regarding Exchange 2010 backups in version 2.41 , let’s initiate service pack install in all of these” • Root Cause Analysis – Search for similar problem descriptions and rank the solutions • Identifying call volume drivers – Example, “oh, we are spending 10% of our time in Europe on VM memory problems” • Improving service – According to chat transcripts this employee is slow in identifying code bug issues EMC CONFIDENTIAL — INTERNAL USE ONLY. 5

Rule Based approach The old industry standard 1 • Subject matter experts create key word SME writes rules rules (install - > “install calls”) • Long tuning process (up to 6 months) 2 3 • Usually low recall Evaluate on Identify • High precision requires more and some common documents errors more complex rules e.g. “DB temp unavailable” / “DB temporarily unavailable”… • A strong preprocessing unit + rule creation engine cuts the manual labor from years to months • Let’s the users feel in control EMC CONFIDENTIAL — INTERNAL USE ONLY. 6

The Alternative Data driven machine learning modeling Our dream approach: • Model the data that you have, not the data you think you have • Automate • Don’t automate too much leaving the user out of the loop • Reproducible (quickly integrate to a new business unit) • Quick (analysts can tune the modeling engine) 1 2 3 4 SME Unsupervised Actionable Preprocess Annotation clustering Insights EMC CONFIDENTIAL — INTERNAL USE ONLY. 7

0 ETL / Technologies ETL Show me the data! Where is the data? This is actually the hardest step. 1) Database column (most likely) 2) Hadoop How to access it? 1) In the Pivotal GreenPlum big data warehouse 2) Pivotal HD (Hadoop by Pivotal) 3) For streams - GemFireXD EMC CONFIDENTIAL — INTERNAL USE ONLY. 8

Preprocess Preprocess Smart dimensionality reduction Classic preprocessing of text: • A friendly tokenization regexp re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s) • Then use the porter stemmer: – “ponies” - > “ poni ” – “expression” - > “express” Drawbacks: • Loses information • Eyesore for the customer EMC CONFIDENTIAL — INTERNAL USE ONLY. 9

Preprocess Preprocess Smart dimensionality reduction Before we start the modeling • Lemmatize instead of stem: – Preprocess text with POS tagg gger er to get the base form – Use NLTK lemmatizers to get the most probable base form • Word Clusters – Create a Deep Learning representation of the words (word2vec) – Extract likely synonyms using heuristics – Allow the SMEs to edit the synonym dictionary – Can be leveraged for query expansion for Search EMC CONFIDENTIAL — INTERNAL USE ONLY. 10

Preprocess – Synonym Extraction Preprocess Smart dimensionality reduction backup 15492 licenses 347 backups 4419 licence 119 backed 386 licences 54 backup's 32 bakcup 28 bakup 14 backuped 14 networker 8703 bacup 8 netwoker 59 backp 8 netwroker 22 buckup 7 netowrker 15 backu 7 networke 13 backus 5 neworker 10 backuo 5 netorker 5 backup1 4 backkup 4 EMC CONFIDENTIAL — INTERNAL USE ONLY. 11

Unsupervised Unsupervised clustering clustering Topic Modeling • Data driven approach requires that we look for the topics present in the data • Topic Modeling has been established as a premier approach in Statistical Machine Learning • Latent Dirichlet Allocation, the mixture model approach by Blei , Ng and Jordan, has been cited by 8,500 academic papers • Input: Text divided to documents • Output: Soft topics • Recipes: Symmetric Vs. Asymmetric prior Redundancy reduction (de-dup) EMC CONFIDENTIAL — INTERNAL USE ONLY. 12

Unsupervised Unsupervised clustering clustering Topic Modeling – In practice • Read: “Care and Feeding of Topic Models” by Boyd -Garber, Mimno and Newman • Asymmetric Priors (Wallach) – Vanilla LDA assumes all topics are as likely – I’ve never encountered such a corpus – Assume the prior for each topic is different and sample as well – Not supported by most big data LDA off the shelf solutions EMC CONFIDENTIAL — INTERNAL USE ONLY. 13

Unsupervised Unsupervised clustering clustering Topic Modeling – In practice • Redundancy (Cohen et al. ) – Copy paste / boiler plate text introduces noise into the topic distribution (see the paper) – These occur a lot in corporate data sets – Remove redundant documents (e.g. 10,000 occurrences of “ SR closed”) – Alternatively sample with Redundancy Aware LDA EMC CONFIDENTIAL — INTERNAL USE ONLY. 14

SME SME Annotation Annotation Inject the domain knowledge • Unsupervised approaches provide us with clusters of documents / words • How can we use this to benefit the business need? • Have the Subject Matter Expert explore the clusters and name them • Provide as many layers of information as possible to make it easy • Coach them first to understand that the precision is never 100% • Allow them to tune the results EMC CONFIDENTIAL — INTERNAL USE ONLY. 15

Actionable Good old Business Intelligence Insights Leverage the tags EMC CONFIDENTIAL — INTERNAL USE ONLY. 16

Actionable Good old Business Intelligence Insights Chat transcripts Analyze chat session. • What’s the topics associated with the conversation? • How quickly do the support person zooms in on the problem? EMC CONFIDENTIAL — INTERNAL USE ONLY. 17

Actionable Good old Business Intelligence Insights Leverage the tags – Combine with structured Topic distribution according to “problem code”. EMC CONFIDENTIAL — INTERNAL USE ONLY. 18

Actionable Good old Business Intelligence Insights Leverage the tags – Combine with structured Topic distribution according to “location”. • Disk replacement is prevalent in the Americas • Hot-fix is more common in Europe • Can zoom in on the requests to analyze EMC CONFIDENTIAL — INTERNAL USE ONLY. 20

Actionable Good old Business Intelligence Insights Easily combine Search with Topics / Machine Learning Use GP-Text for Lucene text searches. • What’s the topics associated with “node”? • What are the structured fields? • GPDB supports Machine Learning – throw in sentiment analysis in 5 minutes EMC CONFIDENTIAL — INTERNAL USE ONLY. 21

Lightly Supervised Content Modeling for Corporate Text Analytics - PowerPoint PPT Presentation

Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL INTERNAL USE ONLY. 1 Talk Outline Text Analytics for Customer Services Customer interaction data

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

T RANSLATING D ART TO EFFICIENT J AVA S CRIPT Kasper Lund Google Translating Dart to efficient

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining

Rx in the real world 1 Rob Ciolli 2 Rob Ciolli 3 Rob Ciolli The App 4 Rob Ciolli Quick

Building Nice Command Line Interfaces A Look Beyond The Standard Library Europython 2015 - Bilbao

Introduction to Logic Programming in Prolog 1 / 39 Outline Programming paradigms Logic

OWASP London Chapter Meeting 27th July 2017 London Chapter Chapter Leaders: Sam

Course Overview 1 2 Maria Hybinette, UGA Maria Hybinette, UGA Administration / Logistics

CSE 344 Section 3 Today: HW3 Setup SQL Server Basics Using nested query semantics

Lightly Supervised Content Modeling for Corporate Text Analytics - PowerPoint PPT Presentation

Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL INTERNAL USE ONLY. 1 Talk Outline Text Analytics for Customer Services Customer interaction data

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Stack Stack Heap Heap Data Data Text Text Program A Program B Stack Stack Text Heap

Business Proposal Infographic Style Your Text Here Your Text Here Your Text Here Your Text

How to Stay Faithful in Exile Daniel 1 Here is some test text Here is some test text Here is

Nehemiah Prays Nehemiah 1-2 Here is some test text Here is some test text Here is some test

T RANSLATING D ART TO EFFICIENT J AVA S CRIPT Kasper Lund Google Translating Dart to efficient

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval &amp; Data Mining

Rx in the real world 1 Rob Ciolli 2 Rob Ciolli 3 Rob Ciolli The App 4 Rob Ciolli Quick

Building Nice Command Line Interfaces A Look Beyond The Standard Library Europython 2015 - Bilbao

Introduction to Logic Programming in Prolog 1 / 39 Outline Programming paradigms Logic

OWASP London Chapter Meeting 27th July 2017 London Chapter Chapter Leaders: Sam

Course Overview 1 2 Maria Hybinette, UGA Maria Hybinette, UGA Administration / Logistics

CSE 344 Section 3 Today: HW3 Setup SQL Server Basics Using nested query semantics

Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining