Lightly Supervised Content Modeling for Corporate Text Analytics - - PowerPoint PPT Presentation

lightly supervised content modeling for corporate text
SMART_READER_LITE
LIVE PREVIEW

Lightly Supervised Content Modeling for Corporate Text Analytics - - PowerPoint PPT Presentation

Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL INTERNAL USE ONLY. 1 Talk Outline Text Analytics for Customer Services Customer interaction data


slide-1
SLIDE 1

1 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Lightly Supervised Content Modeling for Corporate Text Analytics

Raphael Cohen Data Science as a Service EMC

slide-2
SLIDE 2

2 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Talk Outline

Text Analytics for Customer Services

  • Customer interaction data

– Structured Data Vs. Textual data

  • Motivation / Objectives
  • Rule based approach
  • Data driven modeling
  • Machine learning for preprocessing
  • Modeling topics
  • Injecting SME knowledge to a topic model
  • Cool Results
slide-3
SLIDE 3

3 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Costumer Data – in our classic CRM DB

We care… Customer interactions:

  • Service Request
  • Support Chat

Structured Data:

  • Date
  • Employee id
  • Product Type
  • Problem Code
  • Resolution Code

“Something that makes me particularly proud is the use of Big Data analytics to create a detailed picture of service delivery characteristics for continuous improvement.”

Kevin Roche Senior Vice president, EMC Global Services

slide-4
SLIDE 4

4 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Customer Data – In the Data Lake

Unstructured Data:

  • Problem Summary (e.g. “Exchange backup failing”)
  • Resolution Summary (e.g. “hf 23.45 applied, issue solved”)
  • Chat data:

– Customer: “hello” – Helpdesk: “hi tom, how can I help” – Customer: “my daily exchange backup is failing…” – Helpdesk: “did you try to restart the service?”

slide-5
SLIDE 5

5 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Objectives / Motivation

Service organization wish list

  • Early detection of emerging problems

– Example, “we are getting a lot of service requests regarding Exchange 2010 backups in version 2.41, let’s initiate service pack install in all of these”

  • Root Cause Analysis

– Search for similar problem descriptions and rank the solutions

  • Identifying call volume drivers

– Example, “oh, we are spending 10% of our time in Europe on VM memory problems”

  • Improving service

– According to chat transcripts this employee is slow in identifying code bug issues

slide-6
SLIDE 6

6 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Rule Based approach

The old industry standard

SME writes rules Evaluate on some documents Identify common errors 3 1 2

  • Subject matter experts create key word

rules (install -> “install calls”)

  • Long tuning process (up to 6 months)
  • Usually low recall
  • High precision requires more and

more complex rules e.g. “DB temp unavailable” / “DB temporarily unavailable”…

  • A strong preprocessing unit + rule creation engine cuts the

manual labor from years to months

  • Let’s the users feel in control
slide-7
SLIDE 7

7 EMC CONFIDENTIAL—INTERNAL USE ONLY.

The Alternative

Data driven machine learning modeling Our dream approach:

  • Model the data that you have, not the data you think you have
  • Automate
  • Don’t automate too much leaving the user out of the loop
  • Reproducible (quickly integrate to a new business unit)
  • Quick (analysts can tune the modeling engine)

Preprocess Unsupervised clustering Actionable Insights 4 1 2 SME Annotation 3

slide-8
SLIDE 8

8 EMC CONFIDENTIAL—INTERNAL USE ONLY.

ETL / Technologies

Show me the data! Where is the data? This is actually the hardest step. 1) Database column (most likely) 2) Hadoop How to access it? 1) In the Pivotal GreenPlum big data warehouse 2) Pivotal HD (Hadoop by Pivotal) 3) For streams - GemFireXD ETL

slide-9
SLIDE 9

9 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Preprocess

Smart dimensionality reduction Classic preprocessing of text:

  • A friendly tokenization regexp

re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s)

  • Then use the porter stemmer:

– “ponies” -> “poni” – “expression” -> “express”

Drawbacks:

  • Loses information
  • Eyesore for the customer

Preprocess

slide-10
SLIDE 10

10 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Preprocess

Smart dimensionality reduction Before we start the modeling

  • Lemmatize instead of stem:

– Preprocess text with POS tagg gger er to get the base form – Use NLTK lemmatizers to get the most probable base form

  • Word Clusters

– Create a Deep Learning representation of the words (word2vec) – Extract likely synonyms using heuristics – Allow the SMEs to edit the synonym dictionary – Can be leveraged for query expansion for Search

Preprocess

slide-11
SLIDE 11

11 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Preprocess – Synonym Extraction

Smart dimensionality reduction

backup 15492 backups 4419 backed 386 backup's 32 bakcup 28 bakup 14 backuped 14 bacup 8 backp 8 buckup 7 backu 7 backus 5 backuo 5 backup1 4 backkup 4 licenses 347 licence 119 licences 54 networker 8703 netwoker 59 netwroker 22 netowrker 15 networke 13 neworker 10 netorker 5

Preprocess

slide-12
SLIDE 12

12 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Unsupervised clustering

Topic Modeling

  • Data driven approach requires that we look for the topics

present in the data

  • Topic Modeling has been established as a premier approach in

Statistical Machine Learning

  • Latent Dirichlet Allocation, the mixture model approach by Blei ,

Ng and Jordan, has been cited by 8,500 academic papers

  • Input: Text divided to documents
  • Output: Soft topics
  • Recipes:

Symmetric Vs. Asymmetric prior Redundancy reduction (de-dup)

Unsupervised clustering

slide-13
SLIDE 13

13 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Unsupervised clustering

Topic Modeling – In practice

  • Read: “Care and Feeding of Topic Models” by Boyd-Garber,

Mimno and Newman

  • Asymmetric Priors (Wallach)

– Vanilla LDA assumes all topics are as likely – I’ve never encountered such a corpus – Assume the prior for each topic is different and sample as well – Not supported by most big data LDA off the shelf solutions

Unsupervised clustering

slide-14
SLIDE 14

14 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Unsupervised clustering

Topic Modeling – In practice

  • Redundancy (Cohen et al.)

– Copy paste / boiler plate text introduces noise into the topic distribution (see the paper) – These occur a lot in corporate data sets – Remove redundant documents (e.g. 10,000 occurrences of “SR closed”) – Alternatively sample with Redundancy Aware LDA

Unsupervised clustering

slide-15
SLIDE 15

15 EMC CONFIDENTIAL—INTERNAL USE ONLY.

SME Annotation

Inject the domain knowledge

  • Unsupervised approaches provide us

with clusters of documents / words

  • How can we use this to benefit the

business need?

  • Have the Subject Matter Expert

explore the clusters and name them

  • Provide as many layers of information

as possible to make it easy

  • Coach them first to understand that

the precision is never 100%

  • Allow them to tune the results

SME Annotation

slide-16
SLIDE 16

16 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Good old Business Intelligence

Leverage the tags

Actionable Insights

slide-17
SLIDE 17

17 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Good old Business Intelligence

Chat transcripts

Actionable Insights

Analyze chat session.

  • What’s the topics

associated with the conversation?

  • How quickly do the

support person zooms in

  • n the problem?
slide-18
SLIDE 18

18 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Good old Business Intelligence

Leverage the tags – Combine with structured Topic distribution according to “problem code”.

Actionable Insights

slide-19
SLIDE 19
slide-20
SLIDE 20

20 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Good old Business Intelligence

Leverage the tags – Combine with structured Topic distribution according to “location”.

  • Disk replacement

is prevalent in the Americas

  • Hot-fix is more

common in Europe

  • Can zoom in on the

requests to analyze

Actionable Insights

slide-21
SLIDE 21

21 EMC CONFIDENTIAL—INTERNAL USE ONLY.

Good old Business Intelligence

Easily combine Search with Topics / Machine Learning Use GP-Text for Lucene text searches.

  • What’s the topics

associated with “node”?

  • What are the structured

fields?

  • GPDB supports Machine

Learning – throw in sentiment analysis in 5 minutes

Actionable Insights