1 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Lightly Supervised Content Modeling for Corporate Text Analytics - - PowerPoint PPT Presentation
Lightly Supervised Content Modeling for Corporate Text Analytics - - PowerPoint PPT Presentation
Lightly Supervised Content Modeling for Corporate Text Analytics Raphael Cohen Data Science as a Service EMC EMC CONFIDENTIAL INTERNAL USE ONLY. 1 Talk Outline Text Analytics for Customer Services Customer interaction data
2 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Talk Outline
Text Analytics for Customer Services
- Customer interaction data
– Structured Data Vs. Textual data
- Motivation / Objectives
- Rule based approach
- Data driven modeling
- Machine learning for preprocessing
- Modeling topics
- Injecting SME knowledge to a topic model
- Cool Results
3 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Costumer Data – in our classic CRM DB
We care… Customer interactions:
- Service Request
- Support Chat
Structured Data:
- Date
- Employee id
- Product Type
- Problem Code
- Resolution Code
“Something that makes me particularly proud is the use of Big Data analytics to create a detailed picture of service delivery characteristics for continuous improvement.”
Kevin Roche Senior Vice president, EMC Global Services
4 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Customer Data – In the Data Lake
Unstructured Data:
- Problem Summary (e.g. “Exchange backup failing”)
- Resolution Summary (e.g. “hf 23.45 applied, issue solved”)
- Chat data:
– Customer: “hello” – Helpdesk: “hi tom, how can I help” – Customer: “my daily exchange backup is failing…” – Helpdesk: “did you try to restart the service?”
5 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Objectives / Motivation
Service organization wish list
- Early detection of emerging problems
– Example, “we are getting a lot of service requests regarding Exchange 2010 backups in version 2.41, let’s initiate service pack install in all of these”
- Root Cause Analysis
– Search for similar problem descriptions and rank the solutions
- Identifying call volume drivers
– Example, “oh, we are spending 10% of our time in Europe on VM memory problems”
- Improving service
– According to chat transcripts this employee is slow in identifying code bug issues
6 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Rule Based approach
The old industry standard
SME writes rules Evaluate on some documents Identify common errors 3 1 2
- Subject matter experts create key word
rules (install -> “install calls”)
- Long tuning process (up to 6 months)
- Usually low recall
- High precision requires more and
more complex rules e.g. “DB temp unavailable” / “DB temporarily unavailable”…
- A strong preprocessing unit + rule creation engine cuts the
manual labor from years to months
- Let’s the users feel in control
7 EMC CONFIDENTIAL—INTERNAL USE ONLY.
The Alternative
Data driven machine learning modeling Our dream approach:
- Model the data that you have, not the data you think you have
- Automate
- Don’t automate too much leaving the user out of the loop
- Reproducible (quickly integrate to a new business unit)
- Quick (analysts can tune the modeling engine)
Preprocess Unsupervised clustering Actionable Insights 4 1 2 SME Annotation 3
8 EMC CONFIDENTIAL—INTERNAL USE ONLY.
ETL / Technologies
Show me the data! Where is the data? This is actually the hardest step. 1) Database column (most likely) 2) Hadoop How to access it? 1) In the Pivotal GreenPlum big data warehouse 2) Pivotal HD (Hadoop by Pivotal) 3) For streams - GemFireXD ETL
9 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Preprocess
Smart dimensionality reduction Classic preprocessing of text:
- A friendly tokenization regexp
re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",s)
- Then use the porter stemmer:
– “ponies” -> “poni” – “expression” -> “express”
Drawbacks:
- Loses information
- Eyesore for the customer
Preprocess
10 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Preprocess
Smart dimensionality reduction Before we start the modeling
- Lemmatize instead of stem:
– Preprocess text with POS tagg gger er to get the base form – Use NLTK lemmatizers to get the most probable base form
- Word Clusters
– Create a Deep Learning representation of the words (word2vec) – Extract likely synonyms using heuristics – Allow the SMEs to edit the synonym dictionary – Can be leveraged for query expansion for Search
Preprocess
11 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Preprocess – Synonym Extraction
Smart dimensionality reduction
backup 15492 backups 4419 backed 386 backup's 32 bakcup 28 bakup 14 backuped 14 bacup 8 backp 8 buckup 7 backu 7 backus 5 backuo 5 backup1 4 backkup 4 licenses 347 licence 119 licences 54 networker 8703 netwoker 59 netwroker 22 netowrker 15 networke 13 neworker 10 netorker 5
Preprocess
12 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Unsupervised clustering
Topic Modeling
- Data driven approach requires that we look for the topics
present in the data
- Topic Modeling has been established as a premier approach in
Statistical Machine Learning
- Latent Dirichlet Allocation, the mixture model approach by Blei ,
Ng and Jordan, has been cited by 8,500 academic papers
- Input: Text divided to documents
- Output: Soft topics
- Recipes:
Symmetric Vs. Asymmetric prior Redundancy reduction (de-dup)
Unsupervised clustering
13 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Unsupervised clustering
Topic Modeling – In practice
- Read: “Care and Feeding of Topic Models” by Boyd-Garber,
Mimno and Newman
- Asymmetric Priors (Wallach)
– Vanilla LDA assumes all topics are as likely – I’ve never encountered such a corpus – Assume the prior for each topic is different and sample as well – Not supported by most big data LDA off the shelf solutions
Unsupervised clustering
14 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Unsupervised clustering
Topic Modeling – In practice
- Redundancy (Cohen et al.)
– Copy paste / boiler plate text introduces noise into the topic distribution (see the paper) – These occur a lot in corporate data sets – Remove redundant documents (e.g. 10,000 occurrences of “SR closed”) – Alternatively sample with Redundancy Aware LDA
Unsupervised clustering
15 EMC CONFIDENTIAL—INTERNAL USE ONLY.
SME Annotation
Inject the domain knowledge
- Unsupervised approaches provide us
with clusters of documents / words
- How can we use this to benefit the
business need?
- Have the Subject Matter Expert
explore the clusters and name them
- Provide as many layers of information
as possible to make it easy
- Coach them first to understand that
the precision is never 100%
- Allow them to tune the results
SME Annotation
16 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Good old Business Intelligence
Leverage the tags
Actionable Insights
17 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Good old Business Intelligence
Chat transcripts
Actionable Insights
Analyze chat session.
- What’s the topics
associated with the conversation?
- How quickly do the
support person zooms in
- n the problem?
18 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Good old Business Intelligence
Leverage the tags – Combine with structured Topic distribution according to “problem code”.
Actionable Insights
20 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Good old Business Intelligence
Leverage the tags – Combine with structured Topic distribution according to “location”.
- Disk replacement
is prevalent in the Americas
- Hot-fix is more
common in Europe
- Can zoom in on the
requests to analyze
Actionable Insights
21 EMC CONFIDENTIAL—INTERNAL USE ONLY.
Good old Business Intelligence
Easily combine Search with Topics / Machine Learning Use GP-Text for Lucene text searches.
- What’s the topics
associated with “node”?
- What are the structured
fields?
- GPDB supports Machine
Learning – throw in sentiment analysis in 5 minutes
Actionable Insights