Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction Overview Setting the scene Data scientists Data quality The
Overview
Setting the scene Data scientists Data quality The analytics process model Predictive versus descriptive analytics Example applications
2
Setting the scene
3
Living in a data flooded world
https://deepmind.com/blog/alphago-zero-learning-scratch/
DeepMind’s AI became a superhuman chess player in a few hours, just for fun The descendant of DeepMind’s world champion Go program stretches its muscles in a new domain
“ “
2015 4
Living in a data flooded world
https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/
2017 5
Living in a data flooded world
https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/
2019 6
Living in a data flooded world
http://www.nvidia.com/object/drive-px.html http://kevinhughes.ca/blog/tensor-kart
7
Living in a data flooded world
http://affinelayer.com/pixsrv/index.html
8
Living in a data flooded world
9
Living in a data flooded world, continued
https://www.nature.com/articles/nature21056.epdf
10
Living in a data flooded world, continued
https://mashable.com/2018/02/26/ai-beats-humans-at-contracts/
An AI just beat top lawyers at their own game A new study, conducted by legal AI platform LawGeex in consultation with law professors from Stanford University, Duke University School of Law, and University of Southern California, pitted twenty experienced lawyers against an AI trained to evaluate legal contracts. Competitors were given four hours to review five non-disclosure agreements (NDAs) and identify 30 legal issues, including arbitration, confidentiality of relationship, and indemnification. They were scored by how accurately they identified each issue. Unfortunately for humanity, we lost the competition — badly.
“ “
11
Living in a data flooded world, continued
http://3dgan.csail.mit.edu/
12
Living in a data flooded world, continued
https://arxiv.org/abs/1711.10669
Image2Mesh Most of us take for granted the ability to effortlessly perceive our surroundings world and its
- bjects in three dimensions. In general, we have great ideas about the 3D space only by looking
at a single 2D image of an object even when there are many possible shapes that could have produced the same image. We simply rely on assumptions and prior knowledge acquired throughout our lives for the inference. It is one of the fundamental goals of computer vision to give machines the ability to perceive its surroundings as we do, for the purpose of providing solutions to tasks such as selfdriving cars, virtual and augmented reality, robotic surgery, to name a few.
“ “
13
Living in a data flooded world, continued
http://web.mit.edu/vondrick/tinyvideo/
14
Living in a data flooded world, continued
http://karpathy.github.io/2015/10/25/selfie/
15
Living in a data flooded world, continued
https://github.com/ipsingh06/ml-desnapify
16
Living in a data flooded world, continued
https://arxiv.org/abs/1701.04928
17
Living in a data flooded world, continued
https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a- photograph
18
Living in a data flooded world, continued
https://www.technologyreview.com/s/612775/algorithms-criminal-justice-ai/
19
Living in a data flooded (real) world
20
Data science and data scientists
21
Data science
Data contains value and knowledge But to extract this knowledge, you need to be able to:
Store it Manage it Analyze it
Terms often used interchangeably:
Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Knowledge Discovery ≈ Artificial Intelligence ≈ Deep Learning Don't worry too much about this and don't be too swayed by Venn diagrams or infographics
22
Data science
https://vas3k.com/blog/machine_learning/?ref=hn
What even is this? 23
Data scientists
https://www.mckinsey.com/~/media/mckinsey/business functions/mckinsey digital/our insights/big data the next frontier for innovation/mgi_big_data_exec_summary.ashx
24
https://www.bloomberg.com/news/articles/2018-02-13/in- the-war-for-ai-talent-sky-high-salaries-are-the-weapons
> If you want to command a multiyear, seven- figure salary, you used to have only four career options: chief executive officer, banker, celebrity entertainer, or pro athlete. Now there’s a fifth—artificial intelligence expert.
Data scientists
"I suspect AI today is like big data ten years ago"
- Exactly. Also as soon as big data came
around nobody was doing just data, everyone was doing big data even if they had the same 10GB MySQL database they had from previous
- years. AI is a bit the same. Doing any
analytics? - Now it's AI. Opening an excel spreadsheet and doing a curve fit: I am a data scientist doing AI. Doing any actual ML: not learning anymore but super deep learning. (https://news.ycombinator.com/item? id=16366815)
“ “
25
Data scientists
https://www.techrepublic.com/article/why-data-scientist-is-the-most-promising-job-of-2019/
LinkedIn found... Top careers in data science include core data scientist, researcher, and big data specialist.
“ “
26
Defining the data scientist
A data scientist should have solid quantitative skills A data scientist should be a good programmer A data scientist should excel in communication and visualization skills A data scientist should have a solid business understanding A data scientist should be creative
27
What's analytics all about?
Given ((huge) lots) of data, discover patterns and models that are:
Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…
28
Valid, generalizable
https://www.gwern.net/Tanks
RL agent in Udacity self-driving car rewarded for speed learns to spin in circles (https://twitter.com/mat_kelcey/status/886101319559335936)
“ “
NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible
“ “
29
Useful, actionable
We can predict who will churn, and then what?
30
Unexpected, interesting
https://www.fastcompany.com/3063110/the-rise-of-weird-data
Optimizing right turns for UPS drivers Typing with proper capitalization indicates creditworthiness Users of the Chrome and Firefox browsers make better employees But: not everything which is unexpected is interesting, or valid
31
Understandable
"Why does your model predict fraud?" "Which attributes of the customer are important?" "If age goes up you're more at risk?" "I don't understand interaction effects" "What do you mean 'just trust us'?" ...
32
Ethical?
Models become increasingly complex…
But also rule many aspects in our life, from credit scoring to employment, all the way down to predicting recidivism People are becoming increasingly aware of what is being done with their data and are becoming more protective of their privacy and their rights to challenge a model’s conclusion White House released a statement regarding the promises and dangers of analytics: ”Big Risks, Big Opportunities: the Intersection
- f Big Data and Civil Rights”, and many more examples
33
Ethical?
34
Ethical?
35
The analytics process model
36
CRISP-DM
Cross Industry Standard Process for Data Mining 37
Others
SEMMA
Sample, Explore, Modify, Model, and Assess https://en.wikipedia.org/wiki/SEMMA
The drivetrain approach
Jeremy Howard, Margit Zwemer and Mike Loukides https://www.oreilly.com/ideas/drivetrain-approach-data-products
https://www.oreilly.com/ideas/drivetrain-approach-data-products
38
Challenges
39
Challenges
Mapping the business question to a technique / setup (there is no one-size fits all) (Not) realizing the amount of effort required in pre-processing Low amount of training data, either instances or features Or too many features… Huge data imbalance, or not even labeled data Quality of data, noise Predicting the future is hard (who’d have thought!) – hard to extrapolate towards the future for many models (machines are naïve and lazy) Incorporating domain knowledge, explaining models Strong validation / backtesting setup requires time and enough data Organizational aspects, teams, management
40
Some quotes
During 2015, only 15% of Fortune 500 organizations were able to exploit big data for competitive advantage. – Gartner
“ “
Data maturity of companies is very disparate, and the most advanced of them start doubting. – Christophe Bourguignat
“ “
75 % have invested in Big Data, but only 10% have projects in production.
“ “
Companies face disillusions. They start asking questions: I know how much it costs, but how much do I earn? What is my return on investment?
“ “
41
Data quality
GIGO principle
Garbage in, Garbage out; messy data gives messy models
In many cases, simple analytical models perform well, so biggest performance increase comes from the data!
Baesens et al., 2003; Van Gestel, Baesens et al., 2004; Holte, 1993
Importance of Master Data Management and Data quality programmes!
But modern data science requires a more nuanced view
The best way to improve the performance of an analytical model is not to look for fancy tools or techniques, but to improve DATA QUALITY first! (Baesens B., It’s the data, you stupid!, Data News, 2007)
“ “
42
Data quality criteria
Data accuracy
E.g., outliers Age is 300 years versus Income is 1.000.000 Euro (not the same!)
Data completeness
Are missing values important?
Data bias and sampling
Try to minimise, but can never totally get rid of
Data definition
Variables: what is the meaning of 0? Target: fraud, churn, default, customer lifetime value (CLV), ….
Data recency/latency
Refresh frequency
43
The data science team
Database / Datawarehouse administrator Business expert (e.g. marketeer, credit risk analyst, …) Legal expert Data scientist / data miner Software / tool vendors
A multidisciplinary team needs to be set up! (A data scientist is not a magic unicorn) 44
The aftercare
Interpretation and validation of analytical models by business experts
Trivial versus unexpected (interesting?) patterns
Sensitivity analysis
How sensitive is the model with regards to sample characteristics, assumptions and/or technique parameters?
Deploy analytical model into business setting
Represent model output in a user-friendly way Integrate with campaign management tools and marketing decision engines
Model monitoring and backtesting
Continuously monitor model output Contrast model output with observed numbers
We'll highlight this again in later sessions. Still very much the "forgotten" part
- f data science
45
Analytics: the basics
46
What's analytics all about?
Given ((huge) lots) of data, discover patterns and models that are:
Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…
47
Analytics
Essentially refers to extracting valid, useful, interesting and understandable business patterns and/or mathematical decision models from a preprocessed data set
Predictive analytics (supervised learning)
Predict the future based on patterns learnt from past data Classification (categorical) versus regression (continuous) You have a labelled data set at your disposal
Descriptive analytics (unsupervised learning)
Describe patterns in data Clustering, association rules, sequence rules No labelling required
(There is more than just these two) 48
Basic terminology
A tabular data set ("structured data"): 49
Basic terminology
A tabular data set ("structured data"): 50
Basic terminology
A tabular data set ("structured data"): 51
Basic terminology
A tabular data set ("structured data"):
Has instances (examples, rows, observations, customers, ...) and attributes (features, fields, variables, predictors) These features can be:
Numeric (continuous) or categorical (discrete, nominal, ordinal, factor, binary)
Target (label, class, dependent variable) can be present
Can also be numeric, categorical, ...
52
What about big data?
No worries, we'll get to that...
http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data
53
Our goals
Advanced Analytics in Business [D0S07a]
Data science process Supervised learning Ensemble models Advanced techniques Unsupervised learning Working with data science tooling Some “special topics”
Big Data Platforms & Technologies [D0S06a]
Hadoop (MapReduce) Spark Other future trends NoSQL, Neo4J
54
Our goals
You don’t need big data to do analytics ... but you can Big data doesn’t necessarily mean doing analytics ... but it can 55
It's about technology too
... though with a critical view 56
It's about technology too
... though with a critical view
HDFS?
What about HFD5, or Kudu? Do we even have unstructured data? Do we know what to do with it? V’s of Big Data – yeah right!
BigSQL, or Hive, or Slurp?
Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?
What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X? s/Hadoop/Deep Learning/ s/Big Data/AI/
So we start with the analytics first 57
Predictive analytics: classification
58
Predictive analytics: classification
59
Predictive analytics: regression
60
Predictive analytics: regression
61
Descriptive analytics: clustering
62
Descriptive analytics: clustering
63
Descriptive analytics: association rules
64
Descriptive analytics: association rules
65
Descriptive analytics: sequence rules
66
Descriptive analytics: sequence rules
67
Analytical model requirements
Business relevance
Solve a particular business problem
Statistical performance
Statistical significance, accuracy of model Statistical prediction performance
Interpretability and justifiability
Very subjective (depends on decision maker), but crucial! Often need to be balanced against statistical performance
Operational efficiency
How can the analytical models be integrated with campaign management?
Economical cost
What is the cost to gather the model inputs and evaluate the model? Is it worthwhile buying external data and/or models?
Regulatory compliance
In accordance with regulation and legislation
Remember: models which are valid (generalizable), useful (actionable), unexpected (interesting), understandable 68
A few real-life examples
69
Risk analytics
More than ever before, analytical models steer strategic risk decisions of financial institutions Minimum equity (buffer capital) and provisions a financial institution holds are directly determined, e.g. by
Credit risk analytics Market risk analytics Operational risk analytics Insurance risk analytics
Business analytics is typically used to build all these models Often subject to regulation (e.g. Basel II, Basel III, Solvency II, …) Model errors directly affect profitability, solvency, shareholder value, macro-economy,… society as a whole
70
Risk analytics: credit scoring
Estimate probability of default at the time the applicant applies for the loan! Also: LGD (loss given default: predict loss if the applicant defaults) Use predetermined definition of default (e.g. 3 months of payment arrears) Use application variables
E.g. age, income, marital status, years at address, years with employer,…
Use bureau variables
Bureau score, raw bureau data (e.g. number of credit checks, total amount of credits, delinquency history ,…) In the US: Fico scores between 300 to 850 Experian, Equifax, TransUnion E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet
71
Risk analytics: credit scoring
72
Marketing analytics
Plethora of different techniques
Churn prediction (retention modeling): which customers will leave you, and why? Response modeling: to which customers do we send out an incentive? Segmentation modeling: grouping customers in segments Recommender systems: which product to recommend next?
73
Marketing analytics: churn prediction
Understanding why customers leave you Customer retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value
Small improvements in customer retention generate significant returns. Very important in Telco sector (about 2% monthly churn rate)
Transaction versus relationship buyers
Transaction buyers: buy because of low price Relationship buyers: want to build loyal relationship with firm
Contractual versus non-contractual setting
Contractual setting: customer cancels contract (e.g. postpaid Telco) Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e.g.
- nline retailer)
Types of churn:
Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e.g. fraud Expected: customer no longer needs product/service (e.g. baby products)
74
Marketing analytics: churn prediction
Typical data sources:
Demographic data: e.g. age, gender, marital status Relationship variables: e.g. length of relationship Product/Service usage and ownership data: e.g. number of products purchased, number of transactions in previous month, trend in usage,… RFM data Complaints data: e.g. number of filed complaints, service desk contacted,… (Social) network information (cf. infra)
RFM?
Already popular since (Cullinan, 1977) Recency: number of months since last purchase Frequency: number of purchases within a given time frame Monetary: dollar value of purchases Different operationalisations possible of RFM variables E.g., Monetary: average/maximum/total dollar value? Trend variables Can only be measured for existing customers, not for prospects (e.g. response modeling) Often used to build a segmentation scheme or combine into a single RFM score
75
Marketing analytics: churn prediction
Three enourmous challenges...
- 1. Need to make a distinction between a characteristic predictor for future
churn, or a symptom of occurring churn
E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors
- 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-
5% churners)
Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data
- 3. How to make it actionable?
76
Marketing analytics: churn prediction
77
Marketing analytics: response modeling
Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns Campaign can be mail catalogue, email, coupon, A/B or multivariate testing,… Identify the customers most likely to respond based on the following information:
Demographic variables (age, gender, marital status,…) Relationship variables (length of relationship, number of products purchased,…) RFM variables (Social) network information (cf. infra)
The key issue:
There are people who respond positively even if they don't get a letter in the mail There are people who respond negatively even if they get a letter in the mail There are people who respond negatively if they get a letter in the mail, but would have responded positively if we left them alone
78
Marketing analytics: response modeling
Split target group into test group and control group
Test group receives marketing material and control group does not Incremental impact equals the additional purchases that are directly attributable to the campaign (Larsen, 2010) Incremental impact = test group purchase rate – control group purchase rate
Try to factor in the behavior of self-selecting clients, clients that purchase regardless of the marketing campaign
Focus should be on swing clients: interested in the product, but need to be motivated (by e.g. marketing message) to take action Both test and control group should be representative Find a model such that the difference between the test group purchase rate and the control group purchase rate is maximized (i.e. identifying the swing clients)
79
Marketing analytics: response modeling
80
Marketing analytics: response modeling
81
Marketing analytics: response modeling
82
Fraud analytics
Fraud is an uncommon, well-considered, imperceptibly concealed, time- evolving and often carefully organized crime which appears in many types and forms
Financial fraud Tax fraud Insurance fraud Employee fraud ...
83
Fraud analytics
Credit card transaction fraud:
Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) Store itself also processes legitimate transactions
84
Fraud analytics
Identify theft:
Before: person calls his/her frequent contacts After: person also calls new contacts which coincidentally overlap with another persons contacts
85
Fraud analytics
Social security fraud:
Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities
86
Fraud analytics
Insurance fraud:
Exaggeration of damages “Crash for cash”
87
HR analytics
Employee churn Employee performance Employee absence Employee satisfaction Employee Lifetime Value …
88
HR analytics
Baesens, De Winne, Sels, What to Do Before You Fire a Pivotal Employee, 2016 LinkedIn based data mining
https://www.bloomberg.com/news/features/2017-11- 15/the-brutal-fight-to-mine-your-data-and-sell-it-to- your-boss https://hbr.org/2016/01/what-to-do-before-you-fire-a- pivotal-employee
Still, in practice the selection and retention of talent remains more art than science [...]. The goal is to go beyond traditional but little-examined practices [...] to subtler metrics and methods. Big companies are growing more interested as the cost of replacing valued workers becomes clearer.