Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction Overview Setting the scene Data science The analytics process
Overview
Setting the scene Data science The analytics process model Data scientists Example applications
2
Setting the Scene
3
Living in a data flooded world
https://deepmind.com/blog/alphago-zero-learning-scratch/
… 2015 4
Living in a data flooded world
https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/
… 2017 5
Living in a data flooded world
https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/
… 2019 6
Living in a data flooded world
https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery
… 2020 7
https://qz.com/1791222/how-artificial-intelligence- provided-early-warning-of-wuhan-virus/ https://www.vox.com/2020/1/31/21117102/artificial- intelligence-drug-discovery-exscientia
Living in a data flooded world
https://www.vox.com/recode/2020/1/28/21110902/artificial-intelligence-ai-coronavirus-wuhan
8
Living in a data flooded world
https://www.buzzfeednews.com/article/ryanmac/clearview-ai-cops-run-wild-facial-recognition-lawsuits https://www.quantamagazine.org/artificial-intelligence-will-do-what-we-ask-thats-a-problem-20200130/
9
Living in a data flooded world
https://www.latimes.com/business/story/2020-01-21/ralphs-privacy-disclosure
10
Living in a data flooded world
The Economics of AI Today Every day we hear claims that Artificial Intelligence (AI) systems are about to transform the economy, creating mass unemployment and vast monopolies. But what do professional economists think about this? Contrary to the idea of the impending job apocalypse, this model identifies some channels through which AI systems could increase demand for labor. At the same time, and contrary to an standard assumption in economics that new technologies always increase labor demand through augmentation, the task-based model recognizes that the net effect of new technology on labor demand could be negative. This could, for example, happen if firms adopt “mediocre” AI systems that are productive enough to displace workers, but not productive enough to increase labor demand through the other channels. – https://thegradient.pub/the-economics-of-ai-today/
“ “
11
Living in a data flooded world
https://www.nature.com/articles/nature21056.epdf
12
Living in a data flooded world, continued
https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a- photograph
13
Living in a data flooded world, continued
http://theconversation.com/facial-analysis-ai-is-being-used-in-job-interviews-it-will-probably-reinforce-inequality-124790
14
https://github.com/ipsingh06/ml-desnapify http://media.idlab.ugent.be/2019/12/05/safe-sexting-in-a- world-of-ai/
Living in a data flooded world, continued
15
Living in a data flooded (real) world
16
Data Science
17
Data science
Data contains value and knowledge But to extract this knowledge, you need to be able to:
Store it Manage it Analyze it
Terms often used interchangeably:
Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Knowledge Discovery ≈ Artificial Intelligence ≈ Deep Learning Don’t worry too much about this and don’t be too swayed by Venn diagrams or infographics
18
Data science
https://vas3k.com/blog/machine_learning/
What even is this? 19
Data science
20
Using:
- 1. Data
- 2. An algorithm
- 3. A purpose
Which are:
Valid Useful Unexpected Understandable
We focus on analytics from a business perspective
Given ((huge) lots) of data, discover patterns and models from data:
Instead of hand-coding, let the data speak To help predict something, explain something, decide something (and more?)
21
Using data
22
Using data
Structured, unstructed?
Tabular, relational, text, imagery, audio
Non-tabular data
Making it tabular (“featurization”) Using techniques that can directly utilize data as-is (and even then, some raw structure will be imposed)
23
Basic terminology
A tabular data set (“structured data”): 24
Basic terminology
A tabular data set (“structured data”): 25
Basic terminology
A tabular data set (“structured data”): 26
Basic terminology
A tabular data set (“structured data”):
Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be:
Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal
Target (label, class, dependent variable, repsonse variable) can also be present
Numeric, categorical, …
27
Using algorithms
https://vas3k.com/blog/machine_learning/
28
Using algorithms
https://vas3k.com/blog/machine_learning/
29
Using algorithms
Unsupervised machine learning
No target variable necessary Find structure, patterns in data E.g. clustering, association rule mining, sequence rule mining
Supervised machine learning
Target variable available Relate predictor variables to target Churn prediction, fraud detection, response modeling, credit risk modeling
(There are more types than these two) 30
Supervised learning
Regression: continuous label Classification: categorical label For classification:
Binary classification (positive/negative outcome) Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible)
For regression:
Absolute values Delta values Quantiles regression
Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) Binary classification forms the majority of applicative settings 31
Supervised learning
32
Unsupervised learning
Extract patterns from the data as is
Clustering: construct groups over the data set Association/sequence/… rule mining: find rules that describe the data Anomaly detection: find outliers in the data set Dimensionality reduction: reduce number of features
Note that most of these are frequency / distance based 33
Unsupervised learning
34
Purpose
Kind of, but… what’s the business question/problem? So then, unsupervised learning for “descriptive analytics” and supervised learning techniques for “predictive analytics”?
“ “
35
Purpose
Exploratory analytics: plots, distributions, quick charts, basic correlations… very visual
But supervised techniques can also be used for exploratory insights
Descriptive analytics: yes, unsupervised techniques are commonly used
But depends on the pattern-style you want to obtain, also you often already have a hypothesis in mind
Explanatory analytics: unsupervised again?
Depending on the target definition and model type used, a supervised model can also be used as an explanatory means
Predictive analytics: supervised for sure?
Though unsupervised techniques can be used as pre-processing or featurization technique Also consider whether your goal is really predictive
Prescriptive analytics: “what should I do?”
What-if analysis on a trained supervised model Or using goold old operations research
36
Purpose
Purpose is key!
While machine learning is a powerful tool, keep in mind that a large majority of ML/AI use cases in business are not really about ML, but about automation!
Assume I have a trained, validated model which works well. How would the model be used? Which features can I give it at the time of usage? Do I want to make it predictions going forward? What’s my end goal?
“ “
37
Purpose
https://developers.google.com/machine-learning/guides/rules-of-ml
38
Purpose
An interesting finding is that increasing the performance of a model does not necessarily translate into a gain in [business] value. In general we found that often the best problem is not the one that comes to mind immediately and that changing the set up is a very effective way to unlock value. – https://blog.acolyer.org/2019/10/07/150-successful-machine-learning- models/
“ “
39
Key criteria
In any case, we want the models and patterns that we find to be:
Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…
40
Valid → generalizable
https://www.gwern.net/Tanks
RL agent in Udacity self-driving car rewarded for speed learns to spin in circles
“ “
NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible
“ “
41
Useful → actionable
We can predict who will churn, and then what? Making a model to reverse-engineer a rule-based system? Expecting novel cases from supervised learning?
42
Unexpected → interesting
https://www.fastcompany.com/3063110/the-rise-of-weird-data
Optimizing right turns for UPS drivers Typing with proper capitalization indicates creditworthiness Users of the Chrome and Firefox browsers make better employees But: not everything which is unexpected is interesting, or valid
43
Understandable → interpretable
“Why does your model predict fraud for this case?” “Which attributes of the customer are important?” “If age goes up you’re more at risk?” “I don’t understand interaction effects” “What do you mean ‘just trust us’?” …
44
Ethical?
Models become increasingly complex…
But also rule many aspects in our life, from credit scoring to employment, all the way down to predicting recidivism People are becoming increasingly aware of what is being done with their data and are becoming more protective of their privacy and their rights to challenge a model’s conclusion
45
Ethical?
46
Ethical?
47
The Data Science Process
48
The analytics process model
49
CRISP-DM
Cross Industry Standard Process for Data Mining 50
Others
SEMMA
Sample, Explore, Modify, Model, and Assess https://en.wikipedia.org/wiki/SEMMA
The drivetrain approach
Jeremy Howard, Margit Zwemer and Mike Loukides https://www.oreilly.com/ideas/drivetrain-approach-data-products
https://www.oreilly.com/ideas/drivetrain-approach-data-products
51
The real analytics process model
52
Managing data science?
Waterfall? Agile? Scrum? Lean?
The “data science model factory”, “automl”? 53
Managing data science?
Misalignment of technical and business expectations. The outcomes of data science projects are ultimately consumed by business teams. However, oftentimes a data science project starts without a clear alignment between business and data science teams, where the data science team tends to focus highly on “model accuracy” - the easiest metric to measure - while the business team emphasizes metrics like financial benefit, business insights, and model interpretability. The goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and
- iteration. With data science, you learn as you go, not before you go.”
– https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/
“ “
54
Data Scientists
55
Data scientists
https://www.techrepublic.com/article/why-data-scientist-is-the-most-promising-job-of-2019/
LinkedIn found top careers in data science include core data scientist, researcher, and big data specialist.
“ “
56
Data scientists
57
https://www.bloomberg.com/news/articles/2018-02-13/in- the-war-for-ai-talent-sky-high-salaries-are-the-weapons
Data scientists
If you want to command a multiyear, seven-figure salary, you used to have
- nly four career options: chief
executive officer, banker, celebrity entertainer, or pro athlete. Now there’s a fifth—artificial intelligence expert.
“ “
“I suspect AI today is like big data ten years ago“
- Exactly. Also as soon as big data came
around nobody was doing just data, everyone was doing big data even if they had the same 10GB MySQL database they had from previous
- years. AI is a bit the same. Doing any
analytics? - Now it’s AI. Opening an excel spreadsheet and doing a curve fit: I am a data scientist doing AI. Doing any actual ML: not learning anymore but super deep learning.
“ “
58
Defining the data scientist
A data scientist should have solid quantitative skills A data scientist should be a good programmer A data scientist should excel in communication and visualization skills A data scientist should have a solid business understanding A data scientist should be creative
59
Defining the data scientist
Karlijn Willems, Datacamp
60
Defining the data scientist
Udacity
61
The data science team
Database / Datawarehouse administrator Business expert (e.g. marketeer, credit risk analyst, …) Legal expert Data scientist / data miner Software / tool vendors
A multidisciplinary team needs to be set up! (A data scientist is not a magic unicorn) 62
Challenges
63
64
Challenges
Mapping the business question to a technique / setup (there is no one-size fits all) (Not) realizing the amount of effort required in pre-processing Low amount of training data, either instances or features Or too many features… Huge data imbalance, or not even labeled data Quality of data, noise Predicting the future is hard (who’d have thought!) – hard to extrapolate towards the future for many models (machines are naïve and lazy) Incorporating domain knowledge, explaining models Strong validation / backtesting setup requires time and enough data Organizational aspects, teams, management
65
Some quotes
During 2015, only 15% of Fortune 500 organizations were able to exploit big data for competitive advantage. – Gartner Data maturity of companies is very disparate, and the most advanced of them start doubting. – Christophe Bourguignat 75 % have invested in Big Data, but only 10% have projects in production. Companies face disillusions. They start asking questions: I know how much it costs, but how much do I earn? What is my return on investment?
“ “
66
Data quality
GIGO principle
Garbage in, Garbage out; messy data gives messy models
In many cases, simple analytical models perform well, so biggest performance increase comes from the data!
Baesens et al., 2003; Van Gestel, Baesens et al., 2004; Holte, 1993
Importance of Master Data Management and Data quality programmes!
But modern data science requires a more nuanced view
The best way to improve the performance of an analytical model is not to look for fancy tools or techniques, but to improve DATA QUALITY first! (It’s the data, you stupid!, Data News, 2007)
“ “
67
Data quality criteria
Data accuracy
E.g., outliers Age is 300 years versus Income is 1.000.000 Euro (not the same!)
Data completeness
Are missing values important?
Data bias and sampling
Try to minimise, but can never totally get rid of
Data definition
Variables: what is the meaning of 0? Target: fraud, churn, default, customer lifetime value (CLV), ….
Data recency/latency
Refresh frequency
68
The aftercare
Interpretation and validation of analytical models by business experts
Trivial versus unexpected (interesting?) patterns
Sensitivity analysis
How sensitive is the model with regards to sample characteristics, assumptions and/or technique parameters?
Deploy analytical model into business setting
Represent model output in a user-friendly way Integrate with campaign management tools and marketing decision engines
Model monitoring and backtesting
Continuously monitor model output Contrast model output with observed numbers
We’ll highlight this again in later sessions. Still very much the “forgotten” part
- f data science
69
What about big data?
IBM
70
What about big data?
No worries, we’ll get to that…
IBM
71
Our goals
Advanced Analytics in Business [D0S07a]
Data science process Supervised learning Ensemble models Advanced techniques Unsupervised learning Working with data science tooling Some “special topics”
Big Data Platforms & Technologies [D0S06a]
Hadoop (MapReduce) Spark Other future trends NoSQL, Neo4J
72
Our goals
You don’t need big data to do analytics … but you can Big data doesn’t necessarily mean doing analytics … but it can
“ “
73
It’s about technology too
… though with a critical view 74
It’s about technology too
… though with a critical view
HDFS?
What about HFD5, or Kudu? Do we even have unstructured data? Do we know what to do with it? V’s of Big Data – yeah right!
BigSQL, or Hive, or Slurp?
Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?
What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X? s/Hadoop/Deep Learning/ s/Big Data/AI/
So we start with the analytics first… 75
Some Examples
76
Risk analytics
More than ever before, analytical models steer strategic risk decisions of financial institutions Minimum equity (buffer capital) and provisions a financial institution holds are directly determined, e.g. by
Credit risk analytics Market risk analytics Operational risk analytics Insurance risk analytics
Business analytics is typically used to build all these models Often subject to regulation (e.g. Basel II, Basel III, Solvency II, …) Model errors directly affect profitability, solvency, shareholder value, macro-economy,… society as a whole
77
Risk analytics: credit scoring
Estimate probability of default at the time the applicant applies for the loan! Also: LGD (loss given default: predict loss if the applicant defaults) Use predetermined definition of default (e.g. 3 months of payment arrears) Use application variables
E.g. age, income, marital status, years at address, years with employer,…
Use bureau variables
Bureau score, raw bureau data (e.g. number of credit checks, total amount of credits, delinquency history ,…) In the US: Fico scores between 300 to 850 Experian, Equifax, TransUnion E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet
78
Risk analytics: credit scoring
79
Marketing analytics
Plethora of different techniques
Churn prediction (retention modeling): which customers will leave you, and why? Response modeling: to which customers do we send out an incentive? Segmentation modeling: grouping customers in segments Recommender systems: which product to recommend next?
80
Marketing analytics: churn prediction
Understanding why customers leave you Customer retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value
Small improvements in customer retention generate significant returns. Very important in Telco sector (about 2% monthly churn rate)
Transaction versus relationship buyers
Transaction buyers: buy because of low price Relationship buyers: want to build loyal relationship with firm
Contractual versus non-contractual setting
Contractual setting: customer cancels contract (e.g. postpaid Telco) Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e.g.
- nline retailer)
Types of churn:
Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e.g. fraud Expected: customer no longer needs product/service (e.g. baby products)
81
Marketing analytics: churn prediction
Three challenges…
- 1. Need to make a distinction between a characteristic predictor for future
churn, or a symptom of occurring churn
E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors
- 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-
5% churners)
Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data
- 3. How to make it actionable?
82
Marketing analytics: response modeling
Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns Campaign can be mail catalogue, email, coupon, A/B or multivariate testing,… Identify the customers most likely to respond based on the following information:
Demographic variables (age, gender, marital status,…) Relationship variables (length of relationship, number of products purchased,…) RFM variables (Social) network information (cf. infra)
The key issue:
There are people who respond positively even if they don’t get a letter in the mail There are people who respond negatively even if they get a letter in the mail There are people who respond negatively if they get a letter in the mail, but would have responded positively if we left them alone
83
Marketing analytics: response modeling
Gunnarsson, B.R., vanden Broucke, S., De Weerdt, J. (2019). Optimizing Marketing Campaign Targeting Using Uncertainty-Based Predictive Modelling. In: 2019 IEEE International Conference on Data Mining Workshops (ICDMW). Presented at the 19th IEEE International Conference on Data Mining (ICDM), Beijing, China, 08 Nov 2019-11 Nov 2019.
84
Real estate analytics
https://www.rockestate.be/blog/2017/10/26/point-cloud-processing.html
85
Fraud analytics
Fraud is an uncommon, well-considered, imperceptibly concealed, time- evolving and often carefully organized crime which appears in many types and forms
Financial fraud Tax fraud Insurance fraud Employee fraud …
86
Fraud analytics
Credit card transaction fraud:
Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) Store itself also processes legitimate transactions
87
Fraud analytics
Identify theft:
Before: person calls his/her frequent contacts After: person also calls new contacts which coincidentally overlap with another persons contacts
88
Fraud analytics
Social security fraud:
Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities
89
Fraud analytics
Insurance fraud:
Exaggeration of damages “Crash for cash”
90
Fraud analytics
Stripling E., Baesens B., Chizi B., Vanden Broucke S. (2018). Isolation-based conditional anomaly detection on mixed-attribute data to uncover workers’ compensation fraud. Decision Support Systems, 111, 13-26.
91
HR analytics
Employee churn Employee performance Employee absence Employee satisfaction Employee lifetime value …