Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction Overview Setting the scene Data science The analytics process


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction

slide-2
SLIDE 2

Overview

Setting the scene Data science The analytics process model Data scientists Example applications

2

slide-3
SLIDE 3

Setting the Scene

3

slide-4
SLIDE 4

Living in a data flooded world

https://deepmind.com/blog/alphago-zero-learning-scratch/

… 2015 4

slide-5
SLIDE 5

Living in a data flooded world

https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/

… 2017 5

slide-6
SLIDE 6

Living in a data flooded world

https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

… 2019 6

slide-7
SLIDE 7

Living in a data flooded world

https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery

… 2020 7

slide-8
SLIDE 8

https://qz.com/1791222/how-artificial-intelligence- provided-early-warning-of-wuhan-virus/ https://www.vox.com/2020/1/31/21117102/artificial- intelligence-drug-discovery-exscientia

Living in a data flooded world

https://www.vox.com/recode/2020/1/28/21110902/artificial-intelligence-ai-coronavirus-wuhan

8

slide-9
SLIDE 9

Living in a data flooded world

https://www.buzzfeednews.com/article/ryanmac/clearview-ai-cops-run-wild-facial-recognition-lawsuits https://www.quantamagazine.org/artificial-intelligence-will-do-what-we-ask-thats-a-problem-20200130/

9

slide-10
SLIDE 10

Living in a data flooded world

https://www.latimes.com/business/story/2020-01-21/ralphs-privacy-disclosure

10

slide-11
SLIDE 11

Living in a data flooded world

The Economics of AI Today Every day we hear claims that Artificial Intelligence (AI) systems are about to transform the economy, creating mass unemployment and vast monopolies. But what do professional economists think about this? Contrary to the idea of the impending job apocalypse, this model identifies some channels through which AI systems could increase demand for labor. At the same time, and contrary to an standard assumption in economics that new technologies always increase labor demand through augmentation, the task-based model recognizes that the net effect of new technology on labor demand could be negative. This could, for example, happen if firms adopt “mediocre” AI systems that are productive enough to displace workers, but not productive enough to increase labor demand through the other channels. – https://thegradient.pub/the-economics-of-ai-today/

“ “

11

slide-12
SLIDE 12

Living in a data flooded world

https://www.nature.com/articles/nature21056.epdf

12

slide-13
SLIDE 13

Living in a data flooded world, continued

https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a- photograph

13

slide-14
SLIDE 14

Living in a data flooded world, continued

http://theconversation.com/facial-analysis-ai-is-being-used-in-job-interviews-it-will-probably-reinforce-inequality-124790

14

slide-15
SLIDE 15

https://github.com/ipsingh06/ml-desnapify http://media.idlab.ugent.be/2019/12/05/safe-sexting-in-a- world-of-ai/

Living in a data flooded world, continued

15

slide-16
SLIDE 16

Living in a data flooded (real) world

16

slide-17
SLIDE 17

Data Science

17

slide-18
SLIDE 18

Data science

Data contains value and knowledge But to extract this knowledge, you need to be able to:

Store it Manage it Analyze it

Terms often used interchangeably:

Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Knowledge Discovery ≈ Artificial Intelligence ≈ Deep Learning Don’t worry too much about this and don’t be too swayed by Venn diagrams or infographics

18

slide-19
SLIDE 19

Data science

https://vas3k.com/blog/machine_learning/

What even is this? 19

slide-20
SLIDE 20

Data science

20

slide-21
SLIDE 21

Using:

  • 1. Data
  • 2. An algorithm
  • 3. A purpose

Which are:

Valid Useful Unexpected Understandable

We focus on analytics from a business perspective

Given ((huge) lots) of data, discover patterns and models from data:

Instead of hand-coding, let the data speak To help predict something, explain something, decide something (and more?)

21

slide-22
SLIDE 22

Using data

22

slide-23
SLIDE 23

Using data

Structured, unstructed?

Tabular, relational, text, imagery, audio

Non-tabular data

Making it tabular (“featurization”) Using techniques that can directly utilize data as-is (and even then, some raw structure will be imposed)

23

slide-24
SLIDE 24

Basic terminology

A tabular data set (“structured data”): 24

slide-25
SLIDE 25

Basic terminology

A tabular data set (“structured data”): 25

slide-26
SLIDE 26

Basic terminology

A tabular data set (“structured data”): 26

slide-27
SLIDE 27

Basic terminology

A tabular data set (“structured data”):

Has instances (examples, rows, observations, customers, …) And features (attributes, fields, variables, predictors, covariates, explanatory variables, regressors, independent variables) These features can be:

Numeric (continuous) Categorical (discrete, factor) either nominal (binary as a special case) or ordinal

Target (label, class, dependent variable, repsonse variable) can also be present

Numeric, categorical, …

27

slide-28
SLIDE 28

Using algorithms

https://vas3k.com/blog/machine_learning/

28

slide-29
SLIDE 29

Using algorithms

https://vas3k.com/blog/machine_learning/

29

slide-30
SLIDE 30

Using algorithms

Unsupervised machine learning

No target variable necessary Find structure, patterns in data E.g. clustering, association rule mining, sequence rule mining

Supervised machine learning

Target variable available Relate predictor variables to target Churn prediction, fraud detection, response modeling, credit risk modeling

(There are more types than these two) 30

slide-31
SLIDE 31

Supervised learning

Regression: continuous label Classification: categorical label For classification:

Binary classification (positive/negative outcome) Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible)

For regression:

Absolute values Delta values Quantiles regression

Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) Binary classification forms the majority of applicative settings 31

slide-32
SLIDE 32

Supervised learning

32

slide-33
SLIDE 33

Unsupervised learning

Extract patterns from the data as is

Clustering: construct groups over the data set Association/sequence/… rule mining: find rules that describe the data Anomaly detection: find outliers in the data set Dimensionality reduction: reduce number of features

Note that most of these are frequency / distance based 33

slide-34
SLIDE 34

Unsupervised learning

34

slide-35
SLIDE 35

Purpose

Kind of, but… what’s the business question/problem? So then, unsupervised learning for “descriptive analytics” and supervised learning techniques for “predictive analytics”?

“ “

35

slide-36
SLIDE 36

Purpose

Exploratory analytics: plots, distributions, quick charts, basic correlations… very visual

But supervised techniques can also be used for exploratory insights

Descriptive analytics: yes, unsupervised techniques are commonly used

But depends on the pattern-style you want to obtain, also you often already have a hypothesis in mind

Explanatory analytics: unsupervised again?

Depending on the target definition and model type used, a supervised model can also be used as an explanatory means

Predictive analytics: supervised for sure?

Though unsupervised techniques can be used as pre-processing or featurization technique Also consider whether your goal is really predictive

Prescriptive analytics: “what should I do?”

What-if analysis on a trained supervised model Or using goold old operations research

36

slide-37
SLIDE 37

Purpose

Purpose is key!

While machine learning is a powerful tool, keep in mind that a large majority of ML/AI use cases in business are not really about ML, but about automation!

Assume I have a trained, validated model which works well. How would the model be used? Which features can I give it at the time of usage? Do I want to make it predictions going forward? What’s my end goal?

“ “

37

slide-38
SLIDE 38

Purpose

https://developers.google.com/machine-learning/guides/rules-of-ml

38

slide-39
SLIDE 39

Purpose

An interesting finding is that increasing the performance of a model does not necessarily translate into a gain in [business] value. In general we found that often the best problem is not the one that comes to mind immediately and that changing the set up is a very effective way to unlock value. – https://blog.acolyer.org/2019/10/07/150-successful-machine-learning- models/

“ “

39

slide-40
SLIDE 40

Key criteria

In any case, we want the models and patterns that we find to be:

Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…

40

slide-41
SLIDE 41

Valid → generalizable

https://www.gwern.net/Tanks

RL agent in Udacity self-driving car rewarded for speed learns to spin in circles

“ “

NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible

“ “

41

slide-42
SLIDE 42

Useful → actionable

We can predict who will churn, and then what? Making a model to reverse-engineer a rule-based system? Expecting novel cases from supervised learning?

42

slide-43
SLIDE 43

Unexpected → interesting

https://www.fastcompany.com/3063110/the-rise-of-weird-data

Optimizing right turns for UPS drivers Typing with proper capitalization indicates creditworthiness Users of the Chrome and Firefox browsers make better employees But: not everything which is unexpected is interesting, or valid

43

slide-44
SLIDE 44

Understandable → interpretable

“Why does your model predict fraud for this case?” “Which attributes of the customer are important?” “If age goes up you’re more at risk?” “I don’t understand interaction effects” “What do you mean ‘just trust us’?” …

44

slide-45
SLIDE 45

Ethical?

Models become increasingly complex…

But also rule many aspects in our life, from credit scoring to employment, all the way down to predicting recidivism People are becoming increasingly aware of what is being done with their data and are becoming more protective of their privacy and their rights to challenge a model’s conclusion

45

slide-46
SLIDE 46

Ethical?

46

slide-47
SLIDE 47

Ethical?

47

slide-48
SLIDE 48

The Data Science Process

48

slide-49
SLIDE 49

The analytics process model

49

slide-50
SLIDE 50

CRISP-DM

Cross Industry Standard Process for Data Mining 50

slide-51
SLIDE 51

Others

SEMMA

Sample, Explore, Modify, Model, and Assess https://en.wikipedia.org/wiki/SEMMA

The drivetrain approach

Jeremy Howard, Margit Zwemer and Mike Loukides https://www.oreilly.com/ideas/drivetrain-approach-data-products

https://www.oreilly.com/ideas/drivetrain-approach-data-products

51

slide-52
SLIDE 52

The real analytics process model

52

slide-53
SLIDE 53

Managing data science?

Waterfall? Agile? Scrum? Lean?

The “data science model factory”, “automl”? 53

slide-54
SLIDE 54

Managing data science?

Misalignment of technical and business expectations. The outcomes of data science projects are ultimately consumed by business teams. However, oftentimes a data science project starts without a clear alignment between business and data science teams, where the data science team tends to focus highly on “model accuracy” - the easiest metric to measure - while the business team emphasizes metrics like financial benefit, business insights, and model interpretability. The goal of data science is not to execute. Rather, the goal is to learn and develop profound new business capabilities. Algorithmic products and services can’t be designed up-front. They need to be learned. There are no blueprints to follow; these are novel capabilities with inherent uncertainty. Coefficients, models, model types, hyper parameters, all the elements you’ll need must be learned through experimentation, trial and error, and

  • iteration. With data science, you learn as you go, not before you go.”

– https://multithreaded.stitchfix.com/blog/2019/03/11/FullStackDS-Generalists/

“ “

54

slide-55
SLIDE 55

Data Scientists

55

slide-56
SLIDE 56

Data scientists

https://www.techrepublic.com/article/why-data-scientist-is-the-most-promising-job-of-2019/

LinkedIn found top careers in data science include core data scientist, researcher, and big data specialist.

“ “

56

slide-57
SLIDE 57

Data scientists

57

slide-58
SLIDE 58

https://www.bloomberg.com/news/articles/2018-02-13/in- the-war-for-ai-talent-sky-high-salaries-are-the-weapons

Data scientists

If you want to command a multiyear, seven-figure salary, you used to have

  • nly four career options: chief

executive officer, banker, celebrity entertainer, or pro athlete. Now there’s a fifth—artificial intelligence expert.

“ “

“I suspect AI today is like big data ten years ago“

  • Exactly. Also as soon as big data came

around nobody was doing just data, everyone was doing big data even if they had the same 10GB MySQL database they had from previous

  • years. AI is a bit the same. Doing any

analytics? - Now it’s AI. Opening an excel spreadsheet and doing a curve fit: I am a data scientist doing AI. Doing any actual ML: not learning anymore but super deep learning.

“ “

58

slide-59
SLIDE 59

Defining the data scientist

A data scientist should have solid quantitative skills A data scientist should be a good programmer A data scientist should excel in communication and visualization skills A data scientist should have a solid business understanding A data scientist should be creative

59

slide-60
SLIDE 60

Defining the data scientist

Karlijn Willems, Datacamp

60

slide-61
SLIDE 61

Defining the data scientist

Udacity

61

slide-62
SLIDE 62

The data science team

Database / Datawarehouse administrator Business expert (e.g. marketeer, credit risk analyst, …) Legal expert Data scientist / data miner Software / tool vendors

A multidisciplinary team needs to be set up! (A data scientist is not a magic unicorn) 62

slide-63
SLIDE 63

Challenges

63

slide-64
SLIDE 64

64

slide-65
SLIDE 65

Challenges

Mapping the business question to a technique / setup (there is no one-size fits all) (Not) realizing the amount of effort required in pre-processing Low amount of training data, either instances or features Or too many features… Huge data imbalance, or not even labeled data Quality of data, noise Predicting the future is hard (who’d have thought!) – hard to extrapolate towards the future for many models (machines are naïve and lazy) Incorporating domain knowledge, explaining models Strong validation / backtesting setup requires time and enough data Organizational aspects, teams, management

65

slide-66
SLIDE 66

Some quotes

During 2015, only 15% of Fortune 500 organizations were able to exploit big data for competitive advantage. – Gartner Data maturity of companies is very disparate, and the most advanced of them start doubting. – Christophe Bourguignat 75 % have invested in Big Data, but only 10% have projects in production. Companies face disillusions. They start asking questions: I know how much it costs, but how much do I earn? What is my return on investment?

“ “

66

slide-67
SLIDE 67

Data quality

GIGO principle

Garbage in, Garbage out; messy data gives messy models

In many cases, simple analytical models perform well, so biggest performance increase comes from the data!

Baesens et al., 2003; Van Gestel, Baesens et al., 2004; Holte, 1993

Importance of Master Data Management and Data quality programmes!

But modern data science requires a more nuanced view

The best way to improve the performance of an analytical model is not to look for fancy tools or techniques, but to improve DATA QUALITY first! (It’s the data, you stupid!, Data News, 2007)

“ “

67

slide-68
SLIDE 68

Data quality criteria

Data accuracy

E.g., outliers Age is 300 years versus Income is 1.000.000 Euro (not the same!)

Data completeness

Are missing values important?

Data bias and sampling

Try to minimise, but can never totally get rid of

Data definition

Variables: what is the meaning of 0? Target: fraud, churn, default, customer lifetime value (CLV), ….

Data recency/latency

Refresh frequency

68

slide-69
SLIDE 69

The aftercare

Interpretation and validation of analytical models by business experts

Trivial versus unexpected (interesting?) patterns

Sensitivity analysis

How sensitive is the model with regards to sample characteristics, assumptions and/or technique parameters?

Deploy analytical model into business setting

Represent model output in a user-friendly way Integrate with campaign management tools and marketing decision engines

Model monitoring and backtesting

Continuously monitor model output Contrast model output with observed numbers

We’ll highlight this again in later sessions. Still very much the “forgotten” part

  • f data science

69

slide-70
SLIDE 70

What about big data?

IBM

70

slide-71
SLIDE 71

What about big data?

No worries, we’ll get to that…

IBM

71

slide-72
SLIDE 72

Our goals

Advanced Analytics in Business [D0S07a]

Data science process Supervised learning Ensemble models Advanced techniques Unsupervised learning Working with data science tooling Some “special topics”

Big Data Platforms & Technologies [D0S06a]

Hadoop (MapReduce) Spark Other future trends NoSQL, Neo4J

72

slide-73
SLIDE 73

Our goals

You don’t need big data to do analytics … but you can Big data doesn’t necessarily mean doing analytics … but it can

“ “

73

slide-74
SLIDE 74

It’s about technology too

… though with a critical view 74

slide-75
SLIDE 75

It’s about technology too

… though with a critical view

HDFS?

What about HFD5, or Kudu? Do we even have unstructured data? Do we know what to do with it? V’s of Big Data – yeah right!

BigSQL, or Hive, or Slurp?

Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?

What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X? s/Hadoop/Deep Learning/ s/Big Data/AI/

So we start with the analytics first… 75

slide-76
SLIDE 76

Some Examples

76

slide-77
SLIDE 77

Risk analytics

More than ever before, analytical models steer strategic risk decisions of financial institutions Minimum equity (buffer capital) and provisions a financial institution holds are directly determined, e.g. by

Credit risk analytics Market risk analytics Operational risk analytics Insurance risk analytics

Business analytics is typically used to build all these models Often subject to regulation (e.g. Basel II, Basel III, Solvency II, …) Model errors directly affect profitability, solvency, shareholder value, macro-economy,… society as a whole

77

slide-78
SLIDE 78

Risk analytics: credit scoring

Estimate probability of default at the time the applicant applies for the loan! Also: LGD (loss given default: predict loss if the applicant defaults) Use predetermined definition of default (e.g. 3 months of payment arrears) Use application variables

E.g. age, income, marital status, years at address, years with employer,…

Use bureau variables

Bureau score, raw bureau data (e.g. number of credit checks, total amount of credits, delinquency history ,…) In the US: Fico scores between 300 to 850 Experian, Equifax, TransUnion E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet

78

slide-79
SLIDE 79

Risk analytics: credit scoring

79

slide-80
SLIDE 80

Marketing analytics

Plethora of different techniques

Churn prediction (retention modeling): which customers will leave you, and why? Response modeling: to which customers do we send out an incentive? Segmentation modeling: grouping customers in segments Recommender systems: which product to recommend next?

80

slide-81
SLIDE 81

Marketing analytics: churn prediction

Understanding why customers leave you Customer retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value

Small improvements in customer retention generate significant returns. Very important in Telco sector (about 2% monthly churn rate)

Transaction versus relationship buyers

Transaction buyers: buy because of low price Relationship buyers: want to build loyal relationship with firm

Contractual versus non-contractual setting

Contractual setting: customer cancels contract (e.g. postpaid Telco) Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e.g.

  • nline retailer)

Types of churn:

Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e.g. fraud Expected: customer no longer needs product/service (e.g. baby products)

81

slide-82
SLIDE 82

Marketing analytics: churn prediction

Three challenges…

  • 1. Need to make a distinction between a characteristic predictor for future

churn, or a symptom of occurring churn

E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors

  • 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-

5% churners)

Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data

  • 3. How to make it actionable?

82

slide-83
SLIDE 83

Marketing analytics: response modeling

Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns Campaign can be mail catalogue, email, coupon, A/B or multivariate testing,… Identify the customers most likely to respond based on the following information:

Demographic variables (age, gender, marital status,…) Relationship variables (length of relationship, number of products purchased,…) RFM variables (Social) network information (cf. infra)

The key issue:

There are people who respond positively even if they don’t get a letter in the mail There are people who respond negatively even if they get a letter in the mail There are people who respond negatively if they get a letter in the mail, but would have responded positively if we left them alone

83

slide-84
SLIDE 84

Marketing analytics: response modeling

Gunnarsson, B.R., vanden Broucke, S., De Weerdt, J. (2019). Optimizing Marketing Campaign Targeting Using Uncertainty-Based Predictive Modelling. In: 2019 IEEE International Conference on Data Mining Workshops (ICDMW). Presented at the 19th IEEE International Conference on Data Mining (ICDM), Beijing, China, 08 Nov 2019-11 Nov 2019.

84

slide-85
SLIDE 85

Real estate analytics

https://www.rockestate.be/blog/2017/10/26/point-cloud-processing.html

85

slide-86
SLIDE 86

Fraud analytics

Fraud is an uncommon, well-considered, imperceptibly concealed, time- evolving and often carefully organized crime which appears in many types and forms

Financial fraud Tax fraud Insurance fraud Employee fraud …

86

slide-87
SLIDE 87

Fraud analytics

Credit card transaction fraud:

Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) Store itself also processes legitimate transactions

87

slide-88
SLIDE 88

Fraud analytics

Identify theft:

Before: person calls his/her frequent contacts After: person also calls new contacts which coincidentally overlap with another persons contacts

88

slide-89
SLIDE 89

Fraud analytics

Social security fraud:

Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities

89

slide-90
SLIDE 90

Fraud analytics

Insurance fraud:

Exaggeration of damages “Crash for cash”

90

slide-91
SLIDE 91

Fraud analytics

Stripling E., Baesens B., Chizi B., Vanden Broucke S. (2018). Isolation-based conditional anomaly detection on mixed-attribute data to uncover workers’ compensation fraud. Decision Support Systems, 111, 13-26.

91

slide-92
SLIDE 92

HR analytics

Employee churn Employee performance Employee absence Employee satisfaction Employee lifetime value …

92