Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

advanced analytics in business d0s07a big data platforms
SMART_READER_LITE
LIVE PREVIEW

Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction Overview Setting the scene Data scientists Data quality The


slide-1
SLIDE 1

Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a]

Setting the Scene The Data Science Process Supervised and Unsupervised Learning Introduction

slide-2
SLIDE 2

Overview

Setting the scene Data scientists Data quality The analytics process model Predictive versus descriptive analytics Example applications

2

slide-3
SLIDE 3

Setting the scene

3

slide-4
SLIDE 4

Living in a data flooded world

https://deepmind.com/blog/alphago-zero-learning-scratch/

DeepMind’s AI became a superhuman chess player in a few hours, just for fun The descendant of DeepMind’s world champion Go program stretches its muscles in a new domain

“ “

2015 4

slide-5
SLIDE 5

Living in a data flooded world

https://deepmind.com/blog/alphazero-shedding-new-light-grand-games-chess-shogi-and-go/

2017 5

slide-6
SLIDE 6

Living in a data flooded world

https://deepmind.com/blog/alphastar-mastering-real-time-strategy-game-starcraft-ii/

2019 6

slide-7
SLIDE 7

Living in a data flooded world

http://www.nvidia.com/object/drive-px.html http://kevinhughes.ca/blog/tensor-kart

7

slide-8
SLIDE 8

Living in a data flooded world

http://affinelayer.com/pixsrv/index.html

8

slide-9
SLIDE 9

Living in a data flooded world

9

slide-10
SLIDE 10

Living in a data flooded world, continued

https://www.nature.com/articles/nature21056.epdf

10

slide-11
SLIDE 11

Living in a data flooded world, continued

https://mashable.com/2018/02/26/ai-beats-humans-at-contracts/

An AI just beat top lawyers at their own game A new study, conducted by legal AI platform LawGeex in consultation with law professors from Stanford University, Duke University School of Law, and University of Southern California, pitted twenty experienced lawyers against an AI trained to evaluate legal contracts. Competitors were given four hours to review five non-disclosure agreements (NDAs) and identify 30 legal issues, including arbitration, confidentiality of relationship, and indemnification. They were scored by how accurately they identified each issue. Unfortunately for humanity, we lost the competition — badly.

“ “

11

slide-12
SLIDE 12

Living in a data flooded world, continued

http://3dgan.csail.mit.edu/

12

slide-13
SLIDE 13

Living in a data flooded world, continued

https://arxiv.org/abs/1711.10669

Image2Mesh Most of us take for granted the ability to effortlessly perceive our surroundings world and its

  • bjects in three dimensions. In general, we have great ideas about the 3D space only by looking

at a single 2D image of an object even when there are many possible shapes that could have produced the same image. We simply rely on assumptions and prior knowledge acquired throughout our lives for the inference. It is one of the fundamental goals of computer vision to give machines the ability to perceive its surroundings as we do, for the purpose of providing solutions to tasks such as selfdriving cars, virtual and augmented reality, robotic surgery, to name a few.

“ “

13

slide-14
SLIDE 14

Living in a data flooded world, continued

http://web.mit.edu/vondrick/tinyvideo/

14

slide-15
SLIDE 15

Living in a data flooded world, continued

http://karpathy.github.io/2015/10/25/selfie/

15

slide-16
SLIDE 16

Living in a data flooded world, continued

https://github.com/ipsingh06/ml-desnapify

16

slide-17
SLIDE 17

Living in a data flooded world, continued

https://arxiv.org/abs/1701.04928

17

slide-18
SLIDE 18

Living in a data flooded world, continued

https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a- photograph

18

slide-19
SLIDE 19

Living in a data flooded world, continued

https://www.technologyreview.com/s/612775/algorithms-criminal-justice-ai/

19

slide-20
SLIDE 20

Living in a data flooded (real) world

20

slide-21
SLIDE 21

Data science and data scientists

21

slide-22
SLIDE 22

Data science

Data contains value and knowledge But to extract this knowledge, you need to be able to:

Store it Manage it Analyze it

Terms often used interchangeably:

Data Mining ≈ Big Data ≈ Data Analytics ≈ Data Science ≈ Knowledge Discovery ≈ Artificial Intelligence ≈ Deep Learning Don't worry too much about this and don't be too swayed by Venn diagrams or infographics

22

slide-23
SLIDE 23

Data science

https://vas3k.com/blog/machine_learning/?ref=hn

What even is this? 23

slide-24
SLIDE 24

Data scientists

https://www.mckinsey.com/~/media/mckinsey/business functions/mckinsey digital/our insights/big data the next frontier for innovation/mgi_big_data_exec_summary.ashx

24

slide-25
SLIDE 25

https://www.bloomberg.com/news/articles/2018-02-13/in- the-war-for-ai-talent-sky-high-salaries-are-the-weapons

> If you want to command a multiyear, seven- figure salary, you used to have only four career options: chief executive officer, banker, celebrity entertainer, or pro athlete. Now there’s a fifth—artificial intelligence expert.

Data scientists

"I suspect AI today is like big data ten years ago"

  • Exactly. Also as soon as big data came

around nobody was doing just data, everyone was doing big data even if they had the same 10GB MySQL database they had from previous

  • years. AI is a bit the same. Doing any

analytics? - Now it's AI. Opening an excel spreadsheet and doing a curve fit: I am a data scientist doing AI. Doing any actual ML: not learning anymore but super deep learning. (https://news.ycombinator.com/item? id=16366815)

“ “

25

slide-26
SLIDE 26

Data scientists

https://www.techrepublic.com/article/why-data-scientist-is-the-most-promising-job-of-2019/

LinkedIn found... Top careers in data science include core data scientist, researcher, and big data specialist.

“ “

26

slide-27
SLIDE 27

Defining the data scientist

A data scientist should have solid quantitative skills A data scientist should be a good programmer A data scientist should excel in communication and visualization skills A data scientist should have a solid business understanding A data scientist should be creative

27

slide-28
SLIDE 28

What's analytics all about?

Given ((huge) lots) of data, discover patterns and models that are:

Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…

28

slide-29
SLIDE 29

Valid, generalizable

https://www.gwern.net/Tanks

RL agent in Udacity self-driving car rewarded for speed learns to spin in circles (https://twitter.com/mat_kelcey/status/886101319559335936)

“ “

NASA Mars mission planning, optimizing food/water/electricity consumption for total man-days survival, yields an optimal plan of killing 2/3 crew & keep survivor alive as long as possible

“ “

29

slide-30
SLIDE 30

Useful, actionable

We can predict who will churn, and then what?

30

slide-31
SLIDE 31

Unexpected, interesting

https://www.fastcompany.com/3063110/the-rise-of-weird-data

Optimizing right turns for UPS drivers Typing with proper capitalization indicates creditworthiness Users of the Chrome and Firefox browsers make better employees But: not everything which is unexpected is interesting, or valid

31

slide-32
SLIDE 32

Understandable

"Why does your model predict fraud?" "Which attributes of the customer are important?" "If age goes up you're more at risk?" "I don't understand interaction effects" "What do you mean 'just trust us'?" ...

32

slide-33
SLIDE 33

Ethical?

Models become increasingly complex…

But also rule many aspects in our life, from credit scoring to employment, all the way down to predicting recidivism People are becoming increasingly aware of what is being done with their data and are becoming more protective of their privacy and their rights to challenge a model’s conclusion White House released a statement regarding the promises and dangers of analytics: ”Big Risks, Big Opportunities: the Intersection

  • f Big Data and Civil Rights”, and many more examples

33

slide-34
SLIDE 34

Ethical?

34

slide-35
SLIDE 35

Ethical?

35

slide-36
SLIDE 36

The analytics process model

36

slide-37
SLIDE 37

CRISP-DM

Cross Industry Standard Process for Data Mining 37

slide-38
SLIDE 38

Others

SEMMA

Sample, Explore, Modify, Model, and Assess https://en.wikipedia.org/wiki/SEMMA

The drivetrain approach

Jeremy Howard, Margit Zwemer and Mike Loukides https://www.oreilly.com/ideas/drivetrain-approach-data-products

https://www.oreilly.com/ideas/drivetrain-approach-data-products

38

slide-39
SLIDE 39

Challenges

39

slide-40
SLIDE 40

Challenges

Mapping the business question to a technique / setup (there is no one-size fits all) (Not) realizing the amount of effort required in pre-processing Low amount of training data, either instances or features Or too many features… Huge data imbalance, or not even labeled data Quality of data, noise Predicting the future is hard (who’d have thought!) – hard to extrapolate towards the future for many models (machines are naïve and lazy) Incorporating domain knowledge, explaining models Strong validation / backtesting setup requires time and enough data Organizational aspects, teams, management

40

slide-41
SLIDE 41

Some quotes

During 2015, only 15% of Fortune 500 organizations were able to exploit big data for competitive advantage. – Gartner

“ “

Data maturity of companies is very disparate, and the most advanced of them start doubting. – Christophe Bourguignat

“ “

75 % have invested in Big Data, but only 10% have projects in production.

“ “

Companies face disillusions. They start asking questions: I know how much it costs, but how much do I earn? What is my return on investment?

“ “

41

slide-42
SLIDE 42

Data quality

GIGO principle

Garbage in, Garbage out; messy data gives messy models

In many cases, simple analytical models perform well, so biggest performance increase comes from the data!

Baesens et al., 2003; Van Gestel, Baesens et al., 2004; Holte, 1993

Importance of Master Data Management and Data quality programmes!

But modern data science requires a more nuanced view

The best way to improve the performance of an analytical model is not to look for fancy tools or techniques, but to improve DATA QUALITY first! (Baesens B., It’s the data, you stupid!, Data News, 2007)

“ “

42

slide-43
SLIDE 43

Data quality criteria

Data accuracy

E.g., outliers Age is 300 years versus Income is 1.000.000 Euro (not the same!)

Data completeness

Are missing values important?

Data bias and sampling

Try to minimise, but can never totally get rid of

Data definition

Variables: what is the meaning of 0? Target: fraud, churn, default, customer lifetime value (CLV), ….

Data recency/latency

Refresh frequency

43

slide-44
SLIDE 44

The data science team

Database / Datawarehouse administrator Business expert (e.g. marketeer, credit risk analyst, …) Legal expert Data scientist / data miner Software / tool vendors

A multidisciplinary team needs to be set up! (A data scientist is not a magic unicorn) 44

slide-45
SLIDE 45

The aftercare

Interpretation and validation of analytical models by business experts

Trivial versus unexpected (interesting?) patterns

Sensitivity analysis

How sensitive is the model with regards to sample characteristics, assumptions and/or technique parameters?

Deploy analytical model into business setting

Represent model output in a user-friendly way Integrate with campaign management tools and marketing decision engines

Model monitoring and backtesting

Continuously monitor model output Contrast model output with observed numbers

We'll highlight this again in later sessions. Still very much the "forgotten" part

  • f data science

45

slide-46
SLIDE 46

Analytics: the basics

46

slide-47
SLIDE 47

What's analytics all about?

Given ((huge) lots) of data, discover patterns and models that are:

Valid: hold on new data with some certainty, i.e. generalizable Over time, seasonal effects, overfitting, sub-groups, regional differences… Useful: should be possible to act on the item, i.e. actionable Business question, implementation, maintenance costs, ease-of-use… Unexpected: non-obvious to the system, i.e. interesting Balance between trust and discovery… Big and “weird” data Understandable: humans should be able to interpret the pattern Black box vs. white box, trust, validity…

47

slide-48
SLIDE 48

Analytics

Essentially refers to extracting valid, useful, interesting and understandable business patterns and/or mathematical decision models from a preprocessed data set

Predictive analytics (supervised learning)

Predict the future based on patterns learnt from past data Classification (categorical) versus regression (continuous) You have a labelled data set at your disposal

Descriptive analytics (unsupervised learning)

Describe patterns in data Clustering, association rules, sequence rules No labelling required

(There is more than just these two) 48

slide-49
SLIDE 49

Basic terminology

A tabular data set ("structured data"): 49

slide-50
SLIDE 50

Basic terminology

A tabular data set ("structured data"): 50

slide-51
SLIDE 51

Basic terminology

A tabular data set ("structured data"): 51

slide-52
SLIDE 52

Basic terminology

A tabular data set ("structured data"):

Has instances (examples, rows, observations, customers, ...) and attributes (features, fields, variables, predictors) These features can be:

Numeric (continuous) or categorical (discrete, nominal, ordinal, factor, binary)

Target (label, class, dependent variable) can be present

Can also be numeric, categorical, ...

52

slide-53
SLIDE 53

What about big data?

No worries, we'll get to that...

http://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data

53

slide-54
SLIDE 54

Our goals

Advanced Analytics in Business [D0S07a]

Data science process Supervised learning Ensemble models Advanced techniques Unsupervised learning Working with data science tooling Some “special topics”

Big Data Platforms & Technologies [D0S06a]

Hadoop (MapReduce) Spark Other future trends NoSQL, Neo4J

54

slide-55
SLIDE 55

Our goals

You don’t need big data to do analytics ... but you can Big data doesn’t necessarily mean doing analytics ... but it can 55

slide-56
SLIDE 56

It's about technology too

... though with a critical view 56

slide-57
SLIDE 57

It's about technology too

... though with a critical view

HDFS?

What about HFD5, or Kudu? Do we even have unstructured data? Do we know what to do with it? V’s of Big Data – yeah right!

BigSQL, or Hive, or Slurp?

Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?

What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X? s/Hadoop/Deep Learning/ s/Big Data/AI/

So we start with the analytics first 57

slide-58
SLIDE 58

Predictive analytics: classification

58

slide-59
SLIDE 59

Predictive analytics: classification

59

slide-60
SLIDE 60

Predictive analytics: regression

60

slide-61
SLIDE 61

Predictive analytics: regression

61

slide-62
SLIDE 62

Descriptive analytics: clustering

62

slide-63
SLIDE 63

Descriptive analytics: clustering

63

slide-64
SLIDE 64

Descriptive analytics: association rules

64

slide-65
SLIDE 65

Descriptive analytics: association rules

65

slide-66
SLIDE 66

Descriptive analytics: sequence rules

66

slide-67
SLIDE 67

Descriptive analytics: sequence rules

67

slide-68
SLIDE 68

Analytical model requirements

Business relevance

Solve a particular business problem

Statistical performance

Statistical significance, accuracy of model Statistical prediction performance

Interpretability and justifiability

Very subjective (depends on decision maker), but crucial! Often need to be balanced against statistical performance

Operational efficiency

How can the analytical models be integrated with campaign management?

Economical cost

What is the cost to gather the model inputs and evaluate the model? Is it worthwhile buying external data and/or models?

Regulatory compliance

In accordance with regulation and legislation

Remember: models which are valid (generalizable), useful (actionable), unexpected (interesting), understandable 68

slide-69
SLIDE 69

A few real-life examples

69

slide-70
SLIDE 70

Risk analytics

More than ever before, analytical models steer strategic risk decisions of financial institutions Minimum equity (buffer capital) and provisions a financial institution holds are directly determined, e.g. by

Credit risk analytics Market risk analytics Operational risk analytics Insurance risk analytics

Business analytics is typically used to build all these models Often subject to regulation (e.g. Basel II, Basel III, Solvency II, …) Model errors directly affect profitability, solvency, shareholder value, macro-economy,… society as a whole

70

slide-71
SLIDE 71

Risk analytics: credit scoring

Estimate probability of default at the time the applicant applies for the loan! Also: LGD (loss given default: predict loss if the applicant defaults) Use predetermined definition of default (e.g. 3 months of payment arrears) Use application variables

E.g. age, income, marital status, years at address, years with employer,…

Use bureau variables

Bureau score, raw bureau data (e.g. number of credit checks, total amount of credits, delinquency history ,…) In the US: Fico scores between 300 to 850 Experian, Equifax, TransUnion E.g., Baycorp Advantage (Australia & New Zealand), Schufa (Germany), BKR (the Netherlands), CKP (Belgium), Dun & Bradstreet

71

slide-72
SLIDE 72

Risk analytics: credit scoring

72

slide-73
SLIDE 73

Marketing analytics

Plethora of different techniques

Churn prediction (retention modeling): which customers will leave you, and why? Response modeling: to which customers do we send out an incentive? Segmentation modeling: grouping customers in segments Recommender systems: which product to recommend next?

73

slide-74
SLIDE 74

Marketing analytics: churn prediction

Understanding why customers leave you Customer retention is important because long term loyal customers are less price sensitive, cost less to serve and have a higher lifetime value

Small improvements in customer retention generate significant returns. Very important in Telco sector (about 2% monthly churn rate)

Transaction versus relationship buyers

Transaction buyers: buy because of low price Relationship buyers: want to build loyal relationship with firm

Contractual versus non-contractual setting

Contractual setting: customer cancels contract (e.g. postpaid Telco) Non-contractual setting: customer hasn’t purchased any products or services during previous 3 months (e.g.

  • nline retailer)

Types of churn:

Active: customer stops relationship with firm Passive: customer decreases intensity of relationship Forced: company stops relationship because of e.g. fraud Expected: customer no longer needs product/service (e.g. baby products)

74

slide-75
SLIDE 75

Marketing analytics: churn prediction

Typical data sources:

Demographic data: e.g. age, gender, marital status Relationship variables: e.g. length of relationship Product/Service usage and ownership data: e.g. number of products purchased, number of transactions in previous month, trend in usage,… RFM data Complaints data: e.g. number of filed complaints, service desk contacted,… (Social) network information (cf. infra)

RFM?

Already popular since (Cullinan, 1977) Recency: number of months since last purchase Frequency: number of purchases within a given time frame Monetary: dollar value of purchases Different operationalisations possible of RFM variables E.g., Monetary: average/maximum/total dollar value? Trend variables Can only be measured for existing customers, not for prospects (e.g. response modeling) Often used to build a segmentation scheme or combine into a single RFM score

75

slide-76
SLIDE 76

Marketing analytics: churn prediction

Three enourmous challenges...

  • 1. Need to make a distinction between a characteristic predictor for future

churn, or a symptom of occurring churn

E.g. sudden peak in usage often occurs right before churn because customer has already decided to churn Focus on early-warning predictors

  • 2. Real-life churn data sets have very skewed class distribution (e.g. about 1-

5% churners)

Logistic regression and decision tree models cannot be appropriately estimated Use oversampling on the train (not test!) data

  • 3. How to make it actionable?

76

slide-77
SLIDE 77

Marketing analytics: churn prediction

77

slide-78
SLIDE 78

Marketing analytics: response modeling

Customer acquisition: acquiring new customers with targeted campaigns, win-back campaigns Campaign can be mail catalogue, email, coupon, A/B or multivariate testing,… Identify the customers most likely to respond based on the following information:

Demographic variables (age, gender, marital status,…) Relationship variables (length of relationship, number of products purchased,…) RFM variables (Social) network information (cf. infra)

The key issue:

There are people who respond positively even if they don't get a letter in the mail There are people who respond negatively even if they get a letter in the mail There are people who respond negatively if they get a letter in the mail, but would have responded positively if we left them alone

78

slide-79
SLIDE 79

Marketing analytics: response modeling

Split target group into test group and control group

Test group receives marketing material and control group does not Incremental impact equals the additional purchases that are directly attributable to the campaign (Larsen, 2010) Incremental impact = test group purchase rate – control group purchase rate

Try to factor in the behavior of self-selecting clients, clients that purchase regardless of the marketing campaign

Focus should be on swing clients: interested in the product, but need to be motivated (by e.g. marketing message) to take action Both test and control group should be representative Find a model such that the difference between the test group purchase rate and the control group purchase rate is maximized (i.e. identifying the swing clients)

79

slide-80
SLIDE 80

Marketing analytics: response modeling

80

slide-81
SLIDE 81

Marketing analytics: response modeling

81

slide-82
SLIDE 82

Marketing analytics: response modeling

82

slide-83
SLIDE 83

Fraud analytics

Fraud is an uncommon, well-considered, imperceptibly concealed, time- evolving and often carefully organized crime which appears in many types and forms

Financial fraud Tax fraud Insurance fraud Employee fraud ...

83

slide-84
SLIDE 84

Fraud analytics

Credit card transaction fraud:

Stolen credit cards (yellow nodes) are often used in the same stores (blue nodes) Store itself also processes legitimate transactions

84

slide-85
SLIDE 85

Fraud analytics

Identify theft:

Before: person calls his/her frequent contacts After: person also calls new contacts which coincidentally overlap with another persons contacts

85

slide-86
SLIDE 86

Fraud analytics

Social security fraud:

Companies are frequently associated with other companies that perpetrate suspicious/fraudulent activities

86

slide-87
SLIDE 87

Fraud analytics

Insurance fraud:

Exaggeration of damages “Crash for cash”

87

slide-88
SLIDE 88

HR analytics

Employee churn Employee performance Employee absence Employee satisfaction Employee Lifetime Value …

88

slide-89
SLIDE 89

HR analytics

Baesens, De Winne, Sels, What to Do Before You Fire a Pivotal Employee, 2016 LinkedIn based data mining

https://www.bloomberg.com/news/features/2017-11- 15/the-brutal-fight-to-mine-your-data-and-sell-it-to- your-boss https://hbr.org/2016/01/what-to-do-before-you-fire-a- pivotal-employee

Still, in practice the selection and retention of talent remains more art than science [...]. The goal is to go beyond traditional but little-examined practices [...] to subtler metrics and methods. Big companies are growing more interested as the cost of replacing valued workers becomes clearer.

“ “

89