Data & Science A m mandate f for d data d driven c - PowerPoint PPT Presentation

Data & Science A m mandate f for d data d driven c corporate i innovation By Igor Stojković Enterprise Analytics & Data Phillip Morris International

Contents • Mathematics for data science in commercial environment • To prove or not to prove • Multidisciplinary teams and Agile • Rlabs at ABNAMRO • Transforming discussions with business stakeholders into mathematical models • Business & Data understanding/experiment design/data prep/modeling/performance valuation • Second hand car sales model • Kalman filter • Long term short term memory (LSTM) neural network model 2

Mathematics for data science in corporate environments • Not about proving rigorous statements ( L ) • Deductive vs inductive science • Willingness to dive into business details and mathematicise them • Creative analytical thought • Apply advanced techniques in novel ways for operational excellence, new markets and products • Keep reading papers all the time • My current reading: Wasserstein Generative Adversarial Networks (WGAN) • Don’t get bored because it will kill you! 3

Multidisciplinary teams-Agile Senior Stakeholders • Accept or reject proposals Product owner • Determines what needs be built Development Team Scrum Master Guards the process • Data Scientist Domain Expert Data Engineer/Hunter 4

A Data Science objective: Rlabs@ABNAMRO Bank • Risk as a Service (RaaS) • Combine internal credit risk management knowledge with data&science to build new API services for internal and external usage • More efficient and up to date risk management • New proposition to clients • Utilize internal and external data sources • Consider different sub-sectors separately 5

How to approach?? • A general observation: • A washing service SME serving hotels is not interested in PD, LGD, EAD (Basel) CR models • Is interested in predictions on number of sold beds per hotel • Steering their business • Such models are a novelty in banking industry and valuable for risk management • Collected domain expertize and requirements through internal and external discussions: • Which operational figures are crucial about performance of an SME active (e.g. a hotel), that is relevant to creditors as well as buyers and/or suppliers of entities considered? • Boundaries • External information availability/price of data sources • Privacy 6

Dutch second hand car dealership forecast model • Goal: sales forecasts at postal code area level (4 digits) • Available sales events with • Car specs • Car age • Quantity sold • Dealer’s & consumer’s postal code • Other available data: • Martkplaats data with average prices per car specs/age/period • Internal data on consumer behavior (aggregated to areas’ level) • APK data 7

First modeling steps • Data prep • Cleaning – sounds trivial but can be extremely time consuming or even require deep modeling itself • Transforming data structure: aggregate, merge, find suitable representations – sometimes deeply analytical • Target design • # cars sold per period, postal code area, price class & car age • Price classes determined by clustering • Model design choices • Kalman filter • LSTM model 8

Predictive features design • PC area of dealer and consumer • Where do clients of car dealers live (distribution) • Consumer behavior contains clues about driving patterns at PC level • Second hand and new car ownership incidence • APK data contains information on car decay incidence • How often do owners change their second hand cars 9

Klaman filter solution details ( , 𝜁 " ( ~𝑂(0, Σ ( ) , 𝑌 " − 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑗𝑤𝑓 𝑔𝑓𝑏𝑢𝑣𝑠𝑓𝑡 𝑙𝑜𝑝𝑥𝑜 𝑏𝑢 𝑢 𝑍 " := 𝑌 " ∗ 𝛽 " + 𝜁 " F , 𝜁 " F ~𝑂 0, Σ F , 𝛽 " ≔ 𝐺 ∗ 𝛽 "DE + 𝜁 " Σ ( , Σ F - unknown covariance matrices 𝐺 - unknown matrix to be estimated This is a generalization of the local level model. 3000 time series each with a 6 month horizon • Neighboring observations have a 3 months overlap • In total 36 time points per time series • Application of embedding layer technique significantly enhanced • performance We clustered PC’s vector representations and trained Kalman filter • parameters per cluster (iteratively, passing results at end of an epoch as input to the next epoch within a cluster) 10

LSTM neural network • Target redesign • ‘Cut up’ 36 points series (6-8 points per new observation) • Gives multiple observations per series • Some overlap is ok but not too much t33 t34 t35 t36 t31 t32 t3 t4 t5 t6 t12 t1 t2 t9 t10 t11 t7 t8 Subseries 1 Subseries 2 Subseries 3 TRAIN PARTITION Subseries 11 TEST PARTITION • Predictors • Original features series plus embedding layer values 11

Embedding layer • We train a simple NN with one hot’s of PC’s as inputs and series parts (c.q. 6 quarters) as target values • Hidden layer gives a vector representation of abstract PC ids in relation with its series behavior t1 to t6 Series 1 PC1 One hot PC1 Target sub-series t31 to t36 One hot PC1 Series 2 PC1 Weights Relu activatons Weights t1 to t6 One hot Series k PC3000 PC3000 .……. ……………....... 1 0 0 t31 to t36 One hot PC30000 Series k PC3000 One-hot representation of PC’s 12

Embedding layer model formulation • ℎ(𝑦): = 𝜏(𝑋 E *x+ 𝑥 E ), x – one hot representation of a PC area, 𝑋 E and 𝑥 E weights of the hidden layer • t(h):= 𝜏(𝑋 M ∗ h+ 𝑥 M ), 𝑋 M and 𝑥 M are weights of the output layer O , … , 𝑨 Q O ), for 𝑨 ∈ ℝ Q • 𝜏 𝑨 ≔ (𝑨 E ) M , 𝑡 𝑗𝑡 𝑢𝑏𝑠𝑕𝑓𝑢 𝑡𝑓𝑠𝑗𝑓𝑡, • (𝑋 E , 𝑥 E , 𝑋 M , 𝑥 M ):= Ε(𝑡 − 𝑢 ℎ 𝑦 𝐹 𝑗𝑡 𝑢𝑏𝑙𝑓𝑜 𝑥. 𝑠. 𝑢. 𝑒𝑏𝑢𝑏 • Features to add to LSTM model or to use for clustering series for joint Kalman filter inference: E *x+ 𝑥 E (∈ ℝ [ , 𝑚 = 6 𝑢𝑝 10) 𝑋 13

Car sales LSTM model LSTM layer x7 Target Our LSTM architecture 𝑋 a Dense layer 2 𝑋 a Dense layer 1 𝑋 M LSTM layer 2 𝑉 E 𝑋 LSTM cell E LSTM layer 1 𝑉 _ 𝑋 _ Input series x1,…,x6 14

Performance valuation c − 𝑢𝑠𝑣𝑓 𝑤𝑏𝑚𝑣𝑓 𝑝𝑔 𝑡𝑏𝑚𝑓𝑡 𝑔𝑝𝑠 𝑄𝐷 𝑞 𝑏𝑢 𝑢𝑗𝑛𝑓 𝑢 • 𝑧 " g − 𝑝𝑣𝑠 𝑞𝑠𝑓𝑒𝑗𝑑𝑢𝑗𝑝𝑜 𝑔𝑝𝑠 𝑄𝐷 𝑞 𝑏𝑢 𝑢𝑛𝑓 𝑢 c • 𝑧 " k Dh j k i j h c : = • 𝑓𝑠𝑠 k " h j • Baseline prediction is the naive (manager’s) guess : k k h jno Dh j c : = 𝑐𝑏𝑡𝑓_𝑓𝑠𝑠 k " h j • Compare histograms of 𝑓𝑠𝑠 " and 𝑐𝑏𝑡𝑓_𝑓𝑠𝑠 " (aggregate over PC’s) 15

Data & Science A m mandate f for d data d driven c - PowerPoint PPT Presentation

Data & Science A m mandate f for d data d driven c corporate i innovation By Igor Stojkovi Enterprise Analytics & Data Phillip Morris International Contents Mathematics for data science in commercial environment To prove

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

On the use of Fuzzy Logic Controllers to Comply with Virtualized Application Demands in the Cloud

Thermal State of Advanced LIGO Test Masses: Implementation of a Real-Time Mirror Degradation

State-space Title models and the Pawel Zabczyk Kalman filter pawel.zabczyk@bankofengland.co.uk

Tracking H akan Ard o February 22, 2012 H akan Ard o Tracking February 22, 2012 1

Advisers: Dr. In Soo Ahn, Dr. Yufeng Lu Presentation Outline Project Summary Navigation

An Adaptive Covariance Estimation Method Yicun Zhen (joined work with John Harlim) Group

Andreas Maier, Stefan Kiesel and Gert F. Trommer Outline Objectives SAR/INS System

Jidong Gao, Research Meteorologist NOAA/National Severe Storm Laboratory Acknowledgement: