KDD Cup 2009
Fast Scoring on a Large Database
Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team
KDD Cup 2009 Fast Scoring on a Large Database Presentation of the - - PowerPoint PPT Presentation
KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team KDD Cup 2009 Organizing Team Project team at Orange Labs R&D: Vincent Lemaire Marc Boull
Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team
KDD Cup 2009 Organizing Team
Project team at Orange Labs R&D:
Beta testing and proceedings editor:
Web site design:
Coordination (KDD cup co-chairs):
Thanks to our sponsors…
! Orange ! ACM SIGKDD ! Pascal ! Unipen ! Google ! Health Discovery Corp ! Clopinet ! Data Mining Solutions ! MPS
KDD Cup Participation By Year
45 57 24 31 136 18 57 102 37 68 95 128 453 50 100 150 200 250 300 350 400 450 500 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year
453 2009 128 2008 95 2007 68 2006 37 2005 102 2004 57 2003 18 2002 136 2001 31 2000 24 1999 57 1998 45 1997 # Teams Year
Participation Statistics
! 1299 registered teams ! 7865 entries ! 46 countries :
South Africa Latvia France Slovenia Jordan Finland United States Slovak Republic Japan Fiji Uruguay Singapore Italy China United Kingdom Russian Federation Israel Chile Uganda Romania Ireland Canada Turkey Portugal Iran Bulgaria Taiwan Pakistan India Brazil Switzerland New Zealand Hungary Belgium Sweden Netherlands Hong Kong Austria Spain Mexico Greece Australia South Korea Malaysia Germany Argentina
A worlwide operator
! One of the main
telecommunication operators in the world
! Providing services to more than
170 millions customers over five continents
! Including 120 millions under the
Orange Brand
KDD Cup 2009 organized by Orange
Customer Relationship Management (CRM)
!
Three marketing tasks: predict the propensity of customers
– to switch provider: Churn – to buy new products or services: Appentency – to buy upgrades or new options proposed to them: Up-selling
!
Objective: improve the return of investments (ROI) of marketing campaigns
– Increase the efficiency of the campaign given a campaign cost – Decrease the campaign cost for a given marketing objective
!
Better prediction leads to better ROI
!
Train and deploy requirements
– About one hundred models per month – Fast data preparation and modeling – Fast deployment
!
Model requirements
– Robust – Accurate – Understandable
!
Business requirement
– Return of investment for the whole process
!
Input data
– Relational databases – Numerical or categorical – Noisy – Missing values – Heavily unbalanced distribution
!
Train data
– Hundreds of thousands of instances – Tens of thousand of variables
!
Deployment
– Tens of millions of instances
Data, constraints and requirements
In-house system
From raw data to scoring models
0,n 1,n 0,n 1,n 0,1 0 ,n 0 ,n 0,n 1,1 1,1 1,n 0,n 0,n 1,1 1,n 1,n 1,n 1,n 0,n 1,n 1,1 0,n 1,n 1 ,1 0,n 1,1 0,n 1,1 0 ,n 1,n 0,n 1 ,1 1,1 0,n 1 ,n 1,1 1,n (1 ,1) 1 ,1 0,n 1,n 0 ,n 1,1 0,n 0 ,n 0 ,1 1,1 1,n 1 ,n 0,1 0,n 1,1 1,1 0,1 0,n H e ritage tiers Heritag e offre co mmerciale 0,n 1,n 1,1 0,n (1,1 ) 0 ,n 0 ,n 0 ,n 0,1 0 ,1 0,n 1,n 1,n 0,n 1,n 0 ,n (1,1) Fu appa rtient type FU 1 ,1 1,n Offre Id offre Libe llé offre <pi> Produ it & Service Id PS Date fin valid ité du P&S Date déb ut va lidité du P&S Date cré ation du P&S Li bellé P&S <pi> Iden tité Tiers Id id entité tiers Logi n Type iden tité tie rs <p i> O co mposée de PS Elé ment D e Parc Id EDP Date derniè re uti lisation EDP Date premi ère utilisation ED P <pi > M odèle Conceptue l de Don nées M odèl e : MCD PAC_v4 Pa ckage : Dia gram me : Tie rs Servi ces Au teur : clau debe Date : 14/0 6/200 5 Ve rsion : PS a pou r FU T utilise IT EDP souscrit ds O D a te dé but souscriptio n offre D a te fin sCustomer Services Products Call details …
! Data warehouse
– Relational data base
! Data mart
– Star schema
! Feature construction
– PAC technology – Generates tens of thousands of variables
! Data preparation and modeling
– Khiops technology
Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …
scoring model Data feeding
PAC Khiops
Design of the challenge
! Orange business objective
– Benchmark the in-house system against state of the art techniques
! Data
– Data store – Not an option – Data warehouse – Confidentiality and scalability issues – Relational data requires domain knowledge and specialized skills
– Tabular format – Standard format for the data mining community – Domain knowledge incorporated using feature construction (PAC) – Easy anonymization
! Tasks
– Three representative marketing tasks
! Requirements
– Fast data preparation and modeling (fully automatic) – Accurate
– Fast deployment – Robust – Understandable
Data sets extraction and preparation
! Input data – 10 relational table – A few hundreds of fields – One million customers ! Instance selection – Resampling given the three marketing tasks – Keep 100 000 instances, with less unbalanced target distributions ! Variable construction – Using PAC technology – 20000 constructed variables to get a tabular representation – Keep 15 000 variables (discard constant variables) – Small track: subset of 230 variables related to classical domain knowledge ! Anonymization – Discard variable names, discard identifiers – Randomize order of variables – Rescale each numerical variable by a random factor – Recode each categorical variable using random category names ! Data samples – 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set
Scientific and technical challenge
! Scientific objective
– Fast data preparation and modeling: within five days – Large scale: 50 000 train and test data, 15 000 variables – Hetegeneous data – Numerical with missing values – Categorical with hundreds of values – Heavily unbalanced distribution
! KDD social meeting objective
– Attract as many participants as possible – Additional small track and slow track – Online feedback on validation dataset – Toy problem (only one informative input variable) – Leverage challenge protocol overhead – One month to explore descriptive data and test submission protocol – Attractive conditions – No intellectual property conditions – Money prizes
Business impact of the challenge
!
Bring Orange datasets to the data mining community
– Benefit for community – Access to challenging data – Benefit for Orange – Benchmark of numerous competing techniques – Drive the research efforts towards Orange needs
! Evaluate the Orange in-house system
– High number of participants and high quality of the results – Orange in-house results: – Improved by a significant margin when leveraging all business requirements – Almost Parretto optimal when other criterions are considered
(automation, very fast train and deploy, robustness and understandability)
– Need to study the best challenge methods to get more insights
KDD Cup 2009: Result Analysis
Best Result (period considered in the figure) In House System (downloadable : www.khiops.com) Baseline (Naïve Bayes)
Overall – Test AUC – Fast
Good Result Very Quickly Best Results (on each dataset) Submissions
Overall – Test AUC – Fast
Good Result Very Quickly Best Results (on each dataset) Submissions In House (Orange ) System:
Overall – Test AUC – Fast
Very Fast Good Result Small improvement after the first day (83.85 " 84.93)
Overall – Test AUC – Slow
Very Small improvement after the 5th day (84.93 " 85.2) Improvement due to unscrambling?
Overall – Test AUC – Submissions
23.24% of the submissions (>0.5) < Baseline ≈ 15.25% of the submissions (>0.5) > In House ≈ 84.75% of the submissions (>0.5) < In House
Overall – Test AUC 'Correlation' Test / Valid
Overall – Test AUC 'Correlation' Test / Train
Random Values Submitted Boosting Method or Train Target Submitted Over fitting
?
Overall – Test AUC
Test AUC - 12 hours Test AUC - 24 hours Test AUC – 36 days Test AUC – 5 days
Overall – Test AUC
Test AUC - 12 hours Test AUC – 36 days
∆ ∆ ∆ ∆ Difference between :
Test AUC = f (time)
Easier ? Harder ?
Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]
Test AUC = f (time)
Easier ? Harder ?
∆ ∆ ∆ ∆ Difference between :
∆ ∆ ∆ ∆ =1.84% ∆ ∆ ∆ ∆ =1.38% ∆ ∆ ∆ ∆ =0.11%
Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]
Correlation Test AUC / Valid AUC (5 days)
Easier ? Harder ?
Churn – Test/Valid – day ∈ ∈ ∈ ∈ [0:5] Appetency – Test/Valid – day ∈ ∈ ∈ ∈ [0:5] Up-selling– Test/Valid – day ∈ ∈ ∈ ∈ [0:5]
Correlation Train AUC / Valid AUC (36 days)
Difficulty to conclude something…
Churn – Test/Train – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test/Train – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test/Train – day ∈ ∈ ∈ ∈ [0:36]
Histogram
Test AUC / Valid AUC ([0:5] or ]5-36] days)
Knowledge (parameters?) found during 5 days helps after… ?
Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]
Knowledge (parameters?) found during 5 days helps after… ?
Histogram
Test AUC / Valid AUC ([0:5] or ]5-36] days)
YES !
Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36] Churn – Test AUC – day ∈ ∈ ∈ ∈ ]5:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ ]5:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ ]5:36]
Fact Sheets: Preprocessing & Feature Selection
20 40 60 80
Principal Component Analysis Other prepro Grouping modalities Normalizations Discretization Replacement of the missing values
PREPROCESSING (overall usage=95%)
Percent of participants
10 20 30 40 50 60
Wrapper with search Embedded method Other FS Filter method Feature ranking
FEATURE SELECTION (overall usage=85%)
Percent of participants Forward / backward wrapper
Fact Sheets: Classifier
10 20 30 40 50 60 Bayesian Neural Network Bayesian Network Nearest neighbors Naïve Bayes Neural Network Other Classif Non-linear kernel Linear classifier Decision tree... CLASSIFIER (overall usage=93%) Percent of participants
>15% exp loss, >15% sq loss, ~10% hinge loss.
(20% 2-norm, 10% 1-norm).
Fact Sheets: Model Selection
10 20 30 40 50 60 Bayesian Bi-level Penalty-based Virtual leave-one-out Other cross-valid Other-MS Bootstrap est Out-of-bag est K-fold or leave-one-out 10% test MODEL SELECTION (overall usage=90%) Percent of participants
(1/3 boosting, 1/3 bagging, 1/3 other).
Run in parallel Multi-processor None
>= 32 GB > 8 GB
<= 8 GB <= 2GB
Fact Sheets: Implementation
Java Matlab C C++ Other (R, SAS) Mac OS Linux Unix Windows
Memory Operating System Parallelism Software Platform
Winning methods
Fast track:
into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)
an additive boosting decision tree technology, bagging also used.
FS, ensemble of decision trees.
Slow track:
classification trees and shrinkage, using Bernoulli loss.
using AIC, gradient tree-classifier boosting.
multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.
Conclusion
! Participation exceeded our expectations. We thank the
participants for their hard work, our sponsors, and Orange who offered:
– A problem of real industrial interest with challenging scientific and technical aspects – Prizes.
! Lessons learned:
– Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants. – Ensemble methods are effective. – Ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.