KDD Cup 2009 Fast Scoring on a Large Database Presentation of the - - PowerPoint PPT Presentation

kdd cup 2009
SMART_READER_LITE
LIVE PREVIEW

KDD Cup 2009 Fast Scoring on a Large Database Presentation of the - - PowerPoint PPT Presentation

KDD Cup 2009 Fast Scoring on a Large Database Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team KDD Cup 2009 Organizing Team Project team at Orange Labs R&D: Vincent Lemaire Marc Boull


slide-1
SLIDE 1

KDD Cup 2009

Fast Scoring on a Large Database

Presentation of the Results at the KDD Cup Workshop June 28, 2008 The Organizing Team

slide-2
SLIDE 2

KDD Cup 2009 Organizing Team

Project team at Orange Labs R&D:

  • Vincent Lemaire
  • Marc Boullé
  • Fabrice Clérot
  • Raphaël Féraud
  • Aurélie Le Cam
  • Pascal Gouzien

Beta testing and proceedings editor:

  • Gideon Dror

Web site design:

  • Olivier Guyon (MisterP.net, France)

Coordination (KDD cup co-chairs):

  • Isabelle Guyon
  • David Vogel
slide-3
SLIDE 3

Thanks to our sponsors…

! Orange ! ACM SIGKDD ! Pascal ! Unipen ! Google ! Health Discovery Corp ! Clopinet ! Data Mining Solutions ! MPS

slide-4
SLIDE 4

KDD Cup Participation By Year

45 57 24 31 136 18 57 102 37 68 95 128 453 50 100 150 200 250 300 350 400 450 500 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Year

453 2009 128 2008 95 2007 68 2006 37 2005 102 2004 57 2003 18 2002 136 2001 31 2000 24 1999 57 1998 45 1997 # Teams Year

Record KDD Cup Participation

slide-5
SLIDE 5

Participation Statistics

! 1299 registered teams ! 7865 entries ! 46 countries :

South Africa Latvia France Slovenia Jordan Finland United States Slovak Republic Japan Fiji Uruguay Singapore Italy China United Kingdom Russian Federation Israel Chile Uganda Romania Ireland Canada Turkey Portugal Iran Bulgaria Taiwan Pakistan India Brazil Switzerland New Zealand Hungary Belgium Sweden Netherlands Hong Kong Austria Spain Mexico Greece Australia South Korea Malaysia Germany Argentina

slide-6
SLIDE 6

A worlwide operator

! One of the main

telecommunication operators in the world

! Providing services to more than

170 millions customers over five continents

! Including 120 millions under the

Orange Brand

slide-7
SLIDE 7

KDD Cup 2009 organized by Orange

Customer Relationship Management (CRM)

!

Three marketing tasks: predict the propensity of customers

– to switch provider: Churn – to buy new products or services: Appentency – to buy upgrades or new options proposed to them: Up-selling

!

Objective: improve the return of investments (ROI) of marketing campaigns

– Increase the efficiency of the campaign given a campaign cost – Decrease the campaign cost for a given marketing objective

!

Better prediction leads to better ROI

slide-8
SLIDE 8

!

Train and deploy requirements

– About one hundred models per month – Fast data preparation and modeling – Fast deployment

!

Model requirements

– Robust – Accurate – Understandable

!

Business requirement

– Return of investment for the whole process

!

Input data

– Relational databases – Numerical or categorical – Noisy – Missing values – Heavily unbalanced distribution

!

Train data

– Hundreds of thousands of instances – Tens of thousand of variables

!

Deployment

– Tens of millions of instances

Data, constraints and requirements

slide-9
SLIDE 9

In-house system

From raw data to scoring models

0,n 1,n 0,n 1,n 0,1 0 ,n 0 ,n 0,n 1,1 1,1 1,n 0,n 0,n 1,1 1,n 1,n 1,n 1,n 0,n 1,n 1,1 0,n 1,n 1 ,1 0,n 1,1 0,n 1,1 0 ,n 1,n 0,n 1 ,1 1,1 0,n 1 ,n 1,1 1,n (1 ,1) 1 ,1 0,n 1,n 0 ,n 1,1 0,n 0 ,n 0 ,1 1,1 1,n 1 ,n 0,1 0,n 1,1 1,1 0,1 0,n H e ritage tiers Heritag e offre co mmerciale 0,n 1,n 1,1 0,n (1,1 ) 0 ,n 0 ,n 0 ,n 0,1 0 ,1 0,n 1,n 1,n 0,n 1,n 0 ,n (1,1) Fu appa rtient type FU 1 ,1 1,n Offre Id offre Libe llé offre <pi> Produ it & Service Id PS Date fin valid ité du P&S Date déb ut va lidité du P&S Date cré ation du P&S Li bellé P&S <pi> Iden tité Tiers Id id entité tiers Logi n Type iden tité tie rs <p i> O co mposée de PS Elé ment D e Parc Id EDP Date derniè re uti lisation EDP Date premi ère utilisation ED P <pi > M odèle Conceptue l de Don nées M odèl e : MCD PAC_v4 Pa ckage : Dia gram me : Tie rs Servi ces Au teur : clau debe Date : 14/0 6/200 5 Ve rsion : PS a pou r FU T utilise IT EDP souscrit ds O D a te dé but souscriptio n offre D a te fin s
  • uscrip tion o ffre
D D < CRU conce rne FU Gam me Id gam me L ibellé gamm e Date création gamm e Date fin d e gam me <pi> G compo s é e de PS Fon ctio n Usage Id fonction d 'usage Lib éllé fonction usag e <pi> T dé tient EDP Date dé but d étentio n EDP Date fin déte ntion EDP D D < Co mpte Facturatio n Id com pte fa ctu ration Date débu t va lidité com pte fa cturation Date fin validi té compte facturatio n <p i> F émi s e pour C F Com pte Ren du Usage Id co mpte rendu usage Date déb ut CRU Date fin CRU Vol ume d escen dant C RU Vol ume m onta nt CRU Type transmission <pi> IT gé nère CRU CR U g énéré par EDP Ligne Facture Id ligne de facture Li gne a ffich ée sur fa ctu re M ontan t HT M ontan t TTC <pi> Typ e Lign e Facture Id type ligne facture Li bellé type ligne facture <pi> Facture Id facture Date é chéa nce facture <pi> LF co rrespond à EDP LF com pose F EDP fa cturé sur CF Tiers Id tiers Pré nom tiers PP No m tiers PP No m ma rital PP Gen re PP Da te nai s san ce tiers PP Da te créatio n tiers Da te clôture tiers Da te mo dification tiers Type Tie rs <pi> T a po ur rela tion a vec T Foyer Id fo yer Date créa tion fo yer Date fin foyer Nb pe rsonnes fo yer <p i> Adress e Id ad resse Code p ostal di s tributio n Comm une Nb hab itants comm une Départemen t <pi> F a p our A Date débu t adre s se Date fin ad resse D D T a p our F Date dé but a pparte nance foyer Date fin appa rtenan ce foye r Role tie rs ds foyer D D VA1 Type Rel ation Ti ers Id typ e rela tion Da te création typ e de relatio n tiers L ibellé typ e rela tion tie rs <pi Statut Op érateu r Id statu t opé rateur Libe llé statu t opérateur <pi > Ope rateu r Id opéra teur Lib ellé o pérateur <pi> T a po ur S Date d ébut sta tut tiers Date fi n statut tiers D D CSP Id CSP 350 Libel lé CSP 350 Id CSP 23 Libel lé CSP 23 Id CSP 5 Libel lé CSP 5 <p i> T a pour CSP LF a pour TLF Class i fica tion Offre Id classification o ffre Libe llé classifica tion o ffre <pi> O po s i tionné e ds C CO hiérarch ie Groupe de CRU Id grou pe de CRU <p i> C RU a pparti ent à la CCRU Cercle Rela tionne l Id CR Libé llé cercl e rela tionne l <pi> CRU a pour OCR CRU a pour DCR Coo rdonn ées Tiers Id coord onnée tiers Date création coordo nnée Libe llé coordo nnée tiers <p i> T titu laire CT C correspond à M Donné es payeur Inscrip tion fi chie r co ntenti eux Nb dossiers recouvrem ent actifs Nb dossiers réclam ation acti fs Nb dossiers recouvrem ent Nb dossiers réclam ation Niveau risque cou rant Niveau risque précédent Classe d e risque Id classe ris q ue Li bellé classe ri s q ue Li bellé court cla s se risque Niveau risque m inimu m Niveau risque m axim um <pi> T a pour CR Date déb ut tiers d s classe ri s q ue Date fin tiers ds classe risque D D Offre com posée Id o ffre comp osée Lib ellé offre compo s é e <pi> Offre comme rcia le Id offre co mmercial e Libell é offre com merciale Date créatio n offre Date clôture offre <p i> O fait pa rtie OC Date d ébut rattachem ent offre Date fin ratta che ment offre D D EDP correspond PS Positionn emen t cla s sifi cati on Id p ositionn emen t Lib ellé po sitionne ment <p i> P d ans O P hiérarchie CRU Ench ainem ent M édia Id m édia Libe llé m édia <pi> EDP a EU mois vale ur VA6 N10 T payeur du CF DP p our O Etat Usage Id EU libell é état usage <pi > Type de fonction d'usag e id typ e FU lib typ FU <pi>

Customer Services Products Call details …

! Data warehouse

– Relational data base

! Data mart

– Star schema

! Feature construction

– PAC technology – Generates tens of thousands of variables

! Data preparation and modeling

– Khiops technology

Id customer zip code Nb call/month Nb calls/hour Nb calls/month,weekday,hours,service …

scoring model Data feeding

PAC Khiops

slide-10
SLIDE 10

Design of the challenge

! Orange business objective

– Benchmark the in-house system against state of the art techniques

! Data

– Data store – Not an option – Data warehouse – Confidentiality and scalability issues – Relational data requires domain knowledge and specialized skills

– Tabular format – Standard format for the data mining community – Domain knowledge incorporated using feature construction (PAC) – Easy anonymization

! Tasks

– Three representative marketing tasks

! Requirements

– Fast data preparation and modeling (fully automatic) – Accurate

– Fast deployment – Robust – Understandable

slide-11
SLIDE 11

Data sets extraction and preparation

! Input data – 10 relational table – A few hundreds of fields – One million customers ! Instance selection – Resampling given the three marketing tasks – Keep 100 000 instances, with less unbalanced target distributions ! Variable construction – Using PAC technology – 20000 constructed variables to get a tabular representation – Keep 15 000 variables (discard constant variables) – Small track: subset of 230 variables related to classical domain knowledge ! Anonymization – Discard variable names, discard identifiers – Randomize order of variables – Rescale each numerical variable by a random factor – Recode each categorical variable using random category names ! Data samples – 50 000 train and test instances sampled randomly – 5000 validation instances sampled randomly from the test set

slide-12
SLIDE 12

Scientific and technical challenge

! Scientific objective

– Fast data preparation and modeling: within five days – Large scale: 50 000 train and test data, 15 000 variables – Hetegeneous data – Numerical with missing values – Categorical with hundreds of values – Heavily unbalanced distribution

! KDD social meeting objective

– Attract as many participants as possible – Additional small track and slow track – Online feedback on validation dataset – Toy problem (only one informative input variable) – Leverage challenge protocol overhead – One month to explore descriptive data and test submission protocol – Attractive conditions – No intellectual property conditions – Money prizes

slide-13
SLIDE 13

Business impact of the challenge

!

Bring Orange datasets to the data mining community

– Benefit for community – Access to challenging data – Benefit for Orange – Benchmark of numerous competing techniques – Drive the research efforts towards Orange needs

! Evaluate the Orange in-house system

– High number of participants and high quality of the results – Orange in-house results: – Improved by a significant margin when leveraging all business requirements – Almost Parretto optimal when other criterions are considered

(automation, very fast train and deploy, robustness and understandability)

– Need to study the best challenge methods to get more insights

slide-14
SLIDE 14

KDD Cup 2009: Result Analysis

Best Result (period considered in the figure) In House System (downloadable : www.khiops.com) Baseline (Naïve Bayes)

slide-15
SLIDE 15

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions

slide-16
SLIDE 16

Overall – Test AUC – Fast

Good Result Very Quickly Best Results (on each dataset) Submissions In House (Orange   ) System:

  • No parameters
  • On 1 standard laptop (mono proc)
  • If deal as 3 different problems
slide-17
SLIDE 17

Overall – Test AUC – Fast

Very Fast Good Result Small improvement after the first day (83.85 " 84.93)

slide-18
SLIDE 18

Overall – Test AUC – Slow

Very Small improvement after the 5th day (84.93 " 85.2) Improvement due to unscrambling?

slide-19
SLIDE 19

Overall – Test AUC – Submissions

23.24% of the submissions (>0.5) < Baseline ≈ 15.25% of the submissions (>0.5) > In House ≈ 84.75% of the submissions (>0.5) < In House

slide-20
SLIDE 20

Overall – Test AUC 'Correlation' Test / Valid

slide-21
SLIDE 21

Overall – Test AUC 'Correlation' Test / Train

Random Values Submitted Boosting Method or Train Target Submitted Over fitting

?

slide-22
SLIDE 22

Overall – Test AUC

Test AUC - 12 hours Test AUC - 24 hours Test AUC – 36 days Test AUC – 5 days

slide-23
SLIDE 23

Overall – Test AUC

Test AUC - 12 hours Test AUC – 36 days

  • time to adjust model parameters ?
  • time to train ensemble method ?
  • time to find more processors ?
  • time to test more methods
  • time to unscramble ?

∆ ∆ ∆ ∆ Difference between :

  • best result at the end of the first day and
  • best result at the end of the 36 days

∆ ∆ ∆ ∆=1.35%

slide-24
SLIDE 24

Test AUC = f (time)

Easier ? Harder ?

Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]

slide-25
SLIDE 25

Test AUC = f (time)

Easier ? Harder ?

∆ ∆ ∆ ∆ Difference between :

  • best result at the end of the first day and
  • best result at the end of the 36 days

∆ ∆ ∆ ∆ =1.84% ∆ ∆ ∆ ∆ =1.38% ∆ ∆ ∆ ∆ =0.11%

Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]

slide-26
SLIDE 26

Correlation Test AUC / Valid AUC (5 days)

Easier ? Harder ?

Churn – Test/Valid – day ∈ ∈ ∈ ∈ [0:5] Appetency – Test/Valid – day ∈ ∈ ∈ ∈ [0:5] Up-selling– Test/Valid – day ∈ ∈ ∈ ∈ [0:5]

slide-27
SLIDE 27

Correlation Train AUC / Valid AUC (36 days)

Difficulty to conclude something…

Churn – Test/Train – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test/Train – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test/Train – day ∈ ∈ ∈ ∈ [0:36]

slide-28
SLIDE 28

Histogram

Test AUC / Valid AUC ([0:5] or ]5-36] days)

Knowledge (parameters?) found during 5 days helps after… ?

Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36]

slide-29
SLIDE 29

Knowledge (parameters?) found during 5 days helps after… ?

Histogram

Test AUC / Valid AUC ([0:5] or ]5-36] days)

YES !

Churn – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ [0:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ [0:36] Churn – Test AUC – day ∈ ∈ ∈ ∈ ]5:36] Appetency – Test AUC – day ∈ ∈ ∈ ∈ ]5:36] Up-selling– Test AUC – day ∈ ∈ ∈ ∈ ]5:36]

slide-30
SLIDE 30

Fact Sheets: Preprocessing & Feature Selection

20 40 60 80

Principal Component Analysis Other prepro Grouping modalities Normalizations Discretization Replacement of the missing values

PREPROCESSING (overall usage=95%)

Percent of participants

10 20 30 40 50 60

Wrapper with search Embedded method Other FS Filter method Feature ranking

FEATURE SELECTION (overall usage=85%)

Percent of participants Forward / backward wrapper

slide-31
SLIDE 31

Fact Sheets: Classifier

10 20 30 40 50 60 Bayesian Neural Network Bayesian Network Nearest neighbors Naïve Bayes Neural Network Other Classif Non-linear kernel Linear classifier Decision tree... CLASSIFIER (overall usage=93%) Percent of participants

  • About 30% logistic loss,

>15% exp loss, >15% sq loss, ~10% hinge loss.

  • Less than 50% regularization

(20% 2-norm, 10% 1-norm).

  • Only 13% unlabeled data.
slide-32
SLIDE 32

Fact Sheets: Model Selection

10 20 30 40 50 60 Bayesian Bi-level Penalty-based Virtual leave-one-out Other cross-valid Other-MS Bootstrap est Out-of-bag est K-fold or leave-one-out 10% test MODEL SELECTION (overall usage=90%) Percent of participants

  • About 75% ensemble methods

(1/3 boosting, 1/3 bagging, 1/3 other).

  • About 10% used unscrambling.
slide-33
SLIDE 33

Run in parallel Multi-processor None

>= 32 GB > 8 GB

<= 8 GB <= 2GB

Fact Sheets: Implementation

Java Matlab C C++ Other (R, SAS) Mac OS Linux Unix Windows

Memory Operating System Parallelism Software Platform

slide-34
SLIDE 34

Winning methods

Fast track:

  • IBM research, USA +: Ensemble of a wide variety of classifiers. Effort put

into coding (most frequent values coded with binary features, missing values replaced by mean, extra features constructed, etc.)

  • ID Analytics, Inc., USA +: Filter+wrapper FS. TreeNet by Salford Systems

an additive boosting decision tree technology, bagging also used.

  • David Slate & Peter Frey, USA: Grouping of modalities/discretization, filter

FS, ensemble of decision trees.

Slow track:

  • University of Melbourne: CV-based FS targeting AUC. Boosting with

classification trees and shrinkage, using Bernoulli loss.

  • Financial Engineering Group, Inc., Japan: Grouping of modalities, filter FS

using AIC, gradient tree-classifier boosting.

  • National Taiwan University +: Average 3 classifiers: (1) Solve joint

multiclass problem with l1-regularized maximum entropy model. (2) AdaBoost with tree-based weak leaner. (3) Selective Naïve Bayes.

  • (+: small dataset unscrambling)
slide-35
SLIDE 35

Conclusion

! Participation exceeded our expectations. We thank the

participants for their hard work, our sponsors, and Orange who offered:

– A problem of real industrial interest with challenging scientific and technical aspects – Prizes.

! Lessons learned:

– Do not under-estimate the participants: five days were given for the fast challenge, only a few hours sufficed to some participants. – Ensemble methods are effective. – Ensemble of decision trees offer off-the-shelf solutions to problems with large numbers of samples and attributes, mixed types of variables, and lots of missing values.