Analysis of the Effect of Sample Size on the Quality
- f Data Mining Models
Analysis of the Effect of Sample Size on the Quality of Data Mining - - PowerPoint PPT Presentation
Analysis of the Effect of Sample Size on the Quality of Data Mining Models David Watkins SPSS UK Ltd Overview Question how does sample size affect the quality of a data mining model? Why is this important? Several reasons: Common
§ Common belief is that more data = better models.
§ Using all data can be costly § Commonly, the actual requirement is to score all the data,
§ Very small amounts of data are available.
§ Inability of IT to supply data § Time pressure to build a model § The software package or algorithm having a perceived or
§ What is the benefit of acquiring more data?
§ How will more data affect the model quality? § How costly is it to acquire that data?
§ There are many types of data mining models, including:
§ Clustering § Association § Forecasting § Classification
§ The focus here is on binary predictive models
§ Classification models that have a binary dependent variable. § These covers many commercial uses of data mining including
§ customer acquisition § cross-sell § customer retention § fraud detection § credit scoring § etc
§ 18 datasets studied § Varied in records
§ 43342 – 2602718
§ Varied in independent
§ 11 – 115
§ Varied in positive cases
§ 1249 – 289297
§ Varied in ratio of
§ 1.07 – 89.01
8 41 52 260271 8 289297 2E+06 8 1 50000 2621 44 1 E+06 1 8 1 7 1 9 9 528323 254498 273825 1 .07 1 50000 1 31 072 1 41 025 1 7 5 8 5 1 1 1 000834 208931 791 903 3.79 1 50000 1 31 072 496795 1 7 9 1 7 46 52 5921 95 1 1 4454 477741 4.1 7 1 50000 65536 273550 1 6 2 3 8 28 366563 59783 306780 5.1 3 1 50000 32768 1 681 47 1 5 1 2 43 1 001 665 52863 948802 1 7.96 1 50000 32768 5881 26 1 5 3 40 5 9 667871 35791 632080 1 7.67 1 50000 231 70 4091 90 1 4.5 1 5 5 4 1 2931 5 35243 94072 2.67 64651 1 6384 43731 1 4 6 1 9 3 8 1 6891 29099 1 3981 1 4.8 84451 1 1 585 55661 1 3.5 4 9 3 53603 23799 29804 1 .25 26800 1 1 585 1 4507 1 3.5 1 1 7 81 71 030 1 7437 53593 3.07 3551 1 81 92 251 77 1 3 1 4 5 52 69487 1 6975 5251 2 3.09 34740 81 92 25340 1 3 7 4 1 2 43342 6376 36966 5.8 21 670 2896 1 6791 1 1 .5 1 3 27 58028 451 6 5351 2 1 1 .86 2901 1 2048 24267 1 1 1 2 5 41 64308 2351 61 957 26.36 321 50 1 024 26974 1 1 8 3 1 2 1 861 02 1 61 4 1 84488 1 1 4.4 93050 724 82765 9.5 1 6 5 5 1 2 250000 1 325 248675 1 87.8 1 24991 51 2 9601 9 9 1 2 1 8 1 1 2380 1 249 1 1 1 1 31 89.01 561 82 51 2 4551 9 9
§ www.crisp-dm.org
§ The model building process was controlled by five dimensions;
§ dataset, § algorithm, § true records in sample size, § balancing, § and trial.
§ Seven variants of three algorithms were employed:
§ one logistic regression, § four variants of error back propagation MLP neural networks, § and two C5.0 rule induction.
§ The sample sizes were controlled by the number of true records in the
training dataset.
§ The exact number of true records randomly selected from all the training data
started at 32.
§ This figure was multiplied by the square root of 2 for the next sample size.
§ Number of false records was:
§ The same as the true records (balanced) § In keeping with the original –ve:+ve ratio (unbalanced)
§ 40 trials for each of the above § Yielded 177,520 predictive models.
§ Correct ratio § Gain ratio in 1st decile § Gain ratio
§ Similar too, but not to be
dataset evaluation in Records dataset evaluation in classified correctly Records = io CorrectRat curves gains Random and Hindsight between Area curves gains models Random and DM between Area = GainRatio
§ In all cases,
§ Was the highest quality model for a given dataset,
§ Each set was tested to determine if the largest
§ “Too Large Ratio” is then based on whether or not
§ Over 50% of model sets have the highest correct
§ Just under 1/3rd Gain ratio measures display the
same characterstic.
§ More noticeable when looking at the maximum
§ As sample size increases, it’s more likely that a
better gain ratio could be achieved on a smaller training sample.
Measure Too Large Ratio Correct Ratio 0.51 Gain ratio at Decile 1 0.32 Gain ratio at Decile 10 0.28
§ Balancing data can greatly reduce sample size
§ Is it an effective technique? § What effect does it have on model quality?
§ To determine the effect of balancing on model quality,
§ The quality measure of the model built on balanced
§ This yielded a set of balancing effectiveness ratios for
§ Considering Gain Ratio, different model types exhibit different
§ Logistic Regression better on unbalanced data § Rules generally better on balanced data § Neural Nets better on balanced data when sample size gets large
§ For Correct Ratio, balancing seems to not be a good technique
§ This is due to the simplicity of the measure
§ It’s very noticeable the three measurements yield such different
§ § This was automated using Clementine scripting
§ § Control was provide by SPSS Predictive
§ § A 4 PC cluster was employed to spread the
§ Increasing sample size can boost model quality. § However, a training sample can be too large
§ A smaller sample could produce a higher quality model.
§ For any given dataset, increasing the amount of data used to build
§ Balancing is effective when the model quality measure is gain
§ The effect of sample size on model quality is also highly dependent on
how model quality is measured.
§ The initial phase of CRISP-DM, business understanding, outlines
§ This analysis highlights the need for this approach to be adhered to,
including understanding how the model will be used in practice and how that affects the model building and evaluation process.