u a method for regression analysis on sparse datasets
play

(U) A Method for Regression Analysis on Sparse Datasets Daniel - PowerPoint PPT Presentation

(U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015 Background Traditional regression analysis, e.g. Zero-Bias Minimum Percent Error (ZMPE), can run into problems when important independent


  1. (U) A Method for Regression Analysis on Sparse Datasets Daniel Barkmeyer NRO CAAG June 2015

  2. Background • Traditional regression analysis, e.g. Zero-Bias Minimum Percent Error (ZMPE), can run into problems when important independent variables are known only for some datapoints (sparsely populated) • Omit data for which not all drivers tested are known, or • Do not test as drivers those data fields that are not fully populated • NRO CAAG’s Commercial -like Acquisition Program Study (CAPS)* ameliorated this issue • Empirically-derived scoring term based on known drivers • Scores independent of unknown drivers • Regression determines contribution of drivers to score, and coefficients expressing DV as a function of score • Linear regression only OBJECTIVE: Apply score-based regression to power-form functions with multiplicative error terms * Alvarado, W., Barkmeyer, D., and Burgess, E. “Commercial - like Acquisitions: Practices and Costs.” Journal of Cost Analysis an d Parametrics, V3, Issue 1. 2 NRO CAAG

  3. Regression on Sparse Datasets • Advantage – retain explanatory power of sparsely-populated drivers, degrees of freedom in regressions derived from sparsely- populated datasets • For a given independent variable n , if x n is unknown for a datapoint, the influence of n is removed from the score for that datapoint • Datapoint can be retained in the regression as long as some x n are known • Allows all partially-populated datapoints to inform regression Operating Mobile (1) or Operational (1) or ZMPE Data Point Cost Weight Scoring Regression Wavelength Stationary (0) Experimental (0) Regression S = f(W ,l, Mob,Op) 1 $ 18 154 250 0 0 Include Include 2 $ 95 650 1 Omit Include S = f(W,Op) S = f( l, Mob) 3 $ 54 450 0 Omit Include S = f(W, l ,Op) 4 $ 52 310 500 1 Omit Include S = f(W ,l, Mob,Op) 5 $ 68 776 450 0 0 Include Include S = f(W ,l, Mob,Op) 6 $ 165 490 505 1 1 Include Include S = f(W, l, Op) 7 $ 307 900 800 0 Omit Include 8 $ 60 100 1 1 Omit Include S = f(W,Mob,Op) S = f(W ,l, Mob,Op) 9 $ 123 281 550 1 0 Include Include S = f(W, l, Mob) 10 $ 82 200 380 1 Omit Include With 4 drivers, 6 DOF With 4 drivers, 0 DOF 3 NRO CAAG

  4. Scoring Method for Power Function Form • For a linear regression equation in the CAPS model, the score was calculated as a weighted average of the normalized known drivers: 𝑥 𝑜 𝑦 𝑜 𝑇 𝑚𝑗𝑜𝑓𝑏𝑠 = 𝑦 𝑜 known 𝑥 𝑜 • For a power function equation, the desired form of the score is a weighted geometric mean of the normalized known drivers: 𝑥 𝑜 ln 𝑦 𝑜 𝑦𝑜 𝑙𝑜𝑝𝑥𝑜 𝑥 𝑜 𝑇 𝑞𝑝𝑥𝑓𝑠 = 𝑓 where (ln 𝑦 𝑜 ) − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 continuous (ln 𝑦 𝑜 ) 𝑛𝑏𝑦 − (ln 𝑦 𝑜 ) 𝑛𝑗𝑜 ln 𝑦 𝑜 = 𝑦 𝑜 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 , 𝑦 𝑜 binary ( 𝑦 𝑜 ) 𝑛𝑏𝑦 − ( 𝑦 𝑜 ) 𝑛𝑗𝑜 • Where x n is not known, the n th term drops out of the numerator and denominator in the score 4 NRO CAAG

  5. Power Function Form • It can be shown that the form for the score reduces the regression equation 𝐷 𝑧 = 𝐵 + 𝐶 ∙ 𝑇 𝑞𝑝𝑥𝑓𝑠 to the desired power function form: 𝑦 𝑜 𝑄 𝑜 ∙ 𝑦 𝑜 𝑧 = 𝐵 + 𝑅 ∙ 𝑄 𝑜 𝑦 𝑜 cont . 𝑦 𝑜 bin . with constants P and Q functions of B , C , weightings, and max/min values of drivers (all constants) • This form can be used on sparse or full datasets 5 NRO CAAG

  6. Dataset for Testing • Created a dataset representative of typical cost estimating problem • 100 datapoints, 4 independent variable drivers • Driver 1 continuous, lognormally distributed, mean value 500, coefficient of variation 0.65, minimum value 100 • Driver 2 continuous, lognormally distributed, mean value 15, coefficient of variation 5.0, minimum value 0 • Driver 3 binary, 33% of data has value of 1 • Driver 4 binary, 50% of data has value of 1 • Dependent variable values set by the equation 𝑧 = 100 + 20 ∙ 𝑦 10.6 ∙ 𝑦 20.3 ∙ 3 𝑦 3 ∙ 1.2 𝑦 4 ∙ 𝜁 A Q P 1 P 2 P 3 P 4 with error term e lognormally distributed, mean value 1, coefficient of variation 0.4, minimum value 0 • Underlying behavior of the data is known • Regression results can be compared against the expected result 6 NRO CAAG

  7. Validation – 100% Populated • For the test dataset, regression of the form 𝑦 3 ∙ 𝑄 𝑧 = (𝐵 + 𝑅 ∙ 𝑦 1𝑄 1 ∙ 𝑦 2𝑄 2 ∙ 𝑄 3 𝑦 4 ) ∙ 𝜁 4 with objective to minimize the value 100 𝜁 𝑜2 𝑔 𝑝𝑐𝑘 = 𝑜=1 ZMPE Score-Based • • Optimize A, Q, P 1 , P 2 , P 3 , P 4 Convert regression equation to 𝑧 = (𝐵 + 𝐶 𝑓 𝐷∙ 𝑥 1 ln 𝑦 1 +𝑥 2 ln 𝑦 2 +𝑥 3 ln 𝑦 3 +𝑥 4 ln 𝑦 4 ) ∙ 𝜁 𝑥 1 +𝑥 2 +𝑥 3 +𝑥 4 • Solution: • Optimize A, B, C, w 1 , w 2 , w 3 , w 4 • Solution and re-conversion: A 0.0 A 0.0 A 0.0 Q 18.9 B 32.1 Q 18.9 C 7.10 P 1 0.63 P 1 0.63 w 1 17% P 2 0.34 P 2 0.34 w 2 66% P 3 2.89 P 3 2.89 w 3 15% P 4 1.14 P 4 1.14 w 4 2% Score-Based method reproduces ZMPE solution on fully-populated dataset 7 NRO CAAG

  8. Score-Based Regression vs. ZMPE • Validated Score-based method is equivalent to ZMPE on a fully- populated dataset • Next step: sparsely-populated test cases • Individual drivers sparsely populated • 50 regressions, each with randomly-selected values removed, at every 5% interval of population percent between 100% and 5% populated • Other 3 drivers fully populated • All drivers sparsely populated • 200 regressions, each with randomly-selected values removed, at every 5% interval of overall population percent between 100% and 5% • Comparison metric: Characteristic Underlying Percent Error • CUPE is measured across the entire dataset, including values that were removed to simulate sparseness of data 100 𝜗 %𝑜2 𝑜=1 where 𝜗 %𝑜 is the percent error between the • Defined as 𝐷𝑉𝑄𝐹 = 𝐸𝑃𝐺 actual y and the regression equation’s predicted y for the n th datapoint • Measures how well the regression (with incomplete data) captures the underlying relationship (if the data were complete) 8 NRO CAAG

  9. Score-Based Regression vs. ZMPE Weaker Continuous Driver Sparse • Average Degrees of Freedom of Trial Regressions vs. Weak Continuous Degrees of Freedom Driver % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated Average CUPE of Trial Regressions vs. Weak Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • Models show similar performance 70.0% down to 30% populated 60.0% • Below 30% populated, score-based 50.0% method is better able to model the 40.0% underlying relationship 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 1) Percent Populated 9 NRO CAAG

  10. Score-Based Regression vs. ZMPE Stronger Continuous Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Strong Continuous Driver % Populated • 100 ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions linearly with % Populated 80 70 • Score-based regression retains all 60 DOF from the full dataset 50 40 Sparse Data Method 30 ZMPE 20 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Continuous Driver (Driver 2) Percent Populated Average CUPE of Trial Regressions vs. Strong Continuous Driver % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs much better above 70.0% very low population percentages 60.0% • Score-based method only proves 50.0% better able to capture underlying relationship once ZMPE DOF 40.0% 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% becomes very small Continuous Driver (Driver 2) Percent Populated 10 NRO CAAG

  11. Score-Based Regression vs. ZMPE Binary Driver Sparse • Degrees of Freedom Average Degrees of Freedom of Trial Regressions vs. Stratifier % Populated 100 • ZMPE regression DOF decreases 90 Average DOF of 50 Trial Regressions 80 linearly with % Populated 70 • Score-based regression retains all 60 50 DOF from the full dataset 40 30 Sparse Data Method 20 ZMPE 10 0 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated Average CUPE of Trial Regressions vs. Stratifier % Populated 100.0% • CUPE of resultant estimating Average CUPE of 50 Trial Regressions 90.0% Sparse Data Method relationship against full dataset ZMPE 80.0% • ZMPE performs slightly better above 70.0% 20% populated 60.0% • Score-based method proves better 50.0% able to capture underlying relationship 40.0% once ZMPE DOF becomes small 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Stratifier (Driver 3) Percent Populated 11 NRO CAAG

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend