data processing gymnastics
play

Data Processing Gymnastics Get your Data into the Format you Need - PowerPoint PPT Presentation

Data Processing Gymnastics Get your Data into the Format you Need David Lawrence Department of Social Security Data Repackaging Tall and Thin to Short and Fat 26 October 2017 The Requirement Random Sample Survey (RSS) Result Data


  1. Data Processing Gymnastics Get your Data into the Format you Need David Lawrence Department of Social Security

  2. Data Repackaging Tall and Thin to Short and Fat 26 October 2017

  3. The Requirement Random Sample Survey (RSS) Result Data • Supplied in several datasets – some with multiple records per respondent • Want to convert it to a single record per respondent The RSS is used by DSS as the primary Assurance Tool – a sample of recipients has their entitlement reassessed and from this the level of Payment Accuracy can be estimated Data Repackaging 3

  4. The Problem – Repeating Groups • Various Facets of the Survey are divided into separate datasets • Where the respondent could have multiple occurrences of a given result (eg payment components or debts) these are stored as individual records in the appropriate dataset • These individual records (naturally) have the same structure • The maximum number of these records per subject is not known Data Repackaging 4

  5. The Old Way • PROC TRANSPOSE the data for each of the variables in the repeating group • Merge the transposed data steps by the classification variables Drawbacks • Repetitive Code • Maximum number of Repeating Groups not known. This must be determined before the transpose begins Data Repackaging 5

  6. Solution – Step 1 Count the Repeating Groups /* Counts the Number of Observations per ID Number */ data CRNCUST( keep = CRN obscnt ); retain obscnt; set DS3Extrct; by crn; if first.crn then obscnt = 1 ; else obscnt + 1 ; if last.crn then output; run ; /* Extracts the Maximum Number of Observations */ proc means data = CRNCUST; var obscnt; output out = DEBTstatscust max = maxobs; run ; /* sets maximum macro */ data _null_; set DEBTstatscust; call symput( 'cmax', trim( left( maxobs ))); run ; Data Repackaging 6

  7. Step 2 – Generalised Macro for Array Creation / * Creates macro for arrays */ %macro matrixgen( arrayname, startel, endel, type ); array &arrayname {&cmax} &type &startel - &endel; retain &startel - &endel; %mend matrixgen ; • Note that number of array elements is set to &cmax – the max number of observations macro variable created in Step 1 above. • Type allows the array to be declared as numeric or character • The required retain statements to keep are included in the macro definition Data Repackaging 7

  8. Step 3a Create and Initialise the Arrays data DS3Repack; set DS3Extrct; by crn; % matrixcdebt ( IdNo, DebtId1, DebtId&max, ); % matrixcdebt ( BenT, BenType1, BenType&cmax, $ ); % matrixcdebt ( Amount, DebtAMT1, DebtAMT&cmax, ); % matrixcdebt ( SYear, SurvYear1, SurvYear&cmax, $ ); % matrixcdebt ( SvPRD, Surv_Period1, Surv_Period&cmax, $ ); retain crncnt; if first.crn then do; crncnt = 1 ; do x = 1 to &cmax; IdNo(x) = . ; Amount(x) = 0 ; BenT(x) = ‘’ SYear(x) = ''; SvPRD(x) = ''; end; end; Data Repackaging 8

  9. Step 3b Populate the Arrays and Output IdNo(crncnt) = DebtId; Amount(crncnt) = DebtAMT; BenT(crncnt) = BenType; SYear(crncnt) = SurvYear; SvPRD(crncnt) = Surv_Period; crncnt + 1 ; DROP DebtId DebtAMT BenType SurvYear Surv_Period crncnt x; if last.crn then do; NoDebts = crncnt – 1; output; end; run ; Data Repackaging 9

  10. Date Processing for Time Series Extraction Minimising errors for multiple date parameters 26 October 2017

  11. The Requirement • Extracting and Processing Data for multiple points in time from a source with dates supplied in several formats • DSS obtains data from the Department of Human Services Enterprise Data Warehouse • The platform is accessed through SAS EG but the raw data is accessed via a SAS EG connection to Teradata Data Repackaging 11

  12. The Problem – Multiple Date Formats Dates are stored on the EDW either as Teradata Dates • Extract Date and Logical Delete Date Or as a number (yyyymmdd) – eg 20 th June 2017 = 20,170,620 Extraction from one EDW Source is via a SAS/SQL Macro that requires SAS Dates as parameters Once data is extracted processing in SAS uses SAS Date format Some tasks require extractions from multiple points in time to build Time Series or comparison tables . Data Repackaging 12

  13. The Old Way • Hard Coding or Inputting all the relevant dates and date formats • Numbering extracts from multiple files and using a Proc Format to assign the appropriate dates Drawbacks • The more input parameters required – the greater the chance of an incomplete update. • The incorrect result may not be easily noticed Data Repackaging 13

  14. Solution – Declare the Minimum Number of Dates Ideally declare one or two dates Do so in Year and Month Components Let SAS create the rest %let Report_Year = 2017; %let Report_Month = 6; Data Repackaging 14

  15. Date Creation Processing Tools Create a format to determine the last day of the month proc format; value mntdfmt 1, 3, 5, 7, 8, 10, 12 = 31 4, 6, 9, 11 = 30 2 = 28; run; Data Repackaging 15

  16. Creating the Date Variables (Basic) Further manipulation of the inputs can create start dates and title text data dateprep; Report_endD = put( &Report_Month, mntdfmt. ); if &Report_Month = 2 and mod( &Report_Year, 4 ) = 0 and mod( &Report_Year, 200 ) ne 0 then Report_endD = 29; ReportEndDt = mdy( &Report_Month, Report_endD, &Report_Year); DateNum = ( &Report_Year * 10000 ) + ( &Report_Month * 100 ) + Report_endD; DateText = put( DateNum, $8. ); TD_Date = “date ‘” ||substr( DateText , 1, 4 )||’ - ’|| substr( DateText, 5, 2 )|| ’ - ’ || substr( DateText , 7, 2 )||”’”; call symput( ‘ SasDate ’ , trim( left( ReportEndDt ))); call symput( ‘ TeradataDate ’ , trim( left( TD_Date ))); call symput( ‘ NumericDate ’ , trim( left( DateNum ))); run ; Data Repackaging 16

  17. Multiple Points in Time for a Time Series Basic Parameters %let srsstart = 2015; %let srsend = 2017; %let frstmnt = 6; %let lstmnt = 8; Data Repackaging 17

  18. Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; Data Repackaging 18

  19. Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %do y = &loopstart %to &loopend; %if &x = &srsstart %then %do; data dateprep&x&y; %let loopstart = &frstmnt; fildy = input( put( &y, fildfmt. ), 2. ); %end; if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; %else %do; fildtd = mdy( &y, fildy, &x ); %let loopstart = 1; fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; %end; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; %if &x = &srsend %then %do; if &y <= 6 then do; %let loopend = &lstmnt; year1 = &x; %end; year2 = &x- 1 ; %else %do; end; %let loopend = 12; else do; %end; year1 = &x + 1 ; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); run; Data Repackaging 19

  20. Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %do y = &loopstart %to &loopend; %if &x = &srsstart %then %do; data dateprep&x&y; %let loopstart = &frstmnt; fildy = input( put( &y, fildfmt. ), 2. ); %end; if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; %else %do; fildtd = mdy( &y, fildy, &x ); %let loopstart = 1; fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; %end; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; %if &x = &srsend %then %do; if &y <= 6 then do; %let loopend = &lstmnt; year1 = &x; %end; year2 = &x- 1 ; %else %do; end; %let loopend = 12; else do; %end; year1 = &x + 1 ; proc print data=dateprep&x&y; run; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); %end; run; %end; %mend ; Data Repackaging 20

  21. Multiple Points in Time for a Time Series %macro testdtc ; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; %do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1 , 4 )||'-'||substr( fldttxt, 5 , 2 )||'-'||substr( fldttxt, 7 , 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x- 1 ; end; else do; year1 = &x + 1 ; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3 , 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend ; Data Repackaging 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend