Data Processing Gymnastics Get your Data into the Format you Need - - PowerPoint PPT Presentation
Data Processing Gymnastics Get your Data into the Format you Need - - PowerPoint PPT Presentation
Data Processing Gymnastics Get your Data into the Format you Need David Lawrence Department of Social Security Data Repackaging Tall and Thin to Short and Fat 26 October 2017 The Requirement Random Sample Survey (RSS) Result Data
Data Repackaging
Tall and Thin to Short and Fat
26 October 2017
The Requirement
Data Repackaging 3
Random Sample Survey (RSS) Result Data
- Supplied in several datasets – some with multiple records per
respondent
- Want to convert it to a single record per respondent
The RSS is used by DSS as the primary Assurance Tool – a sample of recipients has their entitlement reassessed and from this the level of Payment Accuracy can be estimated
The Problem – Repeating Groups
Data Repackaging 4
- Various Facets of the Survey are divided into separate datasets
- Where the respondent could have multiple occurrences of a given result (eg
payment components or debts) these are stored as individual records in the appropriate dataset
- These individual records (naturally) have the same structure
- The maximum number of these records per subject is not known
The Old Way
Data Repackaging 5
- PROC TRANSPOSE the data for each of the variables in the repeating group
- Merge the transposed data steps by the classification variables
Drawbacks
- Repetitive Code
- Maximum number of Repeating Groups not known. This must be determined
before the transpose begins
Solution – Step 1 Count the Repeating Groups
Data Repackaging 6
/* Counts the Number of Observations per ID Number */ data CRNCUST( keep = CRN obscnt ); retain obscnt; set DS3Extrct; by crn; if first.crn then obscnt = 1; else obscnt + 1; if last.crn then output; run; /* Extracts the Maximum Number of Observations */ proc means data = CRNCUST; var obscnt;
- utput out = DEBTstatscust max = maxobs;
run; /* sets maximum macro */ data _null_; set DEBTstatscust; call symput( 'cmax', trim( left( maxobs ))); run;
Step 2 – Generalised Macro for Array Creation
Data Repackaging 7
- Note that number of array elements is set to &cmax – the max number of
- bservations macro variable created in Step 1 above.
- Type allows the array to be declared as numeric or character
- The required retain statements to keep are included in the macro definition
/* Creates macro for arrays */
%macro matrixgen( arrayname, startel, endel, type ); array &arrayname {&cmax} &type &startel - &endel; retain &startel - &endel; %mend matrixgen;
Step 3a Create and Initialise the Arrays
Data Repackaging 8
data DS3Repack; set DS3Extrct; by crn; %matrixcdebt( IdNo, DebtId1, DebtId&max, ); %matrixcdebt( BenT, BenType1, BenType&cmax, $ ); %matrixcdebt( Amount, DebtAMT1, DebtAMT&cmax, ); %matrixcdebt( SYear, SurvYear1, SurvYear&cmax, $ ); %matrixcdebt( SvPRD, Surv_Period1, Surv_Period&cmax, $ ); retain crncnt; if first.crn then do; crncnt = 1; do x = 1 to &cmax; IdNo(x) = .; Amount(x) = 0; BenT(x) = ‘’ SYear(x) = ''; SvPRD(x) = ''; end; end;
Step 3b Populate the Arrays and Output
Data Repackaging 9
IdNo(crncnt) = DebtId; Amount(crncnt) = DebtAMT; BenT(crncnt) = BenType; SYear(crncnt) = SurvYear; SvPRD(crncnt) = Surv_Period; crncnt + 1; DROP DebtId DebtAMT BenType SurvYear Surv_Period crncnt x; if last.crn then do; NoDebts = crncnt – 1;
- utput;
end; run;
Date Processing for Time Series Extraction
Minimising errors for multiple date parameters
26 October 2017
The Requirement
Data Repackaging 11
- Extracting and Processing Data for multiple points in time
from a source with dates supplied in several formats
- DSS obtains data from the Department of Human Services
Enterprise Data Warehouse
- The platform is accessed through SAS EG but the raw data is
accessed via a SAS EG connection to Teradata
The Problem – Multiple Date Formats
Data Repackaging 12
Dates are stored on the EDW either as Teradata Dates
- Extract Date and Logical Delete Date
Or as a number (yyyymmdd) – eg 20th June 2017 = 20,170,620 Extraction from one EDW Source is via a SAS/SQL Macro that requires SAS Dates as parameters Once data is extracted processing in SAS uses SAS Date format Some tasks require extractions from multiple points in time to build Time Series or comparison tables.
The Old Way
Data Repackaging 13
- Hard Coding or Inputting all the relevant dates and date formats
- Numbering extracts from multiple files and using a Proc Format to assign
the appropriate dates
Drawbacks
- The more input parameters required – the greater the chance of an
incomplete update.
- The incorrect result may not be easily noticed
Solution – Declare the Minimum Number of Dates
Data Repackaging 14
Ideally declare one or two dates Do so in Year and Month Components Let SAS create the rest
%let Report_Year = 2017; %let Report_Month = 6;
Date Creation Processing Tools
Data Repackaging 15
Create a format to determine the last day of the month proc format; value mntdfmt 1, 3, 5, 7, 8, 10, 12 = 31 4, 6, 9, 11 = 30 2 = 28; run;
Creating the Date Variables (Basic)
Data Repackaging 16
data dateprep; Report_endD = put( &Report_Month, mntdfmt. ); if &Report_Month = 2 and mod( &Report_Year, 4 ) = 0 and mod( &Report_Year, 200 ) ne 0 then Report_endD = 29; ReportEndDt = mdy( &Report_Month, Report_endD, &Report_Year); DateNum = ( &Report_Year * 10000 ) + ( &Report_Month * 100 ) + Report_endD; DateText = put( DateNum, $8. ); TD_Date = “date ‘”||substr( DateText, 1, 4 )||’-’||substr( DateText, 5, 2 )||’-’|| substr( DateText, 7, 2 )||”’”; call symput( ‘SasDate’, trim( left( ReportEndDt ))); call symput( ‘TeradataDate’, trim( left( TD_Date ))); call symput( ‘NumericDate’, trim( left( DateNum ))); run;
Further manipulation of the inputs can create start dates and title text
Multiple Points in Time for a Time Series
Basic Parameters
Data Repackaging 17
%let srsstart = 2015; %let srsend = 2017; %let frstmnt = 6; %let lstmnt = 8;
Multiple Points in Time for a Time Series
Data Repackaging 18
%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;
Multiple Points in Time for a Time Series
Data Repackaging 19
%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;
%do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run;
Multiple Points in Time for a Time Series
Data Repackaging 20
%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;
%do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend;
Multiple Points in Time for a Time Series
Data Repackaging 21
%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; %do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend;
Thank you
David Lawrence Senior Statistical Programmer Payability and Integrity Payment Integrity Branch Department of Social Services
Data Repackaging 22