Data Processing Gymnastics Get your Data into the Format you Need - - PowerPoint PPT Presentation

data processing gymnastics
SMART_READER_LITE
LIVE PREVIEW

Data Processing Gymnastics Get your Data into the Format you Need - - PowerPoint PPT Presentation

Data Processing Gymnastics Get your Data into the Format you Need David Lawrence Department of Social Security Data Repackaging Tall and Thin to Short and Fat 26 October 2017 The Requirement Random Sample Survey (RSS) Result Data


slide-1
SLIDE 1

David Lawrence Department of Social Security

Data Processing Gymnastics

Get your Data into the Format you Need

slide-2
SLIDE 2

Data Repackaging

Tall and Thin to Short and Fat

26 October 2017

slide-3
SLIDE 3

The Requirement

Data Repackaging 3

Random Sample Survey (RSS) Result Data

  • Supplied in several datasets – some with multiple records per

respondent

  • Want to convert it to a single record per respondent

The RSS is used by DSS as the primary Assurance Tool – a sample of recipients has their entitlement reassessed and from this the level of Payment Accuracy can be estimated

slide-4
SLIDE 4

The Problem – Repeating Groups

Data Repackaging 4

  • Various Facets of the Survey are divided into separate datasets
  • Where the respondent could have multiple occurrences of a given result (eg

payment components or debts) these are stored as individual records in the appropriate dataset

  • These individual records (naturally) have the same structure
  • The maximum number of these records per subject is not known
slide-5
SLIDE 5

The Old Way

Data Repackaging 5

  • PROC TRANSPOSE the data for each of the variables in the repeating group
  • Merge the transposed data steps by the classification variables

Drawbacks

  • Repetitive Code
  • Maximum number of Repeating Groups not known. This must be determined

before the transpose begins

slide-6
SLIDE 6

Solution – Step 1 Count the Repeating Groups

Data Repackaging 6

/* Counts the Number of Observations per ID Number */ data CRNCUST( keep = CRN obscnt ); retain obscnt; set DS3Extrct; by crn; if first.crn then obscnt = 1; else obscnt + 1; if last.crn then output; run; /* Extracts the Maximum Number of Observations */ proc means data = CRNCUST; var obscnt;

  • utput out = DEBTstatscust max = maxobs;

run; /* sets maximum macro */ data _null_; set DEBTstatscust; call symput( 'cmax', trim( left( maxobs ))); run;

slide-7
SLIDE 7

Step 2 – Generalised Macro for Array Creation

Data Repackaging 7

  • Note that number of array elements is set to &cmax – the max number of
  • bservations macro variable created in Step 1 above.
  • Type allows the array to be declared as numeric or character
  • The required retain statements to keep are included in the macro definition

/* Creates macro for arrays */

%macro matrixgen( arrayname, startel, endel, type ); array &arrayname {&cmax} &type &startel - &endel; retain &startel - &endel; %mend matrixgen;

slide-8
SLIDE 8

Step 3a Create and Initialise the Arrays

Data Repackaging 8

data DS3Repack; set DS3Extrct; by crn; %matrixcdebt( IdNo, DebtId1, DebtId&max, ); %matrixcdebt( BenT, BenType1, BenType&cmax, $ ); %matrixcdebt( Amount, DebtAMT1, DebtAMT&cmax, ); %matrixcdebt( SYear, SurvYear1, SurvYear&cmax, $ ); %matrixcdebt( SvPRD, Surv_Period1, Surv_Period&cmax, $ ); retain crncnt; if first.crn then do; crncnt = 1; do x = 1 to &cmax; IdNo(x) = .; Amount(x) = 0; BenT(x) = ‘’ SYear(x) = ''; SvPRD(x) = ''; end; end;

slide-9
SLIDE 9

Step 3b Populate the Arrays and Output

Data Repackaging 9

IdNo(crncnt) = DebtId; Amount(crncnt) = DebtAMT; BenT(crncnt) = BenType; SYear(crncnt) = SurvYear; SvPRD(crncnt) = Surv_Period; crncnt + 1; DROP DebtId DebtAMT BenType SurvYear Surv_Period crncnt x; if last.crn then do; NoDebts = crncnt – 1;

  • utput;

end; run;

slide-10
SLIDE 10

Date Processing for Time Series Extraction

Minimising errors for multiple date parameters

26 October 2017

slide-11
SLIDE 11

The Requirement

Data Repackaging 11

  • Extracting and Processing Data for multiple points in time

from a source with dates supplied in several formats

  • DSS obtains data from the Department of Human Services

Enterprise Data Warehouse

  • The platform is accessed through SAS EG but the raw data is

accessed via a SAS EG connection to Teradata

slide-12
SLIDE 12

The Problem – Multiple Date Formats

Data Repackaging 12

Dates are stored on the EDW either as Teradata Dates

  • Extract Date and Logical Delete Date

Or as a number (yyyymmdd) – eg 20th June 2017 = 20,170,620 Extraction from one EDW Source is via a SAS/SQL Macro that requires SAS Dates as parameters Once data is extracted processing in SAS uses SAS Date format Some tasks require extractions from multiple points in time to build Time Series or comparison tables.

slide-13
SLIDE 13

The Old Way

Data Repackaging 13

  • Hard Coding or Inputting all the relevant dates and date formats
  • Numbering extracts from multiple files and using a Proc Format to assign

the appropriate dates

Drawbacks

  • The more input parameters required – the greater the chance of an

incomplete update.

  • The incorrect result may not be easily noticed
slide-14
SLIDE 14

Solution – Declare the Minimum Number of Dates

Data Repackaging 14

Ideally declare one or two dates Do so in Year and Month Components Let SAS create the rest

%let Report_Year = 2017; %let Report_Month = 6;

slide-15
SLIDE 15

Date Creation Processing Tools

Data Repackaging 15

Create a format to determine the last day of the month proc format; value mntdfmt 1, 3, 5, 7, 8, 10, 12 = 31 4, 6, 9, 11 = 30 2 = 28; run;

slide-16
SLIDE 16

Creating the Date Variables (Basic)

Data Repackaging 16

data dateprep; Report_endD = put( &Report_Month, mntdfmt. ); if &Report_Month = 2 and mod( &Report_Year, 4 ) = 0 and mod( &Report_Year, 200 ) ne 0 then Report_endD = 29; ReportEndDt = mdy( &Report_Month, Report_endD, &Report_Year); DateNum = ( &Report_Year * 10000 ) + ( &Report_Month * 100 ) + Report_endD; DateText = put( DateNum, $8. ); TD_Date = “date ‘”||substr( DateText, 1, 4 )||’-’||substr( DateText, 5, 2 )||’-’|| substr( DateText, 7, 2 )||”’”; call symput( ‘SasDate’, trim( left( ReportEndDt ))); call symput( ‘TeradataDate’, trim( left( TD_Date ))); call symput( ‘NumericDate’, trim( left( DateNum ))); run;

Further manipulation of the inputs can create start dates and title text

slide-17
SLIDE 17

Multiple Points in Time for a Time Series

Basic Parameters

Data Repackaging 17

%let srsstart = 2015; %let srsend = 2017; %let frstmnt = 6; %let lstmnt = 8;

slide-18
SLIDE 18

Multiple Points in Time for a Time Series

Data Repackaging 18

%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;

slide-19
SLIDE 19

Multiple Points in Time for a Time Series

Data Repackaging 19

%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;

%do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run;

slide-20
SLIDE 20

Multiple Points in Time for a Time Series

Data Repackaging 20

%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end;

%do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend;

slide-21
SLIDE 21

Multiple Points in Time for a Time Series

Data Repackaging 21

%macro testdtc; %do x = &srsstart %to &srsend; %if &x = &srsstart %then %do; %let loopstart = &frstmnt; %end; %else %do; %let loopstart = 1; %end; %if &x = &srsend %then %do; %let loopend = &lstmnt; %end; %else %do; %let loopend = 12; %end; %do y = &loopstart %to &loopend; data dateprep&x&y; fildy = input( put( &y, fildfmt. ), 2. ); if &y = 2 and mod( &x, 4 ) = 0 and mod( &x, 200 ) ne 0 then fildy = '29'; fildtd = mdy( &y, fildy, &x ); fildtn = ( &x * 10000 ) + ( &y * 100 ) + fildy; fldttxt = put( fildtn, $8. ); tdatadttxt = "date '"||substr( fldttxt, 1, 4 )||'-'||substr( fldttxt, 5, 2 )||'-'||substr( fldttxt, 7, 2 )||"'"; if &y <= 6 then do; year1 = &x; year2 = &x-1; end; else do; year1 = &x + 1; year2 = &x; end; curFYr = trim( left( put( year2, $4. )))||'/'||trim( left( substr( put( year1, $4. ), 3, 2 ))); run; proc print data=dateprep&x&y; run; %end; %end; %mend;

slide-22
SLIDE 22

Thank you

David Lawrence Senior Statistical Programmer Payability and Integrity Payment Integrity Branch Department of Social Services

Data Repackaging 22