Getting the Most from Hash Objects Bharath Gowda Getting the most - - PowerPoint PPT Presentation
Getting the Most from Hash Objects Bharath Gowda Getting the most - - PowerPoint PPT Presentation
Getting the Most from Hash Objects Bharath Gowda Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup
Getting the most from Hash objects
Techniques covered are:
- SQL join
- Data step merge using BASE engine
- Data step merge using SPDE merge
- Index Key lookup
- Hash object lookup
- Simple join, iteration and sorting with hash
Test setup
- Only using local SAS datasets.
- Datasets (including SPDE) reside in the local work library
directory.
- Storing data locally nullifies the network delays and I/O
delays.
- The environment has a super fast solid state hard disk which
decreases I/O delay.
- Same join logic is implemented over all the techniques.
PROC SQL
- Requires no sorting prior to joins
- Multi way joins can be performed.
- SQL uses internal utility tables for joins
Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:20:28.79 0:14:15.31 Memory 254512.00k
Proc sql; create table acute_pred as select a.*,b.nwau_sas from pred_cost as a inner join speciality as b
- n a.yearid = b.yearid and
a.facility_identifier= b.facility_identifier ; quit;
Data step merge
- It requires a sort first by the key
variables in all the input datasets.
Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time (avg combined sorting and merging) 0:30:28.79 0:21:15.31 Memory(combined avg) 162393.59k
Proc sort data=pred_cost out= srt_pred_cost; by yearid facility_identifier; Run; Proc sort speciality out= srt_spec (keep= yearid facility_identifier nwau_sas); by yearid facility_identifier; Run; Data acute_pred; Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; If a and b; Run;
SPDE merge
- SPDE stands for Scalable Performance data
engine
- SPDE combines software and hardware
capabilities.
Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (simple join) 12,001,616 real time user cpu time 0:16:15.31 0:12:28.79 Memory 1261040.70k
libname workspde spde "%sysfunc(pathname(work))" temp=yes; proc copy in=work out= workspde; select pred_cost specialty; run; data workspde.acute_pred; merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16 (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; If a and b; run;
Index key lookup
- Main dataset is not sorted.
- Multiple set statements with key= option .
- _IORC_ is one of the key automatic variable
which needs attention.
Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:12:15.31 0:06:28.79 Memory 142279.71k
Proc datasets lib=work nolist; Modify speciality; Index create myindex=(yearid facility_identifier); run; Data acute_pred; set pred_cost; Set speciality key=myindex ; If _IORC_ =0 ; run;
HASH Objects
- One of the fastest approaches
for the look up activities.
- In memory computation makes
this technique the fastest.
- Dataset size is a key factor for
memory consumption.
- Pre sorting is not required.
- Hash object needs to be defined
and instantiated.
- Lookups are performed with the
find() function.
Data acute_pred; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: ‘speciality’,duplicate:’e’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(‘nwau_sas’); hn16.definedone(); end; set pred_cost; Rc=hn16.find(key :yearid, key:facility_identifier); If rc=0; Run;
HASH Objects
- Simple inner joins and left joins can
be achieved by keeping a check on the return code variable.
Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:08:28.79 0:03:15.31 Memory 1465437.28k
if rc=0;
- Hash iterator objects need to be defined and instantiated.
- Hash objects are assigned to hash iterator objects.
Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset:’speciality’, ordered:’descending’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(all:’Y’); hn16.definedone(); Declare hiter iter_ex(‘hn16’); End; /*hn16.output(dataset:’sort_dec’)*/
HASH Iterators
HASH Iterators
- First() and next() functions to
iterate top to bottom and last() and prev() functions to iterate bottom to top
- Can be used to get the top and
bottom records.
Declare hiter iter_ex(‘hn16’); End; Iter_ex.First(); do I = 1 to 10;
- utput top;
Iter_ex.Next(); end; Iter_ex.last(); do I = 1 to 10;
- utput bottom;
Iter_ex.prev(); end; Run;
Avoid or use?
- Where should you apply these techniques?