Getting the Most from Hash Objects Bharath Gowda Getting the most - - PowerPoint PPT Presentation

▶

Sep 11, 2022 471 likes •620 views

Getting the Most from Hash Objects Bharath Gowda Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup

SLIDE 1

Getting the Most from Hash Objects Bharath Gowda

SLIDE 2

Getting the most from Hash objects

Techniques covered are:

SQL join
Data step merge using BASE engine
Data step merge using SPDE merge
Index Key lookup
Hash object lookup
Simple join, iteration and sorting with hash

SLIDE 3

Test setup

Only using local SAS datasets.
Datasets (including SPDE) reside in the local work library

directory.

Storing data locally nullifies the network delays and I/O

delays.

The environment has a super fast solid state hard disk which

decreases I/O delay.

Same join logic is implemented over all the techniques.

SLIDE 4

PROC SQL

Requires no sorting prior to joins
Multi way joins can be performed.
SQL uses internal utility tables for joins

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:20:28.79 0:14:15.31 Memory 254512.00k

Proc sql; create table acute_pred as select a.*,b.nwau_sas from pred_cost as a inner join speciality as b

n a.yearid = b.yearid and

a.facility_identifier= b.facility_identifier ; quit;

SLIDE 5

Data step merge

It requires a sort first by the key

variables in all the input datasets.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time (avg combined sorting and merging) 0:30:28.79 0:21:15.31 Memory(combined avg) 162393.59k

Proc sort data=pred_cost out= srt_pred_cost; by yearid facility_identifier; Run; Proc sort speciality out= srt_spec (keep= yearid facility_identifier nwau_sas); by yearid facility_identifier; Run; Data acute_pred; Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; If a and b; Run;

SLIDE 6

SPDE merge

SPDE stands for Scalable Performance data

engine

SPDE combines software and hardware

capabilities.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (simple join) 12,001,616 real time user cpu time 0:16:15.31 0:12:28.79 Memory 1261040.70k

libname workspde spde "%sysfunc(pathname(work))" temp=yes; proc copy in=work out= workspde; select pred_cost specialty; run; data workspde.acute_pred; merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16 (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; If a and b; run;

SLIDE 7

Index key lookup

Main dataset is not sorted.
Multiple set statements with key= option .
_IORC_ is one of the key automatic variable

which needs attention.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:12:15.31 0:06:28.79 Memory 142279.71k

Proc datasets lib=work nolist; Modify speciality; Index create myindex=(yearid facility_identifier); run; Data acute_pred; set pred_cost; Set speciality key=myindex ; If _IORC_ =0 ; run;

SLIDE 8

HASH Objects

One of the fastest approaches

for the look up activities.

In memory computation makes

this technique the fastest.

Dataset size is a key factor for

memory consumption.

Pre sorting is not required.
Hash object needs to be defined

and instantiated.

Lookups are performed with the

find() function.

Data acute_pred; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: ‘speciality’,duplicate:’e’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(‘nwau_sas’); hn16.definedone(); end; set pred_cost; Rc=hn16.find(key :yearid, key:facility_identifier); If rc=0; Run;

SLIDE 9

HASH Objects

Simple inner joins and left joins can

be achieved by keeping a check on the return code variable.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:08:28.79 0:03:15.31 Memory 1465437.28k

if rc=0;

SLIDE 10

Hash iterator objects need to be defined and instantiated.
Hash objects are assigned to hash iterator objects.

Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset:’speciality’, ordered:’descending’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(all:’Y’); hn16.definedone(); Declare hiter iter_ex(‘hn16’); End; /hn16.output(dataset:’sort_dec’)/

HASH Iterators

SLIDE 11

HASH Iterators

First() and next() functions to

iterate top to bottom and last() and prev() functions to iterate bottom to top

Can be used to get the top and

bottom records.

Declare hiter iter_ex(‘hn16’); End; Iter_ex.First(); do I = 1 to 10;

utput top;

Iter_ex.Next(); end; Iter_ex.last(); do I = 1 to 10;

utput bottom;

Iter_ex.prev(); end; Run;

SLIDE 12

Avoid or use?

Where should you apply these techniques?

Type Proc SQL Datastep Merge Index lookup (key=) SPDE Merge Hash lookup Hash Iterators Small to medium sized tables (< 10,000 rows)

✓ ✓    

Huge table (> 10 million rows)

✓  ✓ ✓ ✓ ✓

Unsorted data(> 10 million rows)

✓  ✓ ✓ ✓ ✓

Datasets with multiple indexes

✓  ✓   

Less memory hungry techniques

 ✓ ✓   

SLIDE 13

Getting the most from Hash objects

Bharath Gowda, SAS analyst Independent Contractor 0468304568 | bharathg1307@gmail.com

SLIDE 14