Getting the Most from Hash Objects Bharath Gowda Getting the most - - PowerPoint PPT Presentation

getting the most from hash objects
SMART_READER_LITE
LIVE PREVIEW

Getting the Most from Hash Objects Bharath Gowda Getting the most - - PowerPoint PPT Presentation

Getting the Most from Hash Objects Bharath Gowda Getting the most from Hash objects Techniques covered are: SQL join Data step merge using BASE engine Data step merge using SPDE merge Index Key lookup Hash object lookup


slide-1
SLIDE 1

Getting the Most from Hash Objects Bharath Gowda

slide-2
SLIDE 2

Getting the most from Hash objects

Techniques covered are:

  • SQL join
  • Data step merge using BASE engine
  • Data step merge using SPDE merge
  • Index Key lookup
  • Hash object lookup
  • Simple join, iteration and sorting with hash
slide-3
SLIDE 3

Test setup

  • Only using local SAS datasets.
  • Datasets (including SPDE) reside in the local work library

directory.

  • Storing data locally nullifies the network delays and I/O

delays.

  • The environment has a super fast solid state hard disk which

decreases I/O delay.

  • Same join logic is implemented over all the techniques.
slide-4
SLIDE 4

PROC SQL

  • Requires no sorting prior to joins
  • Multi way joins can be performed.
  • SQL uses internal utility tables for joins

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:20:28.79 0:14:15.31 Memory 254512.00k

Proc sql; create table acute_pred as select a.*,b.nwau_sas from pred_cost as a inner join speciality as b

  • n a.yearid = b.yearid and

a.facility_identifier= b.facility_identifier ; quit;

slide-5
SLIDE 5

Data step merge

  • It requires a sort first by the key

variables in all the input datasets.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time (avg combined sorting and merging) 0:30:28.79 0:21:15.31 Memory(combined avg) 162393.59k

Proc sort data=pred_cost out= srt_pred_cost; by yearid facility_identifier; Run; Proc sort speciality out= srt_spec (keep= yearid facility_identifier nwau_sas); by yearid facility_identifier; Run; Data acute_pred; Merge srt_pred_cost(in=a) srt_spec(in=b); by yearid facility_identifier; If a and b; Run;

slide-6
SLIDE 6

SPDE merge

  • SPDE stands for Scalable Performance data

engine

  • SPDE combines software and hardware

capabilities.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (simple join) 12,001,616 real time user cpu time 0:16:15.31 0:12:28.79 Memory 1261040.70k

libname workspde spde "%sysfunc(pathname(work))" temp=yes; proc copy in=work out= workspde; select pred_cost specialty; run; data workspde.acute_pred; merge workspde.Hiepred_acutecost_1718(in=a) workspde.nwau16 (in=b keep= nwau_sas yearid facility_identifier); by yearid facility_identifier; If a and b; run;

slide-7
SLIDE 7

Index key lookup

  • Main dataset is not sorted.
  • Multiple set statements with key= option .
  • _IORC_ is one of the key automatic variable

which needs attention.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:12:15.31 0:06:28.79 Memory 142279.71k

Proc datasets lib=work nolist; Modify speciality; Index create myindex=(yearid facility_identifier); run; Data acute_pred; set pred_cost; Set speciality key=myindex ; If _IORC_ =0 ; run;

slide-8
SLIDE 8

HASH Objects

  • One of the fastest approaches

for the look up activities.

  • In memory computation makes

this technique the fastest.

  • Dataset size is a key factor for

memory consumption.

  • Pre sorting is not required.
  • Hash object needs to be defined

and instantiated.

  • Lookups are performed with the

find() function.

Data acute_pred; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset: ‘speciality’,duplicate:’e’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(‘nwau_sas’); hn16.definedone(); end; set pred_cost; Rc=hn16.find(key :yearid, key:facility_identifier); If rc=0; Run;

slide-9
SLIDE 9

HASH Objects

  • Simple inner joins and left joins can

be achieved by keeping a check on the return code variable.

Rows in pred_cost 13,160,905 Rows in speciality 16,170,805 Rows in acute_pred (inner join) 12,001,616 real time user cpu time 0:08:28.79 0:03:15.31 Memory 1465437.28k

if rc=0;

slide-10
SLIDE 10
  • Hash iterator objects need to be defined and instantiated.
  • Hash objects are assigned to hash iterator objects.

Data top bottom; length nwau_sas 8.; If _n_=1 then do; declare hash hn16(dataset:’speciality’, ordered:’descending’); hn16.definekey(‘yearid’,’facility_identifier’); hn16.definedata(all:’Y’); hn16.definedone(); Declare hiter iter_ex(‘hn16’); End; /*hn16.output(dataset:’sort_dec’)*/

HASH Iterators

slide-11
SLIDE 11

HASH Iterators

  • First() and next() functions to

iterate top to bottom and last() and prev() functions to iterate bottom to top

  • Can be used to get the top and

bottom records.

Declare hiter iter_ex(‘hn16’); End; Iter_ex.First(); do I = 1 to 10;

  • utput top;

Iter_ex.Next(); end; Iter_ex.last(); do I = 1 to 10;

  • utput bottom;

Iter_ex.prev(); end; Run;

slide-12
SLIDE 12

Avoid or use?

  • Where should you apply these techniques?

Type Proc SQL Datastep Merge Index lookup (key=) SPDE Merge Hash lookup Hash Iterators Small to medium sized tables (< 10,000 rows)

✓ ✓    

Huge table (> 10 million rows)

✓  ✓ ✓ ✓ ✓

Unsorted data(> 10 million rows)

✓  ✓ ✓ ✓ ✓

Datasets with multiple indexes

✓  ✓   

Less memory hungry techniques

 ✓ ✓   

slide-13
SLIDE 13

Getting the most from Hash objects

Bharath Gowda, SAS analyst Independent Contractor 0468304568 | bharathg1307@gmail.com

slide-14
SLIDE 14