Working with Hash Tables Daniel Petrolito (ANZ Bank) Working With - - PowerPoint PPT Presentation
Working with Hash Tables Daniel Petrolito (ANZ Bank) Working With - - PowerPoint PPT Presentation
Working with Hash Tables Daniel Petrolito (ANZ Bank) Working With Hash Tables Daniel SAS 3+ years Programming 6 years ANZ Data Scientist Combining Data Combining tables Base table Lookup table(s) Want
Working With Hash Tables
- Daniel
- ANZ – Data Scientist
- SAS 3+ years
- Programming 6 years
Combining Data
- Combining tables
- Base table
- Lookup table(s)
- Want flexibility without too much complexity
- Options
- Data Step
✓ Flexible ❖ Must pre-sort data
- SQL
✓ In-database ❖ Cartesian joins
- Others
✓ Macros ✓ Arrays ✓ User-defined formats ✓ Hash tables
Combining Data – techniques
You may have tried the following:
- Index the fields you are joining on
- Reduce number of rows and columns as early in your process as
possible
- Do joins where the data originates
- SAS dataset compression (compress=yes)
- Column lengths
- Hash Tables
Hash Tables - Introduction
- Optimized for speed through
complex hashing algorithms
- Think of it as a SAS indexed
data table loaded in memory
Benefits & Consequences of Hash Tables
Pros
- In-memory, faster lookups than
any other technique
- Can join multiple tables, using
different keys, in a single data step
- No sorting required
Cons
- Object-oriented syntax - not
intuitive to code
- Large memory requirement
- Data step only
Simple Example
People (Base Table) name age John 41 Sally 33 Rebecca 22 Reference (Lookup Table) Name age state John 50VIC Sally 35NSW
Hashtest (Results) name age state rc John 50VIC Sally 35NSW Rebecca 22 160038
Once only, define & populate the hash table For each record in the base table, look up the hash table
Lookup failed – program accordingly
Hash Table Methods
- Find()
- Add()
- Replace()
- Iterate()
- Output()
When Not To Use Hash Tables
- Speed is not required
- Team not technical savvy
- Joins are more optimized to be done from external
databases/sources
- Not used to replace every left join
- Look up table is too big (memory constraint)
Real World Example
- Build a fast SAS solution working with millions
- f rows of data that would join on multiple
tables and calculate complicated metrics.
- Run thousands of times (Monte Carlo
simulation) Issues
- First SQL solution was too slow
- Dynamic & complex joins
Real World Example – Solution with Hash Tables
key range value X123 60 0.1 X123 70 0.2 X123 80 0.3
- Hash tables flexible to only fetch data when
needed.
- Dynamic look up keys:
- hh.find(key:some_variable)
- Ranged look ups:
- hh.find_next()
Further Reading
- http://support.sas.com/resources/papers/proceedings17/0821-2017.pdf
- http://www.lexjansen.com/nesug/nesug07/bb/bb16.pdf
- https://support.sas.com/resources/papers/sgf2008/better-hashing-