Accessing Data in the Cloud Using SAS to read data from Amazon - - PowerPoint PPT Presentation

accessing data in the cloud
SMART_READER_LITE
LIVE PREVIEW

Accessing Data in the Cloud Using SAS to read data from Amazon - - PowerPoint PPT Presentation

Accessing Data in the Cloud Using SAS to read data from Amazon Simple Storage Service (S3) seleritysas.com What is Amazon Simple Storage Service (S3)? An object store, not a file system Write once, read many (WORM) Eventually


slide-1
SLIDE 1

Accessing Data in the Cloud

Using SAS to read data from Amazon Simple Storage Service (S3)

seleritysas.com

slide-2
SLIDE 2

What is Amazon Simple Storage Service (S3)?

  • An object store, not a file system
  • Write once, read many (WORM)
  • Eventually consistent
  • 99.999999999% durability
  • Unlimited storage capacity
  • Highly scalable and available data storage
  • Low latency and high throughput performance
slide-3
SLIDE 3

What Public Data is Available in S3?

  • AWS Public Datasets
  • https://aws.amazon.com/public-datasets/
  • Geospatial and Environmental Datasets
  • Genomics and Life Science Datasets
  • Datasets for Machine Learning
  • Regulatory and Statistical Data
  • awesome-public-datasets
  • https://github.com/caesar0301/awesome-

public-datasets

  • NYC Taxi and Limousine Commission
  • http://www.nyc.gov/html/tlc/html/about/trip_r

ecord_data.shtml

slide-4
SLIDE 4

What is the typical workflow to use raw data from S3?

  • Download the data file from S3 to your PC using http/https
  • Upload/Import the data to SAS
slide-5
SLIDE 5

What would make this more efficient?

  • Cutting out the middle-man (your local PC)
slide-6
SLIDE 6

How can we have S3 communicate direct to the SAS Server?

  • Use the FILENAME URL access method

✓ Easy to implement ✗ File is retrieved using the http protocol (serially) ✗ The slowest of all options, subject to timeouts for very large files

  • Use PROC S3 to download files to the SAS Server’s filesystem

✓ Very fast, as it uses parallel downloads ✗ Only available from 9.4M4 ✗ Only works with secure S3 files, not public S3 files

slide-7
SLIDE 7

How can we have S3 communicate direct to the SAS Server?

  • Use the AWS CLI to download files to the SAS Server’s filesystem

✓ Very fast, as it uses parallel downloads ✗ Need to install the AWS CLI on the SAS Server ✗ Need the ability to run X commands on the SAS Server

  • “Mount” the S3 storage on the SAS Server

✓ Treat it like a local disk ✗ S3 is not designed for block storage/access ✗ Potential issues with current storage driver implementations

slide-8
SLIDE 8

Example: NYC Trip Data in S3

  • NYC Yellow Cab trip data for January 2017
  • 9,710,124 records
  • CSV format
  • 815 MB
  • Location
  • Bucket: nyc-tlc
  • Object Key: trip data/yellow_tripdata_2017-01.csv
  • HTTP Protocol: https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2017-01.csv
  • S3 Protocol: “s3://nyc-tlc/trip data/yellow_tripdata_2017-01.csv”
slide-9
SLIDE 9

FILENAME URL Access Method

NOTE: The data set WORK.YELLOW_TRIPDATA_2017_01 has 9710124 observations and 17 variables. real time 36.09 seconds cpu time 33.85 seconds

slide-10
SLIDE 10

PROC S3

NOTE: PROCEDURE S3 used (Total process time): real time 3.77 seconds cpu time 6.31 seconds NOTE: PROCEDURE IMPORT used (Total process time): real time 26.75 seconds cpu time 26.75 seconds

slide-11
SLIDE 11

AWS CLI

NOTE: DATA statement used (Total process time): real time 5.80 seconds cpu time 0.00 seconds

NOTE: PROCEDURE IMPORT used (Total process time): real time 26.59 seconds cpu time 26.59 seconds

slide-12
SLIDE 12

Questions?

Contact michael@selerity.com.au 1300 727 757

seleritysas.com