Big Data Sets Seen as a Big Problem and How to Deal with Them - - PowerPoint PPT Presentation

big data sets seen as a big problem and how to deal with
SMART_READER_LITE
LIVE PREVIEW

Big Data Sets Seen as a Big Problem and How to Deal with Them - - PowerPoint PPT Presentation

Big Data Sets Seen as a Big Problem and How to Deal with Them Daniil Shliakhov, Kharkiv, Ukraine Frankfurt 2018 B I G D ATA S E T S S E E N A S A B I G P R O B L E M INTRO Running time is an issue! Running time Normal Size Data Sets


slide-1
SLIDE 1

Big Data Sets Seen as a Big Problem and How to Deal with Them

Frankfurt 2018

Daniil Shliakhov, Kharkiv, Ukraine

slide-2
SLIDE 2

B I G D ATA S E T S S E E N A S A B I G P R O B L E M

INTRO

Page 2

Running time

Normal Size Data Sets Large Data Sets

Running time is an issue!

slide-3
SLIDE 3

B I G D ATA S E T S S E E N A S A B I G P R O B L E M

INTRO

Page 3

Parameter Treatment n Mean SD Median Min Max Alkaline Phosphatase (U/L) Baseline Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 1 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx Cycle 2 Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx

slide-4
SLIDE 4

GENERAL TIPS RETRIEVING SAS DATA SETS VIEW option

slide-5
SLIDE 5

data adlb; set adam.adlb; run;

G E N E R A L T I P S

VIEW OPTION

Page 5

A simple data step? How much time it may take to run this step if ADAM.ADLB is huge?

slide-6
SLIDE 6

data adlb; set adam.adlb; run;

G E N E R A L T I P S

VIEW OPTION

Page 6

Real time 34:23 CPU time 3:10 35 seconds? Too looooong L

slide-7
SLIDE 7

data adlb / VIEW=adlb; set adam.adlb; run;

G E N E R A L T I P S

VIEW OPTION

Page 7

A SAS VIEW is a type of SAS data set that retrieves data values from other files

slide-8
SLIDE 8

data adlb / VIEW=adlb; set adam.adlb; run;

G E N E R A L T I P S

VIEW OPTION

Page 8

Real time 0:03 CPU time 0:01 Less than 1 second? MAGIC!

slide-9
SLIDE 9

data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run;

G E N E R A L T I P S

VIEW OPTION

Page 9

Simple merge, common sort… How long?

slide-10
SLIDE 10

data analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis; by trt01an parcat paramcd avisitn; run;

G E N E R A L T I P S

VIEW OPTION

Page 10

Data Step PROC step Real time 36:04 1:12.40 CPU time 8:18 12.45 Simple merge, common sort… Too looooong again

slide-11
SLIDE 11

data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run;

G E N E R A L T I P S

VIEW OPTION

Page 11

Adding VIEW option to do magic

slide-12
SLIDE 12

data analysis / VIEW=analysis; merge adam.adsl adam.adlb; by studyid usubjid; run; proc sort data=analysis out=analysis_sorted; by trt01an parcat paramcd avisitn; run;

G E N E R A L T I P S

VIEW OPTION

Page 12

Data Step PROC step Real time 0:04 1:17.65 CPU time 0:02 20.32 VIEW option gives extra time to drink coffee with colleagues

slide-13
SLIDE 13

GENERAL TIPS RETRIEVING SAS DATA SETS IF or WHERE?

slide-14
SLIDE 14

data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run;

G E N E R A L T I P S

IF or WHERE?

Page 14

IF vs. WHERE. Who is the champion?

slide-15
SLIDE 15

data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run;

G E N E R A L T I P S

IF or WHERE?

Page 15

IF statement WHERE statement Real time 31:64 33:31 CPU time 3:53 5:68 IF is champion! Woohoo!

slide-16
SLIDE 16

data adlb; set adam.adlb; if ANL01FL = 'Y'; run; data adlb; set adam.adlb; where ANL01FL = 'Y'; run;

G E N E R A L T I P S

IF or WHERE?

Page 16

IF statement WHERE statement Real time 32:15 27:26 CPU time 4:28 2.98

INDEX APPLIED

INDEX helps WHERE to win J

slide-17
SLIDE 17

SAS PROCEDURES DESCRIPTIVE STATS

slide-18
SLIDE 18

proc means data=adlb noprint; by trt01an parcat paramcd avisitn; var aval;

  • utput out = mnout

n = n mean = mean median = median std = std min = min max = max; run;

S A S P R O C E D U R E S

DESCRIPTIVE STATS

Page 18

slide-19
SLIDE 19

proc univariate data=adlb noprint; by trt01an parcat paramcd avisitn; var aval;

  • utput out = mnout

n = n mean = mean median = median std = std min = min max = max; run;

S A S P R O C E D U R E S

DESCRIPTIVE STATS

Page 19

slide-20
SLIDE 20

proc summary data=adlb noprint; by trt01an parcat paramcd avisitn; var aval;

  • utput out = mnout

n = n mean = mean median = median std = std min = min max = max; run;

S A S P R O C E D U R E S

DESCRIPTIVE STATS

Page 20

slide-21
SLIDE 21

proc sql noprint; create table mnout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as n, MEAN(aval) as mean, MEDIAN(aval) as median, STD(aval) as std, MIN(aval) as min, MAX(aval) as max from adlb group by trt01an, parcat, paramcd, avisitn quit;

S A S P R O C E D U R E S

DESCRIPTIVE STATS

Page 21

slide-22
SLIDE 22

S A S P R O C E D U R E S

DESCRIPTIVE STATS

Page 22

MEANS UNIVARIATE SUMMARY SQL Real time 15:14 24:78 13:24 13:45 CPU time 3:38 1:76 3:33 2:53 DESCRIPTIVE STATS COMPARISON

slide-23
SLIDE 23

SAS PROCEDURES FREQUENCY

slide-24
SLIDE 24

proc freq data=adlb noprint; by trt01an parcat paramcd; tables avisitn / out=frout; run;

S A S P R O C E D U R E S

FREQUENCY

Page 24

slide-25
SLIDE 25

proc summary data=adlb nway noprint; by trt01an parcat paramcd avisitn;

  • utput out=frout;

run;

S A S P R O C E D U R E S

FREQUENCY

Page 25

slide-26
SLIDE 26

proc sql noprint; create table frout as select trt01an, parcat, paramcd, avisitn, COUNT(*) as count from adlb group by trt01an, parcat, paramcd, avisitn quit;

S A S P R O C E D U R E S

FREQUENCY

Page 26

slide-27
SLIDE 27

S A S P R O C E D U R E S

FREQUENCY

Page 27

FREQ SQL SUMMARY Real time 13:62 12:19 12:02 CPU time 2:04 1:63 0:88 FREQUENCY COMPARISON

slide-28
SLIDE 28

CONCLUSIONS

Do not be afraid to work with big data sets. Just choose the “right” procedure!

slide-29
SLIDE 29

Daniil Shliakhov daniil.shlyakhov@intego-group.com Kharkiv, Ukraine www.intego-group.com

T H A N K YO U