Tag Archives: keep

Testing the (keep=) Option with Proc SQL

I frequently have to pull data from tables with a large number of variables (columns). For example, the table in the example below has well over 100 variables, but I am referencing only seven variables total, including the four variables I want to select and those mentioned in the where clause.

In order to conserve system resources, it is good programming practice to specify only the fields you need instead of routinely using SELECT * to pull all fields. Those who use Data Step programming are probably familiar with drop= and keep= options to limit variable selection and increase programming efficiency; however, did you know that you can also use them with Proc SQL? I wanted to find out if using keep= within Proc SQL made my query run faster even if I had already specified the variables I wanted in my SELECT statement.

I tested this below, running the same query with and without the keep = option and using OPTION FULLSTIMER to show real time and cpu time. Here are the log results:

887 PROC SQL /*inobs = 100*/;
888 CREATE TABLE WORK.F1_Prv AS
889 SELECT DISTINCT
890 DATA_SRC_CD,
891 Prv_Calc_ID,
892 ORIG_DATA_SRC_CD,
893 Calc_Submitter_Cd
894 FROM PRODJ.INTG_SERVICE_DATE
895 WHERE Process_YM = 201001
896 AND PROD_CTG_CD = ’13’
897 AND Calc_Submitter_CD = ‘F’
898 ;
NOTE: Table WORK.F1_PRV created, with 105896 rows and 4 columns.
899 QUIT;

NOTE: PROCEDURE SQL used (Total process time):
real time 6:15.53
user cpu time 0.34 seconds
system cpu time 0.20 seconds
Memory 390k

900 PROC SQL /*inobs = 100*/;
901 CREATE TABLE WORK.F1_Prv AS
902 SELECT DISTINCT
903 DATA_SRC_CD,
904 Prv_Calc_ID,
905 ORIG_DATA_SRC_CD,
906 Calc_Submitter_Cd
907 FROM PRODJ.INTG_SERVICE_DATE (keep = DATA_SRC_CD ORIG_DATA_SRC_CD CALC_SUBMITTER_CD PROD_CTG_CD Prv_Calc_ID Process_YM)
908 WHERE Process_YM = 201001
909 AND PROD_CTG_CD = ’13’
910 AND Calc_Submitter_CD = ‘F’
911 ;
NOTE: Table WORK.F1_PRV created, with 105896 rows and 4 columns.
912 QUIT;

NOTE: PROCEDURE SQL used (Total process time):
real time 1:09.46
user cpu time 12.68 seconds
system cpu time 1.36 seconds
Memory 54200k

We can see a huge difference; Process 1 takes over 6 minutes in real time, while Process 2 takes just over 1 minute. But not so fast! Real time can vary greatly (as a matter of fact, the next two times I ran Process 2, real time was 4 minutes and 7 minutes, while user cpu time stayed between 12 and 13 seconds, and system cpu time was about 1.7 seconds) so the number we really want to look at is the cpu time, which is a better gauge of how much system resources are being used. Cpu time and memory use are both much lower for Process 1 than Process 2. My conclusion? In this case we are better off not using the (keep = ) option with Proc SQL.