Tag Archives: SAS

Reflexive Joins in SAS with Proc SQL

Sometimes, you may want to use a table twice in the same query. When you join a table to itself*, it is called a reflexive join. In the example below, a common task when calculating admissions-related healthcare quality measures is to exclude transfers from admissions rates. When a patient is transferred from one facility to another within 24 hours of being discharged from the first facility, we do not want to count this as a separate admission. If we did, this would inflate our rates.

The parameterized macro below uses a reflexive join to identify transfers and delete them from the original file (designated by the parameter filename). The table is referenced twice in the FROM clause, first with alias A and then with alias B. We use member identifying information (fields rhmo and bkey) to join the table to itself. We also use the intnx function to calculate when an admission for a single member occurs within 24 hours of the discharge date. Finally, in the WHERE clause we specify that the readmission must have occurred at a different facility (src_sys_prov_id). The unique identifier for the admission (conf_id) is used in the DELETE operation to exclude these transfers from the original file.

%macro transfer(filename=)/STORE SOURCE;
CREATE TABLE &filename._TRANS AS
  SELECT B.conf_id, B.admit_dt, A.disch_dt
     FROM PQI.&filename A
       JOIN PQI.&filename B ON A.rhmo=B.rhmo AND A.bkey=B.bkey
           AND A.disch_dt BETWEEN B.admit_dt AND intnx(‘DAY’,B.admit_dt,-1)
               WHERE A.conf_id <> B.conf_id
AND A.src_sys_prov_id <> B.src_sys_prov_id
;
DELETE
  FROM PQI.&filename
    WHERE conf_ID IN
      (SELECT conf_id FROM &filename._TRANS)
;
%mend transfer;

*Fun grammar fact! “Itself” is a reflexive pronoun, which we use in English when the subject of a verb is also the object. Such a verb is a reflexive verb. In Spanish, these verbs take on an entirely different form than other verbs, while in English, a reflexive verb can usually also be used in a non-reflexive way (example: “I dressed myself” vs. “I dressed my cat”).

Survey Design: Stratification & Clustering

In a previous post, I talked about importing Medical Expenditure Panel Survey (MEPS) data into SAS. MEPS survey design is complex, with person weights, stratification and multi-stage clustering techniques; it is not a random sample of the population. Stratification is a survey design technique which is typically done by demographic variables such as age, race, sex, income, etc. The goal is to maximize homogeneity within strata and heterogeneity between strata. Sometimes stratification is used when it is desirable to oversample certain groups under-represented in the general population or with interesting characteristics relevant to what is being studied (for example, blacks, Hispanics, and low-income households).

Clustering is typically done by geography in order to reduce survey costs, where it is not feasible or cost-effective to do a random sample of the entire population of the U.S., for example. Within-cluster correlation underestimates variance/error, as two families in the same neighborhood are more likely to be similar demographically (in regard to income, for instance). Therefore, we want clusters to be spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance. Sometimes a multi-stage clustering approach is used, as in MEPS; for example, a sample of counties is taken, then a sample of blocks is taken from that sample of counties, and finally individuals/households are surveyed from the sample of blocks. Information about how the survey was designed is then stored in survey design variables which are included in the dataset. These survey design variables are used to obtain population means and estimates and can also be used in regression analysis with procedures such as PROC SURVEYREG and PROC SURVEYLOGISTIC.

If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others. It is therefore highly undesirable to estimate population frequencies or means without using person weights or SAS procedures such as PROC SURVEYMEANS and PROC SURVEYFREQ. In regression analysis, ignoring person weights leads to biased coefficient estimates. If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated. For example, when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not.

Using Automatic Macro Variable &SYSDATE

In this example, I want to import some data from an Oracle database and save as a SAS data set with today’s date in the filename. To complicate things a little further, I want the date in mmddyy format so that my file name looks like this: claims_research_101113.

The SAS automatic macro variable SYSDATE stores today’s date in DATE7 format, which looks like this: 11OCT13. There is also a SYSDATE9 variable which stores the date in DATE9 format: 11OCT2013. If I want to format the date the way I want it, I first have to use the INPUTN function to specify a numeric informat, and then I can use the %SYSFUNC macro function to format it as MMDDYYn6.

When I have today’s date formatted the way I want it, I store it in the macro variable &date, which I then add on to my filename when I create the table, so that the previous file doesn’t get overwritten. There’s no need to go into the folder and rename the file manually!  That’s what macro variables are for.  Here’s what it looks like:

%let date=%sysfunc(inputn(&sysdate,date9.),mmddyyn6.);
proc sql ;
    create table mbr_data.claims_research_&date as
        select *
            from ccdr.claims_research;
quit;

 

Two Interview Questions for SAS Programmers

I have a colleague who asks the same two questions every time he interviews a SAS programming consultant for an opening on his team. His first question has to do with the program data vector (pdv). His logic is that understanding the pdv is central to understanding how SAS processes data. Since he is a data step programmer, I also suspect that he uses this question to weed out programmers who rely extensively on PROC SQL, but lack a basic understanding of how the DATA STEP works. I am certainly guilty of preferring PROC SQL over DATA STEP programming, but I do know about the pdv and agree that it is important to understand how SAS processes data in order to make sure your programs run efficiently. Unlike querying with SQL in a relational database, where you want to use a set-based, non-sequential approach to maximize efficiency, SAS processes data sequentially. Understanding this procedural/iterative processing vs. the familiar set-based approach is important for someone coming from SQL to SAS.

When SAS compiles a data step, it creates the program data vector, which contains the automatic variables _N_ and _ERROR_. These variables are not written to the output data set, but they can be accessed and output to another variable or to the log; _N_ stores the number of times the data step has iterated during execution, while _ERROR_ is a binary variable set to 0 unless an error occurs during execution, in which case it is set to 1. As a data step executes and reads the input file, the current observation is read to the pdv, _N_ increments by 1, and the observation gets written to the output file. Then the program loops back to the top of the data step, and the process iterates until it reaches the end of the file, or until an error occurs. You can use this knowledge of the data step and the automatic variables to debug your programs — for example, you can write an observation containing an error to the log using the following statement:

if _error_=1 then put _infile_;

You can also use _N_ as an observation counter, if you are only reading in one observation for each iteration of the data step (it is important to remember that _N_ actually represents the number of iterations of the data step, which will often equal the number of observations read, but not always).

The second question for the prospective SAS programmer is to see if they can name three system options that are useful for debugging code containing macros. These are MPRINT, MLOGIC, and SYMBOLGEN. SYMBOLGEN automatically resolves macro variables used in submitted statements and prints them to the log; this helps you see at a glance if your macro variables are resolving to the values you expect. MPRINT outputs the full macro statements executed when you call a macro and prints them to the log. MLOGIC helps trace the execution of a macro by writing messages to the log at various stages of execution, so you can see more easily where a macro fails.

How to Use ODS to Zip Files in SAS 9.2

Starting with SAS version 9.2, you can use ODS to create zip files in SAS without having to use platform-specific commands (such as Unix commands if your files are on a Unix server). I used ODS to zip a file on my C drive in four steps:

  1. First, I told SAS that I wanted to create a new package. You don’t need to name your package unless you are going to create multiple packages:
    ods package open nopf;
  2. Next, I added my file to the package. You can add multiple files, but I recommend that you only zip a single file at a time. Also note that unless you add the file extension to the full path, your SAS program will run with no errors, but the resulting zip file will be empty:
    ods package add file=’C:\Users\C31497\Desktop\projects\h138.sas7bdat’;
  3. Then I published the zip file and gave it a name. The zip file appears by default in the same path as the original file, but you can specify a different path using the archive_path argument when you assign the properties. If there is an existing zip file with the same name, it will be overwritten:
    ods package publish archive properties(archive_name=’h138.zip’);
  4. Finally, don’t forget to close the ods destination after you’re finished. You may also wish to drop your original (unzipped) file, since this is not done automatically after the zip file is created:
    ods package close;

Here’s what it looks like when you put it together:

ods package open nopf;
ods package add file=’C:\Users\C31497\Desktop\projects\h138.sas7bdat’;
ods package publish archive properties(archive_name=’h138.zip’);
ods package close;

There is still no way to unzip your files outside of using platform-specific syntax, so hopefully SAS is working on that for a future version.