Checking the Match in a Matched Case-Control Study

Sometimes we wish to conduct a study in which we take a population of interest (the treatment group) and match each case to a similar individual sampled from the population which is not undergoing the treatment (the control group). The goal is to find out whether the outcome we wish to measure after treatment is significantly different for each population. This is known as an individually matched case-control study. This post will focus on checking that the matched case population is similar to the control population.

For example, say that we want to find out if providing an incentive to individuals in the treatment group will influence their behavior. We might match on the following variables: medical condition, gender, geographic area, age, risk score, number of office visits in the past 12 months, and total medical cost (TMC) in the past 12 months. Some of these we would want an exact match for (the first 3 variables in this example), whereas the last 4 variables we would match within a given range.

After we run the match, we want to check that the characteristics for the treatment group are similar to those of the control group — that the population means are not significantly different for the continuous variables such as TMC. In order to check this, we first want to test for normality. Whether or not the variable is normally distributed will determine which kind of test we run to see if there is a significant difference between the two groups.

To test normality for the variable TMC, we can use PROC UNIVARIATE as follows:

proc univariate data=work.testdata normal plot;
var TMC;
qqplot TMC /normal(mu=est sigma=est color=red L=1);
run;

This will give you a boxplot and a q-q plot, and the NORMAL option will also give you some normality tests, including Shapiro-Wilk (good for sample sizes < 2000). For normality, you want the mean and the median to be close together on the box plot, a reasonably symmetric distribution (skew close to zero), a relatively straight line for the q-q plot, and the p-values on the normality test results should be > 0.05. If the Shapiro-Wilk W value is close to 1, this also indicates that the data is normal. If p-values are less than 0.05 (alpha), that means you reject the null hypothesis that the distribution is normal and proceed with non-parametric testing.

Assuming your distribution is normal, you would run a paired t-test to see if TMC is similar for both populations (in this example, “study_group” is a flag variable that indicates whether the observation is in the treatment/study group or the control group):

proc ttest data=work.testdata;
paired TMC*study_group;
run;

The null hypothesis is that there is no statistically significant difference between the two groups, so you would want to see a large p-value here (>0.05). This would indicate that our two populations are well-matched on this variable.

However, it is unlikely that TMC is normally distributed, so you would probably end up using the NPAR1WAY procedure instead, with the Wilcoxon test (again, the null hypothesis says that the means are equal, so we want large p-values since we want to accept the null hypothesis):

proc npar1way data=work.testdata wilcoxon;
class study_group;
var TMC;
run;

The tests shown above have all been for continuous variables. Alternatively, if you wanted to test a discrete variable such as gender, you could use PROC FREQ with a chi square test to ensure that gender is independent of the study_group variable:

proc freq data=work.testdata;
tables study_group*male / chisq fisher;
run;

SAS Survey Procedures: PROC SURVEYLOGISTIC vs. PROC LOGISTIC Output

In a previous post, I talked about complex survey designs and why analysis of such survey data requires the use of SAS survey procedures. PROC SURVEYREG and PROC SURVEYLOGISTIC have some of the same options available for output/diagnostics as do their non-survey counterparts, PROC REG and PROC LOGISTIC. Default output includes fit statistics (R squared, AIC, and Schwartz’s criterion), chi-squared tests of the global null hypothesis, degrees of freedom, and coefficient estimates for each parameter along with standard error of coefficient estimates and p-values. PROC SURVEYLOGISTIC also includes odds ratio point estimates and 95% Wald confidence intervals for each input parameter, as does PROC LOGISTIC.

The survey procedures are more limited in some ways, though. For example, PROC LOGISTIC can use an option such as stepwise selection to restrict the output to only predictors with significance above a certain level; there is also an option to rank those predictors. These options do not work with PROC SURVEYLOGISTIC, which makes the output more unwieldy with a large number of predictors. Most notably in terms of differences, PROC LOGISTIC automatically outputs a chi-squared test of the residuals for each input variable; however, any analysis of residuals is irrelevant for the survey procedures since assumptions of normality and equal variance are not applicable due to survey design. Tabled residuals are not output at all for the survey procedures, although covariance matrices are available for both as a non-default option. Similarly, influential observations/outliers are also not analyzed due to the use of person weights. As long as we use person weights, we would get the same coefficients with a regular PROC REG as we would with PROC SURVEYREG, but standard error estimates would be different and predictor significance could also vary.

Reflexive Joins in SAS with Proc SQL

Sometimes, you may want to use a table twice in the same query. When you join a table to itself*, it is called a reflexive join. In the example below, a common task when calculating admissions-related healthcare quality measures is to exclude transfers from admissions rates. When a patient is transferred from one facility to another within 24 hours of being discharged from the first facility, we do not want to count this as a separate admission. If we did, this would inflate our rates.

The parameterized macro below uses a reflexive join to identify transfers and delete them from the original file (designated by the parameter filename). The table is referenced twice in the FROM clause, first with alias A and then with alias B. We use member identifying information (fields rhmo and bkey) to join the table to itself. We also use the intnx function to calculate when an admission for a single member occurs within 24 hours of the discharge date. Finally, in the WHERE clause we specify that the readmission must have occurred at a different facility (src_sys_prov_id). The unique identifier for the admission (conf_id) is used in the DELETE operation to exclude these transfers from the original file.

%macro transfer(filename=)/STORE SOURCE;
CREATE TABLE &filename._TRANS AS
  SELECT B.conf_id, B.admit_dt, A.disch_dt
     FROM PQI.&filename A
       JOIN PQI.&filename B ON A.rhmo=B.rhmo AND A.bkey=B.bkey
           AND A.disch_dt BETWEEN B.admit_dt AND intnx(‘DAY’,B.admit_dt,-1)
               WHERE A.conf_id <> B.conf_id
AND A.src_sys_prov_id <> B.src_sys_prov_id
;
DELETE
  FROM PQI.&filename
    WHERE conf_ID IN
      (SELECT conf_id FROM &filename._TRANS)
;
%mend transfer;

*Fun grammar fact! “Itself” is a reflexive pronoun, which we use in English when the subject of a verb is also the object. Such a verb is a reflexive verb. In Spanish, these verbs take on an entirely different form than other verbs, while in English, a reflexive verb can usually also be used in a non-reflexive way (example: “I dressed myself” vs. “I dressed my cat”).

Survey Design: Stratification & Clustering

In a previous post, I talked about importing Medical Expenditure Panel Survey (MEPS) data into SAS. MEPS survey design is complex, with person weights, stratification and multi-stage clustering techniques; it is not a random sample of the population. Stratification is a survey design technique which is typically done by demographic variables such as age, race, sex, income, etc. The goal is to maximize homogeneity within strata and heterogeneity between strata. Sometimes stratification is used when it is desirable to oversample certain groups under-represented in the general population or with interesting characteristics relevant to what is being studied (for example, blacks, Hispanics, and low-income households).

Clustering is typically done by geography in order to reduce survey costs, where it is not feasible or cost-effective to do a random sample of the entire population of the U.S., for example. Within-cluster correlation underestimates variance/error, as two families in the same neighborhood are more likely to be similar demographically (in regard to income, for instance). Therefore, we want clusters to be spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance. Sometimes a multi-stage clustering approach is used, as in MEPS; for example, a sample of counties is taken, then a sample of blocks is taken from that sample of counties, and finally individuals/households are surveyed from the sample of blocks. Information about how the survey was designed is then stored in survey design variables which are included in the dataset. These survey design variables are used to obtain population means and estimates and can also be used in regression analysis with procedures such as PROC SURVEYREG and PROC SURVEYLOGISTIC.

If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others. It is therefore highly undesirable to estimate population frequencies or means without using person weights or SAS procedures such as PROC SURVEYMEANS and PROC SURVEYFREQ. In regression analysis, ignoring person weights leads to biased coefficient estimates. If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated. For example, when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not.

Using Automatic Macro Variable &SYSDATE

In this example, I want to import some data from an Oracle database and save as a SAS data set with today’s date in the filename. To complicate things a little further, I want the date in mmddyy format so that my file name looks like this: claims_research_101113.

The SAS automatic macro variable SYSDATE stores today’s date in DATE7 format, which looks like this: 11OCT13. There is also a SYSDATE9 variable which stores the date in DATE9 format: 11OCT2013. If I want to format the date the way I want it, I first have to use the INPUTN function to specify a numeric informat, and then I can use the %SYSFUNC macro function to format it as MMDDYYn6.

When I have today’s date formatted the way I want it, I store it in the macro variable &date, which I then add on to my filename when I create the table, so that the previous file doesn’t get overwritten. There’s no need to go into the folder and rename the file manually!  That’s what macro variables are for.  Here’s what it looks like:

%let date=%sysfunc(inputn(&sysdate,date9.),mmddyyn6.);
proc sql ;
    create table mbr_data.claims_research_&date as
        select *
            from ccdr.claims_research;
quit;