survey procedures | Jessica Hampton

In a previous post, I talked about complex survey designs and why analysis of such survey data requires the use of SAS survey procedures. PROC SURVEYREG and PROC SURVEYLOGISTIC have some of the same options available for output/diagnostics as do their non-survey counterparts, PROC REG and PROC LOGISTIC. Default output includes fit statistics (R squared, AIC, and Schwartz’s criterion), chi-squared tests of the global null hypothesis, degrees of freedom, and coefficient estimates for each parameter along with standard error of coefficient estimates and p-values. PROC SURVEYLOGISTIC also includes odds ratio point estimates and 95% Wald confidence intervals for each input parameter, as does PROC LOGISTIC.

The survey procedures are more limited in some ways, though. For example, PROC LOGISTIC can use an option such as stepwise selection to restrict the output to only predictors with significance above a certain level; there is also an option to rank those predictors. These options do not work with PROC SURVEYLOGISTIC, which makes the output more unwieldy with a large number of predictors. Most notably in terms of differences, PROC LOGISTIC automatically outputs a chi-squared test of the residuals for each input variable; however, any analysis of residuals is irrelevant for the survey procedures since assumptions of normality and equal variance are not applicable due to survey design. Tabled residuals are not output at all for the survey procedures, although covariance matrices are available for both as a non-default option. Similarly, influential observations/outliers are also not analyzed due to the use of person weights. As long as we use person weights, we would get the same coefficients with a regular PROC REG as we would with PROC SURVEYREG, but standard error estimates would be different and predictor significance could also vary.

A common way to create a random sample of n=1000 in SAS is to generate a random number field for each observation using RANUNI or a similar function. The data set is then sorted on that field and the top 1000 selected as the sample.

PROC SURVEYSELECT offers a simple alternative with just a few lines of code:

proc surveyselect data=Customers
method=srs n=1000 out=SampleSRS;
run;

The METHOD statement set equal to “srs” indicates that simple random sampling will be used. DATA= specifies the input data set, while OUT= specifies the output data set. N= is used to set the sample size, and an optional SEED= statement can be used if a particular seed is desired for generating the random number; otherwise the seed will default to the time of day from the computer clock. Default output will include the seed, selection probability, and sampling weight for each observation.

Alternatively, if you want to get a little fancier and play around with various sample sizes for different markets, I like to use macro variables when setting some of the parameters:

proc surveyselect data=work.total_elig_pop_&mkt
method=srs n=&size out=ci_share.sample_&mkt._new;
run;

There are many other options available with PROC SURVEYSELECT which you can use for more complex sampling. Additional SAS survey procedures for analyzing data created using complex sampling methods are discussed in one of my conference papers. For more about macro variables and how they make your code easier to maintain, see 10 Steps to Easier SAS Code Maintenance.

Jessica Hampton

Analytics Professional

Tag Archives: survey procedures

SAS Survey Procedures: PROC SURVEYLOGISTIC vs. PROC LOGISTIC Output

Simple Random Sampling with Proc SurveySelect