Tag Archives: proc surveyfreq

Survey Design: Stratification & Clustering

In a previous post, I talked about importing Medical Expenditure Panel Survey (MEPS) data into SAS. MEPS survey design is complex, with person weights, stratification and multi-stage clustering techniques; it is not a random sample of the population. Stratification is a survey design technique which is typically done by demographic variables such as age, race, sex, income, etc. The goal is to maximize homogeneity within strata and heterogeneity between strata. Sometimes stratification is used when it is desirable to oversample certain groups under-represented in the general population or with interesting characteristics relevant to what is being studied (for example, blacks, Hispanics, and low-income households).

Clustering is typically done by geography in order to reduce survey costs, where it is not feasible or cost-effective to do a random sample of the entire population of the U.S., for example. Within-cluster correlation underestimates variance/error, as two families in the same neighborhood are more likely to be similar demographically (in regard to income, for instance). Therefore, we want clusters to be spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance. Sometimes a multi-stage clustering approach is used, as in MEPS; for example, a sample of counties is taken, then a sample of blocks is taken from that sample of counties, and finally individuals/households are surveyed from the sample of blocks. Information about how the survey was designed is then stored in survey design variables which are included in the dataset. These survey design variables are used to obtain population means and estimates and can also be used in regression analysis with procedures such as PROC SURVEYREG and PROC SURVEYLOGISTIC.

If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others. It is therefore highly undesirable to estimate population frequencies or means without using person weights or SAS procedures such as PROC SURVEYMEANS and PROC SURVEYFREQ. In regression analysis, ignoring person weights leads to biased coefficient estimates. If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated. For example, when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not.

Importing Medical Expenditure Panel Survey Data Into SAS

I did an in-house SAS user group presentation last week on using SAS survey procedures to analyze Medical Expenditure Panel Survey (MEPS) data with regard to insurance coverage in the context of healthcare reform (ACA and the New Individual Segment: Profiling the Uninsured and Non-Group Insured Populations with MEPS 2010 and SAS Survey Procedures). The MEPS 2011 consolidated data file is available as of September 2013 for download. It contains detailed information (over 1900 variables) on demographics, household income, employment, diagnosed health conditions, additional health status issues, medical expenditures and utilization, satisfaction with and access to care, and insurance coverage of those surveyed.

There are several government data sets made available to the public each year that are designed for easy analysis with SAS and other statistical programming software (including STATA and SPSS). I attended a BASUG training by Paul Gorrell back in 2012 which introduced me to some of these data sets. The MEPS website includes programming statements to help you import the data to SAS. If you have a SAS/STAT license with Base SAS of version 9 or above, you have access to four SAS survey procedures (PROC SURVEYFREQ, PROC SURVEYMEANS, PROC SURVEYREG, PROC SURVEYLOGISTIC) that you can use to analyze data from complex survey designs such as MEPS.

You can get started in just a few easy steps. There are a couple of ways to do this, but this is the method I used:
1. Download and run the h147ssp.exe file to extract the data to your chosen library.
2. After you execute the file above, you should be able to find the sas transport data file (h147.ssp) in your folder. Now you have to tell SAS where to find it with a FILENAME statement.
3. Assign the LIBNAME where you want your SAS data set to be created.
4. Import the data using PROC XCOPY.

Here’s an example:
LIBNAME MYLIB ‘C:\Users\C31497\Desktop’;
FILENAME H147 ‘C:\Users\C31497\Desktop\h147.ssp’;
PROC XCOPY IN=H147 OUT=MYLIB IMPORT;
RUN;

That’s it! Next you can run a PROC CONTENTS to get a full variable listing, or you can view the online codebook on the MEPS site. You can find out more about SAS Survey procedures in my NESUG 2013 paper, Proc SurveyCorr.

BASUG Meeting Notes: September 2012

I attended the third quarter BASUG (Boston Area SAS User Group) meeting in Cambridge, MA at the Microsoft NERD center on September 20th. Morning speakers included Craig Dickstein of Tamarack Professional Services and Paul Gorrell of IMPAQ International. Craig Dickstein is one of the authors of Health Care Data and SAS and has worked with Cigna as a HEDIS code reviewer. An afternoon training on using SAS to analyze publicly available healthcare data sets was also led by Paul Gorrell.

The entire day focused on healthcare data, with the following presentations: “Data Hygiene Routines for Administrative Healthcare Data”, “Calculating the Hospital Readmission Interval”, and “Using SAS to Generate Estimates of U.S. Prescription Drug Cost and Use”. Dickstein’s presentation on “data hygiene routines” included a useful overview of the architecture of ICD-9 diagnosis codes, CPT codes (categories I-III), and HCPCS procedure codes. His code samples demonstrated how to create procedure code lookup tables with Proc Format and use these lookup tables to identify bad values. His second presentation described the challenges of calculating re-admission intervals and presented some alternatives to using the LAG function, including a detailed discussion of how the Program Data Vector (PDV) works in SAS. Finally, Paul Gorrell discussed how to replicate the numbers found in commonly cited statistics such as “5% of Americans make up 50% of U.S. health care spending” by using SAS/STAT survey procedures such as Proc Surveyfreq and Proc Surveymeans in combination with the HRQ data files available from the Medical Expenditure Panel Survey (MEPS). The next quarterly meeting is scheduled for December 11th, 2012.