Tag Archives: clustering

Survey Design: Stratification & Clustering

In a previous post, I talked about importing Medical Expenditure Panel Survey (MEPS) data into SAS. MEPS survey design is complex, with person weights, stratification and multi-stage clustering techniques; it is not a random sample of the population. Stratification is a survey design technique which is typically done by demographic variables such as age, race, sex, income, etc. The goal is to maximize homogeneity within strata and heterogeneity between strata. Sometimes stratification is used when it is desirable to oversample certain groups under-represented in the general population or with interesting characteristics relevant to what is being studied (for example, blacks, Hispanics, and low-income households).

Clustering is typically done by geography in order to reduce survey costs, where it is not feasible or cost-effective to do a random sample of the entire population of the U.S., for example. Within-cluster correlation underestimates variance/error, as two families in the same neighborhood are more likely to be similar demographically (in regard to income, for instance). Therefore, we want clusters to be spatially close for cost effectiveness but as heterogeneous within as possible for reasonable variance. Sometimes a multi-stage clustering approach is used, as in MEPS; for example, a sample of counties is taken, then a sample of blocks is taken from that sample of counties, and finally individuals/households are surveyed from the sample of blocks. Information about how the survey was designed is then stored in survey design variables which are included in the dataset. These survey design variables are used to obtain population means and estimates and can also be used in regression analysis with procedures such as PROC SURVEYREG and PROC SURVEYLOGISTIC.

If person weights are ignored and one tries to generalize sample findings to the entire population, total numbers, percentages, or means are inflated for the groups that are oversampled and underestimated for others. It is therefore highly undesirable to estimate population frequencies or means without using person weights or SAS procedures such as PROC SURVEYMEANS and PROC SURVEYFREQ. In regression analysis, ignoring person weights leads to biased coefficient estimates. If sampling strata and cluster variables are ignored, means and coefficient estimates are unaffected, but standard error (or population variance) may be underestimated; that is, the reliability of an estimate may be overestimated. For example, when comparing one estimated population mean to another, the difference may appear to be statistically significant when it is not.