Tag Archives: SAS

Create a Data Dictionary Part II: SQL Dictionary Tables

I have already posted on how to export a data dictionary to Excel using a modified Proc Contents. However, you can also practice using the Excel libname statement with SQL dictionary tables to do the same thing. The advantage of using SQL dictionary tables over Proc Contents is that SQL dictionary tables store much more information and give you access to many more variables. The two tables I use in the following code are dictionary.tables and dictionary.columns.

First, you can use Proc SQL with the noexec and feedback options to get a full list of available variables for each table to print to the log without actually having to execute the sql statement (for more on the feedback option, see my previous post):

89 proc sql feedback noexec;
90 select *
91 from dictionary.tables;

NOTE: Statement transforms to:
select libname, memname, memtype, dbms_memtype, memlabel, typemem, crdate, modate, nobs, obslen, nvar, protect, compress, encrypt, npage, filesize,pcompress, reuse, bufsize, delobs, nlobs, maxvar, maxlabel, maxgen, gen, attr, indxtype, datarep, sortname, sorttype, sortchar, reqvector, datarepname,encoding, audit, audit_before, audit_admin, audit_error, audit_data
from DICTIONARY.TABLES;

Now, suppose I want to create my data dictionary based on all the tables I have in my library “HRA”. First, I assign my libname statement to point to the Excel workbook I want to create (note that the libname statement does not support the replace option, so if the file already exists, you’ll have to delete it before trying to recreate it):

libname myxls \\WLDAPP043\Hedisapp\Personal\Jess\HRA\2010\Data_Dictionary.xls;

Then I select from the dictionary tables using Proc SQL to create my two worksheets:

proc sql;
create table myxls.directory as
select memname,
crdate,
modate,
nobs,
obslen,
nvar,
filesize
from dictionary.tables
where libname = ‘HRA’
;
create table myxls.table_listing as
select memname,
name,
type,
length,
format
from dictionary.columns
where libname = ‘HRA’
;
QUIT;

One final note to add is that you can’t open the Excel file while the libname is still assigned, so you’ll either have to close out of SAS before admiring your data dictionary, or run the following line of code to deassign your libname statement.

libname myxls clear;

HASUG Meeting Notes: November 2011 (Social Media)

The 4th quarter 2011 HASUG meeting took place at Bristol-Myers Squibb in Wallingford, CT, on November 10th. Speakers included John Adams of Boehringer Ingelheim and David Kelly from Customer Intelligence at SAS Institute.

David Kelly presented SAS Institute’s Social Media Analytics software platform designed to enable companies to process large volumes of unstructured data from internal and external social media and to base business decisions on that data. His presentation, “The Power of Social Media Listening,” introduced the SAS Customer Intelligence organization at SAS, provided an engaging narrative of social media statistics (about 70% of YouTube and FaceBook activity come from outside the US, for example), portrayed the potential and the landscape of social media and outlined the data challenges (punctuation, spelling, segmentation, acronyms, industry and social media jargon) associated with analyzing the unstructured text data which comprises over 70% of social media data.

Kelly cited the infamous viral YouTube video posted by a previously obscure singer-songwriter who watched United Airlines cargo handlers break his expensive Taylor guitar as a social media example of negative PR which led to a loss of $180 million for United Airlines. Clearly, how corporations react to social media in real-time can have serious financial implications. Kelly also discussed the “4 Cs” of social media: Content, Context, Connections, and Conversations, and noted the importance of being able to identify key “influencers” (a concept which will be of interest to those acquainted with Malcolm Gladwell’s “Tipping Point”) and the origins of negative PR stories.

SAS’s solution for businesses looking to monitor and respond quickly to information about their brand floating around on social media sites is the SAS Social Media Analytics software platform. This platform crawls the web for industry or company-specific information (largely in the form of unstructured text data), capturing, cleaning, organizing, and analyzing that data as part of a customizable self-service application that allows your organization to generate real-time reports, including comparison reports (vs. competition), analysis of historical data and trend identification, “sentiment analysis” currently supported in 13 languages, and much more. See my conference paper for an example of the kind of text analysis you can do using Base SAS.

HASUG Meeting Notes: November 2011 (define.xml)

The 4th quarter 2011 HASUG meeting took place at Bristol-Myers Squibb in Wallingford, CT, on November 10th. Speakers included John Adams of Boehringer Ingelheim and David Kelly from Customer Intelligence at SAS Institute.

John Adams’s presentation, “Creating a define.xml file for ADaM and SDTM,” addressed a current issue within the pharmaceutical industry as CDISC (Clinical Data Interchange Standards Consortium) moves to standardize the electronic submission process of pharmaceutical studies to the FDA, in the interest of making the review process more efficient and consequently decreasing the time it takes for a new drug to reach the market. A define.xml file contains all the metadata information needed to guide the reviewer through the electronic FDA submission. While there is software readily available for the creation of this file for SDTM submissions, only limited support exists for ADaM compatible define.xml files. Adams’s presentation described how his organization addresses this problem.

Adams began with a short tutorial on xml schemas and style sheets before describing the process for creating ADaM compatible define.xml files and discussing the methodology for capturing metadata. The xml tutorial, which was very well done, included a visual representation of basic xml structure, showing how root elements, child elements, attributes, and values are organized hierarchically in xml. He also contrasted html vs. xml (global, standard tags in html vs. non-standard tags defined by a schema in xml) and described the requirement that the define.xml file be “well-formed xml” (as opposed to an xml fragment), listing the basics of well-formed xml as follows: xml declaration, unique root element, start and end tags, proper nesting of case-sensitive elements, quoted attribute values, and use of entities for special characters (&,<,>,etc.). Finally, he defined the two elements of the define.xml file: the schema, an .xsd file which defines the file structure (elements, attributes, order and number of child elements, data types, default values) and validates the data, and the style sheet, an .xsl file which defines the layout/display for rendering the data (table of contents, tables, links), used to transform the xml into an html file that can be recognized and displayed by a browser.

Next, Adams described the general CDISC schema, zeroing in on some of the more important elements, and provided a list of available software tools for developing xml files along with some of the challenges associated with each: CDISC software, SAS Clinical Toolkit (in Base SAS), SAS XML Mapper (Java-based GUI which is helpful translating xml files to SAS data sets but not vice versa), and SAS XML Engine (Base SAS). He described the process of capturing metadata in Excel to use as input for the SAS programs which output the define.xml file, highlighting the newer v.9 Excel libname feature in SAS (example syntax: “LIBNAME WrkBk EXCEL ‘My Workbook.xls’ VER=2002;” see sugi paper for more details: http://www2.sas.com/proceedings/sugi31/024-31.pdf), or refer to my previous post on 3 Ways to Export Your Data to Excel for other ways to use the Excel libname. He also shared a SAS macro using the tranwrd() function to replace special characters such as “&” and “< "which must be represented in the xml document as "&amp" and "&lt." Also of note: Adams recommended Oxygen Editor to debug the xml code and make sure the file displays properly in Internet Explorer. This was a very interesting discussion of how he and others at Boehringer successfully adapted CDISC schema and style sheets to produce an ADaM compatible define.xml file; even for a non-pharmaceutical audience, his discussion of basic xml structure and SAS tools used to solve this business problem could prove useful.

Proc SQL Feedback Option

Most people who work with large data sets understand the importance of only selecting the fields you need from any given table or tables rather than using Select * to select all fields. This leads to more efficient use of system resources, but can be annoying for the programmer who has to type in a long list of fields.

In order to avoid this, one easy tip is to use Enterprise Guide’s interface to select desired fields and then copying and pasting the generated code into your SAS program. An alternative to doing this, however, is to use the Proc SQL Feedback option.

For example, if I want to get a list of fields available in the SQL Dictionary table dictionary.tables for a given libname, I can use the feedback option to get the following log output:

21 proc sql feedback;
22 select *
23 from dictionary.tables
24 where libname = ‘yourlibname’
25 ;
NOTE: Statement transforms to:
select libname, memname, memtype, dbms_memtype, memlabel, typemem, crdate, modate, nobs, obslen, nvar, protect, compress, encrypt, npage, filesize, pcompress, reuse, bufsize, delobs, nlobs, maxvar, maxlabel, maxgen, gen, attr, indxtype, datarep, sortname, sorttype, sortchar, reqvector, datarepname, encoding, audit, audit_before, audit_admin, audit_error, audit_data

from DICTIONARY.TABLES

where libname=’yourlibname’;

I can then copy and paste this listing into my original program and select only those fields I want without having to type them in manually.

Create an Excel Data Dictionary with Proc Contents and Proc Export

Sometimes it may be useful to generate a table listing of all the SAS datasets within a folder (the code below uses the previously assigned libref HRA to refer to this folder) — a sort of table of contents — along with some more detail about the fields within those tables. While this information can be output within SAS using a Proc Contents, you may wish to organize this information differently, keep and drop certain variables, and export it in the form of a multi-sheet Excel workbook. The code below shows how a simple data dictionary can be created using Proc Contents and exported to Excel with Proc Export (or alternatively, the Libname statement). Both of these options for exporting data to Excel were discussed in a previous post. Keep in mind that although the code below only selects a very few of the variables available in SAS, you can modify it to select other variables as well. Try it without the keep= option to see the full range of available variables.

Data Dictionary with Proc Export:

PROC CONTENTS DATA=HRA._all_ memtype=data Out=table_listing (keep=memname Nobs name length format crdate);
RUN;

PROC SQL;
Create table directory as
SELECT distinct memname, nobs, crdate
from table_listing;
QUIT;

PROC EXPORT DATA=directory
OUTFILE=”\\path\Data_Dictionary.xls”
DBMS=EXCEL REPLACE;
SHEET=”directory”;
RUN;

PROC EXPORT DATA=table_listing
OUTFILE=”\\path\Data_Dictionary.xls”
DBMS=EXCEL REPLACE;
SHEET=”table_listing”;
RUN;

Alternate Syntax with Libname Statement:

libname myxls “\\path\Data_Dictionary.xls”; /*does not support replace option*/

PROC CONTENTS DATA=HRA._all_ memtype=data Out=myxls.table_listing (keep=memname Nobs name length format crdate);
RUN;

PROC SQL;
Create table myxls.directory as
SELECT distinct memname, nobs, crdate
from myxls.table_listing;
QUIT;