HASUG Meeting Notes: November 2011 (define.xml)

The 4th quarter 2011 HASUG meeting took place at Bristol-Myers Squibb in Wallingford, CT, on November 10th. Speakers included John Adams of Boehringer Ingelheim and David Kelly from Customer Intelligence at SAS Institute.

John Adams’s presentation, “Creating a define.xml file for ADaM and SDTM,” addressed a current issue within the pharmaceutical industry as CDISC (Clinical Data Interchange Standards Consortium) moves to standardize the electronic submission process of pharmaceutical studies to the FDA, in the interest of making the review process more efficient and consequently decreasing the time it takes for a new drug to reach the market. A define.xml file contains all the metadata information needed to guide the reviewer through the electronic FDA submission. While there is software readily available for the creation of this file for SDTM submissions, only limited support exists for ADaM compatible define.xml files. Adams’s presentation described how his organization addresses this problem.

Adams began with a short tutorial on xml schemas and style sheets before describing the process for creating ADaM compatible define.xml files and discussing the methodology for capturing metadata. The xml tutorial, which was very well done, included a visual representation of basic xml structure, showing how root elements, child elements, attributes, and values are organized hierarchically in xml. He also contrasted html vs. xml (global, standard tags in html vs. non-standard tags defined by a schema in xml) and described the requirement that the define.xml file be “well-formed xml” (as opposed to an xml fragment), listing the basics of well-formed xml as follows: xml declaration, unique root element, start and end tags, proper nesting of case-sensitive elements, quoted attribute values, and use of entities for special characters (&,<,>,etc.). Finally, he defined the two elements of the define.xml file: the schema, an .xsd file which defines the file structure (elements, attributes, order and number of child elements, data types, default values) and validates the data, and the style sheet, an .xsl file which defines the layout/display for rendering the data (table of contents, tables, links), used to transform the xml into an html file that can be recognized and displayed by a browser.

Next, Adams described the general CDISC schema, zeroing in on some of the more important elements, and provided a list of available software tools for developing xml files along with some of the challenges associated with each: CDISC software, SAS Clinical Toolkit (in Base SAS), SAS XML Mapper (Java-based GUI which is helpful translating xml files to SAS data sets but not vice versa), and SAS XML Engine (Base SAS). He described the process of capturing metadata in Excel to use as input for the SAS programs which output the define.xml file, highlighting the newer v.9 Excel libname feature in SAS (example syntax: “LIBNAME WrkBk EXCEL ‘My Workbook.xls’ VER=2002;” see sugi paper for more details: http://www2.sas.com/proceedings/sugi31/024-31.pdf), or refer to my previous post on 3 Ways to Export Your Data to Excel for other ways to use the Excel libname. He also shared a SAS macro using the tranwrd() function to replace special characters such as “&” and “< "which must be represented in the xml document as "&amp" and "&lt." Also of note: Adams recommended Oxygen Editor to debug the xml code and make sure the file displays properly in Internet Explorer. This was a very interesting discussion of how he and others at Boehringer successfully adapted CDISC schema and style sheets to produce an ADaM compatible define.xml file; even for a non-pharmaceutical audience, his discussion of basic xml structure and SAS tools used to solve this business problem could prove useful.

Proc SQL Feedback Option

Most people who work with large data sets understand the importance of only selecting the fields you need from any given table or tables rather than using Select * to select all fields. This leads to more efficient use of system resources, but can be annoying for the programmer who has to type in a long list of fields.

In order to avoid this, one easy tip is to use Enterprise Guide’s interface to select desired fields and then copying and pasting the generated code into your SAS program. An alternative to doing this, however, is to use the Proc SQL Feedback option.

For example, if I want to get a list of fields available in the SQL Dictionary table dictionary.tables for a given libname, I can use the feedback option to get the following log output:

21 proc sql feedback;
22 select *
23 from dictionary.tables
24 where libname = ‘yourlibname’
25 ;
NOTE: Statement transforms to:
select libname, memname, memtype, dbms_memtype, memlabel, typemem, crdate, modate, nobs, obslen, nvar, protect, compress, encrypt, npage, filesize, pcompress, reuse, bufsize, delobs, nlobs, maxvar, maxlabel, maxgen, gen, attr, indxtype, datarep, sortname, sorttype, sortchar, reqvector, datarepname, encoding, audit, audit_before, audit_admin, audit_error, audit_data

from DICTIONARY.TABLES

where libname=’yourlibname’;

I can then copy and paste this listing into my original program and select only those fields I want without having to type them in manually.

Create an Excel Data Dictionary with Proc Contents and Proc Export

Sometimes it may be useful to generate a table listing of all the SAS datasets within a folder (the code below uses the previously assigned libref HRA to refer to this folder) — a sort of table of contents — along with some more detail about the fields within those tables. While this information can be output within SAS using a Proc Contents, you may wish to organize this information differently, keep and drop certain variables, and export it in the form of a multi-sheet Excel workbook. The code below shows how a simple data dictionary can be created using Proc Contents and exported to Excel with Proc Export (or alternatively, the Libname statement). Both of these options for exporting data to Excel were discussed in a previous post. Keep in mind that although the code below only selects a very few of the variables available in SAS, you can modify it to select other variables as well. Try it without the keep= option to see the full range of available variables.

Data Dictionary with Proc Export:

PROC CONTENTS DATA=HRA._all_ memtype=data Out=table_listing (keep=memname Nobs name length format crdate);
RUN;

PROC SQL;
Create table directory as
SELECT distinct memname, nobs, crdate
from table_listing;
QUIT;

PROC EXPORT DATA=directory
OUTFILE=”\\path\Data_Dictionary.xls”
DBMS=EXCEL REPLACE;
SHEET=”directory”;
RUN;

PROC EXPORT DATA=table_listing
OUTFILE=”\\path\Data_Dictionary.xls”
DBMS=EXCEL REPLACE;
SHEET=”table_listing”;
RUN;

Alternate Syntax with Libname Statement:

libname myxls “\\path\Data_Dictionary.xls”; /*does not support replace option*/

PROC CONTENTS DATA=HRA._all_ memtype=data Out=myxls.table_listing (keep=memname Nobs name length format crdate);
RUN;

PROC SQL;
Create table myxls.directory as
SELECT distinct memname, nobs, crdate
from myxls.table_listing;
QUIT;

3 Ways to Export Your Data to Excel

SAS offers many different ways to export data to Excel, including Excel ODS output. The following alternatives to ODS are also available:

The File Export Wizard

This is an easy way to export your data without having to write any code. To use this wizard:

  • Select File>>Export Data
  • Choose the library and SAS data set name
  • Select a data source from the dropdown list (in this case, the version of Excel you’d like to export to – other types of data sources are also available here)
  • Browse to the folder where you’d like to create your Excel Workbook and type a name for the file
  • You then have the option to save the Proc Export statements generated by the Wizard to a SAS program file

Proc Export

If you chose to use the File Export Wizard to save your Proc Export statements, your code may look something like this:

PROC EXPORT DATA=sasdatasetname
OUTFILE=”\\path\filename.xls”
DBMS=EXCEL REPLACE;
SHEET=”sheetname”;
RUN;

  • The data set you’d like to export goes after DATA=
  • The file path goes after OUTFILE=
  • You are exporting to Excel, so DBMS=EXCEL
  • If you wish to replace any existing workbook or worksheet with the same name, use the REPLACE option.
  • The SHEET= statement is optional, but if you wish to create a multi-sheet workbook, it is useful. For your next PROC EXPORT statement, just specify a different file and a new SHEET name.

The Libname Statement

This is a newer option for exporting to Excel which does not support the REPLACE option, so if you already have a file of the same name you are trying to create, you will have to delete or rename the older version first. Also, while the libname is assigned, you will not be able to access the Excel document you have created; you will first have to exit out of SAS or clear the libname. It is easy to create multiple sheets using this method, though, as shown below.

libname myxls \\path\filename.xls;

DATA myxls.sheetname1;
SET sasdataset1;
DATA myxls.sheetname2;
SET sasdataset2;
RUN;

PROC SQL;
Create table myxls.sheetname3 as
SELECT *
from sasdataset3;
QUIT;

HASUG Meeting Notes: May 2011

Northeast Utilities hosted May’s HASUG meeting in Berlin, CT. Both speakers focused on detecting fraud, first from store credit card issuer GE’s perspective, and then from a property and casualty insurance claims perspective.

Usage of SAS in Credit Card Fraud Detection,” presented by GE employee Clint Rickards, began by introducing the most common types of credit card fraud and contrasting challenges faced by PLCC (store card) vs. bank card issuers. He presented the following interesting statistic: half of all credit card fraud as measured in dollar amounts occurs in only six states (CA, TX, FL, NJ, NY, and MI). He then discussed the general architecture of GE’s Risk Assessment Platform (RAP) designed to detect both real-time and post-transaction fraud, which uses the full SAS Business Intelligence suite of products: SAS/Intrnet, Enterprise Guide, Data Integration Studio, Management Console, SAS/Scalable Performance Data Server, and Platform Flow Manager/Calendar Editor. Finally, he stressed the importance of automated processes, reusable code, large jobs broken into smaller pieces to allow for easier debugging, and separation between the testing and production environment.

Next, Janine Johnson of ISO Innovative Analytics presented “Mining Text for Suspicious P&C Claims,” describing how her consulting firm developed an automated (labor intensive, but cost effective) process in Base SAS for “mining” insurance claim adjusters’ notes in an unstructured text field to get data for use in a predictive model. She introduced the process of text mining as follows: information retrieval, natural language processing, creating structured data from unstructured text, and evaluating structured outputs (classification, clustering, association, etc.). Before beginning this process, she emphasized the necessity of consulting a domain expert (in this case, someone in the P&C industry familiar with industry jargon and non-standard abbreviations). She then organized her own project into five steps of an iterative process: cleaning the text (using upcase, compress, translate, and combl functions), standardizing terms (using regular expression functions prxparse, prxposn, prxchange, as well as scan and tranwrd), identifying words associated with suspicious claims and grouping them into concepts (“concept generation”), flagging records with those suspicious phrases, and finally using proc freq with the chi squared option to evaluate lift.