Tag Archives: HASUG

HASUG Meeting Notes: December 2012

HASUG’s 4th quarter meeting, featuring speakers Kevin Viel and Vinodh Paida, took place at Boehringer Ingelheim in Danbury, CT. PharmaSUG speaker Kevin Viel led with “Using the SAS System as a Bioinformatics Tool: A Macro That Calls the Standalone BLAST Setup”. Before sharing the macro, Viel began with some background on genomics and BLAST. A genome is all the genetic information about an organism; the human genome is the complete DNA of an individual person. DNA is a nucleic acid formed by a chain of nucleotides. These nucleotides are four possible bases (adenine, cytosine, guanine, and thymine) represented by A,C,T,G, or N for unknown. We are interested in the nucleotide sequences of DNA fragments (for example, AAAGTCTGAC), which can be used to identify genetic diseases in an individual or to find evolutionary relationships. Viel discussed four types of simple variations that can occur within a given nucleotide sequence: single substitution (AAAGTCTGAC vs. AAACTCCGAC), insertion (AAACTGCCGAC), deletion (AAAGTCTGAC vs. AAGTCTGAC), or inversion (AAAGTCTGAC vs. AAATGCTGAC).

Looking for similar sequences manually is a tedious, time-intensive process which can involve transcription errors. As an alternative, Viel discussed using regular expressions in SAS to look for matching sequences, allowing for one mismatching character such as a single nucleotide substitution in a strand. He then introduced a SAS macro to call BLAST, a sequence similarity tool from NCBI which can be downloaded or used interactively on the web. NCBI’s website defines the tool as follows: “The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.” Viel also described how to set up BLAST for Windows PC and configure the necessary environmental variables for the program to run successfully.

Following Viel’s presentation, Vinodh Paida of Accenture/Octagon shared “Data Edit Checks Integration Using ODS Tagset”, applicable to SAS versions 9.1.3 or higher. Although the paper was written specifically with regard to clinical trials data and reporting, it can generalize easily to other types of data and domains. First Paida summarized five types of commonly encountered data issues centering around invalid dates and missing data in clinical trials: partial dosing start and stop dates (checked for with the length function), future dates, subject with final summary data but missing stop date, adverse events with missing terms, and lab data with missing units but available results.

His SAS code contained blocks of edit checks for each scenario, followed by a macro to create a multi-sheet Excel workbook including a TOC listing with the selected edit checks, along with corresponding descriptions and sheet names. Problem records for each edit check are then output in different sheets of the workbook. The code is flexible to allow the user to select which edit checks to output to Excel. This presentation reminded me of an earlier HASUG presentation which inspired my post on how to create a data dictionary in Excel.

HASUG Meeting Notes: November 2011 (Social Media)

The 4th quarter 2011 HASUG meeting took place at Bristol-Myers Squibb in Wallingford, CT, on November 10th. Speakers included John Adams of Boehringer Ingelheim and David Kelly from Customer Intelligence at SAS Institute.

David Kelly presented SAS Institute’s Social Media Analytics software platform designed to enable companies to process large volumes of unstructured data from internal and external social media and to base business decisions on that data. His presentation, “The Power of Social Media Listening,” introduced the SAS Customer Intelligence organization at SAS, provided an engaging narrative of social media statistics (about 70% of YouTube and FaceBook activity come from outside the US, for example), portrayed the potential and the landscape of social media and outlined the data challenges (punctuation, spelling, segmentation, acronyms, industry and social media jargon) associated with analyzing the unstructured text data which comprises over 70% of social media data.

Kelly cited the infamous viral YouTube video posted by a previously obscure singer-songwriter who watched United Airlines cargo handlers break his expensive Taylor guitar as a social media example of negative PR which led to a loss of $180 million for United Airlines. Clearly, how corporations react to social media in real-time can have serious financial implications. Kelly also discussed the “4 Cs” of social media: Content, Context, Connections, and Conversations, and noted the importance of being able to identify key “influencers” (a concept which will be of interest to those acquainted with Malcolm Gladwell’s “Tipping Point”) and the origins of negative PR stories.

SAS’s solution for businesses looking to monitor and respond quickly to information about their brand floating around on social media sites is the SAS Social Media Analytics software platform. This platform crawls the web for industry or company-specific information (largely in the form of unstructured text data), capturing, cleaning, organizing, and analyzing that data as part of a customizable self-service application that allows your organization to generate real-time reports, including comparison reports (vs. competition), analysis of historical data and trend identification, “sentiment analysis” currently supported in 13 languages, and much more. See my conference paper for an example of the kind of text analysis you can do using Base SAS.

HASUG Meeting Notes: November 2011 (define.xml)

The 4th quarter 2011 HASUG meeting took place at Bristol-Myers Squibb in Wallingford, CT, on November 10th. Speakers included John Adams of Boehringer Ingelheim and David Kelly from Customer Intelligence at SAS Institute.

John Adams’s presentation, “Creating a define.xml file for ADaM and SDTM,” addressed a current issue within the pharmaceutical industry as CDISC (Clinical Data Interchange Standards Consortium) moves to standardize the electronic submission process of pharmaceutical studies to the FDA, in the interest of making the review process more efficient and consequently decreasing the time it takes for a new drug to reach the market. A define.xml file contains all the metadata information needed to guide the reviewer through the electronic FDA submission. While there is software readily available for the creation of this file for SDTM submissions, only limited support exists for ADaM compatible define.xml files. Adams’s presentation described how his organization addresses this problem.

Adams began with a short tutorial on xml schemas and style sheets before describing the process for creating ADaM compatible define.xml files and discussing the methodology for capturing metadata. The xml tutorial, which was very well done, included a visual representation of basic xml structure, showing how root elements, child elements, attributes, and values are organized hierarchically in xml. He also contrasted html vs. xml (global, standard tags in html vs. non-standard tags defined by a schema in xml) and described the requirement that the define.xml file be “well-formed xml” (as opposed to an xml fragment), listing the basics of well-formed xml as follows: xml declaration, unique root element, start and end tags, proper nesting of case-sensitive elements, quoted attribute values, and use of entities for special characters (&,<,>,etc.). Finally, he defined the two elements of the define.xml file: the schema, an .xsd file which defines the file structure (elements, attributes, order and number of child elements, data types, default values) and validates the data, and the style sheet, an .xsl file which defines the layout/display for rendering the data (table of contents, tables, links), used to transform the xml into an html file that can be recognized and displayed by a browser.

Next, Adams described the general CDISC schema, zeroing in on some of the more important elements, and provided a list of available software tools for developing xml files along with some of the challenges associated with each: CDISC software, SAS Clinical Toolkit (in Base SAS), SAS XML Mapper (Java-based GUI which is helpful translating xml files to SAS data sets but not vice versa), and SAS XML Engine (Base SAS). He described the process of capturing metadata in Excel to use as input for the SAS programs which output the define.xml file, highlighting the newer v.9 Excel libname feature in SAS (example syntax: “LIBNAME WrkBk EXCEL ‘My Workbook.xls’ VER=2002;” see sugi paper for more details: http://www2.sas.com/proceedings/sugi31/024-31.pdf), or refer to my previous post on 3 Ways to Export Your Data to Excel for other ways to use the Excel libname. He also shared a SAS macro using the tranwrd() function to replace special characters such as “&” and “< "which must be represented in the xml document as "&amp" and "&lt." Also of note: Adams recommended Oxygen Editor to debug the xml code and make sure the file displays properly in Internet Explorer. This was a very interesting discussion of how he and others at Boehringer successfully adapted CDISC schema and style sheets to produce an ADaM compatible define.xml file; even for a non-pharmaceutical audience, his discussion of basic xml structure and SAS tools used to solve this business problem could prove useful.

HASUG Meeting Notes: May 2011

Northeast Utilities hosted May’s HASUG meeting in Berlin, CT. Both speakers focused on detecting fraud, first from store credit card issuer GE’s perspective, and then from a property and casualty insurance claims perspective.

Usage of SAS in Credit Card Fraud Detection,” presented by GE employee Clint Rickards, began by introducing the most common types of credit card fraud and contrasting challenges faced by PLCC (store card) vs. bank card issuers. He presented the following interesting statistic: half of all credit card fraud as measured in dollar amounts occurs in only six states (CA, TX, FL, NJ, NY, and MI). He then discussed the general architecture of GE’s Risk Assessment Platform (RAP) designed to detect both real-time and post-transaction fraud, which uses the full SAS Business Intelligence suite of products: SAS/Intrnet, Enterprise Guide, Data Integration Studio, Management Console, SAS/Scalable Performance Data Server, and Platform Flow Manager/Calendar Editor. Finally, he stressed the importance of automated processes, reusable code, large jobs broken into smaller pieces to allow for easier debugging, and separation between the testing and production environment.

Next, Janine Johnson of ISO Innovative Analytics presented “Mining Text for Suspicious P&C Claims,” describing how her consulting firm developed an automated (labor intensive, but cost effective) process in Base SAS for “mining” insurance claim adjusters’ notes in an unstructured text field to get data for use in a predictive model. She introduced the process of text mining as follows: information retrieval, natural language processing, creating structured data from unstructured text, and evaluating structured outputs (classification, clustering, association, etc.). Before beginning this process, she emphasized the necessity of consulting a domain expert (in this case, someone in the P&C industry familiar with industry jargon and non-standard abbreviations). She then organized her own project into five steps of an iterative process: cleaning the text (using upcase, compress, translate, and combl functions), standardizing terms (using regular expression functions prxparse, prxposn, prxchange, as well as scan and tranwrd), identifying words associated with suspicious claims and grouping them into concepts (“concept generation”), flagging records with those suspicious phrases, and finally using proc freq with the chi squared option to evaluate lift.

HASUG Meeting Notes: February 2011

The first quarter HASUG Meeting on February 24, 2011, took place at Case Memorial Library in Orange, CT from 10 am -1 pm.

Santosh Bari, a SAS-certified professional currently with eClinical Solutions (a division of Eliassen Group in New London, CT), opened the meeting with his presentation on Proc Report: A Step-by-Step Introduction to Proc Report and Advanced Techniques. Proc Report is a powerful report-generating procedure which combines many of the features of Proc Print, Proc Sort, Proc Means, Proc Freq, and Proc Tabulate. Mr. Bari’s presentation was a very in-depth discussion of proc report options and attributes which included code samples alongside corresponding sample output. He did a very thorough job of presenting the wide array of functionality included in proc report, including more advanced, lesser-known topics such as BREAK BEFORE/AFTER statements, COMPUTE blocks, and PANELS and FLOW options.

Following Mr. Bari, Charles Patridge of ISO Innovative Analytics presented Best Practices: Using SAS Effectively/Efficiently. His presentation, a compilation of a number of popular past topics, included the introduction of an effective naming convention for programs and files along with compelling reasons for creating such a naming system, creation of data dictionaries in Excel with the use of a proc contents-based macro, and central macro autocall libraries. Mr. Patridge used his many years of past consulting experience to argue for spending a little time up-front to organize and name one’s programs and data sets in such a way that makes transparent the order of execution of the programs and the origin of the datasets. When dataset names correspond to the names of the programs which created them, this makes a project self-documenting and easier to hand off to others.