HASUG’s 4th quarter meeting, featuring speakers Kevin Viel and Vinodh Paida, took place at Boehringer Ingelheim in Danbury, CT. PharmaSUG speaker Kevin Viel led with “Using the SAS System as a Bioinformatics Tool: A Macro That Calls the Standalone BLAST Setup”. Before sharing the macro, Viel began with some background on genomics and BLAST. A genome is all the genetic information about an organism; the human genome is the complete DNA of an individual person. DNA is a nucleic acid formed by a chain of nucleotides. These nucleotides are four possible bases (adenine, cytosine, guanine, and thymine) represented by A,C,T,G, or N for unknown. We are interested in the nucleotide sequences of DNA fragments (for example, AAAGTCTGAC), which can be used to identify genetic diseases in an individual or to find evolutionary relationships. Viel discussed four types of simple variations that can occur within a given nucleotide sequence: single substitution (AAAGTCTGAC vs. AAACTCCGAC), insertion (AAACTGCCGAC), deletion (AAAGTCTGAC vs. AAGTCTGAC), or inversion (AAAGTCTGAC vs. AAATGCTGAC).
Looking for similar sequences manually is a tedious, time-intensive process which can involve transcription errors. As an alternative, Viel discussed using regular expressions in SAS to look for matching sequences, allowing for one mismatching character such as a single nucleotide substitution in a strand. He then introduced a SAS macro to call BLAST, a sequence similarity tool from NCBI which can be downloaded or used interactively on the web. NCBI’s website defines the tool as follows: “The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.” Viel also described how to set up BLAST for Windows PC and configure the necessary environmental variables for the program to run successfully.
Following Viel’s presentation, Vinodh Paida of Accenture/Octagon shared “Data Edit Checks Integration Using ODS Tagset”, applicable to SAS versions 9.1.3 or higher. Although the paper was written specifically with regard to clinical trials data and reporting, it can generalize easily to other types of data and domains. First Paida summarized five types of commonly encountered data issues centering around invalid dates and missing data in clinical trials: partial dosing start and stop dates (checked for with the length function), future dates, subject with final summary data but missing stop date, adverse events with missing terms, and lab data with missing units but available results.
His SAS code contained blocks of edit checks for each scenario, followed by a macro to create a multi-sheet Excel workbook including a TOC listing with the selected edit checks, along with corresponding descriptions and sheet names. Problem records for each edit check are then output in different sheets of the workbook. The code is flexible to allow the user to select which edit checks to output to Excel. This presentation reminded me of an earlier HASUG presentation which inspired my post on how to create a data dictionary in Excel.