The following are draft recommendations a Good Programming Practice for analysis, reporting and data manipulation in Clinical Trials and the Healthcare industries.
The purpose is to encourage contributions from across companies, non-profit organizations and regulators in an attempt to create a consensus recommendation. The ambition is that this page becomes recognized by the Pharmaceutical Industry, Clinical Research and Health Care Organizations as well as Regulatory Authorities.
The hope is that the Practice can be reviewed and endorsedby the relevant management teams of several Pharmaceutical companies and major Contract Research Organizations and promoted through relevant professional organizations such as PharmaSUG, PhUSE, PSI, CDISC.
The Good Programming Practices are defined in order to:
- Ensure the clarity of the code and facilitate code review;
- Save time in case of maintenance, and ease the transfer of code among programmers or companies;
- Minimize the need for code maintenance by robust programming;
- Minimize the development effort by development and re-use of standard code and by use of dynamic (easily adaptable) code;
- Minimize the resources needed at execution time (improve the efficiency of the code);
- Reduce the risk of logical errors.
- Meet regulatory requirements regarding validation and 21CFRPart11 compliance
Note: As often, the various guidelines provided hereafter may conflict with one another if applied in too rigorous a way. Clarity, efficiency, re-usability, adaptability and robustness of the code are all important, and must be balanced in the programming practice.
Readability and Maintainability
English is an international language and study protocols, study reports for practical reasons (regulatory authorities, inlicensing, outlicensing, partnerships, mergers) are mostly written in English, therefore it is recommended to write the SAS code and comments in English.
- Include a header for every program (template below).
**********************************************************;* Program name :** Author :** Date created :** Study : (Study number)* (Study title)** Purpose :** Template :** Inputs :** Outputs :** Program completed : Yes/No** Updated by : (Name) – (Date): * (Modification and Reason)**********************************************************;
- In addition to your name or initials, use your login ID to identify yourself in the header. This is so there is no ambiguity on the identify of each programmer.
- Update the revision history at each code modification made after the finalization of the first version of a program.
Note: When you copy a program from another study, you became the author of this program, and you should clear the revision history. You can specify the origin of the program under the “Template” section of the header.
Below is an example with comments of an alternative comment block that I think is more useful for Open Source programming. PaulOldenKamp 16:54, 5 April 2009 (UTC)
/** ---------------------------------------------------------------------------------------- $Id: os3a_autoexec.sas 152 2008-11-17 01:48:40Z Paul_ok01 $ <== Id info automatically inserted with each commit to Subversion version control Application: OS3A - Common Programs Description: OS3A session initialization program. Previous Program: None Saved as: c:\os3a\trunk\os3a_autoexec.sas <== locations, local and web, where the pgm http://os3a.svn.sourceforge.net/viewvc/os3a/trunk/os3a_autoexec.sas can be found Change History: Date Prog Ref Description 04/26/2008 PMO  Initial programming for os3a <== date, programmer initials, ref number  to link to specific location of change Copyright: Copyright (c) 2008 OS3A Program. All rights reserved. <== Always tell who owns the program so one Copyright Contact: firstname.lastname@example.org can ask permission from the copyright holder License: Eclipse Public License v1.0 <== Tell folks how they are licenced to use This program and the accompanying materials are made available under the program. the terms of the Eclipse Public License v1.0 which accompanies this distribution, and is available at www.eclipse.org/legal/epl-v10.html Contributors: Paul OldenKamp, POK_Programming@OldenKamp.org - Develop initial pgm. <== Identify significant contributors <== tag identifies start of info @purpose OS3A system initialization program. used by the Codedoc Perl script to Set up initial options and global macro variables. produce external HTML documentation @param SYSPARM - input provided from program initiation call; default: main_FORE for production. @symbol sysRoot - system root location; Windows - C:/, UNIX - / @symbol remove_cmd - system command to remove file; Windows - erase, UNIX - rm @symbol os3aRoot - directory location of os3a root @symbol futsRoot - directory location of FUTS top level macros. @symbol Root - directory location of sub-project identified with Four Letter Acronym ----------------------------------------------------------------------------------------- */
The results from encoding the header and comments in a SAS program can be seen on the CodeDoc web page. See http://www.thotwave.com/products/codedoc.jsp.
CodeDoc Download Page
- Include a comment before each major DATA/PROC step, especially when you are doing something complex or non-standard. Comments should be comprehensive, and should describe the rationale and not simply the action. For example, do not comment “Access demography data”; instead explain which data elements and why they are needed.
- Organize the comments into a hierarchy.
- Do not include numbers in comments.
Reason: It avoids heavy update when removing or inserting sections.
- Use explicit and meaningful names for variables and datasets, with a maximum length of 8.
- For permanent datasets, use a meaningful dataset label and variable labels.
- When possible, never use the same name for a dataset more than once in the program.
Note: However, keep in mind that large intermediate files take a lot of SAS Workspace.
- Name IN variable using “in” plus a meaningful reference to the dataset.
data aelst; merge aesaes (in=inae) patpat (in=inpat); by patno; if inae and inpat;run;
- Labels must have a maximum length of 40 characters.
- It is mandatory to include libnames, options and formats in a separate setup program unless these are temporary formats or temporary options that are reset after being used.
Reason: It will guarantee that changes of the environment are taken into account in all programs run afterwards.
- Use standard company macros to read in libnames and settings, to write out datasets, and for standard calculating and reporting.
- One statement per line, but several are allowed if small and repeated or related. Long statements should be split across multiple lines.
- Control system settings to show all executed code in the log as the default, in as clear manner as possible. The log should not be so lengthy that the programmer cannot easily navigate (if so, use highly visible comments with sufficient white space). System settings should be able to be easily changed in order for a user to debug a section of code or a macro, in order to temporarily display the %included code, resolved macro names, and logic.
- Use a standard sequence for placing statements and group like statements together.
- Within a program:
- %LET statements and macro definitions
- Input steps
- Save final (permanent) datasets and created outputs
- Within a DATA step:
- All non-executable statements first (e.g. ATTRIB, LENGTH, KEEP…)
- All executable statements next
Reason: It increases the readability of the program.
- Left-justify DATA, PROC, OPTIONS statements, indent all statements within.
proc means data=osevit; var prmres; by prmcod treat;run;
- End every DATA/PROC step with a left-aligned RUN statement.
Reason: It explicitly defines the step boundary.
- Insert at least one blank line after each RUN statement in DATA/PROC steps.
- Indent statements within a DO loop, align END with DO.
- Avoid having too many nested DO loop and IF-ELSE statements.
- In case of interlinked DO loop, add a comment at the start (DO) and end (END) of each loop.
data test01; do patno=1 to 40; * cycle thru patients; do visit=1 to 3; * cycle thru visits; output; end; * cycle thru visits; end; * cycle thru patients;run;
- Insert parentheses in meaningful places in order to clarify the sequence in which mathematical or logical operations are performed.
data test02; set test01; if (visit=0 and vdate lt adate1) or (visit=99 and vdate gt adate2) then delete;run;
Draft section : this may not be specific to clinical programming, but may be of use when considering a general standard for sharing programs.
Use of analysis datasets
For discussion of why programming output directly from raw data is generally avoided.
- When you input or output a SAS dataset, use a KEEP (preferred to DROP) statement to keep only the needed variables.
Reason: The SAS system loads only the specified variables into the Program Data Vector, eliminating all other variables from being loaded.
- When subsetting a SAS dataset, use a WHERE statement rather than IF, if possible.
Reason: WHERE subsets the data before entering it into the Program Data Vector, whereas IF subsets the data after inputting the entire dataset.
- When using IF condition, use IF/ELSE for mutually exclusive conditions, and check the most likely condition first.
Reason: The ELSE/IF will check only those observations that fail the first IF condition. With the IF/IF, all observations will be checked twice. Also, consider the use of a SELECT statement instead of IF/ELSE, as it may be more readable.
- Avoid unnecessary sorting. CLASS statement can be used in some procedure to perform by-group processing without sorting the data.
proc means data=osevit; var prmres; class treat;run;
- If possible (i.e. not a sorting variable), use character values for categorical variables or flags instead of numeric values.
Reason: It saves space. A character “1” uses one byte (if length is set to one), whereas a numeric 1 uses eight bytes.
- Use the LENGTH statement to reduce variable size.
Reason: Storage space can be reduced significantly.
Note: Keep in mind that a too limited variable length could reduce the robustness of the code (lead to truncation with different sets of data).
- Use simple macros for repeating code.
- Use the MSGLEVEL=I option in order to have all informational, note, warning, and error messages sent to the LOG.
- In the final code, there should be no dead code that does not work or that is not used. This must be removed from the program.
- Code to allow checking of the program or of the data (on all data or on a subset of patients such as clean patients, discontinued patients, patients with SAE or patients with odd data) is encouraged and should be built throughout the program. This code can be easily activated during the development phase or commented out during a production run using the piece of code detailed in Section 6.
- It is not acceptable to have avoidable notes or warnings in the log (mandatory).
Reason: They can often lead to ambiguities, confusion, or actual error (e.g. erroneous merging, uninitialized variables, automatic numeric/character conversions, automatic formatting, operation on missing data…).
Note: If such a warning message is unavoidable, an explanation has to be given in the program (mandatory).
- Always use DATA= in a PROC statement (mandatory).
Reason: It ensures correct dataset referencing, makes program easy to follow, and provides internal documentation.
- Be careful when merging datasets. Erroneous merging may occur when:
- No BY statement is specified (set system option MERGENOBY=WARN or ERROR).
- Some variables, other than BY variables, exist in the two datasets (set system option MSGLEVEL=I), S writes a warning to the SAS log whenever a MERGE statement would cause variables to be overwritten at which the values of the last dataset on the MERGE statement are kept).
- More than one dataset contain repeats of BY values. A WARNING though not an ERROR is produced in the LOG. If you really need, PROC SQL is the only way to perform such many-to-many merges.
Reason: One has to routinely carefully check the SASLOG as the above leads to WARNING messages rather ERROR messages yet the resulting dataset is rarely correct.
- When coding IF-THEN-ELSE constructs use a final ELSE statement to trap any observations that do not meet the conditions in the IF-THEN clauses.
Reason: You can only be sure that all possible combinations of data are covered if there is a final ELSE statement.
- When coding a user-defined FORMAT, include the keyword ‘other’ on the left side of the equals sign so that all possible values have an entry in the format.
Reason: A missing entry in a user-defined FORMAT can be difficult to detect. The simplest way to identify this potential problem is to ensure that all values are assigned a format.
Note: This does not apply to INFORMATs. It could be more helpful to get a WARNING message when trying to INPUT data of unexpected format.
- Try to produce code that will operate correctly with unusual data in unexpected situations (e.g. missing data).
Code for Data Checks
Build checks so that their purpose is clear, so that they can be toggled on or off, and remove them once they are no longer needed.
Activate/Deactivate Pieces of Code
In the beginning of the program, define a macro variable that you set to blank during the development phase or that you set equal to * for the production run:
%let c=; or %let c=*;
For the pieces of code that check the data/program, start each line with the macro variable defined above:
&c title “Check the visits for each patient”;&c proc freq data=patvis01;&c table patno*visit;&c run;
This code will be executed if &c is blank (development), but will be commented out when &c=* (production).
Perform Checks on a Subset of Patients
In a separate code that you store under the study MACRO folder, list the subset of patients (clean patients, discontinued patients, patients with SAE or patients with odd data) that you want to look at:
%macro select;2076 2162 2271 2449%mend;
In the beginning of the program, define a second macro variable that you set equal to * when you want to perform checks on all data or to blank when you are interested in a subset of patients:
%let s=*; or %let s=;
For each checking code, add a piece of code that allows subsetting the data, and start each line of this piece of code with the 2 macro variables defined above:
&c title “Check the visits for each patient”;&c proc freq data=patvis01;&c table patno*visit;&c &s where patno in (%select);&c run;
The check will be performed only if &c is blank, and it will be applied to all patients if &s=* or on the subset of patients if &s is blank.
Better still: input the list of check case IDs as a dataset.
Floating Point Error
Consider the real number system that we are familiar with. This decimal system (0 → ± ∞) is obviously infinite. Most computers use floating point representation, in which, a finite set of numbers is used to represent the infinite real number system. Thus, we can deduce that we will have some sort of error appearing from time to time. This is more generally termed Floating Point Error and occurs in computers due to hardware limitations.
The following paper goes someway in explaining why and how this happens and also possible solutions in how to approach this issue.
Paper reference:- http://www.lexjansen.com/phuse/2008/cs/cs08.pdf
Data Imputation versus Hardcoding
- Data Imputation
Integrity of a data transfer
At a minimum, all data transfers should be validated by checking the observation counts for each SAS dataset
or the record counts in other formats, against counts provided by the sender. It is also helpful if the sender
can provide a checksum for each file transferred since this also ensures all content made it’s way to you
without transmission errors. There are many freeware programs available to calculate checksums for any file at websites such as http://sourceforge.net/ .
Draft section : recommendations particularly relevant for the development and use of macros and macro libraries.
Macros are particularly useful under the following circumstances:
- Program code is used repeatedly
- A number of steps must be taken conditionally, and the logic for these is clearly fixed (no need to think of all the steps that should be included in a program under a specific situation: the macro will deduce them for you and generate the appropriate data step or proc step code)
- There is no trivial solution via “ordinary” SAS code
- Their application must be easier as to program the code itself!
- The usage helps users avoiding errors and omissions.
If used appropriately the following benefits can be achieved:
- Increase in quality by avoiding programming bugs and errors
- Savings in time and resources
- Enforcement of standards, e.g. standard methods and standard outputs
- Work can be more enjoyable as programmers can focus on the non-routine work
Ideally Macro development should follow a few rules:
- Macro headers should clearly state all changes to environment and data that result from execution. Changes should be limited to those necessary for the focused purpose of the macro:
- strictly controlled changes to input data and creation of output data
- clear temporary data set clutter
- no unexpected changes to system settings (options, titles, footnotes, etc)
- no unexpected changes to external symbol tables
- Scope of macro variables should be explicitly controlled using %global and %local statements.
- Method of macro variable creation should demonstrate awareness of default scope:
- a “%let” statement will initialize a local symbol table, but will also overwrite existing external symbol table entries;
- a “proc sql create & select into:” blocks will initialize a local symbol table;
- a “call symput” statement usually will not initialize a local symbol table;
- a “call symputx” statement can specify the symbol table (the global one, the most local one that exists, the most local one in which the macro variable exists, or create a local one) in which to store the macro variable.
- The log matters:
- Use Base SAS techniques whenever possible to avoid excessive code generation (log bloat). For example, macro definition should use DATA step array and DO loop processing rather than Macro %DO looping.
- But use pure Macro Language for routine utility macros (see details, below).
- Use appropriate comment style in macro definitions to properly annotate the SAS log when MPRINT in on. For example, use %* style commenting to explain macro logic, but /* style commenting to explain resulting code. (Or * style or PUT statement commenting as appropriate.)
- Allow the users to control the appearance of the log via MPRINT, SYMBOLGEN, and MLOGIC.
- Code within a macro definition should be germane, limited to the specific purpose of the macro. The use of a central repository for macros (“Macro library”) is suggested.
- Macro Library: Code for routine tasks (eg, parameter checking, system and environment checking, messaging, etc.) should be handled by dedicated utility macros. Code for such routine tasks should not overwhelm the current macro definition, obscuring the purpose, and creating unnecessary maintenance overhead and lack of consistency within a library.
- Macro Library: Parameter naming conventions should be used for common parameters such as input/output libnames and data sets. Explicit and transparent control of macro variable scope again becomes crucial to avoid accidental change of external symbol tables
- Macro Library: Use pure Macro Language definitions whenever possible to improve program flow and avoid producing unnecessary Base SAS code. Returning a list of data set variable, checking for macro var existence, returning data set obs count can all be achieved without BASE SAS code. Such macros can be called “inline” without unnecessary overhead or interruption of program flow.
For example, instead of %count_ds_obs definition that uses DATA Step code and interrupts program flow like
%let n_obs = %count_ds_obs(DSIN=myData);%if &n_obs > 0 %then %do; ... more statements ...%end;
an inline, pure Macro Language implemetation allows streamlined code:
%if %inline_ds_obs(DSIN=myData) > 0 %then %do; ... more statements ...%end;
Source: Sunil Gupta, Senior SAS Consultant, Gupta Programming http://www.sascommunity.org/wiki/Good_Programming_Practice_for_Clinical_Trials