Tag Archives: sas data cleaning

Using PROC PRINT to Validate Clinical Data

Using PROC PRINT to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC PRINT procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC PRINT procedure.
  • We will clean the data using data set steps with assignments and IF-THEN-ELSE statements.

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

In the example below, our lab data ranges table we find missing values. We also would like to update the lab test to UPPER case.

Clinical Raw data
Proc Print data val code
PROC PRINT output – data validation

 

From the screenshot above, our PROC PRINT program identified all missing / invalid values as per our specifications. We need to clean up 6 observations.

Cleaning Data Using Assignment Statements and If-Then-Else in SAS

We can use the data step to update the datasets/tables/domains when there is an invalid or missing data as per protocol requirements.

In our example, we have a lab data ranges for a study that has started but certain information is missing or invalid.

To convert our lab test in upper case, we will use an assignment statement. For the rest of the data cleaning, we will use IF statements.

Proc Print data cleaning

 

 

 

 

 

 

 

Data Validation and data cleaning final dataset

 

 

 

 

 

 

 

From our final dataset, we can verify that there are no missing values. We converted our labTest in uppercase and we updated the unit and  EffectiveEnddate to k/cumm and 31DEC2025 respectively.

You cannot use PROC PRINT to detect values that are not unique. We will do that in our next blog ‘Using PROC FREQ to Validate Clinical Data’. To find duplicates/remove duplicates, check out my previous post-Finding Duplicate data.

or use a proc sort data=<dataset> out=sorted nodupkey equals; by ID; run;

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

 

Advertisements

How to Use SAS – Lesson 5 – Data Reduction and Data Cleaning

This video series is intended to help you learn how to program using SAS for your statistical needs. Lesson 5 introduces the concept of data reduction (also known as subsetting ;data sets). I discuss how one can subset a data set (i.e. reduce a data set’s number of observations) based on some criteria using the IF statement in the DATA STEP, or using the WHERE statement in a PROC STEP. I also discuss using the KEEP, DROP, and RENAME statements for reducing data to only a handful of the original variables (i.e. reduce a data set’s number of variables). Furthermore, I show how one can label variables so that descriptive information can be presented in output and value formats so that specific values are easy to understand. Finally, I provide basic examples of each of these for three hypothetical data sets.

Helpful Notes:

1. There are two places you can reduce the data you analyze; in the DATA STEP, and in the PROC STEP.

2. To subset data in the DATA STEP, use the IF statement.

3. To subset data in the PROC STEP, use the WHERE statement.

4. Another way to reduce data is to eliminate variables using a KEEP or DROP statement. This method is useful if you are creating a second data set or analytic version of your main dataset.

5. The RENAME statement simply changes a variables name.

Today’s Code:

data main;
input x y z;
cards;
1 2 3
7 8 9
;
run;

proc contents data=main; run;
proc print data=main; run;

/* 1. Reduce data in the DATA STEP using a simple IF statement */
data reduced_main; set main;
if x = 1;
run;

proc print data=main; run;
proc print data=reduced_main; run;

/* 2. Reduce data in the PROC STEP using a simple WHERE statement */
proc print data=main;
where x = 1;
run;

proc print data=main; run;
proc print data=reduced_main; run;

/* 3. Reduce data in the DATA STEP by KEEPing only the variables you do want */
data reduced_main; set main;
KEEP x y;
run;

proc print data=main; run;
proc print data=reduced_main; run;

/* 4. Reduce data in the DATA STEP by DROPing the variables you don’t want */
data reduced_main; set main;
DROP y;
run;

proc print data=main; run;
proc print data=reduced_main; run;

/* 5. Clean up variables using the RENAME statement within a DATA STEP */
data clean_main; set main;
rename x = ID y = month z = day;
run;

proc contents data=main; run;
proc contents data=clean_main; run;

/* 6. Clean up variables using a LABEL statement within a DATA STEP */
data clean_main; set clean_main;
label ID = “Identification Number” month = “Month of the Year” day = “Day of the Year”;
run;

proc contents data=main; run;
proc contents data=clean_main; run;

/* 7. FORMAT value labels using the PROC FORMAT and FORMAT statements */
PROC FORMAT;
value months 1=”January” 2=”February” 3=”March” 4=”April” 5=”May” 6=”June” 7=”July” 8=”August” 9=”September” 10=”October” 11=”November” 12=”December”;
run;

data clean_main; set clean_main;
format month months.;
run;

proc ;freq data=clean_main;
table month;
run;

-FAIR ;USE-
“Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, and research. Fair use is a use permitted by copyright statute that might otherwise be infringing. Non-profit, educational or personal use tips the balance in favor of fair use.”

Anayansi Gamboa has an extensive background in clinical data management as well as experience with different EDC systems including Oracle InForm, InForm Architect, Central Designer, CIS, Clintrial, Medidata Rave, Central Coding, OpenClinica Open Source and Oracle Clinical.