Tag Archives: data validation plan

Using PROC UNIVARIATE to Validate Clinical Data

Using PROC UNIVARIATE to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC  UNIVARIATE procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC UNIVARIATE procedure.
  • Find extreme values

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

Previously, we used PROC FREQ to find missing/unique values. Today, we will use PROC UNIVARIATE which is useful for finding data outliers, which are data that falls outside expected values.

proc univariate data=labdata nextrobs=10;
var LBRESULT;
run;

Lab data result using Univariate

 

 

 

 

 

 

 

 

 

For validating data, you will be more interested in the last two tables from this report. The missing values table shows that the variable LBRESULT has 260 missing values. There are 457 observations. The extreme observations table can tell us the lowest and highest values (possible outliers) from our dataset. The nextrobs=10 specify the number of extreme observations to display on the report. To suppress it use nextrobs=0.

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

 

Using PROC FREQ to Validate Clinical Data

Using PROC FREQ to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC FREQ procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC FREQ procedure.
  • Spot distinct values

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

Previously, we used PROC PRINT to find missing/invalid values. Today, we will use PROC FREQ  to view a frequency table of the unique values for a variable. The TABLES statement in a PROC FREQ step specified which frequency tables to produce.

proc freq data=labdataranges nlevels;
table _all_ / noprint;
run;

So how many unique lab test do we have on our raw data file? We know that our sas data set has 12 records. The Levels column from this report,  the labtest=3 uniques. Which means, we must have 9 duplicates labtest in total. For this type of data [lab ranges] though, this is correct. We are using it as an example as you can check any type of data.

Proc Freq sas

 

 

 

Lab test data ranges

 

 

 

 

 

 

 

 

 

 

 

 

So remember, to view the distinct values for a variable, you use PROC FREQ that produces frequency tables (nway/one way) . You can view the frequency, percent, cumulative frequency, and cumulative percentage. With the NLEVELS options, PROC FREQ displays a table that provides the number of distinct values for each variable name in the table statement.

Example: SEX variable has the correct values F or M as expected; however, it is missing for two observations.

Missing values proc freq

 

 

 

 

 

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

Using PROC PRINT to Validate Clinical Data

Using PROC PRINT to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC PRINT procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC PRINT procedure.
  • We will clean the data using data set steps with assignments and IF-THEN-ELSE statements.

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

In the example below, our lab data ranges table we find missing values. We also would like to update the lab test to UPPER case.

Clinical Raw data
Proc Print data val code
PROC PRINT output – data validation

 

From the screenshot above, our PROC PRINT program identified all missing / invalid values as per our specifications. We need to clean up 6 observations.

Cleaning Data Using Assignment Statements and If-Then-Else in SAS

We can use the data step to update the datasets/tables/domains when there is an invalid or missing data as per protocol requirements.

In our example, we have a lab data ranges for a study that has started but certain information is missing or invalid.

To convert our lab test in upper case, we will use an assignment statement. For the rest of the data cleaning, we will use IF statements.

Proc Print data cleaning

 

 

 

 

 

 

 

Data Validation and data cleaning final dataset

 

 

 

 

 

 

 

From our final dataset, we can verify that there are no missing values. We converted our labTest in uppercase and we updated the unit and  EffectiveEnddate to k/cumm and 31DEC2025 respectively.

You cannot use PROC PRINT to detect values that are not unique. We will do that in our next blog ‘Using PROC FREQ to Validate Clinical Data’. To find duplicates/remove duplicates, check out my previous post-Finding Duplicate data.

or use a proc sort data=<dataset> out=sorted nodupkey equals; by ID; run;

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

 

How to document the testing done on the edit checks?

Since the introduction of the Electronic Data Capture (EDC) in clinical trials where data is entered directly into the electronic system, it is estimated that the errors (e.g. transcription error) have been reduced by 70% [ Clinical Data Interchange Standards Consortium – Electronic Source Data Interchange 2005].

The Data Management Plan (DMP) defines the validation test to be performed to ensure data entered into the clinical database is complete, correct, allowable, valid and consistent.

Within the DMP, we find the Data Validation Plan. Some companies call it ‘DVS’ others ‘DVP’.  The Good practices for computerized systems in regulated GxP environments defines validation as a system that assures the formal assessment and reporting of quality and performance measures for all the life-cycle stages of software and system development, its implementation, qualification and acceptance, operation, modification, qualification, maintenance, and retirement.

As an {EDC} Developer or Clinical Programmer, you will be asked to:

  • Develop test scripts and execution logs for User Acceptance Testing (UAT).
  • Coordinate of UAT of eCRF build with clinical ops team members and data management and validating documents, included but not limited to: edit check document, issue logs, UAT summary report and preparation and testing of test cases.

Remember not every EDC system is alike. Some systems allow you to perform testing on the edit checks programmed; others allow you to enter test data on a separate instance than production (PROD).

Data Validation and UAT Module.png

For example, some EDC systems facilitate re-usability:

  1. There is a built-in test section for each study – where data can be entered and are stored completely separate from production data. This allows you to keep the test data for as long as needed to serve as proof of testing.
  2. The copy function allows for a library of existing checks (together with their associated CRF pages) to be copied into a new study. If there are no changes to the standard checks or pages then reference can be made back to the original set of test data in a standards study, thus reducing the study level overhead.
  3. The fact that many of the required checks (missing data, range checks, partial dates etc.) do not require the programming of an edit check at all. Each of these and many others are already there as part of the question definition itself and therefore do not need any additional testing or documentation for each study.

If you have not documented, you have not done it-FDA

The “ideal world” scenario would be to reduce the actual edit check testing by the system generating a more “human readable” format of the edit checks. The testers that way would not have to test each boundary conditions of the edit checks once the system is validated. All they would have to do is inspect the “human readable” edit checks vs the alerts and would also be easy for the clients to read and sign off.

You can leverage the EDC systems audit trail under certain conditions. First of all – the system you are testing with must be validated in itself. Some EDC products are only ‘validated’ once a study is built on top of them – they are effectively further developed as part of a study implementation process – in this situation, I would doubt you could safely use the audit trail.

Secondly, you need to come up with a mechanism whereby you can assure that each edit check has been specifically tested – traceability.

Finally, you need to secure the test evidence. The test data inside the EDC tool must be retained for as long as the archive as part of the evidence of testing.

The worst methods in my view are paper / screenshot based. They take too long, and are largely non-reusable. My past experience has been creating test cases using MS Word then performing each step as per test case and take a screenshot, where indicated. Then attached to the final documentation and validation summary. This obviously a manual and tedious process. Some companies create test cases using HPQC or similar tool. This is a bit more automated and traceable yet, it is still prone for errors. It is better than documenting using MS Word or Excel but it is still a manual process.

Re-usability is what it is all about, but, you need to ensure you have methods for assuring the test evidence produced for edit checks you are reusing is usable as part of the re-use exercise.

Edit Check Design, Development and Testing is the largest part of any typical EDC implementation. Applying methods to maximize quality and minimize time spent is one of the areas I have spent considerable time on over the last couple of years.

For additional tips on writing effective edit checks please go here -Effective edit checks eCRFs.

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

Source images: provided courtesy of Google images.

-FAIR USE-
“Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for “fair use” for purposes such as criticism, comment, news reporting, teaching, scholarship, and research. Fair use is a use permitted by copyright statute that might otherwise be infringing. Non-profit, educational or personal use tips the balance in favor of fair use.”