Tag Archives: clinical sas programmer

Becoming a Data Scientist {EDC Developer + Statistical Expert + Data Manager}

At an early age, I was drawn to computers. I did well in math; I love science and I started enjoying programming when my stepfather gave me a small computer to program games. This was my real experience with programming. I think the programming language was Basic. The computer had some built-in games and basic math problems in it but you could also play around with ‘Basic‘ codes and create your own.

Then I went to a technical school and into college where you take basic classes in information system /technology and took courses in telecommunication management.  Most of the courses were around IP, PBX and Network Administration.  As part of that curriculum, I took a basic programming course and VB.net. I really like that since it has a visual interface (drag and drop to create the interface) and when you click a button you create an event so I like the design aspect of it (I am known to be very creative) then I started to design for people (website design and development, small databases). A lot better than working in telecommunications. I thought VB was a great first language to learn. Later I took a Microsoft Access database development class and we learn database design (relational) and found out I was really good at that.

Before I graduated, I was already working for a well known pharmaceutical company as a database analyst within their data management and biometrics team. They really like what I did with their clinical operations data (investigator data – you know the one that now we need CTMS systems for nowadays). So this was a confirmation that ‘databases’ was my passion. I love designing it, managing and maintaining it.

During my early years in this industry, I spent a lot of time writing SQL codes and SAS programs.  We pulled the messy data (back in those years we used the Clintrial Oracle backend system) and very problem solving oriented. A business question was asked and we would go using either SQL or SAS and go into this messy database and figure it out the answer. I really enjoyed that.

In recent years, I take data from a {EDC} system then write scripts to summarize the data for reporting and put into a data warehouse and then I use a product called ‘IBM Cognos’, which points to the data warehouse to build those reports and worked with different users across different departments (a lot of different audiences for the data) with a lot of different interesting data in there. I have spent time using APIs to extract data via Web Services (usually in XML-ODM format) and generate useful reports in SAS or Excel XML.

People think that being a data analyst is just sitting around a computer screen and crunching data. A lot of it is design-oriented, people-oriented, and problem-solving. So when people ask a question, I get to dive into the data and figure it out the answer.

Next step is to get into predictive analytics and do more data mining and data forecasting.

Are you still excited about becoming a data scientist?

You can start by reading my blog about programming languages you should learn here!

Other tools and programming languages you should learn: Anaconda, R Programming, Python, Business Intelligence Software like Tableau, Big Data Analytics with Hadoop, create new representations of the data using HTML and CSS (for example when you use APIs, XML to extract data from third-party sources).

Anayansi, MPM, an EDC Developer Consultant and clinical programmer for the Pharmaceutical, Biotech, and Medical Device industry with more than 18 years of experience.

Available for short-term contracts or ad-hoc requests.  See my contact page for more details or contact me.

Fair Use Notice: Images/logos/graphics on this page contains some copyrighted material whose use has not been authorized by the copyright owners. We believe that this not-for-profit, educational, and/or criticism or commentary use on the Web constitutes a fair use of the copyrighted material (as provided for in section 107 of the US Copyright Law).

Using PROC UNIVARIATE to Validate Clinical Data

Using PROC UNIVARIATE to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC  UNIVARIATE procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC UNIVARIATE procedure.
  • Find extreme values

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

Previously, we used PROC FREQ to find missing/unique values. Today, we will use PROC UNIVARIATE which is useful for finding data outliers, which are data that falls outside expected values.

proc univariate data=labdata nextrobs=10;
var LBRESULT;
run;

Lab data result using Univariate

 

 

 

 

 

 

 

 

 

For validating data, you will be more interested in the last two tables from this report. The missing values table shows that the variable LBRESULT has 260 missing values. There are 457 observations. The extreme observations table can tell us the lowest and highest values (possible outliers) from our dataset. The nextrobs=10 specify the number of extreme observations to display on the report. To suppress it use nextrobs=0.

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

 

Using PROC FREQ to Validate Clinical Data

Using PROC FREQ to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC FREQ procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC FREQ procedure.
  • Spot distinct values

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

Previously, we used PROC PRINT to find missing/invalid values. Today, we will use PROC FREQ  to view a frequency table of the unique values for a variable. The TABLES statement in a PROC FREQ step specified which frequency tables to produce.

proc freq data=labdataranges nlevels;
table _all_ / noprint;
run;

So how many unique lab test do we have on our raw data file? We know that our sas data set has 12 records. The Levels column from this report,  the labtest=3 uniques. Which means, we must have 9 duplicates labtest in total. For this type of data [lab ranges] though, this is correct. We are using it as an example as you can check any type of data.

Proc Freq sas

 

 

 

Lab test data ranges

 

 

 

 

 

 

 

 

 

 

 

 

So remember, to view the distinct values for a variable, you use PROC FREQ that produces frequency tables (nway/one way) . You can view the frequency, percent, cumulative frequency, and cumulative percentage. With the NLEVELS options, PROC FREQ displays a table that provides the number of distinct values for each variable name in the table statement.

Example: SEX variable has the correct values F or M as expected; however, it is missing for two observations.

Missing values proc freq

 

 

 

 

 

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

Using PROC PRINT to Validate Clinical Data

Using PROC PRINT to Validate Clinical Data

When your data isn’t clean, you need to locate the errors and validate them.  We can use SAS Procedures to determine whether or not the data is clean. Today, we will cover the PROC PRINT procedure.

  • First step is to identify the errors in a raw data file. Usually, in our DMP, in the DVP/DVS section, we can identify what it is considered ‘clean’ or data errors.
    • Study your data
  • Then validate using PROC PRINT procedure.
  • We will clean the data using data set steps with assignments and IF-THEN-ELSE statements.

When you validate your data, you are looking for:

  • Missing values
  • Invalid values
  • Out-of-ranges values
  • Duplicate values

In the example below, our lab data ranges table we find missing values. We also would like to update the lab test to UPPER case.

Clinical Raw data
Proc Print data val code
PROC PRINT output – data validation

 

From the screenshot above, our PROC PRINT program identified all missing / invalid values as per our specifications. We need to clean up 6 observations.

Cleaning Data Using Assignment Statements and If-Then-Else in SAS

We can use the data step to update the datasets/tables/domains when there is an invalid or missing data as per protocol requirements.

In our example, we have a lab data ranges for a study that has started but certain information is missing or invalid.

To convert our lab test in upper case, we will use an assignment statement. For the rest of the data cleaning, we will use IF statements.

Proc Print data cleaning

 

 

 

 

 

 

 

Data Validation and data cleaning final dataset

 

 

 

 

 

 

 

From our final dataset, we can verify that there are no missing values. We converted our labTest in uppercase and we updated the unit and  EffectiveEnddate to k/cumm and 31DEC2025 respectively.

You cannot use PROC PRINT to detect values that are not unique. We will do that in our next blog ‘Using PROC FREQ to Validate Clinical Data’. To find duplicates/remove duplicates, check out my previous post-Finding Duplicate data.

or use a proc sort data=<dataset> out=sorted nodupkey equals; by ID; run;

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn

 

A quick way to find duplicates from external lab data

There are many different options to find a duplicate obs or data using SAS.

I have just received data from an external service provider and I want to compare it with my EDC lab data. My goal is to find any duplicates before is useful or before I can compare it with my lab data.

My external data has over 400 records so this will be a tedious job if I use excel (the original format). This review could take over a few hours, but with SAS, we can do this in less than 30 mins.

A simple SAS program:
over 400 records…

This is our SAS code:

SAS code to find duplicate data from an external source

This is our final output. We found the duplicates including empty rows.

external data
external data, lab data

 

 

 

 

 

 

 

 

It is important to note that not all duplicates can be deleted or found. Further review by your clinical team members is required. Once we are confident that duplicates have been removed, we know our data set is more accurate.

One last thing: How to get this data onto SAS so we can perform this task? Here’s a snippet:

  1. Import the file using PROC IMPORT (assuming you have DBMS ACCESS)
  2. You can use a permanent or temporary library. In this example, we created a temporary lib name = IMPORT
  3. We want to bring over the column names as var names using GETNAMES=YES statement

PROC IMPORT OUT= WORK.IMPORT
DATAFILE= “H:\Labs\LocalLab\LabDataReview.xls”
DBMS=EXCEL2000 REPLACE;
GETNAMES=YES;
RUN;

Now you have learned a quick way to find duplicates on your external data. I hope this piece of code will make any data manager job easier. Remember, you can always enhance this program (add labels, formats, etc) and improve efficiency.

To hire me for services, you may contact me via Contact Me OR Join me on LinkedIn