Table of contents
Data validation is a critical step to maintain the accuracy of an analysis or reporting. For instance, there could be erroneous or missing values in the input data due to poor quality of the data sources or errors could occur during the stage of the analysis where data sources are merged/joined or manipulated incorrectly. Thus, data validation can be or should be performed during data cleansing stage prior to analysis and/or at the reporting stage. Manual validation of tabular reports with even several hundred records is a time consuming and an error prone approach, while presence of errors in high stake reports is unacceptable and embarrassing.
Data validations can be carried out in various aspects, such as checking for data types, formats, uniqueness, presence of missing values where they are not accepted, cardinality checks, validation for data integrity and business rules/logic etc. Data validation is usually an automated process in data base systems, but the extent of validations may vary from one system to another. On the other hand, data bases are not the only data sources for analytical tasks. Thus, quality of a data set is always not guaranteed and validation is crucial in analytical work space. Automation of data validation largely contributes to efficient generation of high quality reports.
Example
In this simple example I present an automated process to validate a data set containing personal identification information (POI) using the Validate package in R.
Data Preparation
I created a data set of fictitious POI of 1000 people using the Generator package. The data set contains fields in Table 1 below.
NOTE: Over 18
column is a derived logical column from the dateofbirth
column. Data created by the Generator package do not contain any erroneous data. Thus, I infused the data set with some possible errors, such as missing values, duplicates, typos, inconsistent formats etc., so that they will be picked up during data validation. The complete code for data generation can be found at github.com/geethika01/Data-Validation/blob/...
Table 1: Summary description of the POI data set
The first few rows of the final data set is as below.
Table 2. First few rows of the POI data generated and infused with erroneous values
Summary of data issues is listed below.
Table 3. Summary of data issues infused into the data set
Data Validation
The Validate package checks the data according to a given set of rules. Thus, I first define rules for the data validations, which includes checks for data formats, missing values, uniqueness, and some logic listed in Table 2 above.
These rules are then summarized as labels in a vector of strings.
labels_lst <-c(
"id - consists only 9 digits"
, "id - unique"
, "firstname - contains no digits"
, "firstname - Uppercase"
, "lastname - contains no digits"
, "lastname - Uppercase"
, "dateofbirth - ia a valid date in YYYY-mm-dd format and less than current date"
, "email - in valid format"
, "email - is unique"
, "phone - in correct format (XXX XXX XXXX)"
, "phone - is unique"
, "gender - either M or F"
, "over18 - valid values 1,0, NA and calculation is correct"
)
Secondly I evaluate each rule in R, which are also listed in a vector. NOTE: These rules and their corresponding labels in the previous vector should follow the same order. Also functions used in the rules are in the main script at github.com/geethika01/Data-validation/blob/...
rules_lst <- c(
# id
"ifelse(!is.na(dat$id),(nchar(dat$id)== 9 &
grepl('[0-9]{9}', dat$id)),NA)== T"
, "isDuplicated(dat$id)==T"
# firstname
, "ifelse(!is.na(dat$firstname), grepl('\\\\d', dat$firstname)==F, NA) == T"
, "isUpperCase(dat$firstname)==T"
# lastname
, "ifelse(!is.na(dat$lastname), grepl('\\\\d', dat$lastname)==F, NA) == T"
, "isUpperCase(dat$lastname)==T"
# dateofbirth
, "isValidDOBList(dat$dateofbirth)==T"
# email
, "isValidEmailList(dat$email)==T"
, "isDuplicated(dat$email)==T"
# Phone Number
, "ifelse(!is.na(dat$phone),
(grepl('[0-9]{3}[ ][0-9]{3}[ ][0-9]{4}',dat$phone) &
nchar(dat$phone) == 12), NA)==T"
, "isDuplicated(dat$phone)==T"
# gender
, "ifelse(!is.na(dat$gender), dat$gender %in% c('M', 'F'), NA)==T"
# over18
, "isValidover18List(dat$dateofbirth,dat$over18)==T"
)
Now I create a data frame of the labels and rules and I validate the data set against the rules using the functions in the Validate package. The result_validation object provides an elegant summary of the count of the validated data in terms of number of passes, fails, missing values, errors in the rules and warnings as in Table 4 below.
df <- data.frame(label = labels_lst, rule = rules_lst)
v <- validator(.data = df)
cf <- confront(dat,v)
quality <- as.data.frame(summary(cf))
measure <- as.data.frame(v)
result_validation <- (merge(quality,measure)) %>%
select(label, items, passes, fails, nNA, error, warning)
The summary table (Table 4) can be used to readily identify the data issues in the tabular data. However, in order to identify the actual data with issues, it is useful to generate a more detail outcome as shown in Table 5.
Table 4. Summary of data validation
fail_vals <- data.frame(values(cf))
fail_vals <- as.matrix(fail_vals)
fail_vals<- as.data.frame(which(fail_vals==0, arr.ind=TRUE))
fail_vals <- mutate(fail_vals, label = labels_lst[fail_vals$col])%>%
select(-col) %>% mutate(id = dat[fail_vals$row, 1])
vals <- c()
for (i in 1:nrow(fail_vals)){
vals[i] <- dat[fail_vals$row[i],
str_split(fail_vals$label[i]," - ")[[1]][1]]
}
fail_vals <- cbind(fail_vals,vals)
Table 5. First few rows of the detail outcome of data with issues
The Validator package can be used to identify data issues as a summary at high level and at individual scale, so that they can be traced back and fixed if needed. There are more elegant ways, such as graphical representations, to summarize the validation results as presented in the references. Comparison of the Tables 3 and 4 shows that the infused data issues have all been captured by the validation rules.
In this example, the data validation rules I have implemented evaluates the data formats, data types and some business rules. However, I have not covered validation of data integration or merging of data source. One simpler approach to using the validate package for this kind of validation is to write two independent scripts to generate the same output tabular report using the same inputs and compare the outputs using the compareDF package.
References
Validate package - cran.r-project.org/web/packages/validate/vi..