Data Integrity Means Everything

Insight Main Image

The following article is a guest post to Zephyr from Michael Butrym, Software Quality Capability Leader and Senior Director, Technology Solutions at Accenture.

“Virtually everything in business today is an undifferentiated commodity, except how a company manages its information. How you manage information determines whether you win or lose. How you use information may be the one factor that determines its failure or success – or runaway success.”

The quote above is widely attributed to a Bill Gates interview in The Sunday Times from 1999. It is one of the many insights he had into modern business, and it is even more relevant today than it was 17 years ago.  What Bill doesn’t state, is that the data must be accurate and correct in order to be useful.

So, how do we, as a QA Community, ensure our Data Integrity, so that we can head toward that ‘runaway success’ that Bill spoke about? 

What Does Data Integrity Mean to Software Testing?

First off, let’s talk about how we test applications.  It is generally well accepted that when testing an application, web page, web service, or mobile app that we focus on the functionality of the product under test.  We assume that the data we are using is accurate and correct, and there to allow us to validate the functionality (data is necessary for this step, and we focus on changes to the data caused by improper functionality, not the data itself).

Many applications, unless they are absolutely new version 1.0 releases, come with some legacy data that forms the basis of their historical functionality.  This may be a database of customer information, historical sales orders, drug interactions, page views, demographic info, ad tracking info, etc.  As testers, we tend to focus on the features and changes in functionality, but we have to realize that issues with the data that these features and functions operate on can have just as big an impact on the user experience as buggy code.

So let’s take a tour of the data validation controls to ensure the data Integrity and Quality of solution data.

Data Integrity Means Everything

 

The basics of data validation

Data Integrity Means Everything

Confirm Row Count – The Most Basic of Data Integrity Controls

The ETL process (data transformation and loading) is the start of the data validation process.  While it may seem obvious, it’s the first step to ensuring that we are working with a complete set of data and it is often overlooked by testers whose focus is no new features, not old data.  During the data loading process, we may be loading millions of records from a few, or a few dozen, domains (sources).  The sum total of those sources is our universe of data that drives the solution – missing records from any source could manifest as orphans, incorrect aggregates, or improper outputs and behaviors down the road.  The initial step, then, into ensuring Data Integrity is counting the rows that we’ve loaded into the application environment, and ensuring that our sum total of rows across all domains is represented in staging (including records discarded or flagged as errored, according to ETL rules).  An error at this stage should be cause for alarm and be investigated immediately.  This could be a domain (source) issue (data delivered to the application in an invalid format), a transformation issue (ETL rules implemented incorrectly), or a bug in the parsing and loading routines of the application.  Missing data is often attributed to advanced application functionality, but is quite frequently the result of a data loading issue.  It can be caught early in the software development process if the Analyst does this basic validation at the start.

 

Data Integrity Means Everything

Verifying Data Completeness

Completeness, while closely related to Row Count, is subtly different.  Completeness of the data refers to the state where all of the data that is in the source domains is present in the target (this could be a database, flat files, XML based data storage, etc, depending on the application).  The movement of the data from the source domains to its eventual home is accompanied by processing (ETL) that frequently manipulates the data in ways that are critical to the application under test.  For example, we may see length and type conversions, data substitutions for look-up values (looking up a description based on a numeric code, for example), normalization of phone numbers and zip codes, combining of duplicate records, and more.   This is the step where we may ‘discard’ data from the source domain based on business rules defined in the ETL (mentioned in Row Count, above).  Completeness checks are driven by the business rules, and also account for mismatched records, missing records, or extra records.  Failure to verify data Completeness by the QA teams can be a cause for much puzzlement and hand-wringing when failures and odd behaviors show up later on during functional testing.

 

Data Integrity Means Everything

Ensuring Consistency in Data

Tests for consistency ensure that the source data and the target data agree with each other.  In other words, they do not contain conflicting facts, which may be introduced by flaws in the transformation and integration logic.  In the example of a retail application, a failure in consistency might manifest itself by having two products in the Product table that have the same description, but different Product Identifiers.  A bug of this nature can show itself intermittently in the application itself, depending on how the data is selected for display (e.g. the tester will see the correct description for the product, but the internal ID, which is generally hidden from the user interface, could be one of two different identifiers causing downstream errors in advanced processing).  In this case, testing for data consistency would involve seeking out duplicate records, or alternately ensure the uniqueness of certain fields in the records, within critical data tables according to the requirements and design specifications.

 

 

Data Integrity Means Everything

Checking Validity of Data

This is where a strong understanding of the business need or function of the application under test is critical.  Validity checks rely on business knowledge and the intended use of the application, as a base for data integrity.  Let’s use the example of an automobile insurance management application that contains customer data, automobile data, and policy data.  Validity checking relies heavily on knowledge of the business, and in some cases on common sense.  In this instance, if we see that Joe Smith is insuring 9,999,999 cars on his policy then we’ve come across a failed validity check.  Another example might be seeing that Joe Smith was born in 1792.  This type of data error won’t necessarily be detected during completeness or consistency validation, because in both cases the data may be viewed as complete and consistent.  It does not violate ETL rules (the data is in the proper format, and exists in the source streams).  But it is obvious that once individual  cannot insure almost a million vehicles, and is certainly not still driving if they were born over 200 years ago!  Therefore it is not valid, falling outside of an acceptable range of values as determined by strong knowledge of the business and requirements for the application.  A QA Team that is well versed in the business and the application requirements can seek out and rectify data validity issues early in the testing, circumventing a time consuming and frustrating investigation into these issues late in the testing lifecycle.

 

Data integrity means everything

Spotting Data Redundancy

Testing for data redundancy means looking for duplicates in the data in key areas where duplication will cause functional issues.  These duplicates could be physical duplicates or logical duplicates.  Physical duplicates are two or more records in which every field is exactly the same as the corresponding field in a different record.  An example from a human resources application might include Employee table data where all fields (Emp_ID, FirstName, LastName, Department) are the same in two or more records.  This type of scenario can be caused by loading legacy data incorrectly, issues in the source data, or bugs in the application code (e.g. the developer does an Insert instead of an Update, or doesn’t verify that Deletes complete successfully).  In any logical application this situation would violate the business intent of the solution requirements or design.  Logical duplicates, by contrast, contain duplicate values only in the fields that are used as a business key for that record.  Using the physical example with the Employee table we might see FirstName, LastName, and Department being the same for two or more records, but Emp_ID is not.  In this case we clearly have one employee in the system with TWO Emp_IDs, in violation of the business requirements.  This type of data redundancy can cause headaches in the testing if the Analyst is testing through the user interface only.  However, it is easy to spot when querying the data directly – a little bit of time up front examining the data for physical and logical redundancy can save a great bit of time later on in the software testing lifecycle.

 

Data integrity means everything

Testing Referential Integrit

Simply put, a database that passes the referential integrity check is one that has no orphan records.  As an example, consider an order processing database where we have Order records (which contain references to who ordered, and when) and Order_Detail records (which specify what was ordered, the quantity, etc).  These two data tables linked by the Order_ID field.  Orphans exist when an Order record exists without an associated Order_Detail record, or Order_Detail records exist without a corresponding Order parent record.  A system that contains this type of orphan can lead to invalid aggregation of orders, and inconsistency in sales and order reporting.  By validating the referential integrity of the data the QA team can avoid opening bugs against the code when the issue is really at the data level.

 

 

Data Integrity Means Everything

Identifying Violations of Domain Integrity

Throughout a Client solution the data representing domain values must remain valid.  If we are loading data from the Department feed (lists all valid departments within a company) into a Department_lookup table, we need to ensure that all of the values that get populated after ETL processing are valid for that domain.  If values like “JohnSmith” or “Polo Shirt, blue” end up in this table, we have a clear violation of domain integrity which needs to be fixed.  Other domain integrity violations might include insertion of NULL values or truncation of department names, in this example.

 

 

Data Integrity means everything

Validating the Accuracy and Usability of Data

All of the above Data Integrity validations lead us to verify the accuracy and usability of the application under test from the perspective of the data the code acts upon.  If the data in the solution mirrors real-world business scenarios and satisfies the daily business functional needs then we can say that we’ve built and accurate and usable system.

Failure in any of the integrity checks means a dramatic effect on the accuracy and usability of the system, since incorrect data is propagated to downstream processes and systems used in reporting, campaigns, and forecasts. 

From the Quality Assurance perspective, taking some time up front to validate the integrity of the data can keep data-related bugs from clouding the bug landscape as the application is tested.  In the case where development and QA use different data sets for testing, this is especially critical as a bug opened in QA may not be reproducible with developer data. 

So keep an eye on the data, as well as the code!  If your business wants that runaway success, use your data wisely and ensure its Data Integrity.

 

Data Integrity Means EverythingAuthor Mike Butrym: Michael leads Accenture's Quality Assurance Capability, which focuses on software testing, data validation, and the overall quality of Accenture’S client solutions.

Michael has extensive experience, over his 20+ year career in the industry, in software development and testing, and Quality Assurance centered around Database Marketing Technology. Before Accenture, Mike was the Senior Director of QA at Merkle where he built a high-performance Testing Team based the US and China, created process and procedures based on the Industry Standards around QA Best Practices, and managed the quality of Client Solutions for TSG.

Prior to joining Merkle, Michael has built onshore and offshore Quality Teams around Enterprise Quality Best Practices for companies like Experian and Premiere Global Services, with a focus on producing Enterprise level Quality solutions for clients. 

Michael is a graduate of New Jersey Institute of Technology and Rutgers University, and is located in Freehold New Jersey.

Related Articles: