«    »

Data Quality Limitations: The CAT Theorem

I wrote in my prior post about key attributes of data quality: Accuracy, Completeness, Timeliness, and Relevance. These parallel quality attributes of applications - also known as non-functional requirements. It has long been known that there are competing forces between application quality attributes like performance and scalability. This has been formalized by the CAP (Consistency, Availability, Partition-tolerance) theorem which essentially states that one must choose between consistency and availability when a partition (failure) in the network/system occurs.

Data quality likewise has competing forces in play. I therefore propose the CAT theorem of data quality: the Completeness, Accuracy, and Timeliness of data cannot practically all be achieved.

Business events and processes are typically the source of data. Increasing the number of fields collected or the population size on which data is collected (completeness) or confirming that each field has values that are syntactically and semantically correct (accuracy) takes extra effort and time which negatively impacts timeliness.

For example, consider a typical online sign-up form that collects basic personal information. For the simplest scenario such as for a community online forum, all that might be requested is your name and email with no special validation so the process is fast (timely) but the accuracy of the information is potentially low (since fake names can be used), and the amount of information (completeness) known for each user is low. Many online sites now validate your email address by sending an email to the address provided with a link that you must visit to complete the sign-up process. This improves accuracy at the cost of timeliness. In contrast to an online forum, consider the sign-up process for online banking. Much more information is requested (such as address, birth date, social security number, and occupation details) with a much greater level of validation. Not only does this take longer to fill out, but it takes the bank much longer to validate to ensure accuracy.

Surveys of all types face a different trade-off based on the CAT theorem: the more information asked, the fewer respondents there typically are. This is a trade-off between types of completeness. Fewer respondents means a smaller sample size, potentially with biases making it less representational, thus reducing accuracy. Asking for more information either delays the response or risks inaccuracies. I remember a number of years ago completing a legally-mandated Government of Canada census. There were a long series of lifestyle questions that were difficult to answer (e.g. how often do you as a household do X in the last year). I answered quickly rather than doing a detailed analysis which surely reduced the accuracy of my answers.

Unlike the CAP theorem which is based on physical limitations of I.T. systems, the CAT theorem of data quality is largely based on economic / business realities, thus the inclusion of the word "practically" in the statement of the theorem. There may well be scenarios that violate CAT, but if so I believe them to be exceptions. And some scenarios that appear to violate CAT in the collection of data (e.g. automated collection of operational event data from I.T. systems) see CAT apply when the data is used to achieve a business objective - in other words, when humans get involved. So perhaps the CAT theorem is really a statement about the limitations of people.

What, practically, does the CAT theorem mean for I.T. professionals? One word: simplicity. Strive for simple systems that ask for, validate, and manipulate the minimum information required to achieve the business objectives. It may not seem like much to add a few more data fields, but every field added negatively impacts either the accuracy or the timeliness.

If you find this article helpful, please make a donation.

«    »