As solution or application architects we focus on the quality attributes of systems such as performance, availability, and capacity in order to ensure that the system can effectively deliver the functional capabilities required by its end users and business owners. These are often referred to as non-functional requirements or "ilities". Business systems typically produce and consume realms of business data, and in addition generate operational data such as system event logs and website traffic statistics. Data has a value of its own, whether consumed through an application, analyzed in a data warehouse, or used by a business intelligence system. The data itself has its own set of quality attributes that determine how effectively it can be consumed and impacts how it needs to be produced.
I used to think that data quality was primarily the concern of data architects and data analysts, so let me emphasize that last point. As business, solution or application architects for data-intensive business systems, the priorities and tradeoffs made in terms of data quality can significantly affect the decisions that are made concerning the functionality in these systems and the business processes used in conjunction with these systems to supply or consume data. So understanding data quality is important.
Similar to models of software quality attributes, there are many different models of data quality (see e.g. http://en.wikipedia.org/wiki/Data_quality). There are frequently reoccurring attributes in these models of which I feel four are predominant:
- Accuracy: How well the data reflects reality. Is the data free of errors? There are many aspects to accuracy like validity, consistency, and integrity that data quality models often treat as separate attributes.
- Completeness: Is all necessary data present? This can refer to missing fields of entities, missing related entities, or having an insufficiently sized subset of data from a particular population.
- Timeliness: Data is available to users within the required time. This is commonly expressed as the delay between when events occur to create the data versus when users are able to access the data. For example, near-real-time data versus data that is refreshed daily or weekly.
- Relevance: The degree to which the data meets the needs of users. Can users answer the questions they have from the data?
Ensuring that business systems fully satisfy these attributes is very challenging: in my next post I will discuss inherent limitations in achieving high levels of data quality.
If you find this article helpful, please make a donation.