Blogs

There's more to Data Quality than meets the eye

Aug 25, 2020 1:45:07 PM / by Nicoletta Camodeca

There is more to Data Quality

Data is generally considered high quality if it is "fit for its intended uses in operations, decision making, and planning" [1].

In the last 50 years, the shift from an industrial economy to an information economy has caused data to become increasingly important. This has highlighted the costly impact that poor quality data can have on a company's financial resources [2]. Poor data quality contaminates the data in downstream systems and information assets, which increases costs throughout the enterprise. Moreover, customers' relationships deteriorate as a function of poor quality data leading to inaccurate forecasts and poor decision making. Recently, Harvard Business Review reported that out of 75 companies sampled, only 3% had high-quality data.

However, the financial impact of poor data quality is not measured solely in decreased revenue for a company but it can also entail financial losses and significant legal repercussions.

How can a company start its Data Quality journey?

1. How bad is your data?

Just knowing that the quality of the data is not high is not enough to start a data quality program. It is essential to understand how much of the data is of poor quality and what type of issues are present. The first step to address this is data profiling. Data profiling can guide in data quality analysis by uncovering the issues in each of the fields analyzed. Data profiling can be done using either SQL or data quality profiling tools like IBM InfoSphere Information Analyzer or AbInitio Data Profiler.

Here are some examples of issues detected by data profiling within a field:

  • Special characters
  • Numeric values
  • Spaces/Blanks

Let's take a look at the special characters issue. Sometimes, the presence of special characters in a specific field can be a sign of poor data quality. However, it is essential to define which special characters constitute an abnormality in that particular field before making a specific determination. As an example, let's look at the field First Name with the following values:

  • Jørund
  • Nicol*tta

In Jørund, the ø detected as a special character is the same as the * in Nicol*tta. However, ø is a valid "character" in Scandinavian countries, but * constitutes a real data quality issue. It is essential to understand the data quality issue and analyze it, given the nature and source of the data.

2. Who has what?

Not all the records in a specific system have the same type of information (corresponding fields). Even for the standard fields, there are different priorities and sensitivity. Why is this important? Let's consider this: John Doe is a customer of Acme (Bank). To open an account, John has to provide the following information:

Fields Mandatory/Optional
Name M
Address M
Driver License M

Jane Doe is a third party for Acme, which means she can play some role in Acme's products. However, she does not own any Acme product herself (for example, Ultimate Beneficial Owner). In her case, the same attributes are available, but Acme requires only some:

Fields Mandatory/Optional
Name M
Address M
Driver License O

The presence of blanks in any mandatory field constitutes a data quality issue. At the same time, the same is not true for optional fields. Not having the driver's license information for John's record is a data quality issue. However, it is inconsequential for Jane's record. Not knowing these distinctions between different customers can create false positives in detecting data quality issues for Acme and are also a waste of time and resources.

3. Oh, the possibilities!

Once a data quality issue is identified, data quality rules/controls need to be created to analyze it. For each data element, data quality rules can be created for each data quality dimension that needs to be analyzed. A data quality dimension is a characteristic of the data that can be measured to analyze the data's quality. Some examples of data quality dimensions are:

  • Completeness (whether a field is populated or not)
  • Consistency (whether a field is the same across several systems)
  • Conformity (whether the data is conforming to the set of standard definitions, i.e., the date should be mm/dd/yyyy)

Data quality dimensions are selected since not all the data quality dimension available are relevant for each field. Data quality rules dimensions are selected based on the There is not a 1:1 relationship between data quality rules and data quality dimensions. This is determined by the nature of the data quality dimension. A data element can have only one completeness data quality rule and several consistency data quality rules.

Let's go back to the example of John and Jane. If we create a data quality rule for completeness for Driver's License, the issue detected for John is a real data quality issue because Driver's License is a mandatory field to create a customer record. The same completeness data quality issue identified for Jane is not an actual data quality issue, since the Driver's License is an optional field.

Another example is the Name field. When analyzing the quality of the data in Client Referential Systems, it is worth learning the system's basics. First and Last Name are mandatory fields in any Client Referential Systems. This rule is enforced at the database level. It means that no record will be stored in the database without First and Last Name. Therefore, creating a completeness data quality rule for the First and Last Names will not bring any insight into the quality of the data.

4. Where does your data really come from?

Analyzing the quality of data in a system is not limited only to what we see but also from where it comes. Data quality assessment is a substantial part of every project that requires data migration since we want to avoid a GIGO (Garbage-In-Garbage-Out) scenario. Let's look at the Name and Identifier data for John Doe at Acme. To open an account at Acme, John has to provide the following:

  • First Name
  • Last name
  • Address
  • Document/Identifier

Workflow 1: John goes to the closest Acme branch, fills in the information form, and presents his Driver's License as a valid document. This document provides verification of the name and address provided by John. It is a manual process, and the Driver's License is scanned and kept with John's record.

Workflow 2: John goes to the closest Acme branch and presents his ID card with a microchip. All his personal information is downloaded from it.

In both workflows, we can find the data with anomalies (in this case, we are looking at special characters) in the First Name. However, the percentage of these anomalies is higher for workflow 2 than workflow 1.

How can that be possible?

With workflow 1, we can see that a manual error can be attributed to this data quality issue. But what about workflow 2? It's hard to believe, but it is still a manual error. It is a two-step manual error. Obtaining a legal document that can be used as proof of identity is a manual process. The forms are created to suit the data of a country's majority population. Some countries do not have the parsing of First and Last Name. When it comes to storing foreign names, the government's form may not allow the proper storage of the name. One common "remedy" is to avoid any attempt at parsing the string provided into First, Middle, and Last Names and store the string with the complete name in the Last Name field. And since the First Name must be present on the ID document, a default value of ?? is allocated to the First Name field. This, however, is not the only "manual" entry point. There is a second "manual" entry point that allows for such a data quality issue. The issue is created by how the data provided with the chip is "migrated" to the Client Referential System. The problem is due to the constant changing of the default value to "replace" the ?? from the government ID. The most common of values found in the First Name fields are *, X, XX, XXX. These default values are caused by a different type of "manual manipulation”: no single default value has been defined at enterprise level for First Name (a data quality issue that stems form the lack of data governance).

It's not just about ROI in the end.

An increase in revenue is the most quoted benefit of achieving high-quality data. It also helps prevent financial loss and other adverse and costly legal outcomes.

In the past 15 years, several major financial institutions have paid hefty fines and legal fees for proceedings that found them guilty of irregular activities. The common denominator was poor quality data. The poor quality of data is not as much at the customer level as it is at the third-party level, facilitating almost completely undetected and illicit money transfers.

By not taking care of their data (including quality), the financial institutions were held responsible, liable, and ordered to improve the quality of their data worldwide. The legal fines and the verdicts have been particularly harsh.

The fines and the expenses of cleaning the Client Referential Systems worldwide were not the only financial negative impact. The Financial Institutions had to regain the trust of their customers and develop new strategies to attract potential/new customers.

Data Quality initiatives alone can be the first step to improve data utilization within a company. However, the benefits of Data Quality can be fully leveraged only when carried out within Data Governance, Master Data Management, and Data Analytics frameworks.

We all have to start somewhere, and as we say in Italy, Well begun is half done.

Your data is more valuable than you think!

Nicoletta Camodeca

Written by Nicoletta Camodeca

Nicoletta Camodeca has been a Senior MDM Business Architect since 2002. She has been specializing in Data Quality since 2008 for implementations of IBM MDM solutions in the Retail, Financial, Hospitality, Automotive, and IT industries. She has worked for IBM, TCS, and Infosys through the years before joining Mastech InfoTrellis. She is a graduate from the University of Pisa and holds a Ph.D. in Neuroscience from the University of Dublin, Trinity College. She currently lives in New York City.

Subscribe to Email Updates

Lists by Topic

see all

Posts by Topic

See all

Recent Posts