This blog post lists out some of the core concepts of Data Quality assessment, perception changes in data quality, data quality management over the years, use of data quality tools to parse, standardize and cleanse data.
What is Quality Information?
There has been a change in the conventional focus on data usage with regard to how data sets support operation of transactional systems. Data quality or data fitness in today’s world is perceived in the context of reuse, repurpose and evaluated in line with conformance to business rules.
Why do we need DQ Tools and Techniques?
The fundamental steps in improving Data quality continue to be the same; however we need to rely on data quality tools for the following reasons:
- Evaluate when data errors exist
- Assess severity of the problem
- Get rid of Root cause and correct the data
- Inspect and Monitor
- Enhance quality of data
Key Aspects of DQ Management
Figure 1: Data Quality Management - Key aspects
Figure 2: Data Cleansing
Data cleansing is performed by combining the business rules definition with the software designed to execute the outlined rules.
Let us assume that a data value in a character string is incorrect. To correct the incorrect string, we make use of the transformative capability of the business rule.
Hence STREET could mean a Road, Street, avenue and so on. The business rule for STREET could be defined as follows:
- STR = STREET
- ST =STREET
- St = STREET
- St. = STREET
- Str = STREET
- Str. = STREET
The string 1000 Main Str. will be transformed to 1000 MAIN STREET.
If we were to transform a street name “St. Patricks St.” by applying the above simple business rule, the end result would be “STREET PATRICKS STREET”. The basic rule will not suffice and calls for more control.
When the same input maps to two different outputs, ordering the rules will fail. We need an approach that accommodates both context and order of execution.
Adjusting the rule set
This approach can be used to resolve the data cleansing rule conflicts wherein we can adjust the rule set to a certain extent to ensure distinction of abbreviation and then phasing the application of rules.
If we have two rules that share an equivalent input but have different outputs, using the form:
- R1: Transform string X into string Y1
- R2: Transform string X into string Y2
Then a modification to the rule set to break that conflict could work if we first correct all instances of one type of context-dependent inputs into a modified form and then apply modified rules during a second pass. Here is another stab at modifying our sample rules into two passes.
- St is transformed into __STREET__ at the end of a street name
- St. is transformed into __STREET__ at the end of a street name
- St is transformed into SAINT
- St. is transformed into SAINT
- __STREET__ is transformed into STREET
Adding Contextual Constraints
By adding Contextual constraints the problem is solved as follows but when a new context is introduced it becomes an issue again necessitating another context rule
- St is transformed into SAINT at the beginning of a street name
- St. is transformed into SAINT at the beginning of a street name
- St is transformed into STREET at the end of a street name
- St. is transformed into STREET at the end of a street name
Managing the quality of master address and location data involves reviewing a lot of the existing documentation that has been collected from a number of different operational systems, as well as reviewing the business processes to see where location data is created, modified, or read. There are likely to be many references to operations or transformations performed on addresses, mostly with the intent of improving the quality of the address.
The transformations are referred to using varied terms such as validation; verification; standardization; cleansing; correction.
Figure 3: Address Quality validation
When an item is sent to an addressed location, the following concepts need to be right:
- The item must be directed to a specific recipient party (either an individual or an organization).
This implies a direct connection between the sender and the recipient. The corresponding business rule suggests that the recipient must be identifiable to the sender
- The address must be a deliverable address.
This means that the address must carry enough information to enable a carrier to locate the address as a prelude to delivery
- The intended recipient must be associated with the deliverable address.
This establishes a direct dependence between the recipient and the addressed location, implying awareness of that connection
- The delivery address must conform to the USPS® standard.
An address must be complete, which by definition implies that it can be matched with current Postal Service ZIP + 4® and City State files.
A complete address matches the USPS files and has the correct ZIP + 4.
The address components must be consistent with the postal standard in terms of valid and invalid values. For example, if the USPS file says that Main Street goes from 1-100, an address with 105 Main St is invalid because it is outside the range of recognized numbers.
Validation here indicates that the street address is predictable with what is reported by the USPS documents.
Also means that the address uses the standard USPS abbreviations
Non USPS Addresses
In many smaller and rural communities, the USPS does not deliver directly to homes and businesses; rather, it doles out PO boxes to every address in the community and only delivers mail to the PO Box. When a street address is verified using the USPS database there is every probability that the non-USPS address will turn out to be invalid or bad.
The failure to check non-USPS addresses can be a problem for many businesses. However, a couple of data vendors are able to aggregate contact data from multiple sources to distinguish and confirm these non-USPS addresses. Some vendors additionally offer the choice of geocoding these non-USPS addresses to the rooftop level.
Let us illustrate with an example. Let us assume that we need to deliver an item to a particular individual at a particular location engaging an agent. We feel that it is pretty easy for the agent to locate if we provide the street address alone because the standard address format contains a hierarchical breakdown for refining the location at finer levels of precision. This process works in the US because of the existing postal standards.
The problem occurs when, for some reason, the address does not conform to the standard. If the address is slightly distorted (e.g. it is missing a postal code), the odds are still great that the location can be identified. If the address has significant issues (e.g. the street number is missing, there is no street, the postal code is conflicting with the city and state, or other components are missing), settling the location becomes a hassle.
There are two ways to try to deal with this problem:
- Treat each non-standard address as an exception
- Fix the problem earlier in the process by trying to transform a non-standard address into one that conforms to the standard.
Techniques to address Standardization
Changing the address into a standard form will streamline the geocoding process. This change procedure leverages the following ideas:
- There is a representative model for “standardized” addresses with its accompanying formats, syntax, acceptable value lists, and rules.
- An application is able to scan a non-standardized (or what I called a “provided descriptive”) address, differentiate between the parts that are good and the ones that do not meet the standard.
- There is an approach to outlining the non-standard parts into standard ones.
Standard representation of data helps us to find out if a value meets the standard or not. For example; A street address can be split into three components:
Number, Name and a Street type
Particulars to decide address legitimacy
- The number must be a positive integer number
- The name must have one, and only one word
- The street type must be one of the following: RD, ST, AV, PL, or CT
Defining the Rules
- The address string must have three components to it (format)
- The first component has to only have characters that are digits 0-9 (syntax)
- The first character of the first component cannot be a ‘0’ (syntax)
- The third component must be of length 2 (format)
- The third component has to have one of the valid street types (content)
Data Enhancement is a procedure that adds data to an informational index (data set) to enhance its potential utility.
There are various diverse ways that data sets can be enhanced, including adapting values to meet defined benchmarks, applying data amendments, and adding additional attributes.
Data Enhancements in Business
Figure 4: Data Enhancements - Business
Let us explain how data enhancement can be incorporated in operations.
Names of individuals appear in data records in different ways such as:
- first name followed by last name
- last name with a comma, followed by first name
- with or without titles such as “Mr.” or “Dr”
A data standardization and enhancement process can parse out the key components of a person’s name, fill in the details (if necessary) through lookups in master information tables, and reorganize those components into a format so that a customer’s identity can be set up for verification purposes.
Upgrades are mostly done to provide additional detail for reporting and analysis purposes. Enhancement goes beyond data standardization and correction; rather, the upgrade procedure can include more data by connecting one data set to another. The appended data can augment an analytical process to include extra information in reports and interactive visualizations.
For example; a retail company can take sales data that includes Zip code value and then enhance the data with demographic profiles provided by the Census Bureau to look for correlation between purchasing designs and documented socioeconomics about the specific locations.
Standardizing names and addresses, and linking those records to the reference data collections allows coordinate linkage in light of the particular criteria, ranging from gross-level linkage (say, at the district level) down to specific enhancement at the individual level (such as the names of the products to which a customer subscribes). These qualitative enhancements enlarge the business knowledge and analytics processes to help organizations to
make more deals, increment incomes, and enhance productivity.
Incorporating diagnostic outcomes with operational exercises give viable proposals to enhance client bolster or even drive extra deals.
Record Linkage and Matching
Figure 5: Record Linkage & Mapping
There are many distributed sources of data about customers, and each individual piece of gathered information holds a smidgen of significant worth. But when these disseminated bits of information are merged together, they can be used to recreate an extraordinarily adroit profile of the client.
How are different data records from disparate data sets consolidated to define sagacious profiles?
The answer lies in the ability to gather the diverse bits of information that belong to a single individual and after that glom them together. For example, let’s presume the existence of a record in one data set that has a person’s address, a record in another data set that has that person’s telephone number, a third record that has that person’s enlistment number for a convection oven, another with the person’s car year, make, and model, etc. As long as you can find all the records that are associated with each person and connect them together, you could gather all the intriguing data together and make a solitary delegate profile. That profile is then suitable for use in list generation, but is also used for more comprehensive analytics such as segmentation, clustering analysis, and classification.
The way these records are associated together is through a process called “record linkage.” This process searches through one or more data sets looking for records that refer to the same unique entity based on identifying characteristics that can be used to distinguish one entity from all others, such as names, address, or telephone numbers. When two records are found to share the same pieces of identifying information, you may expect that those records can be connected together.
- The records from the different data sets don’t have the same identifying attributes (one might have telephone number but the other one does not).
- The values in one data set use an alternate structure or configuration than the data in another data set (such as using hyphens for social security numbers in one data set but not in the other).
- The values in one data set are slightly different than the ones in the other data set (such as using nicknames instead of given names).
- One data set has the values broken out into separate data elements while the other does not (such as titles and name suffixes).
One of the most interesting byproducts of record linkage is the ability to infer explicit facts about individuals that are obfuscated as a result of distribution of data.
Data quality tools such as parsing, standardization, and business-rule based record linkage and similarity scoring can help in assessing the similarity between two records. Organizations have to be proactively engaged in managing data quality reducing the reactive culture. Methodologies and techniques utilized for reviewing data errors, root cause analysis, designing and applying remedial measures have to be standardized.
About the Author
Karthik M, Associate Architect at Mastech Infotrellis has an overall experience of 6 years in enterprise applications, profiling, and performance tuning and design best practices.