Data migration and the importance of data quality
Ensuring migration readiness of the data
Data migration from one set of applications to another is a complex task. The complexity increases typically due to data quality issues. How can we identify the data with quality issues and ensure migration readiness?
Data migration from one set of applications to another is particularly complex, when the applications were never designed to interact and share little to no common structure or definitions. The complexity of this task generally increases due to data quality issues. The data quality issues could already exist in the legacy system, but normally become apparent when validated against technical and functional validation rules prescribed by the target system.
Migrating data of insufficient quality not only causes data pollution in the new system, but in many cases also leads to all sorts of issues during the migration, preventing the data to be uploaded by the automated migration solution. The data that does not meet the minimum quality requirement will be rejected by the migration software or the target system. As shown in figure 1, these fall-outs have to be dealt with manually. They require either manual correction of the data before resubmitting to the migration solution or manual entry into the target system altogether. The latter is obviously only feasible when the number of fallouts is limited. For a truly successful data migration, it is therefore crucial to ensure migration readiness of the data.
Possible data issues
The goal of a data migration is to migrate data from a source system into a target system, generally to support target business processes. In this context, data quality should be viewed from the target situation point of view. This essentially means that the source data should be validated against the target system requirements and target business requirements. In this validation process we particularly check whether the data is complete, accurate and valid.
Target system requirements
In the target system, there will be technical validations built in either the database or the application (GUI). If the source data does not meet this minimum quality level, the data will be rejected by the target system and cannot be migrated. Examples of these technical validations are:
- Mandatory field check: no missing values in mandatory field items.
- Data format compliance: all date values are in an accepted date format; phone numbers contain only numbers. In the Data quality domain, this type of data pollution is called “Technical pollution”.
- Business ruling: salary level should not be under the legal minimum wage; all users should be 18 years or older. In the Data quality domain, this type of data pollution is called “Functional pollution”.
Target business requirements
Next to the technical validations that are built in the target system, the business process can place certain constraints on the validity of the data. For instance, the appearance of an old (no longer existing) product or a product description that appears several times in different product groups. Both values are technically valid and accepted by the target system, but from a business perspective not desired or factually incorrect. In the Data quality domain, this type of data pollution is called “Substantive pollution”.
Methods for identifying data issues in the source data
Data validation can be done in the migration software. In a data migration solution, the to-be migrated data is typically extracted into a staging migration database. Within this staging database, the data can be validated against the validation rules developed based on the target system and target business requirements.
The validation of the data quality can be done in 4 ways:
1. Visual approach: visual inspection of physical printouts from the source system by employees/ data stewards. This is only feasible when the data set is relatively small. The quality of the validation is fully dependent on the discipline of the inspectors involved.
2. Data profiling: use systematic analysis on the content of the source data. The type of analysis that are performed are
- Completeness analysis: how often is a given attribute populated versus blank/null?
- Uniqueness analysis: how many unique (distinct) values are found for a given attribute across all records? Are there duplicates? Should there be?
- Range analysis: What are the minimum, maximum, average and median values found for a given attribute?
- Pattern analysis: What formats were found for a given attribute, and what is the distribution of records across these formats
3. Intrinsic measurement: built technical validation rules based on the context, controls (pattern recognition) and knowledge rules.
- Domain control: checked whether the values that appear in a section are valid. For example, a sale date must relate to an existing day. One of the best known domain controls in the Netherlands for banking is the “11-proef” (check digit) for account numbers.
- Integrity check: checked whether the validity rules are met. For example, an order is always associated with a customer or a wedding date falls well after the birth.
4. Reference data: built technical validation rules based on external reference data, e.g. validate addresses against the official postal code table.
Ensuring migration readiness of the source data
Although the validation rules built in the migration software can automatically identify the data with quality issues, correcting these data errors requires active support from the business. As the data owner, only the business has the mandate to take decisions on the required data corrections. A good practice is to have data stewards with relevant decision making mandate assigned to this data cleaning activity. The actual data corrections can then be done via technical batch processing as opposed to through a manual interface. The high-level steps in ensuring migration readiness of the source data are depicted in Figure 2.
Another good practice is to correct the data as much as possible in the source system, as this has a number of advantages:
- The data only needs to be corrected once, instead of multiple times in the staging database.
- The post-migration reconciliation, i.e. comparing the migrated data in the source with the target system, is more transparent.
- Even before the actual migration, the organization can already benefit from improved data quality.
As also described in the blog “Data migration: the high level activities for success”, the data cleaning activity normally require quite a few iterations before the desired data quality level is met. It is therefore recommended to start this activity in parallel to the data migration software development, to ensure timely migration readiness of the to-be migrated data.
More information about data migration and data quality measurement?
We are interested to learn about your experience in dealing with data quality issues in a data migration context. In case you would like to know more about data migration or data quality measurement, please do not hesitate to contact Wing Lee (+31882882902) or Janvier Jessurun (+31882887844).
Read more about data migration
Read the other blogs of Wing Lee about data migration: