Dealing with Bad Data in Integration Solutions

When we build integration solutions one of the biggest challenges we face is “sh!t in sh!t out”. Explained more eloquently we often have line of business systems which have some poor data in it and then we have to massage this and work around it in the integration solution so that the receivers of the data don’t break when they get that data. Also sometimes the receiver doesn’t break but its functionality is impaired by poor data.

Having faced this challenge recently and the problem in many organizations is as follows:

  • No one knows there is a data quality issue
  • If it is known then its difficult to workout how bad it is or estimate its impact
  • Often no one own the problem
  • If no one owns the problem then its unlikely anyone is fixing the problem

Imagine we have a scenario where we have loaded all of the students from one of our line of business systems into our new CRM system and then we are trying to load course data from another system into CRM and to make it all match up. When we try to ask questions of the data in CRM we are not getting the answers we expect and people lack confidence in the new solution. The thing is, the root cause of the problem is poor data quality from the underlying systems but the end users don’t have visibility of that so they just see the problem being with the new system as the old stuff has been around and kind of worked for years.

Dealing with the Issue

There are a number of ways you can tackle this problem and we saw business steering groups discussing data quality and other such things but nothing was as effective and cheap as a simple solution we put in place.

If you can imagine that we use BizTalk to extract the data from the source system and then load it to Service Bus, from where we have various approaches to put/sub the data into other systems. The main recipient of most of the data was Microsoft Dynamics CRM Online. Our idea was to implement some tests of the data as we attempted to load it into CRM. We implemented these in .net and the result of the tests would be a decimal value representing a % score based on the number of tests passed and a string listing the names of the tests that failed.

We would then save this data alongside the record as part of the CRM entity so it was very visible. You can see an example of this below:

We implemented tests like the following:

  • Is a field populated
  • Does the text match a regular expression
  • If we had a relationship to another entity can we find a match

For most of the records we would implement 10 to 20 tests of the data coming from other systems. We can then in CRM easily sort and manage records based on their data quality score.

Making the results visible

At this point from an operational perspective we were able to see how good and bad the data coming into CRM is on a per record basis. The next thing we need to do is to get some focus on fixing the data. The best way to do this is to provide visualisations to the key stakeholders to show how good or bad the data is.

To do this we used a simple Power BI report dashboard pointing at CRM which would do an average for each entity of the data quality score. This is shown in the below picture.

If I am able to say to the business stakeholders that we can not reliably answer certain questions in CRM because the data coming into CRM has a quality score of 50% then this is a powerful statement. Backed up by some specific tests which show whats good and isnt. This is highly likely to create an interest in the stakeholders in improving the data quality so that is serves the purpose they require. The great thing is each time they fix missing data or partly complete data which has accrued over the years into the LOB application, each time data is fixed and reloaded we should see the data quality score improving which means you will get more out of your investment in the new applications.

Summary

The key thing here isnt really how we implemented this solution. We were lucky that adding a few fields to CRM is dead easy. You could implement this in a number of different ways. What is important about this approach is the idea of testing the data during the loading process and recording that quality score and most importantly making it very visible to help everyone have the same view.