Data cleanup

Hello, experts,

I am trying to wrap my mind around the right data cleanup procedure. I have just implemented my SSIS package, that imports data and populates Slowly Changing Dimensions and Fact tables. Now I am trying to understand how would data analyst clean up data (like removing duplicates and fixing spelling errors that lead to multiple records in Dim tables instead of one).

- Would the analyst query resulting Dim/Fact tables and update the ones that do not look correct? But in this case next iteration of incoming data will overwrite the results of this cleanup.

- Would the analyst need to go into production system and fix the dirty data there? This doesn't look very sustainable as well, from security and data safety perspectives.

Also - is there any specific software that tackles this particular issue?

<object cotype="cs" id="SILOBFWOBJECTID" style="width:0px;height:0px;display:block;" type="cosymantecnisbfw"></object>
July 17th, 2015 9:10pm

Hi vfspb,

I suggest to treat the system that accepts the raw data from the SSIS packaeg as a staging environment. No further modifications - this is good to keep a point of truth for troubleshooting.

You would need then to push the conceivably cleaner data further to a kind of a reporting area (data lake).

How the cleanup can be done, there are dozens of choices.

E.g. I did some de-duplication using SSIS and Fuzzy Grouping http://chandunayakwadi.blogspot.ca/2013/01/data-deduplication-using-ssis2012.html The Fuzzy Grouping/Matching can be used to correct misspellings, so no other software is needed in my view. It boils down to how fast, accurate and flexible system you want to build and of course the budget.

Free Windows Admin Tool Kit Click here and download it now
July 19th, 2015 7:51pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics