Solution for a scenario of De-Duplication with Fuzzy Logic

Hi,

I am attempting to create an SSIS Package to do a de-duplication process with a large database of over 10 million people records.

Each person record will have to be matched with the complete DB on Name Match, DOB Match, Relationship Match, Gender Match and then we will have to save the matching records into another table - which will then be reviewed by the client to decide which are the duplicate ones and then suitable actions taken.

We have decided to go with the Fuzzy logic match to do this.

The plan is to pick up a set of 50 records one at a time, match each of them with the complete DB and then move to the next 50 records. It is also important to ignore the already processed 50 records + also ignore the duplicate matches returned.

We want to program such a package and schedule it to run at the background.

Is this possible and if so, what would be the right approach to create this?

Regards,
Ka

September 5th, 2015 2:30am

Hi Karthick,

It seems that your source and origin is the same database.

By using the Fuzzy matching there will be no exact match, but an approximation match.

It is possible to tune how strict matching should be applied though.

The typical approach is to use the look up component.

The matching process itself better be done with the Cache Transform. Perhaps you will not be able to elevate all the array of textual values into the memory so use the disk file approach and just do the lookup against it from the database. There is no need to do anything extra. Ensure the machine is capable enough and schedule the job to run after you gave it a test run, monitor disk, memory usage. The job can be scheduled to run via the SQL Agent, you can Bing/Google how. I would not run it scheduled as it may fail if you do not have a capable machine.

Free Windows Admin Tool Kit Click here and download it now
September 5th, 2015 10:44pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics