Duplicate Record Search

Hi, I have been given a task to locate duplicate and report duplicate records and am trying to determine the best way to do this with databases that have 1 million records plus.

Say I have a table with 20 columns, I need to check to see if 3 of 10 specific columns match.

So if 2 columns are the same its no problem however if 3 or more match, they are considered duplicate.

July 23rd, 2015 2:29pm

Can you give us DDL and example data?

We need to know more to be able to help.

You can provide DDL and example data like this:

DECLARE @table TABLE (col1 INT, col2 INT, col3 INT, col4 INT)
INSERT INTO @table (col1, col2, col3, col4) VALUES 
(1,1,1,1),(1,1,1,1),(1,1,1,1),(1,1,1,1),
(1,1,1,2),(1,1,1,2),(1,1,1,2),(2,1,1,1),
(1,1,1,3),(1,1,1,3),(3,1,1,1),(3,1,1,1),
(1,2,3,4),(5,6,7,8),(9,0,1,2),(3,4,5,6)

Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 2:57pm

Try:

DECLARE @table TABLE (col1 INT, col2 INT, col3 INT, col4 INT)
INSERT INTO @table (col1, col2, col3, col4) VALUES 
(1,1,1,1),(1,1,1,1),(1,1,1,1),(1,1,1,1),
(1,1,1,2),(1,1,1,2),(1,1,1,2),(2,1,1,1),
(1,1,1,3),(1,1,1,3),(3,1,1,1),(3,1,1,1),
(1,2,3,4),(5,6,7,8),(9,0,1,2),(3,4,5,6)

;WITH base AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS seq
  FROM @table 
)

SELECT DISTINCT b.col1, b.col2, b.col3, b.col4, 
CASE WHEN b1.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b2.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b3.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b4.col1 IS NOT NULL THEN 1 ELSE 0 END AS dupes
  FROM base b
    LEFT OUTER JOIN base b1
	  ON b.col1 = b1.col1
	  AND b.seq <> b1.seq
    LEFT OUTER JOIN base b2
	  ON b.col2 = b2.col2
	  AND b.seq <> b2.seq
    LEFT OUTER JOIN base b3
	  ON b.col3 = b3.col3
	  AND b.seq <> b3.seq
    LEFT OUTER JOIN base b4
	  ON b.col4 = b4.col4
	  AND b.seq <> b4.seq
 WHERE CASE WHEN b1.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b2.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b3.col1 IS NOT NULL THEN 1 ELSE 0 END+
CASE WHEN b4.col1 IS NOT NULL THEN 1 ELSE 0 END >= 3


July 23rd, 2015 3:03pm

You do not know that rows are not records. Please read that SQL book you were supposed to read.  Why does the front end layer allow this? We usually like to patch the leaky roof and not mop the wet floor forever. 

We have no idea about DDL, indexes,  etc.  Heck, we do not even know that name of the table you are hiding from us! Key? data types? anything? Try this skeleton: 

CREATE TABLE Foobar
(foo_key CHAR(10) MNOT NULL PRIMARY KEY,
 c01 INTEGER NOT NULL,
 c02 INTEGER NOT NULL,
 c03 INTEGER NOT NULL,
 c04 INTEGER NOT NULL,
 c05 INTEGER NOT NULL,
 c06 INTEGER NOT NULL,
 c07 INTEGER NOT NULL,
 c08 INTEGER NOT NULL,
 c09 INTEGER NOT NULL,
 c10 INTEGER NOT NULL); 

SELECT F1.foo_key AS first_dup, F2.foo_key AS second_dup
  FROM Foobar AS F1
       CROSS JOIN
       Foobar AS F2
WHERE F1.foo_key < F2.foo_key 
   AND 
(CASE WHEN F1.c01 <> F2.c01 THEN 1 ELSE 0 END
   + CASE WHEN F1.c01 <> F2.c01 THEN 1 ELSE 0 END
   + CASE WHEN F1.c02 <> F2.c02 THEN 1 ELSE 0 END
   + CASE WHEN F1.c03 <> F2.c03 THEN 1 ELSE 0 END
   + CASE WHEN F1.c04 <> F2.c04 THEN 1 ELSE 0 END
   + CASE WHEN F1.c05 <> F2.c05 THEN 1 ELSE 0 END
   + CASE WHEN F1.c06 <> F2.c06 THEN 1 ELSE 0 END
   + CASE WHEN F1.c07 <> F2.c07 THEN 1 ELSE 0 END
   + CASE WHEN F1.c08 <> F2.c08 THEN 1 ELSE 0 END
   + CASE WHEN F1.c09 <> F2.c09 THEN 1 ELSE 0 END
     CASE WHEN F1.c10 <> F2.c10 THEN 1 ELSE 0 END) >= 3;












Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 3:17pm

Here is an example.  In this example (1,42,3,44,5,46,47,48,49,50), is the duplicate row, it has 3 columns are duplicates.  col1, col3, col5


DECLARE @table TABLE (col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
INSERT INTO @table (col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES
(1,2,3,4,5,6,7,8,9,10),
(11,12,13,14,15,16,17,18,19,20),
(21,22,23,24,25,26,27,28,29,30),
(31,32,33,34,35,36,37,38,39,40),
(1,42,3,44,5,46,47,48,49,50),
(51,52,53,54,55,56,57,58,59,60),
(62,62,63,64,65,66,67,68,69,70),
(71,72,73,74,75,76,77,78,79,80),
(81,82,83,84,85,86,87,88,89,90),
(91,92,93,94,95,96,97,98,99,100)

July 23rd, 2015 3:21pm

Thank you very much.  I will try this.
Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 3:25pm

Wait... you want to know if the value exists in any row, any column?

July 23rd, 2015 3:29pm

Try this:

DECLARE @table TABLE (col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
INSERT INTO @table (col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES 
(1,2,3,4,5,6,7,8,9,10),
(11,12,13,14,15,16,17,18,19,20),
(21,22,23,24,25,26,27,28,29,30),
(31,32,33,34,35,36,37,38,39,40),
(1,42,3,44,5,46,47,48,49,50),
(51,52,53,54,55,56,57,58,59,60),
(62,62,63,64,65,66,67,68,69,70),
(71,72,73,74,75,76,77,78,79,80),
(81,82,83,84,85,86,87,88,89,90),
(91,92,93,94,95,96,97,98,99,100)

;WITH base AS (
SELECT *, ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS seq
  FROM @table 
), list AS (
SELECT x, y, seq
  FROM base
    UNPIVOT (
	         x FOR y IN (col1, col2, col3, col4, col5, col6, col7, col8, col9, col10)
			) a
)

SELECT l.seq, l2.seq, COUNT(*)
  FROM list l
    INNER JOIN list l2
	  ON l.y = l2.y
	  AND l.x = l2.x
	  AND l.seq <> l2.seq
 GROUP BY l.seq, l2.seq
 HAVING COUNT(*) >= 3


Row 3 does not seem to be a dupe.
Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 3:36pm

Once again thank you.  I tried this and it didn't work for me.  Here is what I tried.  These 2 rows are the duplicates

(1,1,2,3,4,5,6,7,8,9,10)
(5,1,42,3,44,5,46,47,48,49,50)

however it returns all rows.

DECLARE @table TABLE (myKey int primary key,  col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
 INSERT INTO @table (myKey, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES
 (1,1,2,3,4,5,6,7,8,9,10),
 (2,1,12,13,14,15,16,17,18,19,20),
 (3,1,22,23,24,25,26,27,28,29,30),
 (4,1,32,33,34,35,36,37,38,39,40),
 (5,1,42,3,44,5,46,47,48,49,50),
 (6,1,52,53,54,55,56,57,58,59,60),
 (7,2,62,63,64,65,66,67,68,69,70),
 (8,1,72,73,74,75,76,77,78,79,80),
 (9,1,82,83,84,85,86,87,88,89,90),
 (10,1,92,93,94,95,96,97,98,99,100)
 
SELECT b.*
  FROM @table b
    CROSS APPLY @table b2
 WHERE b.myKey <> b2.myKey
 AND (
         b.col1  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col2  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col3  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col4  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col5  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col6  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col7  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col8  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col9  IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
   OR b.col10 IN (b2.col1,b2.col2,b2.col3,b2.col4,b2.col5,b2.col6,b2.col7,b2.col8,b2.col9,b2.col10)
  )

July 23rd, 2015 3:41pm

I think I just had one part backwards, this appears to work.  Thanks again!

DECLARE @table TABLE (myKey int primary key,  col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
 INSERT INTO @table (myKey, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES
 (1,1,2,3,4,5,6,7,8,9,10),
 (2,1,12,13,14,15,16,17,18,19,20),
 (3,1,22,23,24,25,26,27,28,29,30),
 (4,1,32,33,34,35,36,37,38,39,40),
 (5,1,42,3,44,5,46,47,48,49,50),
 (6,1,52,53,54,55,56,57,58,59,60),
 (7,2,62,63,64,65,66,67,68,69,70),
 (8,1,72,73,74,75,76,77,78,79,80),
 (9,1,82,83,84,85,86,87,88,89,90),
 (10,1,92,93,94,95,96,97,98,99,100)
 
SELECT F1.*
  FROM @table F1
    CROSS APPLY @table F2
 WHERE F1.myKey <> F2.myKey
 and (CASE WHEN F1.col1 = F2.col1 THEN 1 ELSE 0 END
    + CASE WHEN F1.col1 = F2.col1 THEN 1 ELSE 0 END
    + CASE WHEN F1.col2 = F2.col2 THEN 1 ELSE 0 END
    + CASE WHEN F1.col3 = F2.col3 THEN 1 ELSE 0 END
    + CASE WHEN F1.col4 = F2.col4 THEN 1 ELSE 0 END
    + CASE WHEN F1.col5 = F2.col5 THEN 1 ELSE 0 END
    + CASE WHEN F1.col6 = F2.col6 THEN 1 ELSE 0 END
    + CASE WHEN F1.col7 = F2.col7 THEN 1 ELSE 0 END
    + CASE WHEN F1.col8 = F2.col8 THEN 1 ELSE 0 END
    + CASE WHEN F1.col9 = F2.col9 THEN 1 ELSE 0 END
    +  CASE WHEN F1.col10 = F2.col10 THEN 1 ELSE 0 END) >= 3;

Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 3:57pm


CREATE TABLE Foobar
(foo_key CHAR(10) MNOT NULL PRIMARY KEY,
 c01 INTEGER NOT NULL,
 c02 INTEGER NOT NULL,
 c03 INTEGER NOT NULL,
 c04 INTEGER NOT NULL,
 c05 INTEGER NOT NULL,
 c06 INTEGER NOT NULL,
 c07 INTEGER NOT NULL,
 c08 INTEGER NOT NULL,
 c09 INTEGER NOT NULL,
 c10 INTEGER NOT NULL); 

SELECT F1.foo_key AS first_dup, F2.foo_key AS second_dup
  FROM Foobar AS F1
       CROSS JOIN
       Foobar AS F2
WHERE F1.foo_key < F2.foo_key 
   AND 
(CASE WHEN F1.c01 <> F2.c01 THEN 1 ELSE 0 END
   + CASE WHEN F1.c01 <> F2.c01 THEN 1 ELSE 0 END
   + CASE WHEN F1.c02 <> F2.c02 THEN 1 ELSE 0 END
   + CASE WHEN F1.c03 <> F2.c03 THEN 1 ELSE 0 END
   + CASE WHEN F1.c04 <> F2.c04 THEN 1 ELSE 0 END
   + CASE WHEN F1.c05 <> F2.c05 THEN 1 ELSE 0 END
   + CASE WHEN F1.c06 <> F2.c06 THEN 1 ELSE 0 END
   + CASE WHEN F1.c07 <> F2.c07 THEN 1 ELSE 0 END
   + CASE WHEN F1.c08 <> F2.c08 THEN 1 ELSE 0 END
   + CASE WHEN F1.c09 <> F2.c09 THEN 1 ELSE 0 END
     CASE WHEN F1.c10 <> F2.c10 THEN 1 ELSE 0 END) >= 3;

July 23rd, 2015 4:38pm

I think I just had one part backwards, this appears to work.  Thanks again!

DECLARE @table TABLE (myKey int primary key,  col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
 INSERT INTO @table (myKey, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES
 (1,1,2,3,4,5,6,7,8,9,10),
 (2,1,12,13,14,15,16,17,18,19,20),
 (3,1,22,23,24,25,26,27,28,29,30),
 (4,1,32,33,34,35,36,37,38,39,40),
 (5,1,42,3,44,5,46,47,48,49,50),
 (6,1,52,53,54,55,56,57,58,59,60),
 (7,2,62,63,64,65,66,67,68,69,70),
 (8,1,72,73,74,75,76,77,78,79,80),
 (9,1,82,83,84,85,86,87,88,89,90),
 (10,1,92,93,94,95,96,97,98,99,100)
 
SELECT F1.*
  FROM @table F1
    CROSS APPLY @table F2
 WHERE F1.myKey <> F2.myKey
 and (CASE WHEN F1.col1 = F2.col1 THEN 1 ELSE 0 END
    + CASE WHEN F1.col1 = F2.col1 THEN 1 ELSE 0 END
    + CASE WHEN F1.col2 = F2.col2 THEN 1 ELSE 0 END
    + CASE WHEN F1.col3 = F2.col3 THEN 1 ELSE 0 END
    + CASE WHEN F1.col4 = F2.col4 THEN 1 ELSE 0 END
    + CASE WHEN F1.col5 = F2.col5 THEN 1 ELSE 0 END
    + CASE WHEN F1.col6 = F2.col6 THEN 1 ELSE 0 END
    + CASE WHEN F1.col7 = F2.col7 THEN 1 ELSE 0 END
    + CASE WHEN F1.col8 = F2.col8 THEN 1 ELSE 0 END
    + CASE WHEN F1.col9 = F2.col9 THEN 1 ELSE 0 END
    +  CASE WHEN F1.col10 = F2.col10 THEN 1 ELSE 0 END) >= 3;

Free Windows Admin Tool Kit Click here and download it now
July 23rd, 2015 7:51pm

Hi mbevins,

You may try this as well.

DECLARE @table TABLE (myKey int primary key,  col1 INT, col2 INT, col3 INT, col4 INT,col5 INT, col6 INT, col7 INT, col8 INT, col9 INT, col10 INT)
 INSERT INTO @table (myKey, col1, col2, col3, col4, col5, col6, col7, col8, col9, col10) VALUES
 (1,1,2,3,4,5,6,7,8,9,10),
 (2,1,12,13,14,15,16,17,18,19,20),
 (3,1,22,23,24,25,26,27,28,29,30),
 (4,1,32,33,34,35,36,37,38,39,40),
 (5,1,42,3,44,5,46,47,48,49,50),
 (6,1,52,53,54,55,56,57,58,59,60),
 (7,2,62,63,64,65,66,67,68,69,70),
 (8,1,72,73,74,75,76,77,78,79,80),
 (9,1,82,83,84,85,86,87,88,89,90),
 (10,1,92,93,94,95,96,97,98,99,100) 

;WITH Cte AS(
SELECT myKey,col,value FROM @TABLE
UNPIVOT
(
VALUE FOR COL IN(COL1,COL2,COL3,COL4,COL5,COL6,COL7,COL8,COL9,COL10)
) upt
)
,Cte2 AS
(
SELECT myKey,CASE WHEN EXISTS(SELECT 1 FROM Cte WHERE col=C.col AND value=C.value AND myKey<>C.myKey)THEN 1 ELSE 0 END cnt
FROM Cte c 
)
SELECT * FROM @table T WHERE EXISTS (SELECT myKey FROM Cte2 WHERE myKey=T.myKey GROUP BY myKey HAVING SUM(cnt)>2)


 

If you have any question, feel free to let me know.
July 24th, 2015 8:14am

Good day mbevins,

I read your hidden message (that this response is not relevant) and I want to clarify something.

My response above, is not something that I ever wrote to anyone (except Celko). Celko usually abuses newbies in this way, by trying to make them feel like they are nothing, if they make non relevant small mistake. In the 0.001% of his responses that he actually writes code, shows what he know. This response Meant to use "Celko approach" to respond to Celko, in order to show him that to be in the other side is not nice! Something have to be done to stop him abuse newbies!

* We are trying to deal with Celko for a long time with no success. I am sorry that I had to post this response on public, but as I said, I got sick of his approach of harming newbies, which only came to get our help.

* please read some of my responses history in order to understand who am I and how I am usually responding (and please check Celko responses history as well, please notice how many of his response were edit by Moderators, there are much responses that you can not see, which we deleted... we tries to edits his responses by removing Curses for example, and yet there is a lot that you can see, which are only the tip of the iceberg).

I am glad that you got the answer to the question,
and I hope to see you here a lot in the future :-)

Free Windows Admin Tool Kit Click here and download it now
July 26th, 2015 3:42am

Thanks, I misread your reply and didnt even want to acknowledge the Other guys response.  He must live in a perfect world where all applications are brand new and written perfectly.  Where the rest of us have to work with applications that are over 15 years old and yes have duplicate records.

July 26th, 2015 1:02pm

You are most welcome :-)

Free Windows Admin Tool Kit Click here and download it now
July 26th, 2015 5:54pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics