Best Match Across Multiple Columns

I have a requirement to get the best match between the columns of two tables, so for example:

Table A

ID

C1

C2

C3

C4

C5

C6

1

A

B

2

A

D

C

3

A

4

D

5

B

6

B

C

Table B

ID

C1

C2

C3

C4

C5

C6

Match

1

A

B

B

B

D

C

1, 3, 5

2

A

C

B

C

C

C

1, 3, 5, 6

3

B

B

B

C

C

C

5

4

E

D

B

F

F

E

4, 5

A row matches when any of the columns are equal or null.  The match column in Table B shows which rows of Table A that row should match.  Once I know which columns match I will then be able to determine the best match by which row matches most columns.  The following query seems to do the job of identifying which rows match:

SELECT TableA.ID, TableB.ID
FROM TableA
	LEFT OUTER JOIN TableB
	ON (TableB.C1 = TableA.C1 OR TableA.C1 IS NULL)
		AND (TableB.C2 = TableA.C2 OR TableA.C2 IS NULL)
		AND (TableB.C3 = TableA.C3 OR TableA.C3 IS NULL)
		AND (TableB.C4 = TableA.C4 OR TableA.C4 IS NULL)
		AND (TableB.C5 = TableA.C5 OR TableA.C5 IS NULL)
		AND (TableB.C6 = TableA.C6 OR TableA.C6 IS NULL)
WHERE NOT (TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableA.C3 IS NULL
		AND TableA.C4 IS NULL
		AND TableA.C5 IS NULL
		AND TableA.C6 IS NULL)

However I am finding that when both these table have around 400K records the query just takes far too long to run.  How can I optimise this query?  Are there any indexes which would help or is there a better way to approach this problem?

Thanks in advance for any help,

Graham.


July 19th, 2013 6:46am

Please try this:

CREATE NONCLUSTERED INDEX [NC_Ix-TableA-All] 
  ON [dbo].[TableA] 
    ( [ID] ASC ) 
	INCLUDE ( [C1],
			  [C2],
			  [C3],
			  [C4],
			  [C5],
			  [C6] );
GO

SELECT TableA.ID, TableB.ID
FROM TableA
	LEFT OUTER JOIN TableB
	ON TableB.C1 = ISNULL(TableA.C1, TableB.C1) 
		AND TableB.C2 = ISNULL(TableA.C2, TableB.C2)
		AND TableB.C3 = ISNULL(TableA.C3, TableB.C3)
		AND TableB.C4 = ISNULL(TableA.C4, TableB.C4)
		AND TableB.C5 = ISNULL(TableA.C5, TableB.C5)
		AND TableB.C6 = ISNULL(TableA.C6, TableB.C6)
WHERE NOT (TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableA.C3 IS NULL
		AND TableA.C4 IS NULL
		AND TableA.C5 IS NULL
		AND TableA.C6 IS NULL);

Free Windows Admin Tool Kit Click here and download it now
July 19th, 2013 9:29am

Thanks for the response, I have tried your suggestion but this has not improved the query.
July 19th, 2013 10:47am

So please post complete DDL.
Free Windows Admin Tool Kit Click here and download it now
July 19th, 2013 4:00pm

Optimization article:

http://www.sqlusa.com/articles/query-optimization/

The very basis of optimization is to consider indexing the columns in WHERE clause and JOIN ON clause predicates.

You may also consider specialty indexing such as covering index:

http://www.sqlusa.com/bestpractices/coveringindex/

It would be helpful if you let us know what bottleneck do you see in the execution plan.

July 27th, 2013 7:40pm

With large tables, this is bound to be slow, because in its nature it is a cartesian product.

It is hard to make it perform better than a cartesian product. In your case you might be able to leverage the fact that you don't have to match the NULL values.

Think Divide & Conquer. If you first match all rows of TableA with the maximum number of NULLs, then you won't have to match the subsets in TableA.

For example, rows in TableB that don't match row 5 in TableA (the row with only the value 'B' in C3) will also never match rows 1 and 6 of TableA, because these rows require the same 'B' in C3.

Here is something to sink you teeth in...

create table TableA(ID int primary key clustered,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char)
insert into TableA values(1,'A',null,'B',null,null,null)
insert into TableA values(2,'A','D',null,'C',null,null)
insert into TableA values(3,'A',null,null,null,null,null)
insert into TableA values(4,null,'D',null,null,null,null)
insert into TableA values(5,null,null,'B',null,null,null)
insert into TableA values(6,null,null,'B','C',null,null)

create table TableB(ID int primary key clustered,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char)
insert into TableB values(1,'A','B','B','B','D','C')  -- 1,3,5
insert into TableB values(2,'A','C','B','C','C','C')  -- 1,3,5,6
insert into TableB values(3,'B','B','B','C','C','C')  -- 5,6
insert into TableB values(4,'E','D','B','F','F','E')  -- 4,5

create table #results(pattern int, ID int,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char,score int,constraint PK_results primary key clustered(pattern,ID))

SELECT identity(int,1,1) AS pattern, *
INTO #t
FROM (
  SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
  FROM TableA
  WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 1
  GROUP BY C1, C2, C3, C4, C5, C6
) X

create unique clustered index CLX_t on #t(pattern)

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C1 = TableB.C1
WHERE #t.C1 IS NOT NULL

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C2 = TableB.C2
WHERE #t.C2 IS NOT NULL

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C3 = TableB.C3
WHERE #t.C3 IS NOT NULL
-- C4 - C6 omitted for brevity

declare @max int
set @max=(SELECT COALESCE(MAX(pattern),0) FROM #t)

SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
INTO #t2
FROM TableA
WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 2
GROUP BY C1, C2, C3, C4, C5, C6

INSERT INTO #t
SELECT DISTINCT #t2.C1, #t2.C2, #t2.C3, #t2.C4, #t2.C5, #t2.C6, #t2.Weight
FROM #t2, #t
WHERE COALESCE(#t.C1,#t2.C1,'')=COALESCE(#t2.C1,'')
AND   COALESCE(#t.C2,#t2.C2,'')=COALESCE(#t2.C2,'')
AND   COALESCE(#t.C3,#t2.C3,'')=COALESCE(#t2.C3,'')
AND   COALESCE(#t.C4,#t2.C4,'')=COALESCE(#t2.C4,'')
AND   COALESCE(#t.C5,#t2.C5,'')=COALESCE(#t2.C5,'')
AND   COALESCE(#t.C6,#t2.C6,'')=COALESCE(#t2.C6,'')

DELETE FROM #t2
WHERE EXISTS (
  SELECT *
  FROM #t
  WHERE pattern > @max
  AND COALESCE(#t.C1,'')=COALESCE(#t2.C1,'')
  AND COALESCE(#t.C2,'')=COALESCE(#t2.C2,'')
  AND COALESCE(#t.C3,'')=COALESCE(#t2.C3,'')
  AND COALESCE(#t.C4,'')=COALESCE(#t2.C4,'')
  AND COALESCE(#t.C5,'')=COALESCE(#t2.C5,'')
  AND COALESCE(#t.C6,'')=COALESCE(#t2.C6,'')
)

INSERT INTO #results
SELECT DISTINCT new.pattern, R.ID, R.C1, R.C2, R.C3, R.C4, R.C5, R.C6, new.Weight
FROM #t new
JOIN #t old
  ON  new.pattern > @max
  AND old.pattern <= @max
  AND (old.C1 IS NULL OR new.C1 = old.C1)
  AND (old.C2 IS NULL OR new.C2 = old.C2)
  AND (old.C3 IS NULL OR new.C3 = old.C3)
  AND (old.C4 IS NULL OR new.C4 = old.C4)
  AND (old.C5 IS NULL OR new.C5 = old.C5)
  AND (old.C6 IS NULL OR new.C6 = old.C6)
JOIN #results R
  ON  (new.C1 IS NULL OR R.C1 = new.C1)
  AND (new.C2 IS NULL OR R.C2 = new.C2)
  AND (new.C3 IS NULL OR R.C3 = new.C3)
  AND (new.C4 IS NULL OR R.C4 = new.C4)
  AND (new.C5 IS NULL OR R.C5 = new.C5)
  AND (new.C6 IS NULL OR R.C6 = new.C6)

-- handle remaining #t2 patterns that aren't subsets
-- omitted for brevity

set @max=(SELECT COALESCE(MAX(pattern),0) FROM #t)

truncate table #t2

insert INTO #t2
SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
FROM TableA
WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 3
GROUP BY C1, C2, C3, C4, C5, C6

INSERT INTO #t
SELECT DISTINCT #t2.C1, #t2.C2, #t2.C3, #t2.C4, #t2.C5, #t2.C6, #t2.Weight
FROM #t2, #t
WHERE COALESCE(#t.C1,#t2.C1,'')=COALESCE(#t2.C1,'')
AND   COALESCE(#t.C2,#t2.C2,'')=COALESCE(#t2.C2,'')
AND   COALESCE(#t.C3,#t2.C3,'')=COALESCE(#t2.C3,'')
AND   COALESCE(#t.C4,#t2.C4,'')=COALESCE(#t2.C4,'')
AND   COALESCE(#t.C5,#t2.C5,'')=COALESCE(#t2.C5,'')
AND   COALESCE(#t.C6,#t2.C6,'')=COALESCE(#t2.C6,'')

DELETE FROM #t2
WHERE EXISTS (
  SELECT *
  FROM #t
  WHERE pattern > @max
  AND COALESCE(#t.C1,'')=COALESCE(#t2.C1,'')
  AND COALESCE(#t.C2,'')=COALESCE(#t2.C2,'')
  AND COALESCE(#t.C3,'')=COALESCE(#t2.C3,'')
  AND COALESCE(#t.C4,'')=COALESCE(#t2.C4,'')
  AND COALESCE(#t.C5,'')=COALESCE(#t2.C5,'')
  AND COALESCE(#t.C6,'')=COALESCE(#t2.C6,'')
)

INSERT INTO #results
SELECT DISTINCT new.pattern, R.ID, R.C1, R.C2, R.C3, R.C4, R.C5, R.C6, new.Weight
FROM #t new
JOIN #t old
  ON  new.pattern >  @max
  AND old.pattern <= @max
  AND (old.C1 IS NULL OR new.C1 = old.C1)
  AND (old.C2 IS NULL OR new.C2 = old.C2)
  AND (old.C3 IS NULL OR new.C3 = old.C3)
  AND (old.C4 IS NULL OR new.C4 = old.C4)
  AND (old.C5 IS NULL OR new.C5 = old.C5)
  AND (old.C6 IS NULL OR new.C6 = old.C6)
JOIN #results R
  ON  (new.C1 IS NULL OR R.C1 = new.C1)
  AND (new.C2 IS NULL OR R.C2 = new.C2)
  AND (new.C3 IS NULL OR R.C3 = new.C3)
  AND (new.C4 IS NULL OR R.C4 = new.C4)
  AND (new.C5 IS NULL OR R.C5 = new.C5)
  AND (new.C6 IS NULL OR R.C6 = new.C6)

-- handle remaining #t2 patterns that aren't subsets
-- omitted for brevity

-- the final results
SELECT R.ID,SUM(COALESCE(score,0))
FROM TableA
JOIN #t
  ON  (#t.C1 = TableA.C1 OR (#t.C1 IS NULL AND TableA.C1 IS NULL))
  AND (#t.C2 = TableA.C2 OR (#t.C2 IS NULL AND TableA.C2 IS NULL))
  AND (#t.C3 = TableA.C3 OR (#t.C3 IS NULL AND TableA.C3 IS NULL))
  AND (#t.C4 = TableA.C4 OR (#t.C4 IS NULL AND TableA.C4 IS NULL))
  AND (#t.C5 = TableA.C5 OR (#t.C5 IS NULL AND TableA.C5 IS NULL))
  AND (#t.C6 = TableA.C6 OR (#t.C6 IS NULL AND TableA.C6 IS NULL))
JOIN #results R
  ON R.pattern = #t.pattern
GROUP BY R.ID

drop table #t2
drop table #results
drop table #t

drop table TableB
drop table TableA

Now this is a first draft, incomplete for handling remainders, and can be optimized further ("if @@rowcount>0"). Also, there is a point at which the #result table becomes too big, and you could implement another round of dividing and conquering.

Of course you'll have to judge whether it is all worth it. As you can guess, writing such code is quite complex. also, simple changes (in your original query) are a hell of a lot of work with the approach I described.

A final note: you may want to replace the NULLs with a magic value that is not used in your regular data. Not only will it make many of your queries easier to write, but you can also expect better index use because of it.

Free Windows Admin Tool Kit Click here and download it now
July 28th, 2013 6:28am

With large tables, this is bound to be slow, because in its nature it is a cartesian product.

It is hard to make it perform better than a cartesian product. In your case you might be able to leverage the fact that you don't have to match the NULL values, and that you can determine hierarchy in your TableA / TableB data.

Think Divide & Conquer. If you first match all rows of TableA with the maximum number of NULLs, then you won't have to match the subsets in TableA.

For example, rows in TableB that don't match row 5 in TableA (the row with only the value 'B' in C3) will also never match rows 1 and 6 of TableA, because these rows require the same 'B' in C3.

Here is something to sink you teeth in...

create table TableA(ID int primary key clustered,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char)
insert into TableA values(1,'A',null,'B',null,null,null)
insert into TableA values(2,'A','D',null,'C',null,null)
insert into TableA values(3,'A',null,null,null,null,null)
insert into TableA values(4,null,'D',null,null,null,null)
insert into TableA values(5,null,null,'B',null,null,null)
insert into TableA values(6,null,null,'B','C',null,null)

create table TableB(ID int primary key clustered,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char)
insert into TableB values(1,'A','B','B','B','D','C')  -- 1,3,5
insert into TableB values(2,'A','C','B','C','C','C')  -- 1,3,5,6
insert into TableB values(3,'B','B','B','C','C','C')  -- 5,6
insert into TableB values(4,'E','D','B','F','F','E')  -- 4,5

create table #results(pattern int, ID int,C1 char,C2 char,C3 char,C4 char,C5 char,C6 char,score int,constraint PK_results primary key clustered(pattern,ID))

SELECT identity(int,1,1) AS pattern, *
INTO #t
FROM (
  SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
  FROM TableA
  WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
  +     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 1
  GROUP BY C1, C2, C3, C4, C5, C6
) X

create unique clustered index CLX_t on #t(pattern)

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C1 = TableB.C1
WHERE #t.C1 IS NOT NULL

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C2 = TableB.C2
WHERE #t.C2 IS NOT NULL

INSERT INTO #results
SELECT pattern, TableB.ID, TableB.C1, TableB.C2, TableB.C3, TableB.C4, TableB.C5, TableB.C6, Weight
FROM #t JOIN TableB ON #t.C3 = TableB.C3
WHERE #t.C3 IS NOT NULL
-- C4 - C6 omitted for brevity

declare @max int
set @max=(SELECT COALESCE(MAX(pattern),0) FROM #t)

SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
INTO #t2
FROM TableA
WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 2
GROUP BY C1, C2, C3, C4, C5, C6

INSERT INTO #t
SELECT DISTINCT #t2.C1, #t2.C2, #t2.C3, #t2.C4, #t2.C5, #t2.C6, #t2.Weight
FROM #t2, #t
WHERE COALESCE(#t.C1,#t2.C1,'')=COALESCE(#t2.C1,'')
AND   COALESCE(#t.C2,#t2.C2,'')=COALESCE(#t2.C2,'')
AND   COALESCE(#t.C3,#t2.C3,'')=COALESCE(#t2.C3,'')
AND   COALESCE(#t.C4,#t2.C4,'')=COALESCE(#t2.C4,'')
AND   COALESCE(#t.C5,#t2.C5,'')=COALESCE(#t2.C5,'')
AND   COALESCE(#t.C6,#t2.C6,'')=COALESCE(#t2.C6,'')

DELETE FROM #t2
WHERE EXISTS (
  SELECT *
  FROM #t
  WHERE pattern > @max
  AND COALESCE(#t.C1,'')=COALESCE(#t2.C1,'')
  AND COALESCE(#t.C2,'')=COALESCE(#t2.C2,'')
  AND COALESCE(#t.C3,'')=COALESCE(#t2.C3,'')
  AND COALESCE(#t.C4,'')=COALESCE(#t2.C4,'')
  AND COALESCE(#t.C5,'')=COALESCE(#t2.C5,'')
  AND COALESCE(#t.C6,'')=COALESCE(#t2.C6,'')
)

INSERT INTO #results
SELECT DISTINCT new.pattern, R.ID, R.C1, R.C2, R.C3, R.C4, R.C5, R.C6, new.Weight
FROM #t new
JOIN #t old
  ON  new.pattern > @max
  AND old.pattern <= @max
  AND (old.C1 IS NULL OR new.C1 = old.C1)
  AND (old.C2 IS NULL OR new.C2 = old.C2)
  AND (old.C3 IS NULL OR new.C3 = old.C3)
  AND (old.C4 IS NULL OR new.C4 = old.C4)
  AND (old.C5 IS NULL OR new.C5 = old.C5)
  AND (old.C6 IS NULL OR new.C6 = old.C6)
JOIN #results R
  ON  (new.C1 IS NULL OR R.C1 = new.C1)
  AND (new.C2 IS NULL OR R.C2 = new.C2)
  AND (new.C3 IS NULL OR R.C3 = new.C3)
  AND (new.C4 IS NULL OR R.C4 = new.C4)
  AND (new.C5 IS NULL OR R.C5 = new.C5)
  AND (new.C6 IS NULL OR R.C6 = new.C6)

-- handle remaining #t2 patterns that aren't subsets
-- omitted for brevity

set @max=(SELECT COALESCE(MAX(pattern),0) FROM #t)

truncate table #t2

insert INTO #t2
SELECT C1, C2, C3, C4, C5, C6, COUNT(*) AS Weight
FROM TableA
WHERE CASE WHEN C1 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C2 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C3 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C4 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C5 IS NULL THEN 0 ELSE 1 END
+     CASE WHEN C6 IS NULL THEN 0 ELSE 1 END = 3
GROUP BY C1, C2, C3, C4, C5, C6

INSERT INTO #t
SELECT DISTINCT #t2.C1, #t2.C2, #t2.C3, #t2.C4, #t2.C5, #t2.C6, #t2.Weight
FROM #t2, #t
WHERE COALESCE(#t.C1,#t2.C1,'')=COALESCE(#t2.C1,'')
AND   COALESCE(#t.C2,#t2.C2,'')=COALESCE(#t2.C2,'')
AND   COALESCE(#t.C3,#t2.C3,'')=COALESCE(#t2.C3,'')
AND   COALESCE(#t.C4,#t2.C4,'')=COALESCE(#t2.C4,'')
AND   COALESCE(#t.C5,#t2.C5,'')=COALESCE(#t2.C5,'')
AND   COALESCE(#t.C6,#t2.C6,'')=COALESCE(#t2.C6,'')

DELETE FROM #t2
WHERE EXISTS (
  SELECT *
  FROM #t
  WHERE pattern > @max
  AND COALESCE(#t.C1,'')=COALESCE(#t2.C1,'')
  AND COALESCE(#t.C2,'')=COALESCE(#t2.C2,'')
  AND COALESCE(#t.C3,'')=COALESCE(#t2.C3,'')
  AND COALESCE(#t.C4,'')=COALESCE(#t2.C4,'')
  AND COALESCE(#t.C5,'')=COALESCE(#t2.C5,'')
  AND COALESCE(#t.C6,'')=COALESCE(#t2.C6,'')
)

INSERT INTO #results
SELECT DISTINCT new.pattern, R.ID, R.C1, R.C2, R.C3, R.C4, R.C5, R.C6, new.Weight
FROM #t new
JOIN #t old
  ON  new.pattern >  @max
  AND old.pattern <= @max
  AND (old.C1 IS NULL OR new.C1 = old.C1)
  AND (old.C2 IS NULL OR new.C2 = old.C2)
  AND (old.C3 IS NULL OR new.C3 = old.C3)
  AND (old.C4 IS NULL OR new.C4 = old.C4)
  AND (old.C5 IS NULL OR new.C5 = old.C5)
  AND (old.C6 IS NULL OR new.C6 = old.C6)
JOIN #results R
  ON  (new.C1 IS NULL OR R.C1 = new.C1)
  AND (new.C2 IS NULL OR R.C2 = new.C2)
  AND (new.C3 IS NULL OR R.C3 = new.C3)
  AND (new.C4 IS NULL OR R.C4 = new.C4)
  AND (new.C5 IS NULL OR R.C5 = new.C5)
  AND (new.C6 IS NULL OR R.C6 = new.C6)

-- handle remaining #t2 patterns that aren't subsets
-- omitted for brevity

-- the final results
SELECT R.ID,SUM(COALESCE(score,0))
FROM TableA
JOIN #t
  ON  (#t.C1 = TableA.C1 OR (#t.C1 IS NULL AND TableA.C1 IS NULL))
  AND (#t.C2 = TableA.C2 OR (#t.C2 IS NULL AND TableA.C2 IS NULL))
  AND (#t.C3 = TableA.C3 OR (#t.C3 IS NULL AND TableA.C3 IS NULL))
  AND (#t.C4 = TableA.C4 OR (#t.C4 IS NULL AND TableA.C4 IS NULL))
  AND (#t.C5 = TableA.C5 OR (#t.C5 IS NULL AND TableA.C5 IS NULL))
  AND (#t.C6 = TableA.C6 OR (#t.C6 IS NULL AND TableA.C6 IS NULL))
JOIN #results R
  ON R.pattern = #t.pattern
GROUP BY R.ID

drop table #t2
drop table #results
drop table #t

drop table TableB
drop table TableA

Now this is a first draft, incomplete for handling remainders, and can be optimized further ("if @@rowcount>0"). Also, there is a point at which the #result table becomes too big, and you could implement another round of dividing and conquering.

Of course you'll have to judge whether it is all worth it. As you can guess, writing such code is quite complex. also, simple changes (in your original query) are a hell of a lot of work with the approach I described.

A final note: you may want to replace the NULLs with a magic value that is not used in your regular data. Not only will it make many of your queries easier to write, but you can also expect better index use because of it.

July 28th, 2013 1:26pm

Thanks all for your responses and apologises for my delayed response.

Gert-Jan, a very interesting approach.  As you mention this is quite a complicated solution involving alot of code, but I guess there is no easy way to solve this, especially when working with approximately 5 million rows in table B.  I would like to spend some time trialling your approach but due to time constraints I have had to implement another solution which appears to work.

--Match on first column (column with most distinct values).
SELECT TableA.ID, TableB.ID
FROM TableA
	LEFT OUTER JOIN TableB
	ON TableB.C1 = TableA.C1
		AND (TableB.C2 = TableA.C2 OR TableA.C2 IS NULL)
		AND (TableB.C3 = TableA.C3 OR TableA.C3 IS NULL)
		AND (TableB.C4 = TableA.C4 OR TableA.C4 IS NULL)
		AND (TableB.C5 = TableA.C5 OR TableA.C5 IS NULL)
		AND (TableB.C6 = TableA.C6 OR TableA.C6 IS NULL)
WHERE NOT (TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableA.C3 IS NULL
		AND TableA.C4 IS NULL
		AND TableA.C5 IS NULL
		AND TableA.C6 IS NULL)

--Match on second column, null only in first column.
SELECT TableA.ID, TableB.ID
FROM TableA
	LEFT OUTER JOIN TableB
	ON TableA.C1 IS NULL
		AND TableB.C2 = TableA.C2
		AND (TableB.C3 = TableA.C3 OR TableA.C3 IS NULL)
		AND (TableB.C4 = TableA.C4 OR TableA.C4 IS NULL)
		AND (TableB.C5 = TableA.C5 OR TableA.C5 IS NULL)
		AND (TableB.C6 = TableA.C6 OR TableA.C6 IS NULL)
WHERE NOT (TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableA.C3 IS NULL
		AND TableA.C4 IS NULL
		AND TableA.C5 IS NULL
		AND TableA.C6 IS NULL)

--Match on third column, null only in first and second column.
SELECT TableA.ID, TableB.ID
FROM TableA
	LEFT OUTER JOIN TableB
	ON TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableB.C3 = TableA.C3
		AND (TableB.C4 = TableA.C4 OR TableA.C4 IS NULL)
		AND (TableB.C5 = TableA.C5 OR TableA.C5 IS NULL)
		AND (TableB.C6 = TableA.C6 OR TableA.C6 IS NULL)
WHERE NOT (TableA.C1 IS NULL
		AND TableA.C2 IS NULL
		AND TableA.C3 IS NULL
		AND TableA.C4 IS NULL
		AND TableA.C5 IS NULL
		AND TableA.C6 IS NULL)

--Continue with fourth, fifth and sixth columns

Within this approach I am matching one column at a time which allows the query to use the index resulting in significant speed gains. These matches are written to a results table and the next column is checked for matches.  The previous column can be set to null in the second check as it has already been matched in its entirety which reduces the possible matches. And this goes on until all matches are added to the results table where the best match can then be selected.

I never thought this problem was so complex until I delved deeper into it but I think this solution will do the trick for the time being.  It would be interesting to compare the query speeds for both this and Gert-Jan's approach, something I might look into when time allows.

Cheers, Graham.

Free Windows Admin Tool Kit Click here and download it now
July 30th, 2013 4:12am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics