Service Broker Activation Deadlocking

I have implemented a (slightly) modified version of conversation recycling using conversation timers and stored procedure activation from http://rusanu.com/2007/05/03/recycling-conversations/ . However it appears that, occasionally, deadlocks  occur between the send and activated procedures on the conversation group/table. The main modification is that instead of having a column to represent the SPID in the table I am using an IdentifierType and Identifier value to identify the conversation. However I am only using the defaults (@@SPID) so I don't think that should matter in this case.

For the send side I have:

CREATE PROCEDURE [dbo].[usp_SendMessage]
	(
	@endpointCode VARCHAR(255) = NULL,
	@endpointGroup VARCHAR(255) = NULL,
	@xmlPayload XML=NULL,
	@binaryPayload VARBINARY(MAX)=NULL,
	@varcharPayload VARCHAR(MAX)=NULL,
	@identifier VARCHAR(50) = @@SPID,
	@identifierType VARCHAR(50) = '@@SPID'
	)
AS BEGIN

	SET NOCOUNT ON

	DECLARE @fromService SYSNAME,
		@toService SYSNAME,
		@onContract SYSNAME,
		@messageType SYSNAME,
		@conversationTimeout INT

	SELECT @fromService = FromService
		, @toService = ToService
		, @onContract = OnContract
		, @messageType = MessageType
		, @conversationTimeout = ConversationTimeout
	FROM dbo.ServiceBrokerEndpointConfig
		WHERE GroupCode = @endpointGroup

	IF @fromService IS NULL OR @toService IS NULL OR @onContract IS NULL OR @messageType IS NULL OR @conversationTimeout IS NULL
	BEGIN
		RAISERROR (
			N'Failed to get endpoint config for GroupCode ''%s''.'
			, 16, 1, @endpointGroup) WITH LOG;
		RETURN;
	END

	DECLARE @SBDialog UNIQUEIDENTIFIER
	DECLARE @Message XML
	DECLARE @counter INT
	DECLARE @error INT
	DECLARE @handle UNIQUEIDENTIFIER;

	DECLARE @NotNullCount INT = ((CASE WHEN @xmlPayload IS NULL THEN 0 ELSE 1 END)
									+ (CASE WHEN @binaryPayload IS NULL THEN 0 ELSE 1 END)
									+ (CASE WHEN @varcharPayload IS NULL THEN 0 ELSE 1 END))
	IF @NotNullCount > 1
	BEGIN
		RAISERROR (
				N'Failed to SEND because %i payload fields are filled in when no more than 1 is expected'
				, 16, 1, @NotNullCount) WITH LOG;
		RETURN;
	END

	SET @counter = 1

	WHILE (1=1)
	BEGIN

		SET @handle = NULL

		-- Seek an eligible conversation in [ServiceBrokerConversations]
		-- We will hold an UPDLOCK on the composite primary key 
		SELECT @handle = Handle
			FROM [ServiceBrokerConversations] WITH (UPDLOCK)
			WHERE Identifier = @identifier
					AND IdentifierType = @identifierType
					AND FromService = @fromService
					AND ToService = @toService
					AND OnContract = @onContract;

		IF @handle IS NULL
		BEGIN

			-- Need to start a new conversation for the current @Id
			BEGIN DIALOG CONVERSATION @handle
				FROM SERVICE @fromService
				TO SERVICE @toService
				ON CONTRACT @onContract
				WITH ENCRYPTION = OFF;

			-- Then the sender must listen on the
			-- send queue for the http://schemas.microsoft.com/SQL/ServiceBroker/DialogTimer message type and
			-- cleanup appropriately.
			IF @conversationTimeout IS NOT NULL
			BEGIN
				BEGIN CONVERSATION TIMER (@handle) TIMEOUT = @conversationTimeout;
			END

			INSERT INTO [ServiceBrokerConversations]
				(Identifier, IdentifierType, FromService, ToService, OnContract, Handle)
				VALUES
				(@identifier, @identifierType, @fromService, @toService, @onContract, @handle);

		END;

		IF @xmlPayload IS NOT NULL
		BEGIN
			-- Attempt to SEND on the associated conversation
			;SEND ON CONVERSATION @handle 
				MESSAGE TYPE @messageType (@xmlPayload);
		END ELSE IF @binaryPayload IS NOT NULL
		BEGIN
			;SEND ON CONVERSATION @handle 
				MESSAGE TYPE @messageType (@binaryPayload);
		END ELSE BEGIN
			;SEND ON CONVERSATION @handle 
				MESSAGE TYPE @messageType (@varcharPayload);
		END

		SELECT @error = @@ERROR;

		IF @error = 0
		BEGIN
			-- Successful send, just exit the loop
			BREAK;
		END

		SELECT @counter = @counter+1;

		IF @counter > 10
		BEGIN
			-- We failed 10 times in a row, something must be broken
			RAISERROR (
				N'Failed to SEND on a conversation for more than 10 times. Error %i.'
				, 16, 1, @error) WITH LOG;
			BREAK;
		END

		-- Delete the associated conversation from the table and try again
		DELETE FROM [ServiceBrokerConversations]
			WHERE Handle = @handle;

		SET @handle = NULL;

	END

END


And for the activation on the initiator queue I have:

CREATE PROCEDURE [dbo].[usp_InitiatorQueueHandler]
AS
BEGIN

	SET NOCOUNT ON

	DECLARE @handle UNIQUEIDENTIFIER;
	DECLARE @messageTypeName SYSNAME;
	DECLARE @messageBody VARBINARY(MAX);

	WHILE (1=1)
	BEGIN

		BEGIN TRAN;

		;WAITFOR (RECEIVE TOP(1)
			@handle = conversation_handle,
			@messageTypeName = message_type_name,
			@messageBody = message_body
		FROM [InitiatorQueue]), TIMEOUT 5000;

		IF (@@ROWCOUNT = 0)
		BEGIN
			COMMIT TRAN;
			BREAK;
		END

		-- Call the base stored procedure to handle ending the conversation
		EXEC dbo.usp_BrokerHandleInitiator @handle, @messageTypeName, @messageBody

		COMMIT TRAN;

	END

END

GO

ALTER QUEUE [InitiatorQueue]
WITH ACTIVATION (
	STATUS=ON,
	PROCEDURE_NAME=dbo.usp_InitiatorQueueHandler,
    EXECUTE AS OWNER,
    MAX_QUEUE_READERS=10
)

GO

CREATE PROCEDURE [dbo].[usp_BrokerHandleInitiator]
	(
	@handle UNIQUEIDENTIFIER,
	@messageTypeName SYSNAME,
	@messageBody VARBINARY(MAX)
	)
AS
BEGIN

	SET NOCOUNT ON

	IF @handle IS NOT NULL
	BEGIN

		-- Delete the message from the [ServiceBrokerConversations] table
		-- before sending the [EndOfStream] message. The order is
		-- important to avoid deadlocks.
		IF @messageTypeName = N'http://schemas.microsoft.com/SQL/ServiceBroker/DialogTimer'
				OR @messageTypeName = N'http://schemas.microsoft.com/SQL/ServiceBroker/EndDialog'
		BEGIN
			DELETE FROM [ServiceBrokerConversations]
				WHERE [Handle] = @handle;
		END

		IF @messageTypeName = N'http://schemas.microsoft.com/SQL/ServiceBroker/DialogTimer'
		BEGIN
			;SEND ON CONVERSATION @handle
				MESSAGE TYPE [EndOfStream];

		END
		ELSE IF @messageTypeName = N'http://schemas.microsoft.com/SQL/ServiceBroker/EndDialog'
		BEGIN
			END CONVERSATION @handle;
		END
		ELSE IF @messageTypeName = N'http://schemas.microsoft.com/SQL/ServiceBroker/Error'
		BEGIN
			END CONVERSATION @handle;

			-- We could send a notification or store the error in a table for further inspection
			DECLARE @error INT;
			DECLARE @description NVARCHAR(4000);
			WITH XMLNAMESPACES (N'http://schemas.microsoft.com/SQL/ServiceBroker/Error' AS ssb)

			SELECT

				@error = CAST(@messageBody AS XML).value(
					'(//ssb:Error/ssb:Code)[1]', 'INT'),
				@description = CAST(@messageBody AS XML).value(
					'(//ssb:Error/ssb:Description)[1]', 'NVARCHAR(4000)')

			-- Maybe log to audit log instead?
			RAISERROR(N'Received error Code:%i Description:"%s"',
				 16, 1, @error, @description) WITH LOG;

		END;
	END
END

The deadlock XML is:

<deadlock>
 <victim-list>
  <victimProcess id="process807dbd0c8" />
 </victim-list>
 <process-list>
  <process id="process807dbd0c8" taskpriority="0" logused="0" waitresource="METADATA: database_id = 21 CONVERSATION_GROUP($hash = 0xff26c7e1:0x478840de:0xd403bb)" waittime="2600" ownerId="8333217736" transactionname="GetDialogByHandle" lasttranstarted="2015-03-23T10:53:58.683" XDES="0x87f251c90" lockMode="X" schedulerid="2" kpid="7220" status="suspended" spid="110" sbid="0" ecid="0" priority="0" trancount="2" lastbatchstarted="2015-03-23T10:53:58.683" lastbatchcompleted="2015-03-23T10:53:58.683" lastattention="1900-01-01T00:00:00.683" clientapp=".Net SqlClient Data Provider" hostname="COLFOQA2" hostpid="1436" loginname="dev" isolationlevel="read committed (2)" xactid="8333217704" currentdb="21" lockTimeout="4294967295" clientoption1="673185824" clientoption2="128056">
   <executionStack>
    <frame procname="MYDB.dbo.usp_SendMessage" line="116" stmtstart="7540" stmtend="7696" sqlhandle="0x03001500aada77428391a0005da4000001000000000000000000000000000000000000000000000000000000">
SEND ON CONVERSATION @handle 
				MESSAGE TYPE @messageType (@xmlPayload);    </frame>
   </executionStack>
   <inputbuf>
Proc [Database Id = 21 Object Id = 1115151018]   </inputbuf>
  </process>
  <process id="process869a5e558" taskpriority="0" logused="588" waitresource="KEY: 21:72057594039959552 (1f1ae6770d1b)" waittime="2600" ownerId="8333217730" transactionname="user_transaction" lasttranstarted="2015-03-23T10:53:58.683" XDES="0x3e28456a8" lockMode="U" schedulerid="4" kpid="6720" status="background" spid="22" sbid="0" ecid="0" priority="0" trancount="2">
   <executionStack>
    <frame procname="MYDB.dbo.usp_BrokerHandleInitiator" line="28" stmtstart="1996" stmtend="2144" sqlhandle="0x03001500f704cd06e691a0005da4000001000000000000000000000000000000000000000000000000000000">
DELETE FROM [ServiceBrokerConversations]
				WHERE [Handle] = @handle;    </frame>
    <frame procname="MYDB.dbo.usp_InitiatorQueueHandler" line="29" stmtstart="1014" stmtend="1172" sqlhandle="0x03001500316f56101694a0005da4000001000000000000000000000000000000000000000000000000000000">
EXEC dbo.usp_BrokerHandleInitiator @handle, @messageTypeName, @messageBody    </frame>
   </executionStack>
   <inputbuf>
   </inputbuf>
  </process>
 </process-list>
 <resource-list>
  <metadatalock subresource="CONVERSATION_GROUP" classid="$hash = 0xff26c7e1:0x478840de:0xd403bb" dbid="21" id="lock54fdb1800" mode="X">
   <owner-list>
    <owner id="process869a5e558" mode="X" />
   </owner-list>
   <waiter-list>
    <waiter id="process807dbd0c8" mode="X" requestType="wait" />
   </waiter-list>
  </metadatalock>
  <keylock hobtid="72057594039959552" dbid="21" objectname="MYDB.dbo.ServiceBrokerConversations" indexname="PK__ServiceB__877FDFD18DF079BD" id="lock6c65b1a00" mode="U" associatedObjectId="72057594039959552">
   <owner-list>
    <owner id="process807dbd0c8" mode="U" />
   </owner-list>
   <waiter-list>
    <waiter id="process869a5e558" mode="U" requestType="wait" />
   </waiter-list>
  </keylock>
 </resource-list>
</deadlock>

I have a clustered index on the fields I am SELECTing by and a UNIQUE index on the Handle (for the DELETE). When running the SELECT/DELETE statements against the table the query plan reports index seeks are being used:

CREATE TABLE [dbo].[ServiceBrokerConversations] (
    [Identifier]     VARCHAR (50)     NOT NULL,
    [IdentifierType] VARCHAR (50)     NOT NULL,
    [FromService]    [sysname]        NOT NULL,
    [ToService]      [sysname]        NOT NULL,
    [OnContract]     [sysname]        NOT NULL,
    [Handle]         UNIQUEIDENTIFIER NOT NULL,
    [CreateDate]     DATETIME2 (7)    NULL,
    PRIMARY KEY CLUSTERED ([Identifier] ASC, [IdentifierType] ASC, [FromService] ASC, [ToService] ASC, [OnContract] ASC) ON [PRIMARY],
    UNIQUE NONCLUSTERED ([Handle] ASC) ON [PRIMARY]
) ON [PRIMARY];

What appears to be happening is the DELETE is somehow deadlocking with the SEND but I am not sure how since I am using them in the same order in both the send procedure and the activated procedure. Also, RCSI is enabled on the database I am receiving the deadlocks on.

EDIT:

I think I have found the culprit with lock acquisition order:

- In the usp_SendMessage proc:

  • The SELECT locks the conversation record
  • The SEND locks the conversation group

- In the timer activated proc on the initiator queue:

  • The RECEIVE locks the conversation group
  • The DELETE locks the conversation record

Given that I think there may be a few solutions:

  1. There is some subtle difference between my code and the code from the article that I am not noticing that when fixed will resolve the deadlocking. I am hoping this is the case since it seems that others have used this pattern without issues as far as I know.
  2. Or...The deadlocking is inherent to the pattern the code is using and I can either:
  • Deal with the deadlocking by adjusting the deadlock priority on the activated stored procedure so that it becomes the victim, and I can implement retry logic.
  • Remove conversation timers and activation all together and resort to some sort of job that expires the conversation by polling it, where I can control the ordering.

My ultimate goal is to eliminate any deadlocking on usp_SendMessage so that it "never" fails.

I appreciate any feedback!

Thanks




  • Edited by ryanmwright Tuesday, March 24, 2015 2:42 PM
March 24th, 2015 12:47am

I can understand why the deadlock happens. As you point out the activation procedure and the send SP acquire locks on the resources in reverse order.

Really why Remus does not consider this in his blog post, I don't know. But may I ask, since you have replaced @@spid as a key with two other columns, does this also mean that multiple processes can use the same conversation? I'm not so sure that this is a good idea. I worked with an SB implementation which reuses conversations some months ago, and I recall that considered channing the pattern, but that I decided against it the end although I don't remember the exact details.

But so much is clear, if multiple processes can use the same handle, they will serialise on the SELECT with UPDLOCK. That will not happen if you change to REPEATABLEREAD, but I guess they will serialiase on the SEND instead.

The best way to address the problem appears to use SET LOCK_TIMEOUT 0 in the activation procedure and trap the the timeout error in a CATCH block, and let the message go back to the queue. This should be better than SET DEADLOCK_PRIORITY, since the there will never be a deadlock that upholds the the sender.

Free Windows Admin Tool Kit Click here and download it now
March 25th, 2015 10:44pm

Hi Erland,

Thank you for the reply.

In terms of using something other than the @@spid I think you're right, it should be approached with caution. To keep in line with the pattern I think I will modify the proc so that only the @@spid can be used so that there will be only one conversation per process.

I like the idea of setting the LOCK_TIMEOUT instead of setting the DEADLOCK_PRIORITY to prevent blocking. If I do I think the transaction will probably be more likely to rollback than with the DEADLOCK_PRIORITY option because it would rollback if that @@spid holds any locks, even in non deadlock scenarios. Since a rollback may be more likely perhaps I should disable the built in poison detection, to prevent the queue from being disabled if it exceeds 5 lock timeouts? (the message sending will be done through triggers so there will be a fairly high volume of them, and a long running process could lock the conversation for enough time for 5 failures to occur)

If I may ask, when you reused conversations in your other project did you implement something similar when cleaning up conversations?




March 29th, 2015 11:53am

I should add the caveat that I have not tested the solution with SET LOCK_TIMEOUT. I was not aware of this deadlock until I saw your post. (I will have to look into this situation for my project.)

I don't want to venture in whether you should disable poison-message detection, but maybe you should run a stress-test to see how likely it is for the queue to be disabled. Then again, I can't see any problem with a disabling poison-message detection on a queue that only receives timer messages.

I don't know how you use(d) IdentifierType and Identifier, but I get the feeling that if you share the conversation among processes, there is more likelihood for the deadlock to occur.

An alternative to the timer is to age out the conversation in the send procedure. That is, if the handle found is too old, send end-of-stream on that conversation, and then get a new handle. But this could leave the conversation around for a long time, if the process goes away.

Free Windows Admin Tool Kit Click here and download it now
March 29th, 2015 1:34pm

Hi Erland,

Thank you for the reply.

In terms of using something other than the @@spid I think you're right, it should be approached with caution. To keep in line with the pattern I think I will modify the proc so that only the @@spid can be used so that there will be only one conversation per process.

I like the idea of setting the LOCK_TIMEOUT instead of setting the DEADLOCK_PRIORITY to prevent blocking. If I do I think the transaction will probably be more likely to rollback than with the DEADLOCK_PRIORITY option because it would rollback if that @@spid holds any locks, even in non deadlock scenarios. Since a rollback may be more likely perhaps I should disable the built in poison detection, to prevent the queue from being disabled if it exceeds 5 lock timeouts? (the message sending will be done through triggers so there will be a fairly high volume of them, and a long running process could lock the conversation for enough time for 5 failures to occur)

If I may ask, when you reused conversations in your other project did you implement something similar when cleaning up conversations?




March 29th, 2015 3:52pm

Hi Erland,

I have decided to go with your last suggestion which is to age the conversation in the send procedure. After doing this it appears that all deadlocking has stopped.

Currently my process looks like this:

  • Cleanup the conversation in the usp_SendMessage procedure if it should be aged (DELETE it from the table and send the EndOfStream message, then create a new conversation). I removed the conversation timer from the process since it is not needed anymore.
  • The activated procedure's only job is now to end its side of the conversation when it receives the EndDialog message.
  • Run a nightly job to clean up all the conversations that are left around because the process did not send any new messages (DELETE it from the table and send the EndOfStream message).

Thank you for the help on this!!



Free Windows Admin Tool Kit Click here and download it now
April 1st, 2015 10:39pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics