SSIS Script Component works on low volume of data but not on high
Hi, I have a SSIS 2008r2 Script Component generating a MD5 Hash value over two bigint data columns. When I push 2-10 rows of data though the component it produce the same hash value time after time, for the same data column values as below: 5543 46113 VWj3yyvbyIegGhaTr+yq/Q== 5543 46114 EH/6YNiMKgFb1tvWIlxhbA== 5543 46117 DmmaOUEj9W80vOlmEIHv8g== 5543 46116 jm9UVJnllx8vBjzTC8gaDg== 5543 46115 fULsdonw0LSRP/YE570BAA== When I push production volumes thought the pipeline some 300,000 rows. The same data columns generate a different MD5 Hash value every time the data is processed. I have tested this by running the package and by just executing the data flow task. Have I missed some think? The script task is in VB Script Thanks
July 25th, 2011 9:38am

Can you share the code to see whats wrong?Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 10:15am

Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Imports Microsoft.SqlServer.Dts.Pipeline Imports System.Text Imports System.Security.Cryptography <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _ <CLSCompliant(False)> _ Public Class ScriptMain Inherits UserComponent Private inputBuffer As PipelineBuffer Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer) inputBuffer = Buffer MyBase.ProcessInput(InputID, Buffer) End Sub Public Shared Function CreateHash(ByVal data As String) As String Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data) Dim md5 As MD5 = New MD5CryptoServiceProvider() Dim hashedData As Byte() = md5.ComputeHash(dataToHash) RNGCryptoServiceProvider.Create().GetBytes(dataToHash) Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None) Return s End Function Public Overrides Sub PreExecute() MyBase.PreExecute() End Sub Public Overrides Sub PostExecute() MyBase.PostExecute() End Sub Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer) Dim counter As Integer = 0 Dim values As New StringBuilder Dim Result As String For counter = 0 To inputBuffer.ColumnCount - 1 Dim value As Object value = inputBuffer.Item(counter) values.Append(value) Next 'Result = CreateHash(values.ToString()) Row.BKHashValue = CreateHash(values.ToString()) 'MsgBox(Result) End Sub End Class While working on this I have concluded the issue starts once the pipeline buffer becomes full and the pipeline batches the records through the component. Thank you for any input
July 25th, 2011 10:28am

The pipeline ALWAYS batches records. Isn't there a seed value or something that you have to use in order to hash different contents consistently to the same value? If that's the case, I don't see you doing that here, and the default may be not to (?) which may be causing the problem... ? Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 11:21am

Thanks Todd, There would not appear to be a seed value required for 'ComputeHash', just the data to be hashed. It appears to be data after the first batching that starts to receive differing hashed values.
July 25th, 2011 11:31am

IIRC, the MD5 functionality in .NET uses unmanaged resources. I've never completely understood how the script components in SSIS do garbage collection, but you could try calling: md5.Clear() just prior to your function return.
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 12:35pm

Thanks. on 10,000 rows run twice I get 111 new hash key values. I think you are on the right lines as it works upto the 10,000 rows fine.
July 25th, 2011 12:52pm

Assumingthat's what's going on, you should probably then create your "md5" variable just once in the PreExecute method, as well as calling your RNGCryptoServiceProvider.Create() method just once and retaining a reference to the returned object (also in the PreExecute method). Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 7:57pm

Assumingthat's what's going on, you should probably then create your "md5" variable just once in the PreExecute method, as well as calling your RNGCryptoServiceProvider.Create() method just once and retaining a reference to the returned object (also in the PreExecute method). Talk to me now on That's what I'd do... The way you're doing this now, you're creating an MD5 object for each row and relying on Garbage Collection to take over when md5 goes out of scope; many thousands of times a second... I'm not sure this is happening so efficiently (or at all) in this case.
July 25th, 2011 8:08pm

FWIW, I've just put hashing in a package, specifically to match a .Net hash to T-SQL hashing. It wasn't easy - mostly to do with having to convert the byte result into a comparable string... (a great resource was Mladen Prajdic's blog.) I'll be uploading that package soon... I'll try to remember to post a link here. Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 26th, 2011 11:37am

I'm glad to see this post, as I'm having the exact same issue. For < 10000 rows, it works fine. Unfortunately, I'm trying to use this to identify modified records in a slowly changing dimension with over 300K rows (at 60 columns). I've also tried SHA1 to no avail. If I run it 2 times in a row, over 200K have a different hash value. FYI, the SCD wizard (rbar) isn't even an option. Way too slow.
July 27th, 2011 9:41pm

If you're doing SCD work, you may want to try my Dimension Merge component on CodePlex. The sample code I was talking about is in this folder on my SkyDrive, called "b11df12d-b972-41de-96b3-4c1448923c6e High-Performance Upserts.zip" Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 28th, 2011 11:09am

Thanks Todd for the suggestion. I'll have to check that task out. I like the row change reason and the retiring features. I configured the package exactly as described in the article, Speeding Up Incremental ETL Processes in SSIS by using MD5 hashes, but the md5 values weren't consistent between loads. Similar to ExpendableGrunt, the hash's would change if processing > 10000 records. In the end, I wound up replacing the script component with the SSIS Multiple Hash tool, which accurately creates the MD5 hash that I need. I had to change the HashValue column to binary(16) and throw a DT_WSTR, 35 cast in on the conditional split compare expression. Also, rather than the (rbar) OleDb command, I'm sending the modified records to a staging table and then doing a set based update on the dimension in the next step of the control flow. We're still in proof of concept and our scope is limited to type 1 changes, but I'm sure the merge task will come in handy soon.
July 28th, 2011 12:35pm

If you're interested, the code for the Multiple Hash transform (which I also use) is on CodePlex. The code in a custom component maps pretty closely (ProcessInput, PreExecute, etc) to a Script component, so it might show you the "right" way to use the .Net hashing code... Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 28th, 2011 1:49pm

Hi, I followed exact same exercise, and I am facing the exact same problem, where the MD5 is returning different values for the same row. I was going through your post thread, and I dont see the final answer, am I missing anything? Did Todd, and other gave you the final verdict yet? Did you manage to use the same code from http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/ or did you made any changes to that? I am doing the same prototyping too. Could you please share your experience with this? Thanks in Advance for your answer. MA
September 19th, 2011 5:20pm

I swear I had a test SSIS package where I worked this out - because I did want to confirm that I could get reliable hashing between SQL and SSIS. And I swear I did. I'm actually mad now that I can't find it, cause it took some work to get it right, and I'm going to use it in a few months. Sorry I can't help more at the moment - but I can't find that package... Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
September 24th, 2011 8:48pm

LOL, Thanks so much Todd, for you help. I really appreciate it. I finally got it right for the last time (I think..!) With tones and tones of research and spending few nights looking through it. I made few changes to the code in the article. http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/ I computed SHA1 algorithm instead of MD5. secondly, I pass in the actual column values with actual data type, before I tried to convert it to String and it dint work. I can send you my script component task to you, and please let me know what you think about it. In the script, I might have declared MD5.Clear() function many times in many places, only to make sure I am clearing that buffer refresh problem then I also have 2 function, one for MD5 and one for SHA1, just to see which one runs fine. At-last I believe SHA1 works fine. Below is the VB.Net Script. ' Microsoft SQL Server Integration Services Script Component ' Write scripts using Microsoft Visual Basic 2008. ' ScriptMain is the entry point class of the script. Imports System Imports System.Data Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper Imports Microsoft.SqlServer.Dts.Pipeline Imports System.Text Imports System.Security.Cryptography <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _ <CLSCompliant(False)> _ Public Class ScriptMain Inherits UserComponent Private inputBuffer As PipelineBuffer Private MD5 As MD5 Private SHA1 As SHA1 Public Overrides Sub PreExecute() MyBase.PreExecute() SHA1 = SHA1.Create() MD5 = MD5.Create() RNGCryptoServiceProvider.Create() End Sub Public Overrides Sub PostExecute() MyBase.PostExecute() End Sub Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer) inputBuffer = Buffer MyBase.ProcessInput(InputID, Buffer) End Sub Public Shared Function CreateHashMD5(ByVal data As String) As String Dim dataToHashMD5 As Byte() = (New UnicodeEncoding()).GetBytes(data) Dim MD5 As MD5 = New MD5CryptoServiceProvider() Dim hashedDataMD5 As Byte() = MD5.ComputeHash(dataToHashMD5) RNGCryptoServiceProvider.Create().GetBytes(dataToHashMD5) Dim M As String = Convert.ToBase64String(hashedDataMD5, Base64FormattingOptions.None) MD5 = MD5.Create() RNGCryptoServiceProvider.Create() Return M MD5 = MD5.Create() RNGCryptoServiceProvider.Create() End Function Public Shared Function CreateHashSHA1(ByVal data As String) As String Dim dataToHashSHA1 As Byte() = (New UnicodeEncoding()).GetBytes(data) Dim SHA1 As SHA1 = New SHA1CryptoServiceProvider() Dim hashedDataSHA1 As Byte() = SHA1.ComputeHash(dataToHashSHA1) RNGCryptoServiceProvider.Create().GetBytes(dataToHashSHA1) Dim S As String = Convert.ToBase64String(hashedDataSHA1, Base64FormattingOptions.None) SHA1 = SHA1.Create() RNGCryptoServiceProvider.Create() Return S SHA1 = SHA1.Create() RNGCryptoServiceProvider.Create() End Function Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) SHA1 = SHA1.Create() MD5 = MD5.Create() RNGCryptoServiceProvider.Create() Dim counter As Integer = 0 Dim values As New StringBuilder Dim value As Object For counter = 0 To inputBuffer.ColumnCount - 1 value = inputBuffer.Item(counter) values.Append(value) Next Row.DWSHA1 = CreateHashSHA1(values.ToString()) Row.DWMD5 = CreateHashMD5(values.ToString()) SHA1 = SHA1.Create() MD5 = MD5.Create() RNGCryptoServiceProvider.Create() End Sub End Class
September 24th, 2011 11:18pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics