SSIS Script Component works on low volume of data but not on high
Hi,
I have a SSIS 2008r2 Script Component generating a MD5 Hash value over two bigint data columns.
When I push 2-10 rows of data though the component it produce the same hash value time after time, for the same data column values as below:
5543
46113 VWj3yyvbyIegGhaTr+yq/Q==
5543
46114 EH/6YNiMKgFb1tvWIlxhbA==
5543
46117 DmmaOUEj9W80vOlmEIHv8g==
5543
46116 jm9UVJnllx8vBjzTC8gaDg==
5543
46115 fULsdonw0LSRP/YE570BAA==
When I push production volumes thought the pipeline some 300,000 rows.
The same data columns generate a different MD5 Hash value every time the data is processed.
I have tested this by running the package and by just executing the data flow task.
Have I missed some think?
The script task is in VB Script
Thanks
July 25th, 2011 9:38am
Can you share the code to see whats wrong?Please mark the post as answered if it answers your question | My SSIS Blog:
http://microsoft-ssis.blogspot.com
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 10:15am
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Private inputBuffer As PipelineBuffer
Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)
inputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub
Public Shared Function CreateHash(ByVal data As String) As String
Dim dataToHash As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim md5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedData As Byte() = md5.ComputeHash(dataToHash)
RNGCryptoServiceProvider.Create().GetBytes(dataToHash)
Dim s As String = Convert.ToBase64String(hashedData, Base64FormattingOptions.None)
Return s
End Function
Public Overrides Sub PreExecute()
MyBase.PreExecute()
End Sub
Public Overrides Sub PostExecute()
MyBase.PostExecute()
End Sub
Public Overrides Sub InputSource_ProcessInputRow(ByVal Row As InputSourceBuffer)
Dim counter As Integer = 0
Dim values As New StringBuilder
Dim Result As String
For counter = 0 To inputBuffer.ColumnCount - 1
Dim value As Object
value = inputBuffer.Item(counter)
values.Append(value)
Next
'Result = CreateHash(values.ToString())
Row.BKHashValue = CreateHash(values.ToString())
'MsgBox(Result)
End Sub
End Class
While working on this I have concluded the issue starts once the pipeline buffer becomes full and the pipeline batches the records through the component.
Thank you for any input
July 25th, 2011 10:28am
The pipeline ALWAYS batches records.
Isn't there a seed value or something that you have to use in order to hash different contents consistently to the same value? If that's the case, I don't see you doing that here, and the default may be not to (?) which may be causing the problem...
?
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 11:21am
Thanks Todd,
There would not appear to be a seed value required for 'ComputeHash', just the data to be hashed.
It appears to be data after the first batching that starts to receive differing hashed values.
July 25th, 2011 11:31am
IIRC, the MD5 functionality in .NET uses unmanaged resources. I've never completely understood how the script components in SSIS do garbage collection, but you could try calling:
md5.Clear()
just prior to your function return.
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 12:35pm
Thanks. on 10,000 rows run twice I get 111 new hash key values.
I think you are on the right lines as it works upto the 10,000 rows fine.
July 25th, 2011 12:52pm
Assumingthat's what's going on, you should probably then create your "md5" variable just once in the PreExecute method, as well as calling your RNGCryptoServiceProvider.Create() method just once and retaining a reference to the returned object (also in
the PreExecute method).
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 25th, 2011 7:57pm
Assumingthat's what's going on, you should probably then create your "md5" variable just once in the PreExecute method, as well as calling your RNGCryptoServiceProvider.Create() method just once and retaining a reference to the returned object (also
in the PreExecute method).
Talk to me now on
That's what I'd do...
The way you're doing this now, you're creating an MD5 object for each row and relying on Garbage Collection to take over when md5 goes out of scope; many thousands of times a second...
I'm not sure this is happening so efficiently (or at all) in this case.
July 25th, 2011 8:08pm
FWIW, I've just put hashing in a package, specifically to match a .Net hash to T-SQL hashing. It wasn't easy - mostly to do with having to convert the byte result into a comparable string... (a great resource was
Mladen Prajdic's blog.)
I'll be uploading that package soon... I'll try to remember to post a link here.
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 26th, 2011 11:37am
I'm glad to see this post, as I'm having the exact same issue. For < 10000 rows, it works fine. Unfortunately, I'm trying to use this to identify modified records in a slowly changing dimension with over 300K rows (at 60 columns). I've also tried SHA1
to no avail.
If I run it 2 times in a row, over 200K have a different hash value.
FYI, the SCD wizard (rbar) isn't even an option. Way too slow.
July 27th, 2011 9:41pm
If you're doing SCD work, you may want to try my
Dimension Merge component on CodePlex.
The sample code I was talking about is in
this folder on my SkyDrive, called "b11df12d-b972-41de-96b3-4c1448923c6e High-Performance Upserts.zip"
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 28th, 2011 11:09am
Thanks Todd for the suggestion. I'll have to check that task out. I like the row change reason and the retiring features.
I configured the package exactly as described in the article,
Speeding Up Incremental ETL Processes in SSIS by using MD5 hashes, but the md5 values weren't consistent between loads. Similar to ExpendableGrunt, the hash's would change if processing > 10000 records.
In the end, I wound up replacing the script component with the SSIS Multiple Hash tool, which accurately creates the MD5 hash that I need. I had to change the HashValue column to binary(16) and throw a DT_WSTR, 35 cast in on the conditional split compare
expression. Also, rather than the (rbar) OleDb command, I'm sending the modified records to a staging table and then doing a set based update on the dimension in the next step of the control flow.
We're still in proof of concept and our scope is limited to type 1 changes, but I'm sure the merge task will come in handy soon.
July 28th, 2011 12:35pm
If you're interested, the code for the Multiple Hash transform (which I also use) is on CodePlex. The code in a custom component maps pretty closely (ProcessInput, PreExecute, etc) to a Script component, so it might show you the "right" way
to use the .Net hashing code...
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
July 28th, 2011 1:49pm
Hi,
I followed exact same exercise, and I am facing the exact same problem, where the MD5 is returning different values for the same row. I was going through your post thread, and I dont see the final answer, am I missing anything? Did Todd, and other gave you
the final verdict yet? Did you manage to use the same code from http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/
or did you made any changes to that? I am doing the same prototyping too.
Could you please share your experience with this?
Thanks in Advance for your answer.
MA
September 19th, 2011 5:20pm
I swear I had a test SSIS package where I worked this out - because I did want to confirm that I could get reliable hashing between SQL and SSIS. And I swear I did. I'm actually mad now that I can't find it, cause it took some work to get it
right, and I'm going to use it in a few months.
Sorry I can't help more at the moment - but I can't find that package...
Talk to me now on
Free Windows Admin Tool Kit Click here and download it now
September 24th, 2011 8:48pm
LOL, Thanks so much Todd, for you help. I really appreciate it. I finally got it right for the last time (I think..!)
With tones and tones of research and spending few nights looking through it. I made few changes to the code in the article.
http://www.sqlservercentral.com/articles/Integration+Services+%28SSIS%29/69766/
I computed SHA1 algorithm instead of MD5. secondly, I pass in the actual column values with actual data type, before I tried to convert it to String and it dint work. I can send you my script component task to you, and please let me know what you think about
it. In the script, I might have declared MD5.Clear() function many times in many places, only to make sure I am clearing that buffer refresh problem then I also have 2 function, one for MD5 and one for SHA1, just to see which one runs fine. At-last I
believe SHA1 works fine.
Below is the VB.Net Script.
' Microsoft SQL Server Integration Services Script Component
' Write scripts using Microsoft Visual Basic 2008.
' ScriptMain is the entry point class of the script.
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports Microsoft.SqlServer.Dts.Pipeline
Imports System.Text
Imports System.Security.Cryptography
<Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute> _
<CLSCompliant(False)> _
Public Class ScriptMain
Inherits UserComponent
Private inputBuffer As PipelineBuffer
Private MD5 As MD5
Private SHA1 As SHA1
Public Overrides Sub PreExecute()
MyBase.PreExecute()
SHA1 = SHA1.Create()
MD5 = MD5.Create()
RNGCryptoServiceProvider.Create()
End Sub
Public Overrides Sub PostExecute()
MyBase.PostExecute()
End Sub
Public Overrides Sub ProcessInput(ByVal InputID As Integer, ByVal Buffer As Microsoft.SqlServer.Dts.Pipeline.PipelineBuffer)
inputBuffer = Buffer
MyBase.ProcessInput(InputID, Buffer)
End Sub
Public Shared Function CreateHashMD5(ByVal data As String) As String
Dim dataToHashMD5 As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim MD5 As MD5 = New MD5CryptoServiceProvider()
Dim hashedDataMD5 As Byte() = MD5.ComputeHash(dataToHashMD5)
RNGCryptoServiceProvider.Create().GetBytes(dataToHashMD5)
Dim M As String = Convert.ToBase64String(hashedDataMD5, Base64FormattingOptions.None)
MD5 = MD5.Create()
RNGCryptoServiceProvider.Create()
Return M
MD5 = MD5.Create()
RNGCryptoServiceProvider.Create()
End Function
Public Shared Function CreateHashSHA1(ByVal data As String) As String
Dim dataToHashSHA1 As Byte() = (New UnicodeEncoding()).GetBytes(data)
Dim SHA1 As SHA1 = New SHA1CryptoServiceProvider()
Dim hashedDataSHA1 As Byte() = SHA1.ComputeHash(dataToHashSHA1)
RNGCryptoServiceProvider.Create().GetBytes(dataToHashSHA1)
Dim S As String = Convert.ToBase64String(hashedDataSHA1, Base64FormattingOptions.None)
SHA1 = SHA1.Create()
RNGCryptoServiceProvider.Create()
Return S
SHA1 = SHA1.Create()
RNGCryptoServiceProvider.Create()
End Function
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
SHA1 = SHA1.Create()
MD5 = MD5.Create()
RNGCryptoServiceProvider.Create()
Dim counter As Integer = 0
Dim values As New StringBuilder
Dim value As Object
For counter = 0 To inputBuffer.ColumnCount - 1
value = inputBuffer.Item(counter)
values.Append(value)
Next
Row.DWSHA1 = CreateHashSHA1(values.ToString())
Row.DWMD5 = CreateHashMD5(values.ToString())
SHA1 = SHA1.Create()
MD5 = MD5.Create()
RNGCryptoServiceProvider.Create()
End Sub
End Class
September 24th, 2011 11:18pm