Previous & Next Row DataFlow Transform
I have issue where based up a value in a column i need to do some processing of the previous and current row.The dataflow is also already sorted. I tried creating a Script Data Flow Transformation to do this but it isn't working right and the debugging of it sucks. Would anyone know of the best way to do this? or some helpful pointers? I tried "firing" information to help debug but doesn't help when the error message i get back is a stack overflow message. An example of what I'm trying to do is process the sorted incoming rows for each person. Each personcan have multiple rows.Basedupon a "status"column in each row dosome different processing onthe previous orcurrent row.Some Psuedo code: if prev.PersonID = current.PersonID if status = 1 change prev.PersonDate to today + 60 days if status = 2 change current.PersonDateto prev.PersonDate change prev.PersonDate to today + 1 day else send rows to output Any comments or suggestions or helpful advice/critique would be MUCH appreciated!
May 1st, 2007 6:52pm
This might help. http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1139922&SiteID=1Though, if you have a variable number of rows per PersonID, I'm not sure how well you can implement this in SSIS. Jay or some of the other script guys around here might have to chime in.
May 1st, 2007 7:00pm
Yeah each person can have 1 or more rows. There isn't a defined set of rows per person. Yeah I looked at the link and that wouldn't work in my situation. Not sure if there would be a good way to do it in straight SQL either. Does anyone have suggestions on how to do it in SQL? If possible? Without cursors? I think i'm going to play around with SQL for little bit and see if i can't come up with something, however executing a OLE DB Command on each record on 1+ million records is going to slow down the process significantly i suspect. I wonder if there is a batch update way of doing this?
May 1st, 2007 7:52pm
I've done this many times. Usually I need to do some type of complex aggregation of multiple rows and only output a single row for a distinct entity. Its a great benefit that your data is already sorted.I wrote the code below mostly from memory, so there may be some syntax problems, but hopefully you get the idea. You cache each row until you've seen the next one. So you're always writing one row behind the current one. You have to override FinishOutputs so you can write out your last row. Code SnippetPublic Class ScriptMain Inherits UserComponent Private Class BufferClass 'define class members for columns Public PersonID As Integer Public Status As Integer Public PersonDate As DateTime End Class Dim PreviousPersonID As Integer = -1 Dim Buffer As BufferClass = Nothing Public Sub WriteBuffer() If Not Buffer Is Nothing Then With NewRecordsBuffer .AddRow() ' add the persisted values from the class to the output buffer .PersonID = Buffer.PersonID .Status = Buffer.Status .PersonDate = Buffer.PersonDate End With Buffer = Nothing End If End Sub Public Overrides Sub FinishOutputs() 'write the previous row WriteBuffer() MyBase.FinishOutputs() End Sub Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer) Dim PreviousPersonDate As DateTime If Row.PersonID = PreviousPersonID Then 'modify previous row If Row.Status = 1 Then Buffer.PersonDate = DateAdd("d", DateTime.Today, 60) ElseIf Row.Status = 2 Then PreviousPersonDate = Buffer.PersonDate Buffer.PersonDate = DateAdd("d", DateTime.Today, 1) End If End If 'write previous row WriteBuffer() 'buffer the current row Buffer = New BufferClass With Buffer ' fill the class with columns that need to be persisted .PersonID = Row.PersonID .Status = Row.Status If Row.Status = 2 Then .PersonDate = PreviousPersonDate Else .PersonDate = Row.PersonDate End If End With PreviousPersonID = Row.PersonID End SubEnd Class
May 1st, 2007 8:54pm
Thanks! I'll give this a shot. This is similar to what I was doing except I was getting a stack overflow problem and not sure why. The only difference was that I was buffering all of each persons rows (each person had 1 or more rows with the most being 14), but only changing data in the current and previous rows. When current row was a different person then it would output the all the buffered rows. Anyways I'll give this a shot and see.
May 1st, 2007 8:59pm
JayH- Does this process normally take a lot of time??? It is extremely slow?
May 1st, 2007 11:11pm
Hmmm I took out the FireInformation method I was using to debug (just one line) and now it is 1000 times faster. JayH - I was wondering if it would be faster to output each row or to output a the person batch of rows?
May 1st, 2007 11:14pm
It works! Thanks JayH! I implemented mine so that it handles batches than just the previous row. As far as performance between each row or a batch I'm not sure, but implementing as a batch is faster than a Merge Join transformation.
May 2nd, 2007 1:06am
thames wrote: It works! Thanks JayH! I implemented mine so that it handles batches than just the previous row. As far as performance between each row or a batch I'm not sure, but implementing as a batch is faster than a Merge Join transformation.I'm glad you got it going. I doubt you'll find any performance difference between lagging only one row and all the rows for the PersonID, especially since you're only expecting a max of four rows per PersonID. I think the code is probably simpler to only do it for one row, but its just a matter of preference.
May 2nd, 2007 3:30am
Just a note to save others less experienced with controlling outputs on script tasks, you need to set the SynchronousInputID to None (not default Input0) in order to control the row (add new ones) the OutputBuffer0 (which is what the code above is referring to with NewRecordsBuffer, however Output0 id the default name given to an output) Maybe it will save you the hour of pecking it cost me! Cory M. Cox
October 7th, 2011 5:46am