Slowly Changing Dimension with 600,000 rows
Hi,
We have been using tasks generated from theSCD wizard. We have smaller dimensions (< 30,000 rows) that work well. Our Product Dimension packageis giving us performance problems (taking 7 hoursto do 600,000 rows when 80,000 records are updated; the rest new inserts). It is similar to the smaller dimensions. Several columns are type 1 and are doing update statements; several are type 2doing updates and inserts. The package had a complicated view as the initial task, but we have since modified to use a SQL command with variable and now the initial read appears quick, but is chunking in 10,000 record increments and taking the 7 hours (never let finish previously). So the package is pretty basic now (reading a source, a small derive and data conversion, a small lookup(cached30,000 records) for a description, then the SCD). Before I start replacing what the SCD generates with stored procedures, anyone have any suggestions as to what might be the issue? We believe we have increased the number of type 2 columns and the SCD definately has more to do than just an insert or update, but 7 hours for 600,000 records seems excessive. Interestingly, the source task never turns green. Previously when we had a Merge Join it completed the read and bottlenecked at a sort and a Merge Join. Now that has been removed and simplified, and all tasks remain yellow with the 10,000 (actually 9,990 I think) chunks appearing at the source, then the SCD before the next chunk appears to be read. On the general release (not the beta). Thanks in advance!
January 18th, 2006 1:53am
Here are some general comments.
The chunking of 10,000 is quite normal. The default buffer
size for an OLE Src is 10,000 rows or 10MB/record length. So your
record length is probably at or below 1024K.
You mention that your lookup is caching 30k records but yet you
have 600k input records. I would assume your using partial
caching rather than full caching. If you are using partial
caching, then every lookup that fails a match from the cache requires a
sql call to the DB. This could be part of the problem. Here
are some recommended approaches
Use full caching making sure your lookup only outputs the
required fields. Do not use the table/view setting for the lookup
if you don't need all the columns since it will waste memory.
Push the description lookup to the source query. If this
is not possible because the source query comes from a different box,
consider doing a full extract from the source system into a staging
table (duplicate schema from source) using a bulk insert task.
After the bulk insert do the description lookup in SQL.
You could try using the merge join to replace the lookup.
For large recordsets, they tend to be more performant. The key is
to make sure both inputs (Source Query and your Destination Lookup
Query) is sorted in the DB rather than in the flow. It will be
much faster to perform the sorts in the DB especially if they are
sorted on a Clustered Index. If you sort them in the DB, you will
need to set OLE Src IsSorted property to true, and set the columns Sort
Position to its relative sort no. These properties can be
accessed via the Advanced Properties page of the task.
Your Derived Column/Data Conversions could also be one area to
look at. I recently had a derived column that performed a length
operation against a 3000 character field. Before I removed it,
the flow processed 1M records in 10mins while after it was processing
in 30s.
You don't have all that many updates (80k if I read it correctly)
but they tend to be performance killers since they are done on at a
time. Your updates combined with the partial caching could be
initiating page/table locks. You may want to perform a SQL trace while
the package is running to make sure there are no deadlock
occuring. Non-Clustered indexes and Foreign Keys will also slow
down the update/insert statements. Consider developing a process
to either disable/drop them before the load and then enable/create them
after the load.
Along the lines of updates, it may be that you need to pipe the
update flow to an OLE Dest that is a temporary table and then perform
the update in SQL so that it is performed in bulk rather than row by
row.
If none of these things help, I would try to isolate what part of the
package is causing your performance bottleneck. Try replacing
parts of the package with very simple and basic replacements. For
example replace your lookup with a derived column setting description
to "MyDesc". That should be very performant. Also pipe your
update and insert statement to SQLIS's trash destination. It just
consumes the row and does no work. You can download it at the
following url.
http://www.sqlis.com/default.aspx?56
If you find a marked increase in performance after the replacement,
then I would look at what you replaced and see how you can make it more
performant.
I hope I've provided you with some possible avenues of attack and hope that you provide us feedback on your progress.
Larry Pope
Free Windows Admin Tool Kit Click here and download it now
January 18th, 2006 4:48am
We've got it under control now. We've moved around a number of things, played with the table indexes and have most everything in the initial SQL command.As you suggested, letting the database do the work (joining, ordering etc.) first,rather than in SSIS tasks, seems to help dramatically. We also found that a complicated cross database view takes substantial time to execute, whereas if we use a SQL command and paste the exact SQL that creates the view, it is much faster. Thanks Larry.
January 19th, 2006 12:39am
If you have a lot of updates you can insert the updates to a second table and do a batch update with the Execute SQL Task.
And of course use the lookups instead of the SCD component:
http://microsoft-ssis.blogspot.com/2011/01/slowly-changing-dimension-alternatives.html
Free Windows Admin Tool Kit Click here and download it now
January 7th, 2011 2:55am


