Slowly Changing Dimension with 600,000 rows (Network Steve Forum)

Slowly Changing Dimension with 600,000 rows

Hi, We have been using tasks generated from theSCD wizard. We have smaller dimensions (< 30,000 rows) that work well. Our Product Dimension packageis giving us performance problems (taking 7 hoursto do 600,000 rows when 80,000 records are updated; the rest new inserts). It is similar to the smaller dimensions. Several columns are type 1 and are doing update statements; several are type 2doing updates and inserts. The package had a complicated view as the initial task, but we have since modified to use a SQL command with variable and now the initial read appears quick, but is chunking in 10,000 record increments and taking the 7 hours (never let finish previously). So the package is pretty basic now (reading a source, a small derive and data conversion, a small lookup(cached30,000 records) for a description, then the SCD). Before I start replacing what the SCD generates with stored procedures, anyone have any suggestions as to what might be the issue? We believe we have increased the number of type 2 columns and the SCD definately has more to do than just an insert or update, but 7 hours for 600,000 records seems excessive. Interestingly, the source task never turns green. Previously when we had a Merge Join it completed the read and bottlenecked at a sort and a Merge Join. Now that has been removed and simplified, and all tasks remain yellow with the 10,000 (actually 9,990 I think) chunks appearing at the source, then the SCD before the next chunk appears to be read. On the general release (not the beta). Thanks in advance!

January 18th, 2006 1:53am

Here are some general comments. The chunking of 10,000 is quite normal. The default buffer size for an OLE Src is 10,000 rows or 10MB/record length. So your record length is probably at or below 1024K. You mention that your lookup is caching 30k records but yet you have 600k input records. I would assume your using partial caching rather than full caching. If you are using partial caching, then every lookup that fails a match from the cache requires a sql call to the DB. This could be part of the problem. Here are some recommended approaches Use full caching making sure your lookup only outputs the required fields. Do not use the table/view setting for the lookup if you don't need all the columns since it will waste memory. Push the description lookup to the source query. If this is not possible because the source query comes from a different box, consider doing a full extract from the source system into a staging table (duplicate schema from source) using a bulk insert task. After the bulk insert do the description lookup in SQL. You could try using the merge join to replace the lookup. For large recordsets, they tend to be more performant. The key is to make sure both inputs (Source Query and your Destination Lookup Query) is sorted in the DB rather than in the flow. It will be much faster to perform the sorts in the DB especially if they are sorted on a Clustered Index. If you sort them in the DB, you will need to set OLE Src IsSorted property to true, and set the columns Sort Position to its relative sort no. These properties can be accessed via the Advanced Properties page of the task. Your Derived Column/Data Conversions could also be one area to look at. I recently had a derived column that performed a length operation against a 3000 character field. Before I removed it, the flow processed 1M records in 10mins while after it was processing in 30s. You don't have all that many updates (80k if I read it correctly) but they tend to be performance killers since they are done on at a time. Your updates combined with the partial caching could be initiating page/table locks. You may want to perform a SQL trace while the package is running to make sure there are no deadlock occuring. Non-Clustered indexes and Foreign Keys will also slow down the update/insert statements. Consider developing a process to either disable/drop them before the load and then enable/create them after the load. Along the lines of updates, it may be that you need to pipe the update flow to an OLE Dest that is a temporary table and then perform the update in SQL so that it is performed in bulk rather than row by row. If none of these things help, I would try to isolate what part of the package is causing your performance bottleneck. Try replacing parts of the package with very simple and basic replacements. For example replace your lookup with a derived column setting description to "MyDesc". That should be very performant. Also pipe your update and insert statement to SQLIS's trash destination. It just consumes the row and does no work. You can download it at the following url. http://www.sqlis.com/default.aspx?56 If you find a marked increase in performance after the replacement, then I would look at what you replaced and see how you can make it more performant. I hope I've provided you with some possible avenues of attack and hope that you provide us feedback on your progress. Larry Pope

Free Windows Admin Tool Kit Click here and download it now

January 18th, 2006 4:48am

We've got it under control now. We've moved around a number of things, played with the table indexes and have most everything in the initial SQL command.As you suggested, letting the database do the work (joining, ordering etc.) first,rather than in SSIS tasks, seems to help dramatically. We also found that a complicated cross database view takes substantial time to execute, whereas if we use a SQL command and paste the exact SQL that creates the view, it is much faster. Thanks Larry.

January 19th, 2006 12:39am

If you have a lot of updates you can insert the updates to a second table and do a batch update with the Execute SQL Task. And of course use the lookups instead of the SCD component: http://microsoft-ssis.blogspot.com/2011/01/slowly-changing-dimension-alternatives.html

Free Windows Admin Tool Kit Click here and download it now

January 7th, 2011 2:55am

This topic is archived. No further replies will be accepted.