Hi,
I have a 36 GB file and we load this data hourly (the file size could however vary depending on the hour). The ETL is scheduled to run every hour.
The approach that I have is
1. First Data Flow task, reads from the file does quite a few transformations like lookups, derived columns etc and then loads a staging table. To make the load faster, I have a conditional split, that splits the output into 4 different streams which then load data into the same staging table, but as different destinations (parallel loading as has been suggested by quite a few online blogs). I also have fast parse set on allowed columns from the file.
2. 2nd data flow then takes the data from the staging table populated above and for a combination of 6 columns, takes the first row only. This has been done by doing a row_number and then taking row_number =1 rows only. As is the case with the previous flow, I have a conditional split to split the output in 4 different streams but loading the same main table (parallel loading).
I want to make this process as fast as possible:- Few things that have crossed my mind
1. Splitting the file up: But this would mean another DFL which would again be time consuming.
2. Make the staging table a heap but that would make the query in the 2nd flow extremely slow.
3. I can build indexes on the staging table after point 1 but then that index creation also takes time, as the number of rows in staging is around 70 million.
4. I could have had a script component as a source which could do the filtering of getting the first row of a combination but that's ruled out as we have seen memory leaks with script components. This is ruled out by architects. So, can't do anything about this point.
After doing all the possible tuning (including changing the query in point 2 to use joins instead of using row_number) the load time varies from 15-20 minutes on my local (from BIDS) and sometimes even slower as BIDS refuses to release memory immediately even after successful execution of the package.
I want to test this package on my local before I move it and I want this package to finish in 8-10 mins or even faster if possible.
Are there other ways to speed it up? Or maybe some tweaks in my existing setup that you guys could suggest.
By the way, I am using BIDS 2012 and SQL Server 2014
Thanks,
Saugat