Loading Huge Flat File faster SSIS

Hi,

I have a 36 GB file and we load this data hourly (the file size could however vary depending on the hour). The ETL is scheduled to run every hour.

The approach that I have is

1. First Data Flow task, reads from the file does quite a few transformations like lookups, derived columns etc and then loads a staging table. To make the load faster, I have a conditional split, that splits the output into 4 different streams which then load data into the same staging table, but as different destinations (parallel loading as has been suggested by quite a few online blogs). I also have fast parse set on allowed columns from the file.

2. 2nd data flow then takes the data from the staging table populated above and for a combination of 6 columns, takes the first row only. This has been done by doing a row_number and then taking row_number =1 rows only. As is the case with the previous flow, I have a conditional split to split the output in 4 different streams but loading the same main table (parallel loading).


I want to make this process as fast as possible:- Few things that have crossed my mind

1. Splitting the file up: But this would mean another DFL which would again be time consuming.

2. Make the staging table a heap but that would make the query in the 2nd flow extremely slow.

3. I can build indexes on the staging table after point 1 but then that index creation also takes time, as the number of rows in staging is around 70 million.

4. I could have had a script component as a source which could do the filtering of getting the first row of a combination but that's ruled out as we have seen memory leaks with script components. This is ruled out by architects. So, can't do anything about this point.

After doing all the possible tuning (including changing the query in point 2 to use joins instead of using row_number) the load time varies from 15-20 minutes on my local (from BIDS) and sometimes even slower as BIDS refuses to release memory immediately even after successful execution of the package.

I want to test this package on my local before I move it and I want this package to finish in 8-10 mins or even faster if possible.

Are there other ways to speed it up? Or maybe some tweaks in my existing setup that you guys could suggest.

By the way, I am using BIDS 2012 and SQL Server 2014

Thanks,

Saugat

June 18th, 2015 3:30pm

Hi,

36 GB is a huge file. Few questions

Does your job completes in an hour ? What would you do if you move the package to Prod env. and the job takes more than an hour ? Do you have any archival system for the File or the same file gets replaced everytime and it's read without being archived? If this is the case i would be thinking of few changes in the design.

I have worked with few GB Files and have seen considerable performance improvement by ->dropping indexes ->inserting data -> recreating the indexes before using the staging table in the second Data Flow. If you are not using the staging table for any kind of look-up within the data flow, I would consider dropping the indexes before inserting data in staging. Insert statement has no benefit from Indexes. 

Regards

Free Windows Admin Tool Kit Click here and download it now
June 18th, 2015 4:40pm

Hi Saugat,

If the next jobs starts irrespective to completion of the previous this processing design is risky.

It is best to process one file and move to the next.

With SSIS itself it is hard to do distributed (AKA parallel) processing. What is possible is to pre-split the file and have it loaded with two packages or more run in parallel making sure it does not block the receiving end and the host machine has enough resources.

So having multiple dedicated staging tables is best.

June 18th, 2015 4:45pm

My suggestion is to full load the data of file without any changes (even data type changes) directly into a stage table. you can use BULK INSERT for that to get the best performance. then the rest can be done from the stage table. on the stage table then you can put indexes like clustered columnstore index (if you have SQL Enterprise) to speed up rest of the process. you can even set the stage table name with timestamp information that shows the table contains information of which date time.
Free Windows Admin Tool Kit Click here and download it now
June 18th, 2015 6:46pm

Hi Saugat,

A couple of questions/ideas:

1. Have you done testing how much time it takes just to read the input file? Setup a test data flow and thunk it with Union All, to determine what is your reading speed. Report back the result.

2. Have you timed how the processing time is spent? Report back in detail which parts how much time individually take.

3. How much data/rows do you have in your destination table?

4. I don't exactly understand why do you need to have two separate data flows. From your description it looks like your first data flow can directly load into your destination table.

5. Let's say you want to use staging table. I don't see a reason to use SSIS data flow to transfer data from the staging table to your destination table. This approach is not very efficient. The better approach is to use T-SQL to transfer the data from the staging table into your final destination table.

June 21st, 2015 10:37pm

The following are some tips which may help

  1. Avoid transformations in ssis package, Massage data with in staging DB
  2. Use Fast load option
  3. Update maxconcurrent executable and  maxenginethreads properties to fit your hardware
  4. Adjust buffersize as per your hardware
  5. Do not have index on your staging Table
  6. Split the data load when you push data from staging to destination
  7. Optimize sql queries(Avoid userdefined function, douple dippings, complex subqueries)

All the best!

Free Windows Admin Tool Kit Click here and download it now
June 22nd, 2015 8:09am

Thanks all for the ideas and suggestions. I will outline the process a bit more. The data that I am talking about are price ticks of stocks and they come everytime the prices change the slightest bit. We get hourly files for a day, each ranging from 20 GB to 40GB. The final destination table is partitioned by date.

So, the flow is such that for each hourly file, I load the file into a temp table (transformations along the way but none are blocking). The 2nd dataflow reads from this temp table and removes duplicates using group by (using SQL) and loads a Staging table. The Staging table holds data for all the hours of the day and when all the hours are read, we do a partition switch to the final destination table

Graphically-> Truncate temp table->  File->Transformations (synchronous) ->Load Temp table (clustered columnstore to speed up the next query using group by-> SQL using group by and loads Staging Table->All hours of the day loaded? If yes, switch the Staging table to the main table.

Here is all that I have done so far:

1. Using Fast Parse for columns in the flat file wherever I could.

2. Using Parallel fast load in SSIS by splitting the output in 4 different streams which load the temp table.

3. Adjusting the column size of fields coming from the source file so that my row width is the lowest allowable.

4. Making the temp table a clustered columnstore table. This was done to speed up the 2nd dataflow which does a group by to remove duplicates (I tried row_number but group by is faster on a CCI table)

5. Adjusting the max buffer size and rows to fit. I now have roughly 140000 rows per buffer. I have set the max insert commit size to 140000. I know that one should set it to 0 for CCI table to get row groups as full as possible meaning fewer row groups. But in my case, this used to slow down my destination commit and I also ran into temp db being full. Now, for around 50.5 million rows in my temp table, I have around 51 row groups most of which are compressed barring a few which are still deltastores. But this reduces my load time in temp DFL from 11 mins to around 6 mins.

6. I have a sql which checks if there are any uncompressed row groups and if yes, I rebuild the index on the temp table. Since the temp table at max can hold only an hour's data (50.5 million in this case, its ok). This taes around 1.5 mins.

7. In the DFL from temp to Staging-I have used a group by  and max on the temp table and this is pretty fast is what I observed. In this DFL also, I adjusted the buffer size and rows and also the max insert commit size to around 140000. Here also I have split the output into 4 streams and have 4 OLEDB destinations which all point to the staging table without table lock (to make the load faster)

Currently the file which has around 80 million records (which after transformations reduces to 50.5 in the temp flow and further to 30.5 million in the staging flow), takes around 14 mins on my local.

Any more ideas to reduce the load time further would be highly appreciated. FYI- I am also looking at the option of making the actual and the Staging table a CCI table. Also, I can't use a script component to filter out rows as that has been ruled out by architects owing to the fact that we have seen memory leaks in SSIS script components.

June 27th, 2015 5:41am

Saugat,

Why do you use DFL to load the data from temp table into the staging table? Why don't you use T-SQL to transfer the data? This will be much faster.

Free Windows Admin Tool Kit Click here and download it now
June 27th, 2015 8:09am

Hi,

Thanks. I will try that. The reason I tried a DFL is because I thought splitting the output into 4 streams using conditional split and then using 4 oledb destinations, each having the same staging table but without a table lock would be faster since

insert into table select ....from temp is not a parallel operation.

But thanks for the pointer, I will try that out as well. Maybe it would still be faster than SSIS DFL.

June 27th, 2015 10:29am

How do you know it is not parallel operation? Have you analyzed the execution plan? I suspect it will be highly parallel if the engine determines it can do it.

The T-SQL approach will always be faster because the data stays in the SQL instance. With the SSIS approach the data has to leave to SSIS to be processed, then pushed back. This is not very efficient.

Free Windows Admin Tool Kit Click here and download it now
June 27th, 2015 10:36am

Hi,

Yes I did check the execution plan and also since I am working on SQL 2014, one of the new features is that

select ...... into NewNonExistingTable from temp is parallel

whereas 

insert into staging ....select ....from temp is not.

Its the insert which is not parallel (don't see those yellow arrows for parallel execution).

Since I load every hour into the staging table, it can't be a non existing table and thus I can't use 

select ...... into NewNonExistingTable from temp

Just to get parallel inserts, I was trying to split the rows and spread them across 4 connections to the same destination table.

But I am still trying the advice that you gave, maybe it would still be faster than SSIS

June 27th, 2015 11:23am

So, I did try the SQL insert and as suggested by the plan (not parallel), it turned out to be slower. That section which transfers data from Temp to staging using group by, in sql, took 8 mins and 48 seconds whereas the same using SSIS and splitting the output in 4 different streams and then loading in parallel those 4 streams took 5 mins and 31 seconds.
Free Windows Admin Tool Kit Click here and download it now
June 27th, 2015 11:58am

Saugat,

You have to find a way to make the transfer execute in parallel in T-SQL.

June 27th, 2015 12:07pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics