Pain on the growing huge amount of data

Hi All:

I am really having the big data "headache" now as our data is growing exponentially with the expanded scope of our business. Our business is quite analytical focused so we have been using sql server for years to do data hosting, BI and others. Therefore we have bunch of codes deployed (UDF, SP) and applications like SSIS packages, SSAS cubes / analytical models, SSRS dashboards as well.

however, we are having 200 million rows every day now, and this is extremely painful to manage the data. I must say I do hate the choice of migrating to the Hadoop environment as our analytical work migration could lead to a disaster grade issue. Also we cannot move data to cloud due to the confidential agreement with the client.

My colleague is trying to use the distributed MySQL solution but I am just wondering if we can keep the Sql server environment in this situation? Any advanced techniques we could done by develop a good architecture / solution to keep our .NET solutions?

Please do help. Any comments will be really appreciated.

Cheers

July 21st, 2015 9:39am

you gave no information on the database size, database server version/edition and how is your current database architectured like files/filegroup

you said - headache in managing - is it just managing or performance too?

if is it just managing - like backups/disk space - you can use the concept of filegroups/files - readonly file groups - backup only read write data every day - may be partition, if your application can work with this and data compression..

performance is different some of the above mentioned will help along with right indexes/memory etc...

Free Windows Admin Tool Kit Click here and download it now
July 21st, 2015 9:46am

Thanks a lot Stan, sorry I was kind of posting as a generic data solution approach discussion.

We are using SQL SERVER 2012 Enterprise version now and migrating to a latest version is not an issue.

Sorry I am more talking about the performance issues, so we are having 200 million rows of data everyday and ideally it needs to be put into the same table. however as you know when the table gets over 200 million rows of data that could be slow so it is kind of impossible for us to put one month data into one table which could be 6 billon rows of data. This still can be solved by indexes / memories?

Really appreciated your advice.

July 21st, 2015 9:52am

Do you really need to deal with all that data? Perhaps it is possible to archive some portion of the data? 

We had a client who has  almost 2TB database,  and have not experienced with  any performance  degradation, meaning you need to have a properly created indexes.

Moreover, have you looked into partition feature in SQL Server, you can pretty easily manage the data

https://msdn.microsoft.com/en-us/library/ms190787.aspx

Free Windows Admin Tool Kit Click here and download it now
July 21st, 2015 9:59am

Sorry I am more talking about the performance issues, so we are having 200 million rows of data everyday and ideally it needs to be put into the same table. however as you know when the table gets over 200 million rows of data that could be slow so it is kind of impossible for us to put one month data into one table which could be 6 billon rows of data. This still can be solved by indexes / memories?

First you need to look at partitioned tables, just to help manage the space.

Second you need to start looking at the columnar indexing in SQL 2012, and at the in-memory tables added in SQL 2014 and enhanced in SQL 2016.  These may help you keep performance acceptable.

But, you may be operating at or beyond what SQL Server can do.  A single 6-billion row table might be manageable, but if you have a dozen of them with complex relationships with widely varying join magnitudes, you may find SQL Server simply cannot handle it.

(and of course your hardware is appropriately sized and configured, with tons of RAM and lots of SSDs, right?)

As the database total size (but not counting big blobs) gets above about 2tb, only a perfectly balanced logical and physical data model will still be tractable in SQL Server.  If your app is too complex, or the data goes to 4tb or 10tb or above, you're probably SOL.

You can try to "shard" across multiple SQL Server instances, which is what a lot of people do with MySQL.  Or you can explore cloud-based scalable architectures or other non-standard SQL-based data warehouse platforms.

Or you can go Hadoop and "big data", but you have to realize, that even when you are truly forced to it, it is NOT like using SQL Server, it is like re-inventing SQL Server and doing a better job of it than Microsoft!  IOW, it is a LOT MORE WORK handling big data non-relationally - or at all, it's like going back in time thirty years.

Josh

July 21st, 2015 1:33pm

however as you know when the table gets over 200 million rows of data that could be slow

Maybe Stan knows something that I don't, but I don't know of any magic limit at 200 million rows.

What matters is how you query the data. With indexes etc in place, queries can still be quick. But, yes, if there are queries that analyse it all, it's going to take some time. Josh suggested columnstore index, and that is certainly worth looking into. There is also the Parallel Data Warehouse, which I believe goes as APS these days.

Free Windows Admin Tool Kit Click here and download it now
July 21st, 2015 5:57pm

however as you know when the table gets over 200 million rows of data that could be slow

Maybe Stan knows something that I don't, but I don't know of any magic limit at 200 million rows.

What matters is how you query the data. With indexes etc in place, queries can still be quick. But, yes, if there are queries that analyse it all, it's going to take some time. Josh suggested columnstore index, and that is certainly worth looking into. There is also the Parallel Data Warehouse, which I believe goes as APS these days.

July 21st, 2015 6:03pm

Do you really need to deal with all that data? Perhaps it is possible to archive some portion of the data? 

We had a client who has  almost 2TB database,  and have not experienced with  any performance  degradation, meaning you need to have a properly created indexes.

Moreover, have you looked into partition feature in SQL Server, you can pretty easily manage the data

https://msdn.microsoft.com/en-us/library/ms190787.aspx

Free Windows Admin Tool Kit Click here and download it now
July 21st, 2015 6:42pm

Sorry I am more talking about the performance issues, so we are having 200 million rows of data everyday and ideally it needs to be put into the same table. however as you know when the table gets over 200 million rows of data that could be slow so it is kind of impossible for us to put one month data into one table which could be 6 billon rows of data. This still can be solved by indexes / memories?

First you need to look at partitioned tables, just to help manage the space.

Second you need to start looking at the columnar indexing in SQL 2012, and at the in-memory tables added in SQL 2014 and enhanced in SQL 2016.  These may help you keep performance acceptable.

But, you may be operating at or beyond what SQL Server can do.  A single 6-billion row table might be manageable, but if you have a dozen of them with complex relationships with widely varying join magnitudes, you may find SQL Server simply cannot handle it.

(and of course your hardware is appropriately sized and configured, with tons of RAM and lots of SSDs, right?)

As the database total size (but not counting big blobs) gets above about 2tb, only a perfectly balanced logical and physical data model will still be tractable in SQL Server.  If your app is too complex, or the data goes to 4tb or 10tb or above, you're probably SOL.

You can try to "shard" across multiple SQL Server instances, which is what a lot of people do with MySQL.  Or you can explore cloud-based scalable architectures or other non-standard SQL-based data warehouse platforms.

Or you can go Hadoop and "big data", but you have to realize, that even when you are truly forced to it, it is NOT like using SQL Server, it is like re-inventing SQL Server and doing a better job of it than Microsoft!  IOW, it is a LOT MORE WORK handling big data non-relationally - or at all, it's like going back in time thirty years.

Josh

Hi Josh:

your advice is really helpful and i do agree with you on the points, especially the last one. Actually i hired a team to do Hadoop earlier this year and we do get the data managed on the server groups. However, moving our analysts' skills from T-SQL to the poor Hive is kind of mission impossible. Majority advanced functions or even regular functions are not avalable in Hive, which drove us nuts. We are an analytical consulting company not a technology company so the user experience of the first line analyts is a lot more important than technology efficienty. So i am trying my best to do the decision that best suits our current situation. I do have a dedicated IT team to manage this but you know, they always trying to give me the option favours themselves most, so that's why i am checking here to seek some opinions from the top experienced people here.

Thanks again for your advice. I will try to look at the columnar indexing and in-memory calculations. Have a good day.

July 21st, 2015 6:49pm

Just one point, partitioning is more about manageability and less about performance.
Free Windows Admin Tool Kit Click here and download it now
July 22nd, 2015 1:37am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics