Why is this social network query so slow in Azure Table Storage?

I am developing a simple social networking application and want to use Azure Table Storage because it is cheap and is supposed to scale very well. Ive tried to follow the guidance in this article to design the tables and queries (https://msdn.microsoft.com/en-us/library/azure/hh508997.aspx). However, when I load the table with millions of microblog records, the query takes more than 5 minutes. Here are my tables and the query:

Users Table
PartitionKey - GUID
RowKey Email address
FirstName
LastName
ThumbnailImageUrl

Friends Table 
PartitionKey Email address of user1
RowKey Email address of user2
Timestamp

Microblogs Table
PartitionKey - DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks
RowKey Email address of user
Timestamp
Text Text of message
FirstName First Name of user
LastName Last Name of user
ThumbnailImageUrl Thumbnail of user

Code snippet with the queries for the microblog:

public Social GetMicroblogs(string userEmail)
{
List<string> lstFriends = null;
lstFriends = context.CreateQuery<Friends>(Friends)
                                                    .Where(Friends => (Friends.PartitionKey.Equals(userEmail)))
                                                    .AsTableServiceQuery(context).Execute()
                                                    .Select(Friends => Friends.RowKey)
                                                    .ToList();
lstFriends.Add(userEmail);

if (lstFriends!= null)
{
              var microblogTableQuery = new TableQuery();
var microblogCondition = string.Empty;
              foreach (var Friends in lstFriends)
microblogCondition += "RowKey eq '" + Friends + "' or ";
microblogCondition = microblogCondition.Remove(microblogCondition.Length - 3);

var microblogDynamicTableEntities = microblogTable.ExecuteQuery(microblogTableQuery.Where(microblogCondition))
                                                                        .Skip((pageIndex - 1)*pageSize).Take(pageSize)
                                                                        .ToList();
}
}

Logic
Each time two users become friends, the software will add a record for each user in a Friends table. For example, if user1 connects with user2, then two records are added to the table, as follows: Record 1 would have PK = user1 and RK = user2, Record 2 would have PK = user2 and RK = user1.

I chose DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks as the partition key in the Microblogs table so that the query results are automatically sorted in chronological order with the newest microblogs first. I do not need to use entity group transactions, and since I will have a large number of microblogs I assumed that having small partitions will enable the queries to be load balanced across multiple servers.

In my query above, I first get the list of friends and then use that as the filter for the query on the microblog table.

Question
My question is, why is this query so slow? Based on the article referenced above, I expected it to achieve the fastest results?




  • Edited by Xethnic Friday, May 22, 2015 10:33 PM
May 22nd, 2015 10:27pm

Thanks, Sadiqh. I really appreciate your help. That post on Stackoverflow is also mine. Before I did the design, I sought advice on this forum and Stackoverflow. The Stackoverflow response gave me some good ideas that led to the current design, so I posted some comments hoping the original contributor might have some good feedback. -- Tim
Free Windows Admin Tool Kit Click here and download it now
May 24th, 2015 7:19am

Hi Xethnic,

The query you're using for microblog table will actually trigger a whole table scan since you only provide a fixed RowKey but no PartitionKey to the query. Please consider changing your microblog table design, use UserID as PartitionKey and use "DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks" as RowKey so that this query will only trigger a partition scan.

For further information of Azure Table Storage and how to well design tables, please refer to:

http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx

http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

If you have any further questions, please feel free to ask me here.

Best Regards,

Zhaoxing Lu

May 26th, 2015 2:03am

Hi Xethnic,

The query you're using for microblog table will actually trigger a whole table scan since you only provide a fixed RowKey but no PartitionKey to the query. Please consider changing your microblog table design, use UserID as PartitionKey and use "DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks" as RowKey so that this query will only trigger a partition scan.

For further information of Azure Table Storage and how to well design tables, please refer to:

http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx

http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/

If you have any further questions, please feel free to ask me here.

Best Regards,

Zhaoxing Lu

Free Windows Admin Tool Kit Click here and download it now
May 26th, 2015 2:03am

Thank you, Zhaoxing Lu. I expect each user to have more than 1,000 entities in the microblogs table. In the case of this design, if I only specify the filters for the PartitionKey but don't specify a RowKey, will the query return 1,000 entities for the first user and none of the entities for the other users? I hadn't tried it yet, but that was what I expected based on the documentation. -- Tim
May 26th, 2015 12:59pm

Hi Xethnic,

Yes, if you only specify the filters for the PartitionKey but don't specify a RowKey, the query will return 1,000 entities for the first user and none of the entities for the other users. The query result is sorted firstly in the PartitionKey order, then in RowKey order.

Besides the requirement to query all microblogs of one user, do you have another requirement to query a particular number of "all users' latest microblogs"? If so, please consider creating two tables, one with user ID PK and timestamp RK, and one with timestamp RK and user ID RK. When a microblog is posted, please add one entity to both tables respectively. The advantage of this design is you can support both query effectively, but the disadvantage is the cost of removing an user will be pretty high.

If you are concerning about the consistency issue while inserting entities to two tables, please refer to "Eventually Consistent Transaction Pattern" in the following article.

https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#eventually-consistent-transactions-pattern

Best Regards,

Zhaoxing Lu

Free Windows Admin Tool Kit Click here and download it now
May 28th, 2015 7:02am

Hi Xethnic,

As far as I am concerned, you problem is more likely what a relationship database can solve. If you still would like to choose Azure Storage Table (which is actually a non-SQL database), a solution I can think of is shown below:

Since you would like to implement segmented query inside a friend circle efficiently, the microblogs of the user and his friend should have the same partition key, and only those microblogs are in this partition. Therefore, you need to maintain a MicroblogInFriendCircle table, whose PartitionKey is a user Id, and the RowKey is timestamp of the corresponding microblog. While a microblog is posted, you should add the microblog into the poster and his friends' partitions. In addition, if you'd like to avoid microblog data duplication, please consider maintain a dedicated Microblog table whose PartitionKey is poster user Id and RowKey is microblog Id, so that you can just add microblog Id into MicroblogInFriendCircle as a reference.

When you want to query the microblogs of an user and his friends chronologically, just specify a filter "PK=UserId" and ".Take(20)" to MicroblogInFriendCircle table, and leverage the TableContinuationToken to fetch table entities segment by segment. And if what you stored in MicroblogInFriendCircle table is just microblogs references, you need to get the real microblog from Microblog table by issuing individual 20 requests (which can be done in parallel of course).

A summary: this solution can achieve best performance for your requirement to query the microblogs of an user and his friends chronologically, because it's a partition scan (rather than whole table scan) which leverages the RowKey order perfectly. However, it's not a good solution honestly. Besides complexity, the solution has another disadvantage you should note: if the friend list of an user is changed, you have to maintain the MicroblogInFriendCircle table with lots of adding/deleting operations, which is pretty costly. And again, you have to leverage "Eventually Consistent Transaction Pattern" to achieve data consistency.

Maybe other guys can think of a better table design for your question, the solution mentioned above is just my personal idea. :)

Best Regards,

Zhaoxing Lu

  • Proposed as answer by Zhaoxing Lu Monday, June 01, 2015 8:09 AM
  • Unproposed as answer by Zhaoxing Lu Monday, June 01, 2015 8:09 AM
May 31st, 2015 4:34am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics