Working with the Table service REST API - Batching data

3/16/2011 10:14:34 PM

You used both the StorageClient library and the REST API to insert new entities into the Products table. In this section, we’ll look at how you can both improve performance and perform transactional changes by batching up data.

The following code inserts multiple entities into the Products table using the StorageClient library:

var shirtContext = new ProductContext();

for (int i = 0; i < 10; i++)
{
   shirtContext.AddObject("Products",
   new Product
   {
      PartitionKey = "Shirts",
      RowKey = i.ToString(),
      Name = "Shirt" + i.ToString(),
      Description = "A Shirt"
   });
}
shirtContext.SaveChanges();

The preceding code will create 10 new shirts and add each new shirt to a list of objects that are to be tracked; it does this by calling the AddObject method on the shirtContext object. Following the Unit of Work pattern, the context object won’t send any changes to the Table service until the SaveChanges method is called. It will then iterate through the list of tracked objects and insert them into the Products table.

By default, the SaveChanges call will insert the entities into the table one by one rather than batching the inserts into a single call. Figure 1 shows the HTTP traffic for the preceding call, captured by using Wireshark (a packet-sniffing tool).

Figure 1. By default, the context object will save each entity with an individual request rather than saving them all in a batch.

As you can see from figure 1, to insert 10 shirts, the application must perform 10 HTTP POST requests to the Table service. This method can cause performance problems if you’re inserting a large number of entities and your application is outside of Windows Azure or your web or worker role isn’t affinitized to the same data center as your storage account.

Warning

Due to latency, inserting 10 shirts using the preceding code took 4 seconds between our local machine and the live Table service. When running the same code as a web role in the Windows Azure data center, it took milliseconds.

Although minimizing latency will give large performance benefits, you can gain larger performance improvements by batching up inserts into single calls using entity group transactions.

Note

Due to the flexible nature of the Windows Azure platform, you can host your storage account and your web and worker roles in different data centers. As you can see from the previous example, this flexibility comes at a price: latency. For the best performance, always affinitize your web roles, worker roles, and storage service to the same data center to minimize latency.

1. Entity group transactions

Entity group transactions are a type of batch insert where the whole batch is treated as a transaction, and the whole thing either succeeds or is rolled back entirely. First, let’s look at how batch inserts are done.

Passing SaveChangesOptions.Batch as a parameter into the SaveChanges method calls will batch up all changes into a single HTTP POST:

shirtContext.SaveChanges(SaveOptions.Batch);

Batching up the data like this reduced our insert of 10 shirts (from the local machine to the live service) from 4 seconds to 1 second.

The SaveOption parameter can also be passed in with the call to the SaveChanges method to specify what happens if the inserts aren’t entirely successful:

SaveOptions.None— By default, when no SaveOption is passed, or when SaveOptions.None is passed, as part of the SaveChanges method, and a tracked entity fails to be inserted, the context object will stop attempting to save any further entities. Any entities that were saved successfully won’t be rolled back and will remain in the table.
SaveOptions.ContinueOnError— If this option is passed as part of the SaveChanges call, and an entity fails to save, the context object will continue to save all other entities.
SaveOptions.Batch— If this option is passed as part of the SaveChanges call, all entities will be processed as a batch in the scope of a single transaction—known as an entity group transaction. If any of the entities being inserted as part of the batch fails to be inserted, the whole batch will be rolled back.

These are the rules for using entity group transactions:

A maximum of 100 operations can be performed in a single batch.
The batch may not exceed 4 MB in size.
All entities in the batch must have the same partition key.
You can only perform a single operation against an entity in a batch.

In this book, we won’t discuss the REST implementation of entity group transactions due to the complexity of the implementation. But it’s worth noting that if you decide to use the REST implementation, the Table service only implements a subset of the available functionality. As of the PDC 2009 release, the Table service only supports single changesets (a changeset being a set of inserts, updates, or deletes) within a batch.

Note

If you’re interested in looking at the REST implementation of batching, you should look up the “Performing Entity Group Transactions” MSDN article: http://msdn.microsoft.com/en-us/library/dd894038.aspx.

Entity group transactions are executed using an isolation method known as snapshot isolation. This is a standard method of isolation used in relational databases such as SQL Server or Oracle; it’s also known as multiversion concurrency control (MVCC). A snapshot of the data is taken at the beginning of a transaction, and it’s used for the duration of the transaction. This means that all operations within the transaction will use the same set of isolated data that can’t be interfered with by other concurrent processes. Because the data is isolated from all other processes, there’s no need for locking on the table, meaning that operations can’t be blocked by other processes. On committing the transaction, if any modified data has been changed by another process since the snapshot began, the whole transaction must be rolled back and retried.

2. Retries

In order to handle the MVCC model, your code must be able to perform retries. The ability to handle retries is built into the StorageClient library and can be configured using the following code:

shirtContext.RetryPolicy =
    RetryPolicies.Retry(5, TimeSpan.FromSeconds(1));

The preceding retry policy will reattempt the SaveChanges operation up to five times, retrying every second. If you don’t wish to set a retry policy, you can always set the policy as NoRetry:

shirtContext.RetryPolicy = RetryPolicies.NoRetry;

If you need more complicated retry polices with randomized back-off timings, or if you wish to define your own policy, this can also be achieved by setting an appropriate retry policy. Unfortunately, if you’re using the REST API directly, you’ll need to roll your own retry logic.

In order to make use of the standard retry logic, you’ll need to use the SaveChangesWithRetries method rather than the SaveChanges method, as follows:

shirtContext.SaveChangesWithRetries();

Use retries for queries too

Although retry policies are vital when using entity group transactions, they can also be useful when querying data. Your web and worker roles are based in the cloud and can be shut down and restarted at any time by the Fabric Controller (such as in a case of a hardware failure), so to provide a more professional application, it may be advisable to use retry policies when querying data.

So far we’ve covered the modification of data in quite a lot of detail. But entity group transactions can also be useful for querying data. With that in mind, it’s worth breaking away from data updates and focusing on how to retrieve data via the REST API.

Other -----------------

- Modifying entities with the REST API is CRUD (part 3) - Updating entities

- Modifying entities with the REST API is CRUD (part 2) - Deleting entities

- Modifying entities with the REST API is CRUD (part 1) - Inserting entities

- Working with the Table service REST API - Authenticating requests against the Table service

- Content delivery networks

- Using BLOB storage as a media server (part 3) - A Silverlight-based chunking media player

- Using BLOB storage as a media server (part 2) - A WPF-based adaptive-streaming video player

- Using BLOB storage as a media server (part 1) - Building a Silverlight or WPF video player

- Hosting Silverlight applications in BLOB storage (part 2) - Communicating with third-party sites

- Hosting Silverlight applications in BLOB storage (part 1) - Hosting the Silverlight Spectrum emulator