Partitioning data across lots of servers : Partitioning the storage account & Partitioning tables

3/11/2011 9:24:34 AM

In this section, we’ll look at how the Table service scales using partitioning at the storage account and table levels. To achieve a highly scalable service, the Table service will split your data into more manageable partitions that can then be apportioned out to multiple servers. As developers, we can control how this data is partitioned to maximize the performance of our applications.

Let’s look at how this is done at the storage account layer.

1. Partitioning the storage account

In this section, we’ll look at how data is partitioned, but we’ll leave performance optimization to a later section.

In figure 1, there were two tables within a storage account (ShoppingCart and Products). As the Table service isn’t a relational database, there’s no way to join these two tables on the server side. Because there’s no physical dependency between any two tables in the Table service, Windows Azure can scale the data storage beyond a single server and store tables on separate physical servers.

Figure 1 shows how these tables could be split across the Windows Azure data center. In this figure, you’ll notice that the Products table lives on servers 1, 2, and 4, whereas the ShoppingCart table resides on servers 1, 3, and 4. In the Windows Azure data center, you have no control over where your tables will be stored. The tables could reside on the same server (as in the case of servers 1 and 4) but they could easily live on completely separate servers (servers 2 and 3). In most situations, you can assume that your tables will physically reside on different servers.

Figure 1. Tables within a storage account split across multiple servers

Data replication

In order to protect you from data loss, Windows Azure guarantees to replicate your data to at least three different servers as part of the transaction. This data replication guarantee means that if there’s a hardware failure after the data has been committed, another server will have a copy of your data.

Once a transaction is committed (and your data has therefore been replicated at least three times), the Table service is guaranteed to serve the new data and will never serve older versions. This means that if you insert a new Hawaiian shirt entity on server 1, you can only be load balanced onto one of the servers that has the latest version of your data. If server 2 was not part of the replication process and contains stale data, you won’t be load balanced onto that server. You can safely perform a read of your data straight after a write, knowing that you’ll receive the latest copy of the data.

The Amazon SimpleDB database (which has roughly the same architecture as the Windows Azure Table service) doesn’t have this replication guarantee by default. Due to replication latency, it isn’t uncommon in SimpleDB for newly written data not to exist or to be stale when a read is performed straight after a write. This situation can never occur with the Windows Azure Table service.

Now that you’ve seen how different tables within a single account will be spread across multiple servers to achieve scalability, it’s worth looking at how you can partition data a little more granularly, and split data within a single table across multiple servers.

2. Partitioning tables

One of the major issues with traditional SQL Server–based databases is that individual tables can grow too large, slowing down all operations against the table. Although the Windows Azure Table service is highly efficient, storing too much data in a single table can still degrade data access performance.

The Table service allows you to specify how your table could be split into smaller partitions by requiring each entity to contain a partition key. The Table service can then scale out by storing different partitions of data on separate physical servers. Any entities with the same partition key must reside together on the same physical server.

In tables 1 through to table 3 , all the data was stored in the same partition (Shirts), meaning that all three shirts would always reside together on the same server, as shown in figure 1. Table 4 shows how you could split your data into multiple partitions.

Table 4. Splitting partitions by partition key
Timestamp	PartitionKey	RowKey	PropertyBag
2009-07-01T16:20:32	Red	1	Name: Red Shirt
			Description: Red
2009-07-01T16:20:33	Blue	1	Name: Blue Shirt
			Description: A Blue Shirt
2009-07-01T16:20:33	Blue	2	Name: Frilly Blue Shirt
			Description: A Frilly Blue Shirt
2009-07-05T10:30:21	Red	2	Name: Frilly Pink Shirt
			Description: A Frilly Pink Shirt
			ThumbnailUri: frillypinkshirt.png

In table 4 the Red Shirt and the Frilly Pink Shirt now reside in the Red partition, and the Blue Shirt and the Frilly Blue shirt are now stored in the Blue partition. Figure 2 shows the shirt data from table 11.5 split across multiple servers. In this figure, the Red partition data (Red Shirt and Pink Frilly Shirt) lives on server A and the Blue partition data (Blue Shirt and Frilly Blue Shirt) is stored on server B. Although the partitions have been separated out to different physical servers, all entities within the same partition always reside together on the same physical server.

Figure 2. Splitting partitions across multiple servers

Row Keys

The final property to explain is the row key. The row key uniquely identifies an entity within a partition, meaning that no two entities in the same partition can have the same row key, but any two entities that are stored in different partitions can have the same key. If you look at the data stored in table 11.5 , you can see that the row key is unique within each partition but not unique outside of the partition. For example, Red Shirt and Blue Shirt both have the same row key but live in different partitions (Red and Blue).

The partition key and the row key combine to uniquely identify an entity—together they form a composite primary key for the table.

Indexes

Now that you have a basic understanding of how data is logically stored within the data service, it’s worth talking briefly about the indexing of the data.

There are a few rules of thumb regarding data-access speeds:

Retrieving an entity with a unique partition key is the fastest access method.
Retrieving an entity using the partition key and row key is very fast (the Table service needs to use only the index to find your data).
Retrieving an entity using the partition key and no row key is slower (the Table service needs to read all properties for each entity in the partition).
Retrieving an entity using no partition key and no row key is very slow, relatively speaking (the Table service needs to read all properties for all entities across all partitions, which can span separate physical servers).

We’ll explore these points in more detail as we go on.

Load balancing of requests

Because data is partitioned and replicated across multiple servers, all requests via the REST API can be load balanced. This combination of data replication, data partitioning, and a large web server farm provides you with a highly scalable storage solution that can evenly distribute data and requests across the data center. This level of horsepower and data distribution means that you shouldn’t need to worry about overloading server resources.

Now that we’ve covered the theory of table storage, it’s time to put it into practice. Let’s open Visual Studio and start storing some data.