Azure Table Storage, Domain Persistence and Concerns

M Sheik Uduman Ali

5.00/5 (1 vote)

Sep 20, 2011

CC (ASA 2.5)

6 min read

13939

This post is about the anti-corruption layer between domain objects and data persistence in the Azure world.

Domain modeling is a vital part in application development and nobody has a second opinion about the importance of domain-driven design. This post is about the anti-corruption layer between domain objects and data persistence in the Azure world. Whenever I started working on the object-repository framework, this famous quote by Einstein echoed in my mind:

In theory, theory and practical are the same. In practice, they are not.

We have to compromise on the “persistence agnostic domain model” principle. This has happened even with popular frameworks ActiveRecord (Rails) and Entity Framework (.NET) too. I will skip the compromise part for now.

In Azure, you have two choices to persist objects. One is table storage and the other is SQL Azure. Typically, Web 2.0 applications use a mixed approach like frequently used read-only data is on NOSQL and source of truth is on relational data stores (CQRS). This would be recommended when your application is running on Cloud, because every byte is billable and metered. In this article, I brief the concerns when you are choosing Table Storage as the source of truth. Would Azure Table Storage be a good choice for a domain object repository? Though this is not the time to say “Azure Table Storage for domain object-repository should be highly recommended, partially, or completely avoided” kind of opinions, I can share some of my experiences with Azure Table Storage.

Due to high performance, Azure axes some NOSQL features that are typically available in other NOSQL products. This seems like to get the ambitious mileage advertised for a bike, bike companies say you should drive on a specific road, weather, and load conditions.

I have used the famous Customer-Order domain model in this post as shown in the below figure:

The actual classes are:

public class Customer
{
    public int Id { get; set; }
    public string Name { get; set; }
    public string City { get; set; }

    List<Order> GetOrders(criteria)...
}

public class Product
{
    public int Id { get; set; }
    public string Name { get; set; }
    public double UnitPrice { get; set; }
}

public class Order
{
    public int Id { get; set; }
    public DateTime QuotedAt { get; set; }
    public int Status { get; set; }
    public int CustomerId { get; set; }
    public List<OrderLine> OrderLines;

    public double CalculatePrice()...
}

public class OrderLine
{
    public int Id { get; set; }
    public Product ProductId { get; set; }
    public double Quantity { get; set; }
    public int OrderId { get; set; }

    public double CalculatePrice()...
}

Is a Key-Value Data Store Enough?

Key-Value based data stores are the actual starting point of the NOSQL revolution, later Document based data stores were widely adopted for object persistence. A document data store has the capability to persist an object (complex data type) against a key, however a Key-Value data store can persist only scalar values. This means that the Key-Value’s entity model is very much like a relational representation (primary key – foreign key and link table), instead in a Document data store, we have the choice to embed an object into another object. In the above example, the Customer’s Order object can be embedded within Customer, and OrderLine and Product would be by reference. However, Azure Table Storage is just a Key-Value data store. You still have to provide meta-data for referential integrity.

Am I Smart Enough on the “Partition Key” Decision?

Physical location of a table in Azure (and all other NOSQL data stores too) is based on Partition Key (in Mongo DB, Sharding Key) selection. You can get the “indexing” like faster query result only when you give partition key and row key. Hence, the partition key selection is one of the architectural decisions for a cloud application. How smart you are to choose the right partition key is a matter here! In the Customer-Order, we can simply choose the following as the partition key for the respective tables:

Customer – First letter of Name in upper case
Product – First letter of Name in upper case
Order – Again use the partition key of the customer, since Order is always made by a customer
OrderLine – Either use the Order table partition key or the Product table partition key; if we choose the Order table, the OrderLine table partition will use the Customer table partition key

The below figure depicts how these tables would be distributed in a data center with four nodes:

Now the concerns are:

Is table partitioning always happening or based on capacity? Typically, NOSQL data stores (like Mongo) start “sharding” when running out of drive capacity in the current machine. It seems very optimal. However, there is no clear picture on Azure table storage’s sharding.

Data store is even smarter than me when sharding. In Mongo DB, it only asks for table object property or properties as “Sharding” key. Based on the load, these data stores scale out data across servers. The sharding algorithm intelligently splits the entities between the available servers based on the values in the provided “Sharding” key. However, Azure asks the exacts value in the partition key and it groups entities that have the same value. Azure does not give the internals of how partitioning happens. Will it scale-out on all the nodes in the data center? Or limit to some numbers? No clear pictures though.

What will happen if entities with the same partition key on the single table server run out storage capacity? Do not have a clear picture. Some papers mentioned that table server is the abstraction and called as “access layer”, which in turn has a “persistent storage layer” which contains scalable storage drives. The capacity of the drives will be increased based on the current capacity of the storage for a table server.

Interestingly, I found a reply in one of the Azure forums for a similar question (but quoted couple of years before):

…our access layer to the partitions is separated from our persistent storage layer. Even though we serve up accesses to a single partition from a single partition server, the data being stored for the partition is not constrained to just the disk space on that server. Instead, the data being stored for that partition is done through a level of indirection that allows it to span multiple drives and multiple storage nodes.

Based on the size of data, we don’t know how many actual table servers will be created for a table. Assume a business which needs more Orders for a limited Product and small Customers should have largely scaled out Order and OrderLine table. But, here these two tables are restricted by a small Customer table. A web 2.0 company may need more products and customers which linearly have very large Order and OrderLines. If this scaled out on a large number of table servers, co-location between OrderLine and Product should be important as well for Customer-Order association. If Azure table storage scale-out algorithm does the partitioning based on the knowledge of that table only, then unnecessary network latency will be introduced.

Is ADO.NET Data Service Serializer Enough for the Business?

Enumerations are very common in domain models. However, ADO.NET data service serializer does not serialize it. Either we need to remove the enumerated properties or write a custom serializer.

Final Words

So, you can either teach me what you know of the above concerns or stack your concerns up in this post.