Overview
When designing an API, there are countless factors that must
be considered. Security, consistency, state management, style; the list seems
never ending. One factor that often goes overlooked, however, is scale. By
designing your APIs with scale in mind from the beginning, hundreds of hours of
development time can be saved as the system grows.
Introduction
The definition of an Application Programming Interface (API)
can, at times, be difficult to determine. Technically speaking, any function
that is called by another programmer's code could fit the definition. Debating
which code qualifies as an API is beyond the scope of this article so, for our
purposes, we will assume that basic functions qualify.
The examples in this article are intentionally kept simple
to illustrate the main point. C# functions are used, but the core principles
can apply to almost any language, framework, or system. The data structures in
the examples are modeled based on the familiar relation style used by many
industry standard databases. Again, this is for illustrative purposes only and
should not be viewed as a requirement for applying the principles.
The Requirements
Let's assume we are creating a basic order processing system
for a client and that the three main classes (or "data structures" if
you prefer) are already defined. Below we have a very basic relational class
structure. The Customer
class has a "foreign key" (borrowing database
terminology) to Address
and the Order
class has foreign keys to
Address
and
Customer
. You are asked to create a
library that can be used to process Orders. The first business rule to
implement is that the State
of the Customer
's HomeAddress must be the same as
the State
of the Order
's BillingAddress (don't ask why, business rules rarely
make any sense). ;-)
public class Address
{
public int AddressId { get; set; }
public string Street { get; set; }
public string City { get; set; }
public string State { get; set; }
public string Zipcode { get; set; }
}
public class Customer
{
public Address HomeAddress { get; set; }
public int CustomerId { get; set; }
public int HomeAddressId { get; set; }
public string CustomerName { get; set; }
}
public class Order
{
public Customer MainCustomer { get; set; }
public Address ShippingAddress { get; set; }
public Address BillingAddress { get; set; }
public int OrderId { get; set; }
public int CustomerId { get; set; }
public int ShippingAddressId { get; set; }
public int BillingAddressId { get; set; }
public decimal OrderAmount { get; set; }
public DateTime OrderDate { get; set; }
}
The Implementation
Checking to see if two fields match is certainly an easy
task. Hoping to impress your boss, you whip out the solution in less than ten
minutes. The VerifyStatesMatch
function returns a boolean that will indicate to
the caller whether the business rule is being followed or not. You run some
basic tests on your library and you determine that the code only takes, on
average, 50 ms to execute and does not have any flaws. The boss is very
impressed and gives your library to the other developers to use in their
applications.
public bool VerifyStatesMatch(Order order)
{
bool retVal = false;
try
{
Customer customer = SomeDataSource.GetCustomer(order.CustomerId);
Address shippingAddress = SomeDataSource.GetAddress(order.ShippingAddressId);
retVal = customer.HomeAddress.State == shippingAddress.State;
}
catch (Exception ex)
{
SomeLogger.LogError(ex);
}
return retVal;
}
The Problem
The next day you come into work and there is a sticky note
on your monitor: "Come see ASAP - The Boss". You figure that you did
such a great job on your library yesterday, your boss must have an even harder task
for you today. You soon find out, however, that there are some serious problems
with your code.
You: Hi boss, what's up?
Boss: Your library is causing all kinds
of problems in the software!
You: What? How?
Boss: Bob says your algorithm is too
slow, John said it's not working properly, and Steve said something about
"object reference not set to an instance of object".
You: I don't understand, I tested it
yesterday and everything was fine.
Boss: I don't want excuses. Go talk to
the other guys and figure it out!
Not the way you wanted to start your day right? I would be
surprised if most developers haven't been faced with this kind of situation
before. You thought you had coded your library "perfectly", but yet
there appears to be all kinds of problems. By applying the principles of The
One, The Many, The Null, and The Nothing you will be able to see where the API
fails to meet the expectations of others.
The One
http://en.wikipedia.org/wiki/The_Matrix
The first principle to follow is to handle properly handle
"The One". By The One, I mean that your API should process one
instance of the expected input without any errors that you do not explicitly
tell callers may occur. You might be thinking: "Isn't that obvious?",
but let's look at our example and show how we might not be properly handling
one Order.
Customer customer = SomeDataSource.GetCustomer(order.CustomerId);
Address shippingAddress = SomeDataSource.GetAddress(order.ShippingAddressId);
retVal = customer.HomeAddress.State == shippingAddress.State;
As the comment above states, we assumed that the
HomeAddress
property loaded properly from the data source. Although 99.99% of the time it
probably will, a bullet proof API must account for the one in a million
scenario where it won't. Also, depending on the language, the comparison of the
two State properties my fail if either property did not load properly. The
point here is that you cannot make any assumptions about the input you are
given or about that data you get from code that you do not control.
This is the easiest principle to understand so let's fix our
example move on.
Customer customer = SomeDataSource.GetCustomer(order.CustomerId);
Address shippingAddress = SomeDataSource.GetAddress(order.ShippingAddressId);
if(customer.HomeAddress != null)
{
retVal = customer.HomeAddress.State == shippingAddress.State;
}
The Many
http://msdn.microsoft.com/en-us/library/w5zay9db.aspx
Getting back to our scenario above, we need to talk to Bob.
Bob said the code was too slow, but 50 ms is well within the accepted execution
time given the architecture of the system. Well, it turns out that Bob has to
process 100 Orders for your largest customer in a batch, so the call to your
method is taking 5 seconds total in the loop he is using.
Bobs code:
foreach(Order order in bobsOrders)
{
...
bool success = OrderProcess.VerifyStatesMatch(order);
....
}
You: Bob, why do you think my code is
too slow? It only takes 50 ms to process an order.
Bob: Customer Acme Inc. demands the
fastest performance possible for their batch orders. I need to process 100
orders so 5 seconds is too slow.
You: O, I didn't know we needed to
process orders in batches.
Bob: Well, it's only for Acme because
they are our largest customer.
You: O, I wasn't told anything about
Acme or batch orders.
Bob: Well, shouldn't your code be able
to efficiently handle the processing of more than one order at a time?
You: O....yeah, of course.
It's very obvious what happened and why Bob thinks the code is "too slow".
You were not told about Acme and no one said anything about batch processing.
Bob's loop is loading the same Customer and, most likely, the same Address
record 100 times. This issue can easily be fixed by accepting an array of Orders
instead of just one and by adding some simple caching. The C# params
keyword was designed for situations just
like this.
public bool VerifyStatesMatch(params Order[] orders)
{
bool retVal = false;
try
{
var customerMap = new Dictionary<int, Customer>();
var addressMap = new Dictionary<int, Address>();
foreach (Orderorder in orders)
{
Customer customer = null;
if(customerMap.ContainsKey(order.CustomerId))
{
customer = customerMap[order.CustomerId];
}
else
{
customer = SomeDataSource.GetCustomer(order.CustomerId);
customerMap.Add(order.CustomerId, customer);
}
Address shippingAddress = null;
if(addressMap.ContainsKey(order.ShippingAddressId))
{
shippingAddress = addressMap[order.ShippingAddressId];
}
else
{
shippingAddress = SomeDataSource.GetAddress(order.ShippingAddressId);
addressMap.Add(order.ShippingAddressId,shippingAddress);
}
retVal = customer.HomeAddress.State == shippingAddress.State;
if(!retVal)
{
break;
}
}
}
catch (Exception ex)
{
SomeLogger.LogError(ex);
}
return retVal;
}
This version of the function will greatly speed up Bob's
batch processing. Most of the data calls have been eliminated because we can
simply look up the record by ID in the temporary cache (Dictionary).
Once you have opened your API up to The Many, you must now
put in some range checks. What if, for example, someone send one million orders
into your method? Is a number that large outside of the scope of the
architecture? This is where understanding both the system architecture and
business processes pays off. If you know that the maximum use case for
processing orders is 10,000, you can with confidence add a check for, say,
50,000 records. This will ensure that someone doesn't accidentally bog down the
system with a large, invalid, call.
While these are not the only optimizations that can be made,
it hopefully illustrates how planning for "The Many" in the beginning
can save rework later.
The Null
http://en.wikipedia.org/wiki/Null_pointer#Null_pointer
You: Steve, are you passing null into
my code?
Steve: I'm not sure, why?
You: The boss said you were getting
"object ref..." errors.
Steve: O, that must be from the legacy
system. I don't control the output from that system, we just pipe it into the
new system as is.
You: That seems silly, why don't we do
something about these nulls?
Steve: I do; I check for null in my
code; don't you?
You: O....yeah, of course.
"Object reference not set to an instance of an
object." Do I even need to explain that error? For many of us, it has cost
us many hours of our lives. In most languages, null, the empty set, etc is a
perfectly valid state for any non-value type. This means that any solid API
must account for "The Null" even if it is technically
"wrong" for a caller to pass it.
Of course, checking every reference for null can become very
time consuming and is probably overkill. However, you should never trust input
coming from a source you do not control so we must check our "orders"
parameter, as well as the Orders inside of it, for null.
public bool VerifyStatesMatch(params Order[] orders)
{
bool retVal = false;
try
{
if (orders != null)
{
var customerMap = new Dictionary<int, Customer>();
var addressMap = new Dictionary<int, Address>();
foreach (Order order in orders)
{
if (order != null)
{
Customer customer = null;
if (customerMap.ContainsKey(order.CustomerId))
{
customer = customerMap[order.CustomerId];
}
else
{
customer = SomeDataSource.GetCustomer(order.CustomerId);
customerMap.Add(order.CustomerId, customer);
}
Address shippingAddress = null;
if (addressMap.ContainsKey(order.ShippingAddressId))
{
shippingAddress = addressMap[order.ShippingAddressId];
}
else
{
shippingAddress = SomeDataSource.GetAddress(order.ShippingAddressId);
addressMap.Add(order.ShippingAddressId, shippingAddress);
}
retVal = customer.HomeAddress.State == shippingAddress.State;
if (!retVal)
{
break;
}
}
}
}
}
catch (Exception ex)
{
SomeLogger.LogError(ex);
}
return retVal;
}
By diligently checking for null, you can avoid the
embarrassing support calls from customers asking what an "instance of an
object" is. I always err on the side of caution; I would rather have my
function return the default value and log a message (or send an alert) than throw the somewhat useless null
reference error. Of course, this decision is completely dependent on the type
of system, whether the code is running in a client or server, etc. The lesson
here is that you can only ignore null for so long, before it will bite you.
UPDATE: To be clear, I am not advocating that a function should "do nothing" when an invalid state has been encountered. If null parameters are not acceptable in your system, throw an exception (like ArgumentNull in .NET). However, there are some situations where returning a meaningful default is perfectly acceptable and throwing an exception is not necessary. Fluent methods, for example, will typically return the value that was passed into them if they cannot act on the value. There are far too many factors to make any kind of blanket statements about what should be done when a null is encountered.
The Nothing
http://youtu.be/CrG-lsrXKRM
You: John, what are you passing into my
code? It looks like an incomplete Order.
John: O, sorry about that. I don't
really need to use your method, but one of the other libraries required me to
pass an Order parameter. I guess that library is calling your code. I don't
work with orders, but I have to use that other library.
You: That other library needs to stop
doing that; that is bad design.
John: Well, that library has evolved
organically as the business has changed. Also, it was written by Matt who is
out this week; I'm not really sure how to change it. Shouldn't your code be
checking for bad input anyway?
You: O....yeah, of course.
Of the four principles, "The Nothing" is probably
the most difficult to describe. Despite meaning "nothing" or
"empty", null actually has a definition and can be quantified. Heck,
most languages have a built in keyword for it; null certainly is not nothing. By
handling "nothing", I mean that your API must handle input that is
essentially garbage. In our example, this would translate into handling an
Order that does not have a CustomerId or that has an OrderDate from 500 years
ago. A better example would be a collection that does not have any items in it.
The collection is not null and should fall into the "The Many"
category, but the caller failed to populate the collection with any data. You
must always be sure to handle this "nothing" scenario. Let's adjust
our example to ensure that callers can't just pass something like looks like an
Order; it must fulfill the minimum universal requirements. Otherwise, we will
just treat it as nothing.
...
if (order != null && order.IsValid)
...
Conclusion
If there is one point that I hope this article has
demonstrated, it is that any code taking input is never "perfect" and
the implementation of any function or API must take into account how the API is
going to be used. In our example, our 12 line function grew to 50 lines without
any changes to the core functionality. All the code we added was to handle the
scale, or range, of data that we would accept and process properly and
efficiently.
The amount of data being stored has grown exponentially over years so the scale of input data will only increase while the quality of the
data has no place to go but down. Properly coding an API in the beginning, can make a huge difference in winning business, scaling as customers grow, and
reducing future maintenance costs (and headaches for you!).