packetcache: Exploring Development of a memcache-like Network Cache

Michael Sydney Balloni

5.00/5 (2 votes)

Nov 29, 2022

Apache

8 min read

6551

This article explores UDP programming, LRU cache development, and .NET packet processing.

Download source code - 32.8 KB

Introduction

In this article, I will show you how to do UDP network programming, how to write a Least-Recently-Used (LRU) cache, and how to do packet processing in .NET. These programming skills are applied to writing a network memory cache, like memcache. It is UDP-centric, uses an LRU cache, and is single-threaded. I then show how to access this cache from a .NET client.

UDP Programming

User Datagram Protocol (UDP) programming is a way of sending small pieces of information between computers over a network. Unlike Transmission Control Protocol (TCP), there is no guarantee of delivery of data, or even that data is received in the order it was sent, or that the data was not mangled along the way.

In UDP programming, there is a server and a client. The server binds to a port and listens for datagrams (network packets) that are sent by clients, and (usually) responds to the client with a datagram response of some sort. Clients connect to servers by address and the server's port and send requests to the server, and receive responses from the server. Unlike the full duplex persistent streams of TCP, UDP is more like playing catch with a baseball.

The most popular UDP programming system is the Domain Name Service (DNS), responsible for turning domains like michaelballoni.com into network addresses. In DNS, the client sends the domain name to the server, and the server sends the network address back to the client, using basic datagrams with no persistent connections like TCP uses. This makes the protocol lightweight without the processing or resource overhead of persistent connections.

Besides being unreliable, the other big limitation of UDP is that only about 500 bytes can be sent to or received from a UDP server over the internet. If you are willing to work within these restrictions, if you are basically willing to send little things back and forth and are willing to tolerate or compensate for unreliable networking, then UDP is right for you.

UDP Server Implementation

In UDP programming, the server code typically creates a socket, binds to a port, and goes into a receive-request / process-request / send-response loop.

Create UDP Socket

my_socket = ::socket(AF_INET, SOCK_DGRAM, 0)

If my_socket came back < 0, you'd do some error handling.

Bind UDP Socket to Port

sockaddr_in servaddr;
memset(&servaddr, 0, sizeof(servaddr));
servaddr.sin_family = AF_INET; 
servaddr.sin_addr.s_addr = INADDR_ANY; // listen on any address
servaddr.sin_port = ::htons(my_port);  // listen on this 16-bit port number
result = ::bind(my_socket, (const sockaddr*)&servaddr, sizeof(servaddr))

If result came back non-zero, you would do some error handling, and clean up my_socket from above.

Receive Request Datagrams and Send Responses

Here is a basic request / response UDP server loop.

// https://stackoverflow.com/questions/1098897/
// what-is-the-largest-safe-udp-packet-size-on-the-internet
const size_t max_packet_size = 508;

while (true)
{
	// clientaddr is the data structure that holds the client's address
	// so we know where to send the response back to
	sockaddr_in clientaddr;
	int addr_len = sizeof(clientaddr);

	// read the request data and client address using the recvfrom function
	uint8_t input_buffer[max_packet_size];
	int read_amount =
		::recvfrom(my_socket, (char*)input_buffer, 
        (int)max_packet_size, 0, (sockaddr*)&clientaddr, &addr_len);
	if (read_amount <= 0)
		// error handling
	
	// process the input_buffer and read_amount yielding data in 
    // output_buffer and output_len
	uint8_t output_buffer[max_packet_size];
	size_t output_len = 0;
	// ...
	
	// send the response back to the client using the sendto function
	if (::sendto(my_socket, (const char*)output_buffer, 
       (int)output_len, 0, (sockaddr*)&clientaddr, addr_len) < 0)
		// error handling
}

UDP Client Programming

UDP client programming is much like UDP server programming, with the request being sent and response received, and a bit simpler. The style I use here is after TCP with the connect / send / recv functions, as this provides the useful illusion of a persistent connection.

Create UDP Socket, Just Like the Server

my_socket = ::socket(AF_INET, SOCK_DGRAM, 0)

If my_socket came back < 0 you'd do some error handling.

Connect to the UDP Server

sockaddr_in servaddr;
memset(&servaddr, 0, sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_port = ::htons(port);			    // the server's 16-bit port number
servaddr.sin_addr.s_addr = ::inet_addr(server); // inet_addr works with dotted 
                                                // IPv4 addresses
											    // use getaddrinfo to support more
result = ::connect(my_socket, (sockaddr*)&servaddr, sizeof(servaddr))

If result != 0, you'd handle the error and close my_socket.

Interact with the UDP Server

// populate request_buffer / request_len with your request data
uint8_t request_buffer[max_packet_size];
size_t request_len;
// ...

// send the request
result = ::send(my_socket, (const char*)request_buffer, (int)request_len, 0);
// if result < 0, do error handling, clean up my_socket

// read the response
uint8_t response_buffer[max_packet_size];
int response_len = ::recv(my_socket, (char*)response_buffer, (int)max_packet_size, 0);
// if response_len <= 0, do error handling, clean up my_socket

// process the response_buffer / response_len response

And that's it for UDP programming, client and server. Well... you should really set a timeout on that recv call. As written, if no response were delivered to the client for whatever reason, be it server error or network error, the client application would hang. Oh, and you'd want to validate the response in case it got garbled on the network. Yeah, packet loss, package damage, yep, that should do it.

Network Cache Implementation

The attached source code contains a "packetcache" solution. The mission of this solution is two-fold: first, develop a cache server the likes of memcache; second, develop cache clients to prove out performance differences between packetcache and memcache.

To this end, there is a C# class library and two C# applications, a smoke test POC packetapp client, and a performance benchmark packetstress application, and there is C++ code for a static library with all the guts of the C++ UDP server and a C++ POC UDP client used only for development.

You can dig into all that C# and C++ code and draw your own conclusions on the performance profiles of packetcache vs. memcache. Here, I want to lay out what an LRU cache can look like in C++.

LRU Cache Overview

A cache maps keys to values. Values for keys are added to the cache as needed, then accessed "for free" by other users of the cache. A Least-Recently-Used (LRU) cache maintains the most-recently used cache entries, trimming, you guessed it, the least recently used cache entries, whenever the cache takes up too much memory.

The LRU Cache I created is composed of two data structures: first, a list of cache entries ordered most to least recently used; second, a hash table mapping keys to cache entries. The trick is to have the cache entry objects chained together to form the list, and the hash table contains pointers to the cache entries.

LRU Cache Class

The cache class contains the linked list (cache_list) and a hash table (std::unordered_map). cache_list has head and tail pointers to objects of type cache_entry. The hash table maps std::vector<uint8_t> to cache_entry*. Having a std::vector<uint8_t> hash table key is really no different than having a std::string key. I did not want to deal with character encodings, or even assume that clients want to cache by string. Keys can be any binary value of any length taking UDP packet size into account.

Here's the trick for allowing std::vector<uint8_t> hash table keys:

#include "MurmurHash2.h" // https://github.com/aappleby/smhasher

struct byte_array_hasher
{
	std::size_t operator()(const std::vector<uint8_t>& v) const noexcept
	{
		return MurmurHash2(v.data(), int(v.size()), 0);
	}
};

Then you can declare the hash table in the cache class like so:

std::unordered_map<std::vector<uint8_t>, cache_entry*, byte_array_hasher> m_cache

Cache Entry Class

Each cache entry is an object of class cache_entry:

time_t expiration;
time_t last_access;

cache_entry* prev;
cache_entry* next;

std::vector<uint8_t> data;

The cache entry serves as holding the data value and the last access timestamp, as well as being a linked list node with previous and next pointers.

The cache entry constructor elevates itself to the head of the list:

cache_entry(time_t _expiration, const std::vector<uint8_t>& _data, cache_entry*& head)
	: expiration(_expiration)
	, last_access(::time(nullptr))
	, prev(nullptr)
	, next(nullptr)
	, data(_data)
{
	if (head != nullptr)
	{
		head->prev = this;

		next = head;
		prev = nullptr;
	}
	head = this;
}

Cache List Class

The cache_list class manages the list of cache_entrys. It only has head and tail and size member variables, but it runs the show with this API:

cache_entry* cache_list::add(time_t expiration, const std::vector<uint8_t>& data)

Create a new cache entry, making it the new head, and possibly tail, of the list.

cache_entry* new_entry = 
      new cache_entry(expiration, data, m_head); // new_entry becomes head
if (m_tail == nullptr)
	m_tail = new_entry;
m_size += new_entry->size();
return new_entry;

const std::vector<uint8_t>& cache_list::get(cache_entry* entry)

When a cache entry is hit, move it to the front of the list and return its data.

entry->last_access = ::time(nullptr);

//
// move the entry to the front of the list
//

// bail if already at the head
if (m_head == entry)
	return entry->data;

// adjust neighbor pointers
auto prev_neighbor = entry->prev;
auto next_neighbor = entry->next;

if (prev_neighbor != nullptr)
	prev_neighbor->next = next_neighbor;

if (next_neighbor != nullptr)
	next_neighbor->prev = prev_neighbor;

// make the new entry head of the list
if (m_head != nullptr)
{
	m_head->prev = entry;

	entry->next = m_head;
	entry->prev = nullptr;
}
m_head = entry;

// set the tail
if (next_neighbor != nullptr && next_neighbor->next == nullptr)
	m_tail = next_neighbor;
else if (prev_neighbor != nullptr && prev_neighbor->next == nullptr)
	m_tail = prev_neighbor;

return entry->data;

cache_list::set(cache_entry* entry, time_t expiration, const std::vector<uint8_t>& data)

When a cache entry gets a new value, update the entry and update the size of the cache.
NOTE: I don't move the entry to the front of the list here, as it is not recently used yet.

entry->last_access = ::time(nullptr);

size_t old_size = entry->data.size();

entry->expiration = expiration;
entry->data = data;

// adjust the new size carefully
int64_t size_change = int64_t(data.size()) - old_size;
if (-size_change > int64_t(m_size))
	m_size = 0;
else
	m_size += size_change;

cache_list::remove(cache_entry* old_entry)

Remove an entry from the cache.

// update head and tail
if (m_head == old_entry)
	m_head = old_entry->next;

if (m_tail == old_entry)
	m_tail = old_entry->prev;

// update neighbors
if (old_entry->prev != nullptr)
	old_entry->prev->next = old_entry->next;

if (old_entry->next != nullptr)
	old_entry->next->prev = old_entry->prev;

// adjust our size
if (old_entry->size() > m_size)
	m_size = 0;
else
	m_size -= old_entry->size();

// clean it up
delete old_entry;

cache_list::trim(size_t target_size)

Remove entries until the cache size dips below a target size.

while (m_size > target_size && m_tail != nullptr)
	remove(m_tail);

Over on the other side... C# Packet Processing

On the C# side of things, we want to implement a memcache-like interface like this:

public interface ICache
{
	Task<string?> GetAsync(string key);
	Task<bool> SetAsync(string key, string value, int cacheSeconds);
	Task<bool> DelAsync(string key);
}

To this end, we write a central class for working with request and response data, Packet. Packet has the following member variables:

public CacheOp Op;
public long TtlSeconds;
public MemoryStream Key = new MemoryStream(MaxPacketSize);
public MemoryStream Value = new MemoryStream(MaxPacketSize);

The core purpose of the Packet class is to pack Packet members into a buffer, and parse Packet members from a buffer. One key to good .NET performance is to avoid memory allocations, so the C# client code that processes network data maintains one set of Packet objects and buffers for the long life of the object.

Here is the Pack function for taking a Packet object and writing it into an output buffer, including the amount written.

public static void Pack(Packet p, byte[] output, out int idx)
{
	// Make sure it will all fit
	if
	(
		1 +  // op
		4 + // ttl
		2 + p.Key.Length +
		2 + p.Value.Length +
		4   // crc
		> MaxPacketSize
	)
	{
		throw new PacketException("Too much data");
	}

	// Add the op
	idx = 0;
	output[idx++] = (byte)p.Op;

	// Add the TTL
	var ttl_bytes = 
        BitConverter.GetBytes(IPAddress.HostToNetworkOrder((int)p.TtlSeconds));
	output[idx++] = ttl_bytes[0];
	output[idx++] = ttl_bytes[1];
	output[idx++] = ttl_bytes[2];
	output[idx++] = ttl_bytes[3];

	// Add the key len and data
	short key_len = (short)p.Key.Length;
	var key_len_bytes = BitConverter.GetBytes(IPAddress.HostToNetworkOrder(key_len));
	output[idx++] = key_len_bytes[0];
	output[idx++] = key_len_bytes[1];
	Buffer.BlockCopy(p.Key.GetBuffer(), 0, output, idx, key_len);
	idx += key_len;

	// Add the value len and data
	short value_len = (short)p.Value.Length;
	var value_len_bytes = 
        BitConverter.GetBytes(IPAddress.HostToNetworkOrder(value_len));
	output[idx++] = value_len_bytes[0];
	output[idx++] = value_len_bytes[1];
	Buffer.BlockCopy(p.Value.GetBuffer(), 0, output, idx, value_len);
	idx += value_len;

	// CRC what we've got so far
	var crc_bytes = Utils.Crc32(output, idx);
	output[idx++] = crc_bytes[3];
	output[idx++] = crc_bytes[2];
	output[idx++] = crc_bytes[1];
	output[idx++] = crc_bytes[0];
}

The Parse function takes an input buffer and populates the members of a Packet.

public static void Parse(byte[] input, Packet p)
{
	// Reset the output
	p.Reset();

	// Validate data length
	if (input.Length < MinPacketSize)
		throw new PacketException("Not enough data");
	else if (input.Length > MaxPacketSize)
		throw new PacketException("Too much data");
	short input_len = (short)input.Length;

	// Validate the CRC
	byte[] crc_computed_bytes = Utils.Crc32(input, input_len - 4);
	if
	(
		crc_computed_bytes[0] != input[input_len - 1]
		||
		crc_computed_bytes[1] != input[input_len - 2]
		||
		crc_computed_bytes[2] != input[input_len - 3]
		||
		crc_computed_bytes[3] != input[input_len - 4]
	)
	{
		throw new PacketException("Checksum mismath");
	}

	// Extract op
	int idx = 0;
	p.Op = (CacheOp)input[idx++];

	// Extract ttl
	p.TtlSeconds = IPAddress.HostToNetworkOrder(BitConverter.ToInt32(input, idx));
	idx += 4;

	// Extract key
	short key_len = IPAddress.HostToNetworkOrder(BitConverter.ToInt16(input, idx));
	idx += 2;
	if (key_len > 0)
	{
		if (idx + key_len >= input_len)
			throw new PacketException("Invalid key length");

		p.Key.Write(input, idx, key_len);
		p.Key.Seek(0, SeekOrigin.Begin);

		idx += key_len;
	}

	// Extract value
	short value_len = IPAddress.HostToNetworkOrder(BitConverter.ToInt16(input, idx));
	idx += 2;
	if (value_len > 0)
	{
		if (idx + value_len >= input_len)
			throw new PacketException("Invalid value length");

		p.Value.Write(input, idx, value_len);
		p.Value.Seek(0, SeekOrigin.Begin);

		idx += value_len;
	}

	// Validate input size
	if (idx != input_len - 4)
		throw new PacketException("Invalid packet format");
}

With the Parse and Pack functions doing the heavy lifting, the packet processing code's core "Do a cache operation" function is simply:

// Take the data like the key and value from the request packet and 
// populate the request buffer
int request_len;
Packet.Pack(m_requestPacket, m_requestBuffer, out request_len);

// Send the request
// m_client is .NET's UdpClient object, the raw network "connection" with the server
if (await m_client.SendAsync(m_requestBuffer, request_len) != request_len)
    throw new PacketException("Not all data sent");

// Recv the response
byte[] response = (await m_client.ReceiveAsync()).Buffer;

// Parse the buffer into a response packet with the result of the operation
Packet.Parse(response, m_responsePacket);

Apples and Oranges

I was interested in stacking my little packetcache up against the proven memcache. I'm a Windows guy, so I looked for a memcache port / installer for Windows. I found this site and used it for my initial tests.

StackOverflow has other advice at this link.

In my all-on-my-PC tests, packetcache did about 60K ops per seconds vs. memcache's 40K. 50% better, that's cool. But the memcache server seemed unstable as I simulated more clients with more client threads, and I wanted to bring things more into apples and apples, so I went to AWS and fired up an ElastiCache memcache node and a basic Windows EC2, same price, loaded packetcache onto the EC2, and accessed both from a common Windows EC2 in the same region. The observed performance was near identical at 20K ops per second. I can live with that: packetcache on a weak Windows EC2 does at least as well at what it does as memcache does on not-Windows and dedicated hardware. There's still apple / orange issues, such as that memcache can be configured to do UDP. My point is that this approach is not worlds and worlds worse than the champ.

Conclusion and Points of Interest

I've shown you UDP programming, LRU cache implementation, and .NET packet packing and parsing. I invite you to take the code and build it on Linux with dedicated hardware and derive your own conclusions.

History

28^th November, 2022: Initial version