Build Your Own Database

nam1234567

4.91/5 (101 votes)

Sep 14, 2015

CPOL

12 min read

232489

3004

Build your own database class library with C#

Download source - 4.5 MB

Introduction
The Design
Implementation

1. Introduction

I ran into a problem where I needed to store million records of arbitrary data ranging from 1KB-8KB, in the way that I can find, search, read, iterate through my data as quick as possible, just like native disk access.

While Sqlite is very fast at first, I dislike using it due to its inefficient use of space, and it is dramatically slow down overtime when a lot of data is being written and/or deleted. Sqlite also mixes data and indexes into the same file, making linear iteration a horrible experience, especially when records get deleted, it leaves gaps that never get filled until you call the vacuum() command. See more about sqlite problems in this mozilla article.

So, I want to make a database that is just as fast as Sqlite, but does not slow down overtime, better reusing of deleted space, and does not require vacuum().

The full implemented source code can be downloaded in here on github (however, I recommend you to go through the article before looking at the code.

Goals

I set a very specific goal to make things easier: I made a database to store cows, where the CowModel is defined:
```
    public class CowModel
    {
        public Guid Id { get; set; }
        public string Breed { get; set; }
        public int Age { get; set; }
        public string Name { get; set; }
        public byte[] DnaData { get; set; }
    }
```
As you can see, the size of Breed, Name and DnaData properties are unknown, making this project more challenging as the database needs to handle variable-length data.

The deliverable will be a single C# DLL that we can hook straight in any main project, with the following interface that allows us to interact with the database.

Note that all read/search/delete operations are ad-hoc (i.e., use indexes):

    public interface ICowDatabase
    {
        void Insert (CowModel cow);
        void Delete (CowModel cow);
        void Update (CowModel cow);
        CowModel Find (Guid id);  // Adhoc
        IEnumerable<CowModel> FindBy (string breed, int age); // Adhoc
    }

Scope

This is a simple database that requires only 24-32 hours of work, the main goal is performance and to learn how database works, so you can craft your own.

Features such as ACID complain, transaction/atomic write, multiple users are not included, however the design is opened for these to be implemented.

2. The Design

A. Block Storage

We will store our data into one single file. Indexes are stored separately, one on each file.

For the database to be able to reuse space for deleted data, instead of seeing the database file as a Stream, we see data as individual blocks of equal size. This guarantees a minimum size to make unused space reusable.

Size of a block is your choice, but typical file system block size is 4KB, where the OS reads and writes data in 4KB block, so have your block size either be 256B, 512B, 1024B, etc.. to make sure your database is aligned with the underlying file system block size.

We call this BlockStorage. I'll pick 4KB as my block size as my cow DNA data is pretty large.

Content of a block can be anything, a block doesn't care what it stores. It also supports metadata header, so later we can store stuff such as references to another blocks, actual content size of a block being use, and so on..

We will be accessing Block Storage using the following interface:

    public interface IBlockStorage
    {
        /// <summary>
        /// Number of bytes of custom data per block that this storage can handle.
        /// </summary>
        int BlockContentSize {
            get;
        }

        /// <summary>
        /// Total number of bytes in header
        /// </summary>
        int BlockHeaderSize {
            get;
        }

        /// <summary>
        /// Total block size, equal to content size + header size, should be a multiple of 128B
        /// </summary>
        int BlockSize {
            get;
        }

        /// <summary>
        /// Find a block by its id
        /// </summary>
        IBlock Find (uint blockId);

        /// <summary>
        /// Allocate new block, extend the length of underlying storage
        /// </summary>
        IBlock CreateNew ();
    }

    public interface IBlock : IDisposable
    {
        /// <summary>
        /// Id of the block, must be unique
        /// </summary>
        uint Id {
            get;
        }

        /// <summary>
        /// A block may contain one ore more header metadata, 
        /// each header identified by a number and 8 bytes value.
        /// </summary>
        long GetHeader (int field);

        /// <summary>
        /// Change the value of specified header.
        /// Data must not be written to disk until the block is disposed.
        /// </summary>
        void SetHeader (int field, long value);

        /// <summary>
        /// Read content of this block (src) into given buffer (dst)
        /// </summary>
        void Read (byte[] dst, int dstOffset, int srcOffset, int count);

        /// <summary>
        /// Write content of given buffer (src) into this (dst)
        /// </summary>
        void Write (byte[] src, int srcOffset, int dstOffset, int count);
    }

B. Record Storage

Block Storage does not meet the requirement that is our data is variable in length.

To support variable length records, such as CowModel where Breed, Name, DnaData properties are unknown in length, we add another layer on top of Block Storage, called Record Storage.

In here, a record is made up from one or multiple blocks, ordered. Another word, a Record is a linked list of Blocks.

A record may use a full block to store its data, or just part of a block, defined in `length` field of a block's metadata header. (We'll get to the implementation later.)

A record also has an ID, which is the ID of the first block it is made up from.

A block can be logically marked as deleted, using one of the block's metadata header field. Also, we will be using a special record, record #0 (the first record) to store a stack of deleted block IDs. Deleting a record will be done by logically marking its blocks as deleted, then "push" the IDs of those blocks to record #0. This allows the space of unused blocks tobe reclaimed when creating new record, by reading record #0 and "pop" out a free ID (if there is one) and reuse it.

The use of Record Storage is as follows:

    public interface IRecordStorage
    {
        /// <summary>
        /// Effectively update an record
        /// </summary>
        void Update (uint recordId, byte[] data);

        /// <summary>
        /// Grab a record's data
        /// </summary>
        byte[] Find (uint recordId);

        /// <summary>
        /// This creates new empty record
        /// </summary>
        uint Create ();

        /// <summary>
        /// This creates new record with given data and returns its ID
        /// </summary>
        uint Create (byte[] data);

        /// <summary>
        /// Similar to Create(byte[] data), but with dataGenerator which generates
        /// data after a record is allocated
        /// </summary>
        uint Create (Func<uint, byte[]> dataGenerator);

        /// <summary>
        /// This deletes a record by its id
        /// </summary>
        void Delete (uint recordId);
    }

C. Indexing

Indexing allows searching of records based on its properties (i.e., find a cow by Id, find cows by breed); of course you can iterate through records to look for what you want, but it would be very inefficient when you have more than hundred records.

B-Tree is the most popular data structure to build an index in a database. If you are unfamiliar with B-tree or Tree structure in general, you can visualize its benefits as a List<Tuple<K, V>>, sorted by K, where K is the indexing key (i.e. Cow's ID or Breed), and V is ID of records in the RecordStorage mentioned above. This enables you to do a BinarySearch on key K and instantly return your position of V, regardless of how big the list is.

For more information about the B-Tree or Tree structure, I recommend reading about Binary Tree, Binary Search Tree, Self-Balance Binary Search Tree, Red Black Tree, B-Tree.

For now, all we need to know is that the indexes store Key-Value pairs, where Key is our CowModel's properties, and Value is the record ID of our data. It does not only allow instant lookup of Value by Key (like a Dictionary), but also finds a position of an arbitrary key K and iterates from K to all other keys either with ascending or descending order.

B-Tree is made up of individual nodes, each node keeps references to other nodes. We store the index by serializing each node into a record, utilizing or RecordStorage designed above. Each index on its own file.

When we need to access a Node, we read its binary content and deserialize into a Node, make changes to it, and then serialize back to byte array and get them saved back into a record.

For best performance, a Node should fit perfectly into a 4KB block. We can tune number of entries per Node to make it fit.

The use of Index is as follows:

    public interface IIndex<K, V>
    {
        /// <summary>
        /// Create new entry in this index that maps key K to value V
        /// </summary>
        /// <param name="key">Key.</param>
        /// <param name="value">Value.</param>
        void Insert (K key, V value);

        /// <summary>
        /// Find an entry by key
        /// </summary>
        /// <param name="key">Key.</param>
        Tuple<K, V> Get (K key);

        /// <summary>
        /// Find all entries that contain a key larger than or equal to specified key
        /// </summary>
        IEnumerable<Tuple<K, V>> LargerThanOrEqualTo (K key);

        /// <summary>
        /// Find all entries that contain a key larger than specified key
        /// </summary>
        IEnumerable<Tuple<K, V>> LargerThan (K key);

        /// <summary>
        /// Find all entries that contain a key less than or equal specified key
        /// </summary>
        IEnumerable<Tuple<K, V>> LessThanOrEqualTo (K key);

        /// <summary>
        /// Find all entries that contain a key less than specified key
        /// </summary>
        IEnumerable<Tuple<K, V>> LessThan (K key);

        /// <summary>
        /// Delete an entry from this index, optionally use specified IComparer to compare values
        /// </summary>
        /// <param name="key">Key.</param>
        /// <param name="value">Value.</param>
        /// <param name="valueComparer">Value comparer; Default value is Comparer[k].Default</param>
        bool Delete (K key, V value, IComparer<V> valueComparer = null);

        /// <summary>
        /// Delete all entries of given key
        /// </summary>
        /// <param name="key">Key.</param>
        bool Delete (K key);
    }

D. Putting Things Together

We will put IRecordStorage and one or more IIndex together to make up a complete database. Since these two interfaces are implemented, there is nothing much left to do.

Inserting a record to a database is done by creating a record in IRecordStorage.Create() that returns a uint ID, then this ID is used with IIndex for indexing.
Updating a record also updates relevant IIndex.
Deleting a record also deletes off its entry from IIndex.
Searching for a record will use IIndex whenever possible. When none of the indexes can be used, we will iterate through all primary index entries (full index scan).

Implementation

The source code is attached. You can also find my implementation here on github.

Let's call the database "FooDB", where the main source code located in FooCore, along with its unit tests FooTest, and our Cow application that uses FooDB to make its CowDatabase is FooApplication.

A. BlockStorage Implementation

Interesting classes to look at:

As BlockStorage has no dependencies, the implementation is easy and straight forward.

Creating New Blocks

BlockStorage initializes with a Stream as its only dependency.
BlockStorage maintains the length of the stream to be multiple of specified blockSize when you initialize it. When a client asks to create a new block, it simply extends the length of the underlying Stream.

Finding a Block

Each Block is identified by an unique id, which is also its position within the Stream. For example, if you have a Stream length of 4096B, and blockSize of 1024B, then you have 4 blocks: 0, 1, 2, 3.
You can easily jump to block #3 by setting the stream position to 3*1024B = 3072.

Reading and Writing to a Block

The Block implementation provides Read() and Write() methods to work with block contents. They are just simple wrappers around reading and writing byte array into/from the underlying Stream.
The very first few bytes of each block are reserved for headers. A header is simply a LONG value (8 bytes), and it can be anything, a block does not care about its headers, except that it allows clients to read and write into it.
When a client finishes using a block, it must call Dispose for all the cache to be flushed.

NOTES

In this implementation, I cache the first 4KB which usually contains both header and part of the body content to optimize performance, as file system typically reads and writes in 4KB block, it saves some write calls. However, it results in non-linear write for blocks larger than 4KB (as the first 4KB cached data always gets written last).

This should be fixed on write sensitive programs by either smarter flushing the cache or cache the entire block in memory, then writing them in sequence when a Block is Disposed.

B. RecordStorage Implementation

Interesting classes to look at:

FooCore/RecordStorage.cs

As you can see, RecordStorage initializes with IBlockStorage as dependency, as we designed RecordStorage to be built on top of BlockStorage.

Creating Records

RecordStorage creates a record for an arbitrary data byte array by splitting specified data into multiple blocks, like a linked list.
It links one block to the other using metadata headers (field #0 kNextBlockId, and #3 kPreviousBlockId) to identify a block before and after a particular block that makes up a single Record.
It uses a block header (field #2 kBlockContentLength) to mark the actual useable data of a Block. This allows a record to have any size, instead of being just multiple of a particular block size.
Allocating new record done by reusing deleted blocks whenever it is possible, (i.e. "popping" an uint from record #0), or asking the underlying BlockStorage to allocate blocks.

Finding Records

Finding a record is easy, as ID of the record is also the ID of the first block that it is made up from.
If the first record has a next block, grab the next block, and continue until the last block is reached.

Deleting Records

Deleting a record is done by marking a record and all its blocks as deleted (field #4 kIsDeleted)
It uses a special record, record #0, to keep track of deleted blocks. This record is like a Stack of uint. When a record gets deleted, a uint number is "pushed" to the stack.

Updating Records

If new record data uses the same number of blocks as old data, then RecordStorage updates content of existing blocks.
If new record data uses more blocks than old data, RecordStorage allocates (or reuse) more blocks as required.
If new record data uses less block than old data, unused blocks will be marked as deleted.

C. B-Tree (on disk) Implementation

Interesting classes to look at are as follows:

B-Tree implementation is already done by other people, we just need to translate their algorithms into code. I followed these articles implementing a B-Tree - this, this and this. (It took me more than 18 hours anyways).

The tricky part is the tree needs to be accessible from hard drive, not memory. As we access part of the tree, only that part gets loaded from hard drive into memory, not the entire tree.

To do that, we just need to make sure of 2 things:

TreeNode is serializable (can be converted/constructed to/from byte array). As a Tree made up from multiple TreeNodes, a Tree is serializable if its nodes are serializable.
Accessing to any TreeNode must be done through an instance of ITreeNodeManager, i.e., a TreeNode should not keep direct in-memory reference to another TreeNode, instead, they keep ID of others.

For development and unit testing, I mocked up an implementation of ITreeNodeManager, the TreeMemoryNodeManager. Later on, I wrote TreeDiskNodeManager for production use, along with TreeDiskNodeSerializer helper to serialize TreeNode.

D. CowDatabase Implementation

Creating CowDatabase at this point will be fairly easy and quick as we have the RecordStorage and Indexing built.

Please check out my implementation of the CowDatabase as follows:

Initialization

FooDatabase initializes with a path to the database as its argument, and it uses a naming convention to locate two other indexes (primary index of cow Guid, and secondary index of Breed and Age).

It then constructs a primary index, a Tree<Guid, uint> (mapping a cow's GUID to its record ID) and a secondary index Tree<Tuple<string, int>, uint> (mapping a cow's Breed and Age to its record ID).

Insertion, Deletion and Update of Cows

CowDatabase uses a RecordStorage instance to store serialized data of cows, with the help of CowSerializer. Whenever a cow is inserted or removed or updated, it calls Create(), Remove(), Update() respectively on RecordStorage, as well as creating and changing the required index.

Searching for a Cow or Cows

Searching for a cow/cows is done by looking at either primary or secondary index for the actual record ids, then those ids are used with RecordStorage to read cow data.

Now, we can start storing and querying our cows as in FooApplication/Program.cs; Enjoy your custom built database!

Point of Interests

I used mono on Mac to write this piece of demo, very satisfying performance however, I got around 30-40% CPU time spend on running C# code instead of doing actual IO. (I was expecting it to be around 10-20%). Still, I think C# better to be run on Windows.

I also faced an issue where my mac caches all the write calls into memory, and I have no way to flush it, even with FileStream.Flush(bool true). I think this is a mono bug?

Anyways, a fun weekend project, hope everyone enjoys it.