Adventure in two colors - Implementing an std::map like structure in C#

Ury Jamshy

4.44/5 (5 votes)

Jun 21, 2012

CPOL

6 min read

15854

223

mplementation of a data structure similar to the C++ std:map and based on the Red Black tree data structure.

Download source - 18.3 KB

Sample Image - maximum width is 600 pixels

Introduction

As an originally C++ programmer, I miss some aspects of the STL containers which were not implemented in the standard .net collections. Specifically, the ability to search for an item that is not necessarily in the collection, and get the closest match. These are the lower_bound and the upper_bound of the std::map. The need for this functionality (lower_bound and the upper_bound) emerged while trying to display tags along the time axis. When the view was zoomed, the subset of tags within the zoomed range needed to be fetched. If the set of tags was static, then sorting once and then using the generic List<> could do the trick, since it provides the BinarySearch method. However, the set was dynamic, so another approach was needed. Here SortedMap<TKey, TValue> is presented: a generic collection based on the Red Black tree data structure. It provides lower_bound and upper_bound in addition to the standard collection interface.

If you need the closest match functionality, you are welcome to read ahead and use the code. Even if you don't care about lower/upper bound, you might find the discussion about testing useful, since it presents some of the primary testing principles, that in my opinion are important for creating robust and (relatively) bug-free software.

Red Black Trees

The Red Black tree is described in many places, so we'll only present it briefly here. It is a binary tree, with each node painted either in black or red (perhaps inspired by the roulette colors). A red black tree satisfies the following properties:

Every leaf is black.
If a node is red, then its children are black.
Every path from a leaf to the root contains the same number of black nodes.

It is not hard to prove using the properties above that the height of a Red Black tree is O(log n). Hence search for an element is O(log n). Searching for a closest match, rather than exact match is easy enough: Just search for the element, and if not found, return the node where the search stopped. Inserting an element into a Red Black tree is composed of two steps:

First perform a standard binary tree insert. Second fix the the tree so its properties are preserved.
Removing an element is similar: Standard remove followed by steps to preserve the Red Black properties.

It follows that insertion and removal are also O(log n).

Using the code

SortedMap is composed of 2 layers:

RedBlackTree<T> (in RedBlackTree.cs) - implementation of the Red Black data structure.
SortedMap<TKey, TValue> (in SortedMap.cs) - a wrapper of the Red Black tree as a standard collection implementing IDictionary<>, ICollection<>, and IEnumerable<>.

Other classes in the project:

GenericEnumerator<T> (in GenericEnumerator.cs) - helper for the SortedMap.
PermutationGenerator (in PermutationGenerator.cs) - creates all permutations of a specific size: used for testing
TreeVisualizer<T> (in TreeVisualizer.cs) - displays the tree in a graphical way: used for testing
RedBlackTreeTester (in RedBlackTreeTester.cs) - testing the RedBlackTree
SortedMapTester (in SortedMapTester.cs) - testing the SortedMap

If you need a standard collection - use the SortedMap class, you will need SortedMap.cs, RedBlackTree.cs, and GenericEnumerator.cs. If you do not care about standard interfaces use directly the RedBlackTree class - only RedBlackTree.cs is needed.

RedBlackTree<T> provides the following methods (note that it has only one type argument):

Clear() - empties the tree
void Add(T item) - add a new item to the tree
void Remove(T item) - removes an existing item
TreeNode Find(T item) - find an exact match
TreeNode FindGreaterEqual(T item) - find either an exact match or the next item
TreeNode First() - return the first node in the tree
TreeNode Next(T item) - return the next node
bool IsValid() - check whether the tree is a valid Red Black tree (for testing)
bool TravelTree(TreeVisitor visitor) - travels the tree and applies the visitor to each node (for testing)

SortedMap<TKey, TValue> provides the closest match functionality in addition to the standard collection methods:

TValue LowerBound(TKey key) - return the first element whose key is no less than the provided key
TValue UpperBound(TKey key) - return the first element whose key is greater than the provided key
IEnumerable<KeyValuePair<TKey, TValue>> LowerBoundItems(TKey key) - return an enumerator starting with the lower bound item
IEnumerable<KeyValuePair<TKey, TValue>> UpperBoundItems(TKey key) - return an enumerator starting with the upper bound item

It is surprising how much code is needed in order to create a standard collection. The SortedMap is almost 1000 lines although it does not contain any logic - just implementing standard interfaces. The following code snippet demonstrates how to retrieve all items with keys that greater than a particular key:

SortedMap map = new SortedMap<int, double>();
...
int key = someValue;
foreach (KeyValuePair<int, double> pair in map.UpperBoundItems(key))
{
    // Do something
}

Complexity

As mentioned above, insertion removal and search for a key (both exact or closest match) cost O(log n) operations. Moving to the next element is O(log n) - worst, O(1) - amortized. This means that a single next operation may cost O(log n), but moving along the whole tree costs O(n) operations, so the average cost for a single operation is O(1). The following table summarizes the complexity of the various operations:

Operation	Complexity
Insertion	O(log n)
Removal	O(log n)
Search	O(log n)
Next element	O(log n) (amortized O(1))

Testing the Tree

The first question that comes to mind when creating this kind of data structure is why this particular choice of colors (Red and Black). The second question is how to make sure that the structure fulfills its requirements. One answer is validity check. I added a method IsValid to the RedBlackTree class, that checks the validity of the structure, and makes sure in particular, that the count of black nodes from every leaf is the same. During testing, this method is called after each operation in order to locate the exact point when the tree became invalid. However this is not enough, since the tree may be valid, but contains wrong data. For this I used the "compare to a simpler data structure" technique. Each operation on the tree was applied also to a List<>. The content of the tree was compared to the content of the list after each modification - to verify that the content of the tree is correct. 2 automatic tests were used to test the tree:

The first was adding to the tree all permutation of the numbers 1 to 9. The second was to randomly add and remove values to the tree while checking its validity. I often use this kind of random test since in most cases it will reveal bugs in unexpected situations. However this was not enough: The IsValid failed and the tree was too complex to understand the problem. Here the TreeVisualizer came to the rescue. It is a simple class that displays the tree in a text window (You can see a snapshot at the top of the article). Using it to within the tree implementation greatly helped to pinpoint the problems. So to summarize:

Validation - Write a method that tests the requirements and invoke it during testing.
Compare to a simpler structure - If possible, compare the new data structure to a simpler structure (preferably standard).
Automatic Random Testing - It will reveal situations that you didn't foresee.
Visualization - It helps to visualize the data structure. Sometimes it's a simple dump and in other times a graphical image.