Creating a List like class (For Beginners)

Paulo Zemek

4.91/5 (12 votes)

Jan 29, 2015

CPOL

13 min read

14649

A basic tutorial explaining how to create a List class.

Background

Sometime ago I started writing some very basic tutorial articles that I planned to publish in my personal site. I actually did publish some of them in the site but I never gave any links to them so I believe no-one ever saw them.

In any case, now I am posting one of those tutorial articles here. And it is about how to implement a List like class. So...

The very basic

A List is like a resizable array. The main points are:

You can get or set its items using an indexer (that is, something like list[somePoition] = someValue);
You can Add() and Remove() values from it, effectively changing its size.

Surely there are other methods on the .NET List<T> implementation but I am not covering all the methods, as many of them are there simply because that makes the class more complete. The real point is that a list works pretty similar to an array but can be resized while an array can't (don't be fooled by the static Array.Resize() method as it actually does a copy of the array instead of resizing one like the name suggests).

If our purpose was to create a really, really naive implementation of a List we could do this:

public class List
{
  private object[] _array = new object[0];

  public int Count
  {
    get
    {
      return _array.Length;
    }
  }

  public object this[int index]
  {
    get
    {
      return _array[index];
    }
    set
    {
      _array[index] = value;
    }
  }

  public void Add(object item)
  {
    int positionOfNewItem = _array.Length;
    Array.Resize(ref _array, _array.Length + 1);
    _array[positionOfNewItem] = item;
  }

  /* of course, some more methods here */
}

I said this was a very, very naive implementation. It is missing methods, like Remove(), Clear and others, it is missing parameter validations (what will happen if a value of -1 is given to the indexer?) and it is not a generic implementation (if we want a list of strings only, we will need to copy the code and change the type object with the type string) but most importantly: It is slow. Really slow.

Note that for each new item, an Array.Resize() is done, which actually allocates a new array and copies all items to the new array. So, to add only 4 items we will:

Allocate a new array (of size 1). The copy will do nothing, and we will set the item 1 (index 0);
Then we will allocate a new array, of size 2, we'll copy one item and, finally, set the item 2;
Then we will allocate a new array, of size 3, we'll copy 2 items and, finally, set the item 3.
Then we will allocate a new array, of size 4, we'll copy 3 items and, finally, set the item 4.

I think that you can imagine that the more items we have, the slower it gets. This is because the more items in the list, the slower the copy gets.

With 4 items it may not sound that bad. But what about 1000 items? Will we allocate 1000 new arrays and do 1000 copies, each time bigger than the previous one?

The real answer is: If we don't care about wasting some memory, we can do better.

A better implementation

The .NET List<T> implementation allocates an array bigger than the amount of items it needs to store so it can do many adds before resizing the inner array. Many list-like classes do the same, even those written in other languages. So, yes, that does waste some memory if we don't do the right amount of Add() calls but, from a performance perspective, it is much faster.

The implementation is more like this:

public class List
{
  private static readonly object _emptyArray = new object[0];

  private object[] _array = _emptyArray;
  private int _count;

  public int Count
  {
    get
    {
      return _count;
    }
  }

  public object this[int index]
  {
    get
    {
      if (index < 0 || index >= _count)
        throw new IndexOutOfRangeException("index");

      return _array[index];
    }
    set
    {
      if (index < 0 || index >= _count)
        throw new IndexOutOfRangeException("index");

      _array[index] = value;
    }
  }

  public void Add(object item)
  {
    if (_count == _array.Length)
    {
      int newCount = _count * 2;
      if (newCount == 0)
        newCount = 4;

      Array.Resize(ref _array, newCount);
    }

    _array[_count] = item;
    _count++;
  }
}

I went a little further and improved the getter and setter by throwing an appropriate exception if the index is invalid. I also avoid allocating a new empty array per list, only a single empty array is allocated. It is also important to note that I use the _count, not the _array.Length to throw the exceptions, as the array may be bigger than the actual Count.

Actually, we could consider valid to set a value at any index and, if needed, resize the inner array. Well, I don't do this and the .NET List doesn't either. I have reasons to believe such automatic resizing to be bad but I don't want to lose focus here, so, at this moment, we must accept that we don't do automatic resizes in that situation.

Now take a look at the Add() implementation. It will only resize the array (which actually causes a new array allocation with a copy) if the actual Count equals the array Length. When doing the first add the Count will be zero and the array length too, so it will end-up allocating an array of size 4.

For the next 3 Add() calls, the array will be big enough, so the item will be set directly to the array and the _count will be incremented. And, when we finally arrive do the 5th Add(), it will double the array size. As it keeps doubling the size, even if we need to copy items sometimes, this situation becomes less and less common... so we can say that we keep a really good speed for additions at the end. Actually, that's all the optimization that the List class needs regarding additions.

What else?

Well, I still didn't present the Remove() method (or any of the other methods), I still didn't talk about Object Oriented Programming Principles that we are following (or not), I didn't talk about the .NET interfaces or anything about different implementations. But I hope that which this lesson you understand that doing copies at every Add() is bad and that it should be avoided (that's why the .NET list uses that rule of doubling the inner array when its capacity is exhausted). I also hope you understand that using more memory than it is needed "right now" can be a good thing if you believe that such memory may be used in the future.

So let's see some of the other methods:

RemoveLast()

The logic of doubling the array length to add new items at the end doesn't help in adding or removing items in the middle but the fact that the class is prepared to deal with a Count that's different from the real array size helps when removing items located at the end. We can say that by simply doing a

_count--;

We can remove an item from the end. But such solution is still naive. What will happen if the user never adds a new item after a RemoveLast()? The item that's not accessible anymore (because it is outside the _count boundaries) can't be collected because to the garbage collector it is still active (it is still part of an active array). It is important to note that when we talk about Garbage Collection, accessible items mean those items that are directly or indirectly referenced by "roots" (local variables and static variables) not items that are accessible or not by the code logic. Putting an item in an array that's accessible from a root, even if it is in a position that we will never access, is still considered as making that object "alive". So, a RemoveLast() is going to look more like this:

public void RemoveLast()
{
  if (_count == 0)
    throw new InvalidOperationException("This list is already empty.");

  _count--;
  _array[_count] = null;
}

That _array[_count] = null; means that such item can be collected if needed. It may be a waste of time if we simply add an item just after doing the remove but the class doesn't know that, so it must do something smart for those cases we don't actually add a new item just after that call (and, well, setting a null value to an array index is pretty fast).

Clear()

The Clear() is both pretty simple and pretty problematic. What do we expect to happen after a Clear()?

Imagine that the actual list has 1 million items. Will the user of this list Clear() it and add 1 million items again or will the user simply let it die cleared?

Actually, a "perfect" Clear() (in the sense of releasing the memory) will do this:

public void Clear()
{
  _array = _emptyArray;
  _count = 0;
}

But the .NET implementation does something like this:

public void Clear()
{
  Array.Clear(_array, 0, _count);
  _count = 0;
}

This, in my opinion, means the developers who wrote the List class were expecting that it will become as big as it was before the Clear(). This can't be considered a bug but it is a memory consumption problem (especially for .NET, that's expected to take care of memory deallocation for us). If you clear a List<T> because you expect it to reduce its inner array size... well, it is not going to happen. Items that were in it would be available for garbage collection (considering there aren't other live references to them) but the inner array itself will not. Also, if you think that doing a clear after finishing using your local list is helping the garbage collector, then think again. It isn't. The garbage collection process must "mark" all objects that are in use before freeing memory, starting from roots (local variables and static variables). So, it will not lose time with giant arrays if there are no live references to those giant arrays. A list that's only seen by a loca local variable will be ready for collection when the method ends (or just after the last time the variable is used inside the method, depending on the optimizations). Doing a Clear() just before the list variable goes out of scope means you are actually keeping the variable reference alive for more time, which can avoid the list from being collected if a garbage collection happens at that exact moment (not very common), and the nulls aren't going to help "collect the referenced objects" faster, as when the list itself is unreachable the garbage collector will not lose time looking at its contents.

Note: Don't confuse my last statement about clearing a list before it goes out-of-scope with Dispose(). Except from some rare exceptions, disposable objects must be disposed as soon as you know you don't need them anymore.

AddRange()

The AddRange is not really smart but it avoids some unnecessary validations and copies. Remember what happens at every Add()? We check if we need to resize the inner array and, independently if we do it or not, we set the item and the count. Well, we can validate the size only once, resize it only once (if we need that at all) and then we can copy the items directly.

Imagine that we actually have an inner array of length 4, with a single item (so, a count of 1). Now we want to add 1000 items. Instead of adding each one individually, and doing 1000 checks to see if we need to regrow the inner array (and actually doing the copies to regrow when we need them), we do the validation and resize only once. So, we can improve performance with a method like:

public void AddRange(IReadOnlyCollection collection)
{
  if (collection == null)
    throw new ArgumentNullException("collection");

  int count = collection.Count;
  if (count == 0)
    return;

  int finalCount = _count + count;
  if (finalCount > array.Length)
    Array.Resize(ref _array, finalCount)

  collection.CopyTo(_array, _count);
  _count = finalCount;
}

Actually, this is not the code of the List implementation in .NET. The List implementation in .NET is both more complex and "stupid". It tries to detect if the received collection is really an ICollection or not, because the parameter type is IEnumerable<T>. If it is an ICollection it uses the Count to do this optimized solution... if it is not, then it iterates the given collection adding the items normally. I don't consider that a good implementation but this is how it works.

Generics

You probably know that the .NET List<T> is a generic type. That's what that <T> is all about. In my example I was using the object type but using the type as object has many problems. In special:

It allows any object to be put into the list, without any validation;
Users will need to cast the read items to their right type before using them. If they actually inserted an object of an invalid type, they will only see the error when doing the cast, not at the insert;
Value-types (ints, for example) will be boxed.

And we can solve all those problems by making the type generic. To do that, it is enough to put that <T> after the class name and replace all the object uses by a T. This means that we will have an array of ints if we use a List<int>, an array of strings if we use a List<string> and I think that you already got the idea. By using a generic type we will have better performance (at least for value types) and we will also have some compile-time validations, as a list of strings is not going to accept an int as parameter.

The interfaces

The .NET List implements lots of interfaces today. In fact, some of them are actually legacy and problematic.

For example, the non-generic ICollection has properties like IsSynchronized and SyncRoot. Putting a public SyncRoot property is both an excess of responsibility (the List class doesn't need to care about this), a possible source of bugs and also it doesn't allow to use reader/writer locks (and the List itself is prepared for it). If someone wants to make a list thread-safe by using locks, it is the responsibility of that someone (or of that class) do deal with the locks. We should not add properties (and consequently fields) to our type for that kind of situation. The ICollection also has the CopyTo method but it actually doesn't allow you to tell which part do you want to copy. You always copy the entire contents of one collection to an array (starting at a specific index). This is simply odd, in my opinion.

The generic interfaces, on the other hand, seem much more accurate. Yet the mutable interfaces appeared before the read-only ones, so the mutable ICollection<T> has the IsReadOnly property and, if it is true... well, you still have all the mutable methods available, but they are expected to throw exceptions if they are used. Only the new IReadOnly* interfaces are really slim. But, even then, I can't say that you should implement those interfaces.

The fact that most of those interfaces expose counter-intuitive properties (like the SyncRoot or the IsReadOnly on a mutable interface) only shows me that we should avoid implementing interfaces for the sake of completeness. Users can always create adapters/wrappers that implement interfaces and redirect to our types if needed. In fact, if performance is a problem, as long as we don't seal our types (and sealing itself is another big topic if we want to discuss) users can simply inherit from the actual List and implement the interfaces they want, without the extra cost of redirection from one type to another (as happens with adapters). So, in my opinion, the only interface that's really worth implementing is the IEnumerable<T> interface... or maybe not even that one. See, what do you prefer:

foreach(string name in names)
{
  // Do something here.
}

Or do you prefer this:

foreach(string name in names.Enumerate())
{
  // Do something here.
}

Actually the first solution means we have to implement the IEnumerable<T> while the second solution means we only need to create a method Enumerate() that returns an IEnumerable<T>. I will let you decide which one is better but, in any case, the contents of the GetEnumerator() method (needed by the interface) or the contents of the Enumerate() method will look like this:

for(int i=0; i<_count; i++)
  yield return _array[i];

So, now that you know how the class works, it is up to you to implement the methods that may be missing, to implement (or not) some interfaces and even add attributes so it can be used by other frameworks (like [Serializable]).

CodeProject

Conclusion

This time I don't have a real conclusion. I wanted to show how to create a List class and talk a little about the excess of responsibilities the actual .NET implementation has. I did it, so I hope you like. That's all.