Download demo project - 62 Kb

Introduction

This article will take a dive into one of the reasons why code sometimes sands over. The usage of the versatile and quick to implement ’List’ list and is usages when the elements of iterations become plentiful and at what simple steps can be done to achieve a manyfold improvement while you're using List's, and what you can do when it's nolonger enough to perform. You're presented with a sample project which will actually execute the benhcmark tests on your own machine and you can see the code implementation if you like.

Background

I have often come across time thiefs when lists have existence checks before adding done with simple lamda's but just how bad is that anyway?

        
        if (compareMethod == CompareMethod.LinqAnyBothEquals)
        {
            if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
                list.Add(data);
        }

And what are our options in terms of low hanging fruits and other in the loom?
I decided to measure and to provide a couple of suggestions to what you should when you encounter list operations suddenly beginning to take time and an indication of when you've outgrown the list datastructure and what is the next step if your data actually is ok to grow in numbers and you want to regain performance.

Using the code

The example code is a console application for VS 2015 and contain a program.cs file with all the party inside and an extension class with a small utility directed at the console window.
Mostly you shoud use the project if you find my explenation drawouts to be too skinny or simply want to see the performance test running yourself it will run a few minutes to completion.

The test plan

These tests will be based on sample data of our imaginary TestData deriviates Data1 and Data2. The base class contains the two properties, and ID and a Name property.
Objects of these types are instantiated with an incremental id number and a random character 10 digit name.
such lists are made that contain 1000, 10000 and 100000 elements. Those are simulated to be accessed by taking 25% of the elements at random and for each also at random either remove them or not from the main list and add them to the test set, ensuring that there will be some full scans in between. This is deliberately not intended to be identical runs, but comparable realistic ones. However, the numbers will differ a bit from time to time the test is run, but the distribution remains the same as a trend.

        var sw = new Stopwatch();
        ("Compare method " + compareMethod).ToConsole();
        foreach (var amt in amountsOfData)
        {
            int manyData = amt;
            "".ToConsole();
            GetMeAlistWithManyData(rnd, list, pickLength, manyData);
            "Get 25 % of data, each element delete or not at random from list".ToConsole();
            int aboutTwentyFivePct = (int)(manyData * 0.25);
            var arr = new T[aboutTwentyFivePct];
            for (int i = 0; i < aboutTwentyFivePct; i++)
            {
                int pos = rnd.Next(0, list.Count());
                arr[i] = list[pos];
                if (rnd.Next(0, 1) == 1)
                {
                    list.Remove(arr[i]);
                }
            }

Each element in the test set will then be attempted added back to the list structures while first checking for existence.
Experiments will be made with enhancing the Data classes and optimizing data structure capabilities.

The Lambda culprit

Lamda is so wonderfull it let's you do almost everything to let you quickly produce working code, but it does so using reflection and that takes a wee bit of time, how much?

      "Testing to add each element, but only if it does not exist".ToConsole();
        sw.Start();
        foreach (T data in arr)
        {
            if (compareMethod == CompareMethod.LinqAnyBothEquals)
            {
                if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
                    list.Add(data);
            }
            else if (compareMethod == CompareMethod.Contains)
            {
                if (!list.Contains(data))
                    list.Add(data);
            }
            else
            {
                throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
            }
        }
        sw.Stop();

When you work on data with less than 1000 elements, your really shouldn't worry too much about using Lambda, go ahead! If it makes your life easier, i know it often makes mine.

rows	duration
1000	10 ms
10000	1107 ms
100000	69143 ms

As you see these existence checks become costly with lists, when they grow but hey! a list has a .Contains method!

The frailty of Contains

As you propably know the Generic List type implements a contains method which it is tempteming to belive you can use to check if a list well contains that object

        for (int i = 0; i < 3; i++)
            {
                Data1 objX = GetNewData(rnd, pickLength);
                list.Add(objX);
            }
        var objValueCopy = new Data1 { ID = list[2].ID, Name = list[2].Name };
        if (list.Contains(objValueCopy)){
            //jump around!               
        } else {
            //No i'm a reference to a new object with identical values!
        }

As you propably suspect from knowlege or intuition, no jumping around accurs in the code above, if not helped out Contains will look to see if that exact object is already contained, which only sometimes is usefull in our case not.

Contains is clay

But if your Data class defines how it should be compared, you can use Contains to compare in other ways than reference, here Data2 implements IEqualityComparer to define how to test for equality with this class.

    public class TestData
    {
        public int ID { get; set; }
        public string Name { get; set; }
    };
    
    // Plain vanilla dataclass with nothing special to help    
    public class Data1 : TestData
    {
    }
    
    /// Implementing IEqualityComparer to help    
    public class Data2 : Data1, IEqualityComparer, IEqualityComparer
    {
        public bool Equals(Data2 x, Data2 y)
        {
            return x.ID == y.ID && x.Name == y.Name;
        }
        public new bool Equals(object x, object y)
        {
            if (x is Data2)
            {
                if (y is Data2)
                {
                    return Equals((Data2)x, (Data2)y);
                }
            }
            return false;
        }
        public int GetHashCode(Data2 obj)
        {
            return (ID.ToString() + Name).GetHashCode();
        }
        public int GetHashCode(object obj)
        {
            if (obj is Data2)
            {
                return (obj as Data2).GetHashCode();
            }
            return obj.GetHashCode();
        }
    }

Now we can get results using Contains instead:

rows	duration
1000	1 ms
10000	186 ms
100000	18330 ms

Part Conclusions

Takehome one: Implementing IEqualityComparer can SIGNIFICANTLY increase performance of existence checks in lists!

Takehome two: The improvements manifest with small lists and will be largest in the beginning, as number of compars ultimately will remain a major bottleneck with 100K+ datasets.

Takehome three: Lists do not perform well with big amounts of data when inclusion checks have to be made, if data grow consider alternate approaches to lists

IEqualityComparer is King

Many different CLR Classes honour and use IEqualityComparer, another very high performing datastructure is the HashSet

    sw.Start();
    foreach (T data in arr)
    {
        if (compareMethod == CompareMethod.LinqAnyBothEquals)
        {
            if (!hashSet.Any(s => s.ID == data.ID && s.Name == data.Name))
                hashSet.Add(data);
        }
        else if (compareMethod == CompareMethod.Contains)
        {
            if (!hashSet.Contains(data))
                hashSet.Add(data);
        }
        else
        {
            throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
        }
    }
    sw.Stop();

rows	duration
1000	0 ms
10000	0 ms
100000	5 ms

Pretty warp speed actually, isn't it?
But only on a Data2 type data object set, Data1 types still suffer

rows	duration
1000	12 ms
10000	754 ms
100000	73748 ms

HashSets have another thing up their sleeve

So we're f*** if we cannot control the data class?
No, you can specify an EqualityComparer when you instantiate the HashSet and then the pipe has a different sound

    public class TestDataEqualityComparer : IEqualityComparer
    {
        public bool Equals(TestData x, TestData y)
        {
            return x.ID == y.ID && x.Name == y.Name;
        }
        public int GetHashCode(TestData obj)
        {
            return (obj.ID.ToString() + obj.Name).GetHashCode();
        }
    }

Passing an instance of this comparer to the constructor and operating on Data1 object:

rows	duration
1000	0 ms
10000	1 ms
100000	28 ms

This is nearly as good!

Part conclusions

Takehome four: It doesn't matter if you use a hashset, if you use Linq comparing you're just as damned

Takehome five: Consider using Hashsets for big amounts of data and provide an EqualityComparer to the constructor if you cannot control the data classes to include such there.

A little word of advice

In this article i've adressed the range inclusion check. It is worth considering that if you do not want to check for equality, altering the the data classes is not adviceable. You can defend moving from reference equality to value equality, but say for instance you'd want to see if the ID is already there only, for instance because you'd want to update the objects or their value properties could've changed and that's not what decides your program flow. Then you should make a comparer instead with a meaningfull name, say passing in something called TestDataIdentityEqualityComparer could keep your code easy to read and avoid ininteded accidents of people using the Data classes in other contexts. It is better to provide customer comparers that way to avoid confusion.

Conclusion

If are using generic lists with more than 100 elements and lamda expressions to test for exising object with identical values when adding, consider changing to a HashSet and remember to provide an EqualityComparer if your data class does not implement IEqualityComparer and your code will run faster.

It is debateable where the pain barrier lies for the change is it at 100 or at 1000, but I hope to have demonstrated that you can implement significant improvements by letting your Data classes implement IEqualityComparer even if you stick with Lists because of the ease of use and amount of data involved.

History

This is version one of this article