Click here to Skip to main content
15,745,199 members
Articles / Programming Languages / C#
Article
Posted 21 Sep 2015

Stats

14.5K views
134 downloads
15 bookmarked

When Lists and data grow up

Rate me:
Please Sign up or sign in to vote.
5.00/5 (14 votes)
21 Sep 2015CPOL6 min read
This article will take a dive into one of the reasons why code sometimes sands over.

Introduction

This article will take a dive into one of the reasons why code sometimes sands over. The usage of the versatile and quick to implement ’List’ list and is usages when the elements of iterations become plentiful and at what simple steps can be done to achieve a manyfold improvement while you're using List's, and what you can do when it's nolonger enough to perform. You're presented with a sample project which will actually execute the benhcmark tests on your own machine and you can see the code implementation if you like.

Background

I have often come across time thiefs when lists have existence checks before adding done with simple lamda's but just how bad is that anyway?

C#
if (compareMethod == CompareMethod.LinqAnyBothEquals)
{
    if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
        list.Add(data);
}

And what are our options in terms of low hanging fruits and other in the loom?
I decided to measure and to provide a couple of suggestions to what you should when you encounter list operations suddenly beginning to take time and an indication of when you've outgrown the list datastructure and what is the next step if your data actually is ok to grow in numbers and you want to regain performance.

Using the code

The example code is a console application for VS 2015 and contain a program.cs file with all the party inside and an extension class with a small utility directed at the console window.
Mostly you shoud use the project if you find my explenation drawouts to be too skinny or simply want to see the performance test running yourself it will run a few minutes to completion.

The test plan

These tests will be based on sample data of our imaginary TestData deriviates Data1 and Data2. The base class contains the two properties, and ID and a Name property.
Objects of these types are instantiated with an incremental id number and a random character 10 digit name.
such lists are made that contain 1000, 10000 and 100000 elements. Those are simulated to be accessed by taking 25% of the elements at random and for each also at random either remove them or not from the main list and add them to the test set, ensuring that there will be some full scans in between. This is deliberately not intended to be identical runs, but comparable realistic ones. However, the numbers will differ a bit from time to time the test is run, but the distribution remains the same as a trend.

C#
        var sw = new Stopwatch();
        ("Compare method " + compareMethod).ToConsole();
        foreach (var amt in amountsOfData)
        {
            int manyData = amt;
            "".ToConsole();
            GetMeAlistWithManyData<t>(rnd, list, pickLength, manyData);
            "Get 25 % of data, each element delete or not at random from list".ToConsole();
            int aboutTwentyFivePct = (int)(manyData * 0.25);
            var arr = new T[aboutTwentyFivePct];
            for (int i = 0; i < aboutTwentyFivePct; i++)
            {
                int pos = rnd.Next(0, list.Count());
                arr[i] = list[pos];
                if (rnd.Next(0, 1) == 1)
                {
                    list.Remove(arr[i]);
                }
            }
</t>


Each element in the test set will then be attempted added back to the list structures while first checking for existence.
Experiments will be made with enhancing the Data classes and optimizing data structure capabilities.

The Lambda culprit

Lamda is so wonderfull it let's you do almost everything to let you quickly produce working code, but it does so using reflection and that takes a wee bit of time, how much?

C#
"Testing to add each element, but only if it does not exist".ToConsole();
  sw.Start();
  foreach (T data in arr)
  {
      if (compareMethod == CompareMethod.LinqAnyBothEquals)
      {
          if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
              list.Add(data);
      }
      else if (compareMethod == CompareMethod.Contains)
      {
          if (!list.Contains(data))
              list.Add(data);
      }
      else
      {
          throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
      }
  }
  sw.Stop();

When you work on data with less than 1000 elements, your really shouldn't worry too much about using Lambda, go ahead! If it makes your life easier, i know it often makes mine.

rows duration
1000 10 ms
10000 1107 ms
100000 69143 ms

As you see these existence checks become costly with lists, when they grow but hey! a list has a .Contains method!

The frailty of Contains

As you propably know the Generic List type implements a contains method which it is tempteming to belive you can use to check if a list well contains that object

C#
        for (int i = 0; i < 3; i++)
            {
                Data1 objX = GetNewData<data1>(rnd, pickLength);
                list.Add(objX);
            }
        var objValueCopy = new Data1 { ID = list[2].ID, Name = list[2].Name };
        if (list.Contains(objValueCopy)){
            //jump around!               
        } else {
            //No i'm a reference to a new object with identical values!
        }
</data1>

As you propably suspect from knowlege or intuition, no jumping around accurs in the code above, if not helped out Contains will look to see if that exact object is already contained, which only sometimes is usefull in our case not.

Contains is clay

But if your Data class defines how it should be compared, you can use Contains to compare in other ways than reference, here Data2 implements IEqualityComparer to define how to test for equality with this class.

C#
    public class TestData
    {
        public int ID { get; set; }
        public string Name { get; set; }
    };
    
    // Plain vanilla dataclass with nothing special to help    
    public class Data1 : TestData
    {
    }
    
    /// Implementing IEqualityComparer to help    
    public class Data2 : Data1, IEqualityComparer, IEqualityComparer<data2>
    {
        public bool Equals(Data2 x, Data2 y)
        {
            return x.ID == y.ID && x.Name == y.Name;
        }
        public new bool Equals(object x, object y)
        {
            if (x is Data2)
            {
                if (y is Data2)
                {
                    return Equals((Data2)x, (Data2)y);
                }
            }
            return false;
        }
        public int GetHashCode(Data2 obj)
        {
            return (ID.ToString() + Name).GetHashCode();
        }
        public int GetHashCode(object obj)
        {
            if (obj is Data2)
            {
                return (obj as Data2).GetHashCode();
            }
            return obj.GetHashCode();
        }
    }
</data2>

Now we can get results using Contains instead:

rows duration
1000 1 ms
10000 186 ms
100000 18330 ms

Part Conclusions

Takehome one: Implementing IEqualityComparer can SIGNIFICANTLY increase performance of existence checks in lists!

Takehome two: The improvements manifest with small lists and will be largest in the beginning, as number of compars ultimately will remain a major bottleneck with 100K+ datasets.

Takehome three: Lists do not perform well with big amounts of data when inclusion checks have to be made, if data grow consider alternate approaches to lists

IEqualityComparer is King

Many different CLR Classes honour and use IEqualityComparer, another very high performing datastructure is the HashSet

C#
sw.Start();
foreach (T data in arr)
{
    if (compareMethod == CompareMethod.LinqAnyBothEquals)
    {
        if (!hashSet.Any(s => s.ID == data.ID && s.Name == data.Name))
            hashSet.Add(data);
    }
    else if (compareMethod == CompareMethod.Contains)
    {
        if (!hashSet.Contains(data))
            hashSet.Add(data);
    }
    else
    {
        throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
    }
}
sw.Stop();
rows duration
1000 0 ms
10000 0 ms
100000 5 ms

Pretty warp speed actually, isn't it?
But only on a Data2 type data object set, Data1 types still suffer

rows duration
1000 12 ms
10000 754 ms
100000 73748 ms

HashSets have another thing up their sleeve

So we're f*** if we cannot control the data class?
No, you can specify an EqualityComparer when you instantiate the HashSet and then the pipe has a different sound

C#
    public class TestDataEqualityComparer : IEqualityComparer<testdata>
    {
        public bool Equals(TestData x, TestData y)
        {
            return x.ID == y.ID && x.Name == y.Name;
        }
        public int GetHashCode(TestData obj)
        {
            return (obj.ID.ToString() + obj.Name).GetHashCode();
        }
    }
</testdata>

Passing an instance of this comparer to the constructor and operating on Data1 object:

rows duration
1000 0 ms
10000 1 ms
100000 28 ms

This is nearly as good!

Part conclusions

Takehome four: It doesn't matter if you use a hashset, if you use Linq comparing you're just as damned

Takehome five: Consider using Hashsets for big amounts of data and provide an EqualityComparer to the constructor if you cannot control the data classes to include such there.

A little word of advice

In this article i've adressed the range inclusion check. It is worth considering that if you do not want to check for equality, altering the the data classes is not adviceable. You can defend moving from reference equality to value equality, but say for instance you'd want to see if the ID is already there only, for instance because you'd want to update the objects or their value properties could've changed and that's not what decides your program flow. Then you should make a comparer instead with a meaningfull name, say passing in something called TestDataIdentityEqualityComparer could keep your code easy to read and avoid ininteded accidents of people using the Data classes in other contexts. It is better to provide customer comparers that way to avoid confusion.

Conclusion

If are using generic lists with more than 100 elements and lamda expressions to test for exising object with identical values when adding, consider changing to a HashSet and remember to provide an EqualityComparer if your data class does not implement IEqualityComparer and your code will run faster.


It is debateable where the pain barrier lies for the change is it at 100 or at 1000, but I hope to have demonstrated that you can implement significant improvements by letting your Data classes implement IEqualityComparer even if you stick with Lists because of the ease of use and amount of data involved.

History

This is version one of this article

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) getCore.dk
Denmark Denmark
Thomas mostly work as staff in multi-annual engagements with companies, but at interval is available as a software contractor.

Thomas as been a professional programmer from 1998 and have been programming .net since v1.1 in 2002

His specialities is C# high throughput systems, dataprocessing and -transformation as well as MSSQL systems from small to VLDB size and the optimization of code on .net and TSQL.

Fullstack Microsoft developer with several pillars and quite some years on the different ones.

He is married, a father of two boys, lives in Copenhagen, Denmark.

Comments and Discussions

 
QuestionGreat! Pin
irneb23-Sep-15 1:40
irneb23-Sep-15 1:40 
AnswerRe: Great! Pin
Thomas Nielsen - getCore23-Sep-15 2:56
professionalThomas Nielsen - getCore23-Sep-15 2:56 
GeneralMy vote of 5 Pin
David A. Gray22-Sep-15 9:05
David A. Gray22-Sep-15 9:05 
GeneralRe: My vote of 5 Pin
Thomas Nielsen - getCore22-Sep-15 9:37
professionalThomas Nielsen - getCore22-Sep-15 9:37 
QuestionAs Is So Often the Case, There Is Another Way Pin
David A. Gray22-Sep-15 9:04
David A. Gray22-Sep-15 9:04 
I use generic Lists a lot, but I do two things that you didn't mention, both of which almost certainly outperform Lambdas.

First, I make sure that any object that I intend to store in a List implements the IComparable interface. If necessary, and the class is mine, I make it implement IComparable. If the class isn't mine, and doesn't implement IComparable, I derive my own class from it, add an IComparable interface to it, and use my derived class. Although it has never yet happened, should I need to store instances of a sealed class that doesn't implement IComparable, I would create an Adapter around it that does.

Second, since the Items in my List implement IComparable, when I am done loading the List, I call the Sort method on it, which then allows me to use its BinarySearch method with impunity.
AnswerRe: As Is So Often the Case, There Is Another Way Pin
Thomas Nielsen - getCore22-Sep-15 9:50
professionalThomas Nielsen - getCore22-Sep-15 9:50 
AnswerRe: As Is So Often the Case, There Is Another Way Pin
irneb23-Sep-15 1:46
irneb23-Sep-15 1:46 
GeneralRe: As Is So Often the Case, There Is Another Way Pin
Thomas Nielsen - getCore23-Sep-15 2:34
professionalThomas Nielsen - getCore23-Sep-15 2:34 
GeneralRe: As Is So Often the Case, There Is Another Way Pin
irneb12-Oct-15 5:42
irneb12-Oct-15 5:42 
GeneralRe: As Is So Often the Case, There Is Another Way Pin
Thomas Nielsen - getCore2-Nov-15 0:34
professionalThomas Nielsen - getCore2-Nov-15 0:34 
GeneralRe: As Is So Often the Case, There Is Another Way Pin
irneb2-Nov-15 2:12
irneb2-Nov-15 2:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.