Introduction
This article will take a dive into one of the reasons why code sometimes sands over. The usage of the versatile and quick to implement ’List’ list and is usages when the elements of iterations become plentiful and at what simple steps can be done to achieve a manyfold improvement while you're using List's, and what you can do when it's nolonger enough to perform. You're presented with a sample project which will actually execute the benhcmark tests on your own machine and you can see the code implementation if you like.
Background
I have often come across time thiefs when lists have existence checks before adding done with simple lamda's but just how bad is that anyway?
if (compareMethod == CompareMethod.LinqAnyBothEquals)
{
if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
list.Add(data);
}
And what are our options in terms of low hanging fruits and other in the loom?
I decided to measure and to provide a couple of suggestions to what you should when you encounter list operations suddenly beginning to take time and an indication of when you've outgrown the list datastructure and what is the next step if your data actually is ok to grow in numbers and you want to regain performance.
Using the code
The example code is a console application for VS 2015 and contain a program.cs file with all the party inside and an extension class with a small utility directed at the console window.
Mostly you shoud use the project if you find my explenation drawouts to be too skinny or simply want to see the performance test running yourself it will run a few minutes to completion.
The test plan
These tests will be based on sample data of our imaginary TestData deriviates Data1 and Data2. The base class contains the two properties, and ID and a Name property.
Objects of these types are instantiated with an incremental id number and a random character 10 digit name.
such lists are made that contain 1000, 10000 and 100000 elements. Those are simulated to be accessed by taking 25% of the elements at random and for each also at random either remove them or not from the main list and add them to the test set, ensuring that there will be some full scans in between. This is deliberately not intended to be identical runs, but comparable realistic ones. However, the numbers will differ a bit from time to time the test is run, but the distribution remains the same as a trend.
var sw = new Stopwatch();
("Compare method " + compareMethod).ToConsole();
foreach (var amt in amountsOfData)
{
int manyData = amt;
"".ToConsole();
GetMeAlistWithManyData<t>(rnd, list, pickLength, manyData);
"Get 25 % of data, each element delete or not at random from list".ToConsole();
int aboutTwentyFivePct = (int)(manyData * 0.25);
var arr = new T[aboutTwentyFivePct];
for (int i = 0; i < aboutTwentyFivePct; i++)
{
int pos = rnd.Next(0, list.Count());
arr[i] = list[pos];
if (rnd.Next(0, 1) == 1)
{
list.Remove(arr[i]);
}
}
</t>
Each element in the test set will then be attempted added back to the list structures while first checking for existence.
Experiments will be made with enhancing the Data classes and optimizing data structure capabilities.
The Lambda culprit
Lamda is so wonderfull it let's you do almost everything to let you quickly produce working code, but it does so using reflection and that takes a wee bit of time, how much?
"Testing to add each element, but only if it does not exist".ToConsole();
sw.Start();
foreach (T data in arr)
{
if (compareMethod == CompareMethod.LinqAnyBothEquals)
{
if (!list.Any(s => s.ID == data.ID && s.Name == data.Name))
list.Add(data);
}
else if (compareMethod == CompareMethod.Contains)
{
if (!list.Contains(data))
list.Add(data);
}
else
{
throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
}
}
sw.Stop();
When you work on data with less than 1000 elements, your really shouldn't worry too much about using Lambda, go ahead! If it makes your life easier, i know it often makes mine.
rows | duration |
1000 | 10 ms |
10000 | 1107 ms |
100000 | 69143 ms |
As you see these existence checks become costly with lists, when they grow but hey! a list has a .Contains method!
The frailty of Contains
As you propably know the Generic List type implements a contains method which it is tempteming to belive you can use to check if a list well contains that object
for (int i = 0; i < 3; i++)
{
Data1 objX = GetNewData<data1>(rnd, pickLength);
list.Add(objX);
}
var objValueCopy = new Data1 { ID = list[2].ID, Name = list[2].Name };
if (list.Contains(objValueCopy)){
} else {
}
</data1>
As you propably suspect from knowlege or intuition, no jumping around accurs in the code above, if not helped out Contains will look to see if that exact object is already contained, which only sometimes is usefull in our case not.
Contains is clay
But if your Data class defines how it should be compared, you can use Contains to compare in other ways than reference, here Data2 implements IEqualityComparer to define how to test for equality with this class.
public class TestData
{
public int ID { get; set; }
public string Name { get; set; }
};
public class Data1 : TestData
{
}
public class Data2 : Data1, IEqualityComparer, IEqualityComparer<data2>
{
public bool Equals(Data2 x, Data2 y)
{
return x.ID == y.ID && x.Name == y.Name;
}
public new bool Equals(object x, object y)
{
if (x is Data2)
{
if (y is Data2)
{
return Equals((Data2)x, (Data2)y);
}
}
return false;
}
public int GetHashCode(Data2 obj)
{
return (ID.ToString() + Name).GetHashCode();
}
public int GetHashCode(object obj)
{
if (obj is Data2)
{
return (obj as Data2).GetHashCode();
}
return obj.GetHashCode();
}
}
</data2>
Now we can get results using Contains instead:
rows | duration |
1000 | 1 ms |
10000 | 186 ms |
100000 | 18330 ms |
Part Conclusions
Takehome one: Implementing IEqualityComparer can SIGNIFICANTLY increase performance of existence checks in lists!
Takehome two: The improvements manifest with small lists and will be largest in the beginning, as number of compars ultimately will remain a major bottleneck with 100K+ datasets.
Takehome three: Lists do not perform well with big amounts of data when inclusion checks have to be made, if data grow consider alternate approaches to lists
IEqualityComparer is King
Many different CLR Classes honour and use IEqualityComparer, another very high performing datastructure is the HashSet
sw.Start();
foreach (T data in arr)
{
if (compareMethod == CompareMethod.LinqAnyBothEquals)
{
if (!hashSet.Any(s => s.ID == data.ID && s.Name == data.Name))
hashSet.Add(data);
}
else if (compareMethod == CompareMethod.Contains)
{
if (!hashSet.Contains(data))
hashSet.Add(data);
}
else
{
throw new NotImplementedException("This method is not implemented, comparemethod " + compareMethod);
}
}
sw.Stop();
rows | duration |
1000 | 0 ms |
10000 | 0 ms |
100000 | 5 ms |
Pretty warp speed actually, isn't it?
But only on a Data2 type data object set, Data1 types still suffer
rows | duration |
1000 | 12 ms |
10000 | 754 ms |
100000 | 73748 ms |
HashSets have another thing up their sleeve
So we're f*** if we cannot control the data class?
No, you can specify an EqualityComparer when you instantiate the HashSet and then the pipe has a different sound
public class TestDataEqualityComparer : IEqualityComparer<testdata>
{
public bool Equals(TestData x, TestData y)
{
return x.ID == y.ID && x.Name == y.Name;
}
public int GetHashCode(TestData obj)
{
return (obj.ID.ToString() + obj.Name).GetHashCode();
}
}
</testdata>
Passing an instance of this comparer to the constructor and operating on Data1 object:
rows | duration |
1000 | 0 ms |
10000 | 1 ms |
100000 | 28 ms |
This is nearly as good!
Part conclusions
Takehome four: It doesn't matter if you use a hashset, if you use Linq comparing you're just as damned
Takehome five: Consider using Hashsets for big amounts of data and provide an EqualityComparer to the constructor if you cannot control the data classes to include such there.
A little word of advice
In this article i've adressed the range inclusion check. It is worth considering that if you do not want to check for equality, altering the the data classes is not adviceable. You can defend moving from reference equality to value equality, but say for instance you'd want to see if the ID is already there only, for instance because you'd want to update the objects or their value properties could've changed and that's not what decides your program flow. Then you should make a comparer instead with a meaningfull name, say passing in something called TestDataIdentityEqualityComparer could keep your code easy to read and avoid ininteded accidents of people using the Data classes in other contexts. It is better to provide customer comparers that way to avoid confusion.
Conclusion
If are using generic lists with more than 100 elements and lamda expressions to test for exising object with identical values when adding, consider changing to a HashSet and remember to provide an EqualityComparer if your data class does not implement IEqualityComparer and your code will run faster.
It is debateable where the pain barrier lies for the change is it at 100 or at 1000, but I hope to have demonstrated that you can implement significant improvements by letting your Data classes implement IEqualityComparer even if you stick with Lists because of the ease of use and amount of data involved.
History
This is version one of this article
Thomas mostly work as staff in multi-annual engagements with companies, but at interval is available as a software contractor.
Thomas as been a professional programmer from 1998 and have been programming .net since v1.1 in 2002
His specialities is C# high throughput systems, dataprocessing and -transformation as well as MSSQL systems from small to VLDB size and the optimization of code on .net and TSQL.
Fullstack Microsoft developer with several pillars and quite some years on the different ones.
He is married, a father of two boys, lives in Copenhagen, Denmark.