Another option, similar to Griff's, would be to process line by line
without loading the whole file into memory*:
Use
File.ReadLines(path)
to get an
IEnumerable<string>
for the input
pass that through the
.Distinct(IEqualityComparer<string>)
which gives another
IEnumerable<string>
for the output.
Then you can use
File.WriteAllLines(path, IEnumerable<string>)
to make an output file, or use a
foreach
loop to write all the lines to the
Console
.
So now the exercise is to write a small class that implements
IEqualityComparer<string>
. This can split the string into the parts and use whatever
a priori information you may have about them to check if they match (and ensure matching inputs have the same HashCode).
There a couple of other optimizations I can think of, but I'll leave those as "exercises".
* The .Distinct() does internally build a representation that collects one entry for each unique string, but this is (potentially) much smaller than the whole file, and definitely smaller than both the whole file collection and the .Distinct() internal representation.