Removing duplicates from a notepad file

Question

1.67/5 (3 votes)

See more:

Let us say you have notepad file where it has the following lines. I have to find the duplicates I have achieved it partially. i.e my below program works and prints the result in console. if you notice "user1, user2" is repeated twice which should be removed which it does.. However I have to handle another scenario as well that is, it has to remove "user2, user1" also which it does not do

user1, user2
user3, user1
user1, user2
user5, user6
user2, user1

below is the program

C#

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Collections.Generic;
namespace ex
{
    class Program
    {
        static void Main(string[] args)
        {
            string path = @"C:\Users\Documents\Visual Studio 2010\Friends.txt";
            StreamReader sr = new StreamReader(path);


            List<string> lines = new List<string>();
            string line;
            
            while ((line=sr.ReadLine())!=null)
            {

               // string[] nl = line.Split(' ');

              //  for (int i = 0; i<nl.Length; i++)
               // {
                     lines.Add(line);
              //  }

               
            }

            List<string> removingduplicates = lines.Distinct().ToList();

           // string nn=removingduplicates.Join(",",removingduplicates);

            foreach (string item in removingduplicates)
            {
                Console.WriteLine(item);
            }

            
        }
    }
}

Posted 21-Aug-15 3:00am

ShaHam11

Add a Solution

Comments

Sergey Alexandrovich Kryukov 21-Aug-15 12:02pm

"Notepad file" is nonsense. This is the same as saying: "Microsoft digit 7".
—SA

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

OriginalGriff · Answer 1 · 2015-08-21T04:11:00

If you have to handle "user1, user2" as matching "user2, user1", then you wiull have to be a bit more constructive.

But...this is your homework, so no code!

Start by reading your lines, and using Split to "break" them into a left-of-the-comma and a right-of-the-comma part. Use Trim to remove any miscellaneous spaces.
Sort the parts so they are always in the same order.
Rebuild your strings, using string.Join to add the comma and space back in.
Now you can remove your duplicates.

Matt T Heffron · Answer 2 · 2015-08-21T08:26:00

Another option, similar to Griff's, would be to process line by line without loading the whole file into memory*:
Use File.ReadLines(path) to get an IEnumerable<string> for the input
pass that through the .Distinct(IEqualityComparer<string>) which gives another IEnumerable<string> for the output.
Then you can use File.WriteAllLines(path, IEnumerable<string>) to make an output file, or use a foreach loop to write all the lines to the Console.
So now the exercise is to write a small class that implements IEqualityComparer<string>. This can split the string into the parts and use whatever a priori information you may have about them to check if they match (and ensure matching inputs have the same HashCode).

There a couple of other optimizations I can think of, but I'll leave those as "exercises".

* The .Distinct() does internally build a representation that collects one entry for each unique string, but this is (potentially) much smaller than the whole file, and definitely smaller than both the whole file collection and the .Distinct() internal representation.

Removing duplicates from a notepad file

2 solutions

Solution 1

Solution 2

Add your solution here

Preview 0