Click here to Skip to main content
15,888,113 members
Please Sign up or sign in to vote.
3.00/5 (1 vote)
Hello, everybody. The code below, works very slowly, about 10 seconds per iteration. How to improve him?

C#
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;

namespace Compare
{
    public class Comparer
    {
        public static List<string> CompareLength(string file, List<string> arr)
        {
            List<string> result = new List<string>();
            foreach (string item in arr)
                if (new FileInfo(item).Length == new FileInfo(file).Length)
                    result.Add(item);
            return result;
        }
        public static void CompareFilesRec(List<string> array)
        {
            List<string> outp = new List<string>();

            int i = 0;
            while (i < array.Count)
            {
                List<string> fromAll = Comparer.CompareLength(array[i], array);
                string chosen = array[i];
                if (fromAll.Count != 0)
                    foreach (var f2 in fromAll)
                    {
                        if (Checksum.CompareHash(chosen, f2))
                        {
                            outp.Add(f2);
                            array.Remove(f2);
                        }
                        else { i++; }

                    }
                Comparer.ShowResults(outp);
                outp.Clear();
            }
        }
        public static void ShowResults(List<string> results)
        {
            if (results.Count >= 2)
            {
                foreach (var element in results)
                {
                    Console.WriteLine(element);
                }
                Console.WriteLine();
            }
        }
    }
}


Checksum.CompareHash
C#
static public bool CompareHash(string file1, string file2)
{
    long FirstSize = new FileInfo(file1).Length;
    long SecondSize = new FileInfo(file2).Length;

    byte[] buffer1 = new byte[FirstSize];
    byte[] buffer2 = new byte[SecondSize];

    try
    {
        using (var mmf1 = MemoryMappedFile.CreateFromFile(file1, FileMode.OpenOrCreate, null, 0))
        using (var mmf2 = MemoryMappedFile.CreateFromFile(file2, FileMode.OpenOrCreate, null, 0))
        using (var reader1 = mmf1.CreateViewStream())
        using (var reader2 = mmf2.CreateViewStream())
        {
            try
            {
                reader1.Read(buffer1, 0, (int)FirstSize);
                reader2.Read(buffer2, 0, (int)SecondSize);
            }
            catch (Exception ex)
            {
                Console.WriteLine("Error: {0}", ex.Message);
            }


            if (buffer1.Length != buffer2.Length || Crc8.ComputeChecksum(buffer1) != Crc8.ComputeChecksum(buffer2))
                return false;
        }
    }
    catch (IOException) { }
    catch (UnauthorizedAccessException) { }
    return true;
}



So, I realized that solution, check it:
C#
namespace Compare
{
    public class Comparer
    {
        public static List<string> CompareLength(string file, List<string> arr)
        {
            List<string> result = new List<string>();
            foreach (string item in arr)
                if (new FileInfo(item).Length == new FileInfo(file).Length && Checksum.CompareHash(file,item))//here I added hash check. To avoid another loop or something.
                    result.Add(item);
            return result;
        }
        public static void CompareFilesRec(List<string> array)
        {
            Dictionary<long, List<string>> di = new Dictionary<long,List<string>>();

            int i = 0;
            while (i < array.Count)
            {
                long size = new FileInfo(array[i]).Length;

                List<string> fromAll = Comparer.CompareLength(array[i], array);

                if (di.ContainsKey(size))
                {
                    fromAll.Clear();
                    array.Remove(array[i]);
                }
                else
                {
                    di.Add(size, fromAll);
                    i++;
                }
            }
            ShowResults(di);
        }
        public static void ShowResults(Dictionary<long, List<string>> dic)
        {
            foreach(var element in dic)
            {
                foreach (var outer in element.Value)
                {
                    Console.WriteLine(outer);
                }
                Console.WriteLine();
            }
        }
    }
}


Also i changed the hashing function:
C#
namespace Compare
{
    public class Checksum
    {
        static public bool CompareHash(string file1, string file2)
        {
            long FirstSize = new FileInfo(file1).Length;
            long SecondSize = new FileInfo(file2).Length;

            SHA256 sha1 = SHA256Managed.Create();
            SHA256 sha2 = SHA256Managed.Create();

            byte[] buffer1 = new byte[FirstSize];
            byte[] buffer2 = new byte[SecondSize];

            try
            {
                using (var mmf1 = MemoryMappedFile.CreateFromFile(file1, FileMode.OpenOrCreate, null, 0))
                using (var mmf2 = MemoryMappedFile.CreateFromFile(file2, FileMode.OpenOrCreate, null, 0))
                using (var reader1 = mmf1.CreateViewStream())
                using (var reader2 = mmf2.CreateViewStream())
                {
                    try
                    {
                        reader1.Read(buffer1, 0, (int)FirstSize);
                        reader2.Read(buffer2, 0, (int)SecondSize);
                    }
                    catch (Exception ex)
                    {
                        Console.WriteLine("Error: {0}", ex.Message);
                    }


                    if (buffer1.Length != buffer2.Length && sha1.ComputeHash(buffer1) != sha2.ComputeHash(buffer2))
                        return false;
                }
            }
            catch (IOException) { }
            catch (UnauthorizedAccessException) { }
            return true;
        }
    }
}


Please, tell me my errors in code above...
Posted
Updated 6-Apr-13 4:55am
v4
Comments
Sergey Alexandrovich Kryukov 3-Apr-13 15:43pm    
Please, what's the purpose of comparison? Same exact content or not? Or something else?

Of course, there are some problems already apparent. Why comparing hash? Why passing file names to length comparison? You are wasting some time on it.

—SA
Je7 3-Apr-13 15:47pm    
Yes, same exact content is the purpose of comparison.

The logic: we receive simple list of files -> get element from list -> compare chosen's file length with another files, with the same length -> collect similar files and output them.

What you can advice to do?
Sergey Alexandrovich Kryukov 3-Apr-13 16:03pm    
No, "similar" is not "exact same content". Can you strictly formulate it. Do you want to sort the list into sets of identical files... It's hard to understand, what exactly.
Anyway, in your code, there is an huge amount of repeated operations: getting file info again and again, passing file names instead of some data. Where is Checksum.CompareHash?..
—SA
Je7 3-Apr-13 16:23pm    
Ok. I need "exact same content". Checksum.CompareHash added above.
Matt T Heffron 3-Apr-13 16:25pm    
You didn't complete the edit of the question above. we don't see the Checksum.CompareHash...
[EDIT] Now it's there.[/EDIT]

1 solution

No code, just suggestions/hints.
Ok, first remove redundant work. (There's plenty of it.)

Create a Dictionary<ulong, List<string>>
Make one pass through the list of all filenames.
For each filename, get the file length.
Use the .TryGetValue to get the list of files with that length.
If there is no such list, then create a new List<string> and insert it into the Dictionary.
Add the filename to the list.

After all of the filenames have been "processed".
Go through all of the .Values in the Dictionary.
If the length is >1 then perform a similar process on each filename in the list, with the checksum/hash as the Dictionary key instead of the length (use a different Dictionary).
All of the lists in this Dictionary with more than one entry are sets of identical files.

Length checking is done only once per file and the checksum/hash is computed only once per file (and only for the files that could possibly be matched).
 
Share this answer
 
v4
Comments
Sergey Alexandrovich Kryukov 3-Apr-13 16:45pm    
Yes, there is too much of redundancy, but finding an near-optimum implementation requires good thinking and work. Your suggestions make sense, my 5.
—SA
Matt T Heffron 3-Apr-13 16:48pm    
Thanks.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900