Timings for Four Directory Traversal Algorithms

gggustafson

4.42/5 (6 votes)

Nov 16, 2012

CPOL

4 min read

23033

200

This article presents the results of timing four directory traversal algorithms.

Download source - 18.9 KB

Introduction

I am currently writing a tool (Transfer) that will copy all files from a specified source directory (and its subdirectory) to a specified target directory, maintaining the source directory structure. The tool will be published on CodeProject when it is completed.

The tool differs significantly from XCOPY in that:

Whether or not to overwrite an existing target file is based upon:
- the source and target file lengths, and, if equal,
- the contents of the files.
Subdirectories can be excluded, either en masse (like /bin) or by specific selection.
Copying files can be limited to specific extensions.

During the research for the way in which to most efficiently traverse the source directory, I performed tests of four algorithms. These algorithms, along with their testing methods, were:

Algorithm       Test Method

GetFiles        retrieve_GetFiles_files
FileSystemInfo  retrieve_FileSystemInfo_files
Stacks          retrieve_Stack_files
Recursion       retrieve_Recursive_files

The Project

Although there may be some valid objections made regarding my choice of algorithm, I limited myself to well-known implementations. For the purpose of timing, I tried to limit each method to the minimum code required.

Before I go too far, the reader should know that this project was developed only to determine which algorithm, based on execution time, I should use in the Transfer tool project. That tool requires the collection of directory and subdirectory names and the collection of all of the names of files that are at or below a specified topmost directory.

Timing Template

Each of the four algorithms was executed a user-specified number of times. The template that I used to time each algorithm was:

using System.Diagnostics; // need for Stopwatch

TimeSpan        elapsed;
Stopwatch       stopwatch = new Stopwatch ( );
int             time_ms;

stopwatch.Start();
for ( int i = 0; ( i < iterations ); i++ )
    {
    if ( [ target file ].Count > 0 )
        {
        [ target file ].Clear ( );
        }
    [ execute an algorithm from the topmost_directory ]
    }
stopwatch.Stop();
elapsed = stopwatch.Elapsed;
time_ms = elapsed.Hours * 3600000 +
          elapsed.Minutes * 60000 +
          elapsed.Seconds * 1000 +
          elapsed.Milliseconds;

Because two algorithms had a recursive component, all of the algorithms required:

use of a List < string > structure, declared at the global level,
preconditioning of their output files.

So, even though a specific algorithm did not require it, all target files were cleared at the beginning of each iteration.

GetFiles Test Method

The output of each algorithm was required to be saved in a List < string > structure. This requirement increased the length of some algorithms whose returned values were in a string [ ] format. For instance, the GetFiles algorithm:

// *********************************** retrieve_GetFiles_files

void retrieve_GetFiles_files ( string root )
    {
    string [ ]      entries;
    
    entries = Directory.GetFiles (
                         root,
                         "*.*",
                         SearchOption.AllDirectories );
                                // convert string [ ] to 
                                // List < string >
    foreach ( string entry in entries )
        {
        FileAttributes attributes = File.GetAttributes ( 
                                                     entry );
                                                     
        if ( ( attributes & FileAttributes.Directory ) ==
             FileAttributes.Directory )
            {
                                // directory, not a file
            }
        else
            {
            GetFiles_files.Add ( entry );
            }
        }
    }

All of the work is performed in the GetFiles invocation, but because the output was required to be a List < string > of files, the GetFiles test method required a change to:

eliminate directories,
convert the output from string [ ] to List < string >.

This conversion added time to the GetFiles algorithm.

FileSystemInfo Test Method

// ***************************** retrieve_FileSystemInfo_files

void retrieve_FileSystemInfo_files ( string root )
    {
    DirectoryInfo       directory_info = null;
    FileSystemInfo [ ]  directory_entries = null;

    directory_info = new DirectoryInfo ( root );

    directory_entries = directory_info.
                                GetFileSystemInfos ( );

    foreach ( FileSystemInfo entry in directory_entries )
        {
        if ( entry is DirectoryInfo )
            {
            retrieve_FileSystemInfo_files ( entry.FullName );
            }
        else if ( entry is FileInfo )
            {
            FileSystemInfo_files.Add ( entry.FullName );
            }
        else
            {

            }
        }
    }

As with the GetFiles invocation, the output from the FileSystemInfo invocation is composed of both directories and files. In addition, it is necessary to recurse through the directories. As a result, it is impossible to record filenames within the retrieve_FileSystemInfo_files method. This is one of the methods that forced the clearing of the output List < string > before invoking the method.

Stacking and Recursion

In searching for algorithms, I came across the Microsoft Developers Network (MSDN) article How to: Iterate Through a Directory Tree (C# Programming Guide).

What I found particularly interesting was an implicit disavowal of the GetFiles and FileSystemInfo methods. For within the article appeared the following (paraphrased) warning:

The weakness in this approach (root.GetDirectories("*.*", System.IO.SearchOption.AllDirectories);) is that if any one of the subdirectories under the specified root causes a DirectoryNotFoundException or UnauthorizedAccessException, the whole method fails and returns no directories. The same is true when you use the GetFiles method. If you have to handle these exceptions on specific subfolders, you must manually walk the directory tree ...

With this warning in mind, I implemented the Stack and Recursive methods, as provided in the MSDN article.

Stack Test Method

I think stacks are the forgotten data structure. I was implementing them in C in the 1970's; used them in Pascal during the 1980's; and then, for some reason, stopped using them. But here, the stack proves its worth.

// ************************************** retrieve_Stack_files

void retrieve_Stack_files ( string root )
    {
                                // names of directories to be 
                                // examined for files
    Stack < string > directories = new Stack < string > ( 20 );

    directories.Push ( root );

    while ( directories.Count > 0 )
        {
        string      current_directory = directories.Pop ( );
        string [ ]  subdirectories;
        
        try
            {
            subdirectories = Directory.GetDirectories ( 
                                        current_directory );
            }
                                // UnauthorizedAccessException 
                                // will be thrown if the pro-
                                // cess does not have dis-
                                // covery permission on a 
                                // directory or file; this 
                                // exception will be ignored
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // DirectoryNotFound may be 
                                // raised, although unlikely;
                                // this occurs if current_dir-
                                // ectory has been deleted by 
                                // another application; this 
                                // exception will be ignored
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
            
        try
            {
            foreach ( string file in Directory.GetFiles ( 
                                        current_directory ) )
                {
                Stack_files.Add ( file );
                }
            }
                                // earlier comments apply
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // earlier comments apply
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
                                // push the subdirectories 
                                // onto the stack for further
                                // traversal
        try
            {
            foreach ( string subdirectory in subdirectories )
                {
                directories.Push ( subdirectory );
                }
            }
                                // earlier comments apply
        catch ( UnauthorizedAccessException e )
            {
            continue;
            }
                                // earlier comments apply
        catch ( DirectoryNotFoundException e )
            {
            continue;
            }
        }
    }

Yes, the source code is somewhat longer than the two preceding algorithms, but I think it may be more readable. But far more important is its robustness. If all that we can retrieve is the first list of subdirectories, then we at least get all of the files beneath each one - we do not fail without returning something. If, however, the reader prefers to throw an exception, that code can be easily added to each of the catch blocks.

Although it is tempting to collapse the try-catch blocks, experimentation shows that the execution time increases. I have no idea why this is the case.

Recursive Test Method

// ********************************** retrieve_Recursive_files
        
void retrieve_Recursive_files ( DirectoryInfo root  )
{
    FileInfo [ ]        files = null;
    DirectoryInfo [ ]   subdirectories = null;
                                // process all the files 
                                // directly under this folder 
    try
    {
        files = root.GetFiles ( "*.*" );
    }
                                // UnauthorizedAccessException 
                                // will be thrown if the pro-
                                // cess does not have dis-
                                // covery permission on a 
                                // directory or file; this 
                                // exception will be ignored
    catch  ( UnauthorizedAccessException e )
    {

    }
                            // DirectoryNotFound may be 
                                // raised, although unlikely;
                                // this occurs if current_dir-
                                // ectory has been deleted by 
                                // another application; this 
                                // exception will be ignored
    catch  ( DirectoryNotFoundException e )
    {
    
    }

    if  ( files != null )
    {
        foreach  ( FileInfo fi in files )
        {
                                // only access the existing 
                                // FileInfo object; if it is 
                                // required to open, delete, 
                                // or modify the file, then a 
                                // try-catch block is required 
                                // to handle the case where 
                                // the file has been deleted 
                                // since the invocation of 
                                // this method
            Recursive_files.Add ( fi.FullName );
        }

                                // Now find all the subdirect-
                                // ories under this directory
        subdirectories = root.GetDirectories ( );

        foreach  ( DirectoryInfo dirInfo in subdirectories )
        {
                                // Recursive call for each 
                                // subdirectory.
            retrieve_Recursive_files ( dirInfo );
        }
    }
}

Timing Results

I ran the ComparingDirectoryTraversalMethods project tool for one and ten iterations, traversing a chosen directory using each of the four algorithms. The results are:

Iterations   GetFiles   FileSystemInfo   Stack   Recursive
    1           3062        4158          5139      6164
   10          31098       41895         51453     61823

Conclusion

This paper has presented timings for four popular directory traversing algorithms.

Considering the warning of the earlier referenced MSDN article and the fact that there is not an order of magnitude difference in execution times, I believe that serious consideration should be given to the Stack and the Recursive algorithms for traversing directory structures.

I have chosen the Stack implementation for the Transfer tool. It will be executed in a thread, separate from the UI thread.

History

11/9/2012 - Original article.