Sorting Variable Length Strings in O(N) Time

Roger Doss

5.00/5 (2 votes)

Nov 7, 2012

CPOL

7 min read

32071

554

The algorithm described herein is designed to implement sorting of variable length strings in O(n) time.

Download source - 126.3 KB

Introduction

The algorithm described herein is designed to implement sorting of variable length strings in O(n) time. The implementation sorts an input file of strings. When processing the input file, this program uses two linked lists, one for the strings, and another for the attributes about the strings. This allows a more elegant implementation for processing strings of variable length, as opposed to static or heap-dynamic arrays. Once the file has been read, the data is transitioned into the desired format, that is, a char array holds all the strings, and another array contains the start and size information for each string.

Algorithm Description

Sorting strings in O(n) time, requires a little thought before it can be accomplished successfully. First, we note that a popular and effective method for sorting integral keys is Radix sort[1]. And we also note that Radix sort can be customized in a number of ways. In this case, we are being asked to sort variable length strings, with possible values being A-Z. That is, varying length keys, with Radix 26. The character '.' is not being considered a potentially value, and is used only to aid in the processing of the strings from the input file.

To accomplish this particular sorting problem, we need to first sort the strings based on their length, then build an array of indexes corresponding to sub-lists of strings of a given length. This allows us to sort alphabetically, considering those strings of length > w, were w is going to range from k-1 down to 0. In this way, we explicitly consider only those strings that are big enough to be considered with respect to size at each pass. This directly implies that we will retain the O(N) property of Radix sort for the task at hand.

Looking at the code, we see a properly coded radix sort for sorting the strings based on their length. This routine explicitly does not use linked lists for its buckets, as that may incur overhead of run-time memory allocation for each bucket. We opt to dynamically allocate a single array, and work from there. What we are doing is first initializing an array, called counts to zero, then mapping the occurrence of the hexadecimal digit i=k-1 down to 0, to its appropriate bucket and incrementing the number of occurrences of a particular digit. We then pass through, and accumulate the appropriate indexes, corresponding to correct positions in the array tmp[], based on the w'th digit. We then map the elements to their respective buckets, using the indexes we created. And when we are done, we concatenate the strings placing them from tmp[] into A[]. When w == 0, we will have performed a Radix sort based on size.

Once we have the list sorted by size, we need to obtain the starting indexes for each string of size > k-1 down to 0. This is trivial, as we have a sorted list, we simply apply the counting mechanism employed in the bucket sorting phase of Radix sort, and then calculate the indexes. These indexes, stored in sizes[], correspond to sublists, of strings that can be sorted at each pass of the alphabetical Radix sort, for any given value of w, in the range of k -1 down to 0.

We then call a specialized Radix sort to sort alphabetically, this is done by considering ONLY elements of size > w at any given time, the statement:

lower = sizes[w];

allows us to only consider the w'th character in that sublist for strings large enough to be considered, that is, strings that actually have a w'th character. When we are done, at each pass, we concatenate everything back into A[], considering only those slots that range from lower to n -1, as those are the ones we have sorted. When w ==0, the strings are now sorted in alphabetical order. We also note that we have not changed the ordering of size, therefore, the string ABS will always come before ABSOLUTE, as desired.

NOTE: The obvious, and incorrect answers for this problem are the following:

Explicitly padding the strings to be the size of the maximal length string results in O(n^2) runtime.
Considering all the strings for the w'th character when some strings are not large enough. We could have tried to sort based on this, and for example considered a zero character when strings are too short; however, this is incorrect, as when w is large, we may inadvertently sort strings which do not need to be sorted. Consider when n-1 strings are size,1, and the n'th string is size n. This approach would make the algorithm O(n^2) as it effectively is the same as padding, without the overhead of explicitly increasing size of each string with zero characters.

Algorithm Analysis

Time Complexity

First, the radix sort based on string size, takes O(n + k) time. In this particular implementation, k is 8 (8 * 4 => 32 bit int when radix is 16). This is a constant independent of n, hence sorting the strings based on their size via Radix sort is O(n).

Second, we use a variation of bucket sort, to obtain an array of starting addresses of elements of length 1,2,3,... k. This is O(n).

Third, when we do the radix sort of the strings based on alphabetical order, we have a list ordered by size, and an array of starting addresses for each set of strings that happen to be > w in length, thus, we proceed for w = k -1 (k -1 being the maximal length string, LSD first) and only consider the set of strings > w, that is, lower = sizes[w]; When we are done, we have sorted the list. This routine is O(n + k). Even though k is going to vary in length due to the fact that the strings are of varying length, k is treated as constant with respect to doing the sort on the k'th character during each pass as only individual sublists large enough to be considered are sorted at any given value of k. We do not consider elements too small, that is, when A[i].T < w, nor do we explicitly pad the strings to be of a fixed size. When we are done at each pass, we concatenate the results into A[] for only those strings that we have sorted. With this and the fact that k is independent of n, the total time complexity for this routine is going to be O(n).

Thus we have, O(n) + O(n) + O(n) for the size radix sort, for the counting of the indexes, and for the alphabetical radix sort respectively. This algorithm is O(n).

NOTE: The strings themselves are not touched except for the consideration of the w'th character, which takes O(1) time to accomplish, so it play no part in the sorting. To be more explicit, during the sorting, we only access the strings by doing the operation CH(i,w) == C[A[i].S + w], which takes O(1) time. We do not move the strings from their location in C[], only their attributes A[] are moved, as described in the handout.

Space Complexity

The algorithm uses O(n) space to store the strings in C[]. O(n) space to store the attributes of the strings in A[]( starting indexes and size ) and during each step of the algorithm in execution, that is, for the steps described above, we use array's of either O(k), O(n), O(RADIX +1) space for sizes[], tmp[], and count[], respectively. The total space complexity is O(n).

Using the Code

The algorithm uses a Radix-Sort[1] to sort the strings based on size, with a Radix of 16, ie, Hexadecimal. The reason is, we can binary shift the bits by a factor of four and bitwise mask with the number 15 ( binary all ones in positions 3,2,1,0 respectively ) to obtain the next digit in Hex ( see ./include/sort.h). This is considerably faster than using 10 as a Radix. But is machine-dependent, if you are using a BIG-ENDIAN or LITTLE-ENDIAN machine, kindly define the appropriate included macro. If you don't, YOU WILL NOT BE ABLE TO COMPILE.

Files:

            Makefile         -- Makefile to build on Linux/Unix
            f                     -- Sample input file
            include            -- Include directory containing headers
            include/sort.h  -- Header for sorting routines
            include/list.h   -- Header for linked list code
            include/defs.h -- Defines  used by this program
            include/types.h-- Typedefs used by this program
            list.c                -- Linked list code for processing input file f
            main.c             -- function main
            sort.c               -- The Algorithm, and supporting code

References:

[1] Sedgewick, R. 1997. Algorithms in C, Parts 1-4: Fundamentals, Data Structures, Sorting, Searching. 3^rd Edition. Addison-Wesley