This article demonstrates the implementation of a smart cross-platform textdiff lexicographical analyzer utility which can be used to find and locate the differences between two files.
Introduction
It’s more than 40 years since in 1970s Douglas McIlroy and James W. Hunt, developers in AT&T Bell Labs (Murrey Hill, New Jersey), introduced the first release of diff utility as part of the 5th-gen Unix V distribution. diff is a small file comparison utility that performs the lexicographic analysis to find and locate the differences between two files as the result of changes being made to the contents of these files by a program during its execution.
Initially, the implementation of the diff utility was mainly based on using the either heuristic or Levinshtein distance probability characters-based algorithms. The main disadvantage of these algorithms is that the following algorithm actually perform a prediction of the characters to be inserted, deleted or modified at a certain position in a string based on each character’s probability computation, which makes it impossible to use the following algorithm to analyze and compare short strings since that to compute the distance between two predictable characters by using the following algorithms, we must have the information enough about the probability of each preceding character to be transformed into another one by performing those character’s insertion, deletion or modification, so that as many characters have been previously predicted with a certain probability as the most precise value of the edit-distance can be computed for the current character by using the Levinshtein distance algorithm. That’s actually why, the Levinshtein and heuristic algorithms are considered to be the least efficient to compare the short character’s sequences.
In this article, we’ll discuss about the algorithm that, unlike the Levinshtein distance and heuristic algorithms, allows to find the differences between two files based on the performing of context search to locate similar substrings persisting in both files using offset-based context search algorithm.
The following article also contains the ANSI C implementation of the alternative textdiff utility using the algorithm being discussed. The textdiff utility being introduced in the following article was developed and tested on the Windows platform, but its source code can also be used to maintain the following utility for the other platforms such as Linux or iMac. Also the following source code uses the 64-bit data types such as __int64
to provide the support for comparing and analyzing files having size of more than 1GB.
Since the development of file comparison and lexicographic analysis tools such as textdiff is still relevant and being a point of interest, there’re the large number of the similar products released by the different vendors including IDM UltraCompare, WinMerge as well as the DiffNow and DiffChecker online tools. Each of the mentioned above products has been implemented by using different algorithms providing the different results of files comparison and analysis.
The main goal of the following article is to demonstrate the algorithm which is capable of providing the optimal results of the file lexicographical comparison and analysis compared to the other existing text comparison tools and utilities.
As well as the basic algorithm discussion, in this article, we’ll perform the analysis of the source code of the textdiff utility being introduced by providing code snippets written in C to each phase of the algorithm being discussed throughout in this article.
Background
In this section, we’ll discuss more about the algorithm that allows to compare and analyze the contents of two files by performing a context search to find and compute the position of the all non-identical fragments that might persist as the result of the changes being made to the contents of these two files. The following algorithm can be used to perform the comparison of either plain-text or binary files.
Algorithm of Finding and Locating the Differences Between Two Files
The algorithm being discussed in this article is intended to lexicographically compare the contents of two files stored into the two previously allocated string buffers str1[]
and str2[]
at the beginning of code execution. To demonstrate the process of files comparison by using the following algorithm, we’ll use two plain-text files, each one containing a short text string that consists of the number of either identical or non-identical fragments. The non-identical fragments are typically those fragments contained in the first str1 []
buffer that are actually missing from the second buffer str2 []
.
Before we begin, let’s go over some considerations that further will be used during the following algorithm discussion. Suppose we need to identify the all non-identical fragments contained in the first str1[]
buffer. For that purpose, we’ll be using the second str2 []
buffer to find the fragments persisting in both str1[]
and str2 []
buffers by performing the context search and actually exclude them to locate the starting and final positions of those non-identical fragments from str1 []
not contained in str2 []
. Since that, we consider the second str2[]
buffer to be a “pattern” of search. And vice versa, if we need to locate the same non-identical fragments from the str2[]
buffer we’re treating the first str1[]
buffer as the “pattern” of search.
Finding and Locating Non-Identical Fragments
Now, it’s time to take a closer look at the algorithm itself being discussed. Since we have two string buffers string []
and pattern []
which have not been previously analyzed, to find the first non-identical fragment, according to the following algorithm, we should perform loop iteration over the two string buffers, either string []
or pattern []
, starting at the position index == 0
of the first character of these two buffers, lexicographically comparing each character in string []
buffer located at the current position index with a corresponding character from the pattern []
buffer until we’ve found a pair of unmatched characters located at the same position. During each loop iteration, the loop counter variable index is incremented and at the end of loop execution, the following variable is assigned to the position value of the first occurrence of a pair of characters from either string []
or pattern []
buffer that don’t actually match. The position value of index variable is actually an absolute value of the starting position for current non-identical fragment being found. Fig. 3. Illustrates the process described above:
In this case, the two characters “T
” from string []
buffer and “D
” from pattern []
located at the position index == 6
don’t match which indicates that the current non-identical fragment will start at the position index of the string []
buffer. The following code snippet implements the linear search algorithm that aims to find the first occurrence of a pair of characters that don’t actually match starting at the beginning of either the string []
and pattern []
buffers:
size_t index = 0;
while (string[index] == pattern[index] &&
string[index] != '\0' && pattern[index] != '\0') index++;
At this point, since we’ve obtained the position value at which the current non-identical fragment starts, to locate this fragment, we also need to compute the value of its final position by performing the offset-based context search to find the starting position of the last longest occurrence of substring obtained from the pattern []
buffer starting at the position index, located at a certain position of the string []
buffer.
Let’s recall that, we basically deal with two special cases in which the non-identical fragments of these two string buffers could be found and located. In the first case, we’re performing the offset-based context search to find non-identical fragments that start at a certain position and can only be found and located in the string []
buffer. The non-identical fragments of this type are typically called – “insertions”. Similarly, we will be performing the similar offset-based context search to find non-identical “insertion” fragments contained only in the pattern []
buffer starting at the same position as in the previous case. Also, there’s another case in which a pair of two non-identical fragments starting at a certain position in either the string []
or pattern []
buffers causing each buffer to have different characters at the same positions, and at the same time, having different lengths. The fragments of this particular type are also called “overwrites” or “dual insertions”.
Now, let’s take a closer look at the algorithm being discussed. At this point, we initially start finding the final position of the first non-identical fragment by performing offset-based context search in the string []
buffer, which is quite similar to the trivial context search but with only one difference that the offset-bases context search is mainly based on performing the number of trivial context searches for each value of the offset loop counter variable.
Now it’s time to demonstrate how to find the final position of the current non-identical fragment from the previous example assuming that the following fragment is the “insertion” fragment persisting only in string []
buffer by performing the mentioned above offset-based context search.
In the following example, to demonstrate the process of the trivial context search performed by the offset-based context search algorithm for each offset value, we’ll initially assume that the final position of the first non-identical fragment can be obviously found by locating the starting position of a substring extracted from the pattern []
buffer having offset variable value equal to 0
. In this particular case, using the offset-based context search is equivalent to performing the trivial context search, since we need to execute just one iteration of the outermost loop, for a single value of the variable offset == 0
.
At this point, the first thing that we should do is to extract all characters from the substring in the pattern []
buffer starting at the position index == 6
and copy it into previously allocated temporary buffer. For example, if the current non-identical fragment starts at the position index == 6
, then finally the temporary buffer will contain the substring “DOG JUMPS OVER A NICE FOX”.
By performing the trivial context search, we need to iterate over the string []
buffer starting at the position index and for each position pos
in the string []
buffer assign the values of the current position pos
in the string []
buffer and the position of the first character in the temporary buffer to the loop counter variables pos_s
and pos_p
respectively.
Then, we have to perform loop iteration over the string []
and the temporary buffer comparing each succeeding character in string []
buffer at the position pos_s
with a corresponding character from the temporary buffer located at the position pos_p
until we’ve found a character located at the position pos_s
in the string []
buffer which doesn’t match a character located at the position pos_p
in the temporary buffer. During the loop execution, we’re incrementing the values of pos_s
and pos_p
variables so that at the end of the following loop execution, the variables pos_s
and pos_p
will be assigned to the values of the positions of characters that don’t match in either the string []
or the temporary buffer.
During the context search described above, we don’t actually perform a search for the entire substring from the pattern []
buffer that exactly matches the substring starting at a certain position pos of the string []
buffer. Instead, we’re performing the context search by a partial match to find the position of last longest occurrence of a character’s sequence matching in both the pattern []
buffer and the string []
buffer starting at certain positions. After we’ve computed the value of longest matching substring length for each position pos
in the string []
buffer, we have to determine the position value pos
of the substring in string []
buffer for which the value of the substring length is the maximum.
For that purpose, iterating over the string []
buffer for each position pos
for which the longest substring’s occurrence position is being found, we’re performing a check if the computed value of the current substring length is greater than the value of the maximum length so far. If so, we assign the length value of the substring at the current position to that variable used to store the currently maximum length value. Similarly, if the length of the current substring being found is greater than the maximum length value, we assign the value of the starting position pos for the current substring to max_pos
variable. So that, finally at the end of loop execution, the following variable will be assigned to a value of the position pos
of the last longest substring found in string []
buffer matching the substring from the pattern []
buffer starting at the position index. Fig. 4. Illustrates the process discussed above:
As an alternative, we might not copy the sequence of characters starting at the position index into a previously allocated temporary buffer. Instead, since the value of the position index has been obtained, we can treat the sequence of characters in pattern []
buffer located at the positions which are greater or equal to the position index as the substring for which we’ll be finding the occurrence in the string []
buffer. In this case, actually we need to perform the same context search as it has been already discussed above, but assuming that the first character of the temporary buffer is actually located at the position index in the pattern []
buffer. Then, we’re comparing each character in pattern []
buffer starting at the position index with a corresponding character at string []
buffer located at the position pos_s
as it has already been discussed.
For example, for each character starting at the character “T” in the string []
buffer we’re performing a check if it’s equal to the character “D” in the substring contained within the pattern []
buffer. During each iteration, we’re similarly comparing the characters “T” and “D”, and then “I” and “D” until we encounter that the character “D” at the position pos_s == 8
in the string []
buffer exactly matches the same character “D” located at the position index in the pattern []
buffer. At this point, we’re comparing succeeding characters starting at the positions pos_s = 8
and pos_p = 6
in both string []
and pattern []
buffers respectively ending up at the position pos_s = 9
and pos_p = 7
since that the two characters “E” and “O” are not the equal characters. Obviously that the following substring contains only one character and has the length of 1. As it has already been discussed above since this is the first value of length, we normally are assigning the value of length of the current substring to the variable max_len = 1
. Respectively, we’re assigning the value of the starting position pos of the longest occurrence of the substring to the max_pos
variable so that max_pos = 8
.
Now we’re proceeding to compare each next character from the string []
buffer with the character located at the position index == 6 at the pattern []
buffer. We’re normally comparing the characters “D” and “E”, “D” and “<space>” until we’ve found that the two characters at the positions pos == 11
and position index == 6
are identical. At this point, we’re similarly comparing each pair of characters starting at the positions pos_s = 10
and pos_e == 6
until we’ve found two characters “L” and “N” that don’t match at the position pos_s == 28
in the string []
buffer and pos_p == 23
in the pattern []
buffer respectively. Notice that all pairs of characters located between the position index and the pos_p
in the pattern []
buffer exactly match. Similarly, as has been discussed above, at this point, we need to compute the length of the substring obtained by subtracting the value of index variable from the value of the pos_p
variable: pos_p – index = 23 – 6 = 17
. As we can see, the length value of the current substring is much greater than the value of the max_len
variable being previously obtained. In this case, we assign the value of difference pos_p – index
to the max_len
variable so that max_len = 17
. Also, we assign the value of the position pos
of the substring being found to the max_pos
variable so that max_pos = pos = 11
.
After we have obtained the final position of the first non-identical fragment by locating the last longest occurrence of the substring extracted from the pattern []
buffer in the string []
buffer starting at the position index + offset
, now we proceed the iterations by performing the discussed above trivial context search for each succeeding value of the offset variable.
By performing the offset-based context search, for each value of the offset variable during each loop iteration, we’re actually extracting a particular substring from the pattern []
buffer starting at the position index + offset and for each particular string, at the end of each iteration, we’re performing a check if the value of length of the last longest occurrence of the substring being extracted max_len
for a specific value of the offset is greater than the length of the substring having the maximum possible length so far. If so, we’re assigning the values of max_len
to the chunk_info[0].max_len
variable and proceed with another substring for the next value of the offset variable, so that, finally, the chunk_info[0].max_len
variable will be assigned to the maximum possible length value of the last occurrence of the substring being extracted from the pattern []
buffer, found in the string []
buffer during the performing of the trivial context search for a specific value of the offset variable. Note that, unlike finding the “insertions” discussed above, in this particular case, by performing the offset-based context search, we need to find the last longest occurrence of the substring located starting at the minimal position, having the minimal distance to the position index. That’s actually why, during each iteration, for each value of the offset variable, we also compare the value of the position for the current substring with the value of the currently minimal position so far. In the other words, if a certain substring has the greater length than the current maximum length so far and at the same time start at the position closer to the position index, the variables chunk_info[0].max_pos
and chunk_info[0].max_len
are assigned to the values of the position and length of the current substring for the current value of the offset variable.
Since the final position value for the current non-identical fragment has been obtained and assigned to the chunk_info[0].max_pos
variable, then we’re performing the check if the current non-identical fragment is actually the “insertion” fragment by comparing the value of the position index with the value of pos_p
variable. If these values are equal, then the following non-identical fragment is actually the “insertion” fragment and persists only in the first string []
buffer.
Now at this point, we have to extract the non-identical fragment being found using the values of the position index and the value of final position assigned to the chunk_info[0].max_pos
variable by copying the following fragment from the string []
buffer into the separate buffer diff. In this case, the non-identical fragment being found is “TIDE”.
char* diff = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[0].max_pos - index,
(const void*)&string[index], chunk_info[0].max_pos - index);
char* diff = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[1].max_pos - index,
(const void*)&pattern[index], chunk_info[1].max_pos - index);
Similarly, at this point, we’ll use the same trivial context algorithm to obtain the value of the final position of the current no-identical fragment contained within the pattern []
buffer starting at the same position index == 6
. As we already know, for that purpose, we’ll consider that the string []
buffer is actually a pattern of search. To demonstrate the following process, let’s now use a bit modified “pattern” string that contains the character “T” at the position 28 of the pattern []
buffer.
By performing the context search in the pattern []
buffer, we’re finally locating the last longest occurrence of the substring that is starting at the position pos_p = 28
and consists of just one character “T” within. In this case, the value of the pos_p
variable is assigned to the chunk_info[1].max_pos = 28
variable while the value of the substring length is assigned to the variable chunk_info[1].max_len
equal to 1.
The code fragment that implements the offset-based context search algorithm discussed above that performs the location of the longest occurrence of the substring:
for (size_t offset = 0; offset < strlen(pattern); offset++)
{
long long max_pos = -1, max_len = -1;
for (size_t pos = index; string[pos] != '\0'; pos++)
{
size_t pos_s = pos, pos_p = index + offset;
while (pos_s < size1 && pos_p < size2 &&
string[pos_s] == pattern[pos_p] &&
string[pos_s] != '\0' && pattern[pos_p] != '\0')
{
pos_s++; pos_p++;
}
if (pos_s - pos >= max_len || max_len == -1)
{
max_pos = pos;
max_len = pos_s - pos;
}
}
if (max_len > 1 && max_len >= chunk_info[0].max_len ||
chunk_info[0].max_len == -1)
{
chunk_info[0].max_len = max_len;
if (max_pos < chunk_info[0].max_pos || chunk_info[0].max_pos == -1)
chunk_info[0].max_pos = max_pos;
}
}
size_t pos_p1 = index;
while (pattern[pos_p1] != string[chunk_info[0].max_pos] &&
pattern[pos_p1] != '\0' && chunk_info[0].max_pos > -1) pos_p1++;
Since we’ve finally found the first non-identical fragment of the string []
buffer “TIDE”, now we have to indent both string []
and pattern []
buffers to the position index at which the following fragment has been previously found. For that purpose, we have to copy the contents of the string []
buffer starting at the final position of the non-identical fragment being found to the position index. Otherwise, if the current non-identical fragment is found in the pattern []
buffer, we have to copy the contents of the pattern []
buffer starting at the final position of the non-identical fragment to the same position index in the pattern []
buffer.
memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
strlen(diff1)], strlen(&string[index + strlen(diff1)]));
memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
strlen(diff2)], strlen(&pattern[index + strlen(diff2)]));
Let’s now spend more time to discuss another example of finding and locating the “overwrite” non-identical fragments. Suppose we’ve already found a certain non-identical fragment “TIDE” located at the position index == 6
of the string []
buffer. Obviously that, this is an “insertion” fragment. After locating the following fragment, the two string buffers, string []
and pattern []
are indented to the position 0 so that both of them will contain particular strings starting at the positions 6 and 11 respectively:
After those two buffers string []
and pattern []
have been indented, we’re actually performing the same iteration over those two string buffers similarly comparing each pair of characters contained in both string []
and pattern []
buffer until we incur the pair of characters that don’t actually match. In this case, the first pair of the unmatched characters “L” and “N” is obviously located at the position index == 0
. So that during the first iteration of the loop at which the value of the offset variable equal to 0
, we’re actually extracting the substring “NICE FOX” and copy it into the temporary buffer. Then, we’re performing the trivial context search to find the last longest occurrence of that substring the string []
buffer. Unfortunately, in this particular case, there’s no suitable substring contained in the string []
buffer that would match the substring contained in the temporary buffer. Then we’re proceeding to perform the next loop iteration for the value of variable offset == 1
. Similarly, we’re extracting the substring “ICE FOX” from the pattern []
buffer and perform the same trivial context search. In this case, the following substring also cannot be found in the string buffer. After that, we’re proceeding with the substring “CE FOX”, “E FOX” until we incur the substring “ FOX”, for which could be found at the position pos == 5
in the string []
buffer and has the length of max_len == 3
. Fig. 6 illustrates the process of finding the two non-identical “overwrite” fragment which could be found in string []
and pattern []
buffers:
Since the position of the current substring has been obtained, we’re treating this position as the position of the final character of that non-identical fragment which we’re actually searching for. In this case, the following character is “<space>”. Then we’re performing the obvious linear search of the following character in pattern []
buffer and locate it at the position pos_p == 4
assuming that the following position is the final position of another non-identical fragment found in the pattern []
buffer. Now, having the starting and final position for both non-identical “overwrite” fragments from either string []
or pattern []
buffers, let’s extract those fragments so that from the string []
buffer we would extract the substring “LARGE” and from the pattern []
buffer – the substring “NICE”.
char* diff1 = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff1, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff1, chunk_info[0].max_pos - index,
(const void*)&string[index], chunk_info[0].max_pos - index);
char* diff2 = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff2, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff2, chunk_info[1].max_pos - index,
(const void*)&pattern[index], chunk_info[1].max_pos - index);
We’re performing the loop iteration, proceeding to find the next current non-identical buffers until we incur the entire contents of the string []
and pattern []
buffers become identical as the result of performing the indenting after each non-identical fragment is being found and located. At this point, we similarly start from the beginning of the string []
and pattern []
buffers. Normally, we perform the iterations of the outermost controlling loop until we incur that both string []
and pattern []
buffers are identical or empty having no characters to be compared.
ullong diff_pos[2] = { 0 };
bool parsed = false; int nbuf = 0;
while (strcmp(string, pattern) != 0 && parsed == false)
{
}
Also there’re special cases at which the either string []
or pattern []
buffers doesn’t actually contain any non-identical fragments but are still considered to be different. This normally occurs when string []
contains a fragment missing from pattern []
buffer located at the position beyond of the last character in pattern []
buffer and vice versa. In this case, we should simply extract the fragment located beyond the last position in the pattern []
buffer from the string []
buffer assuming that the following fragment to be non-identical:
if (index >= strlen(pattern) && pattern[index] == '\0')
{
char* diff = new char[strlen(string) - index + 1];
memset((void*)diff, 0x00, strlen(string) - index + 1);
memcpy_s((void*)diff, strlen(string) - index,
(const void*)&string[index], strlen(string) - index);
parsed = true;
if (strcmp("\0", diff) < 0)
df(diff, diff_pos[0] + index, strlen(string) - index, 0);
diff_pos[0] += strlen(string) - index;
}
else if (index >= strlen(string) && string[index] == '\0')
{
char* diff = new char[strlen(pattern) - index + 1];
memset((void*)diff, 0x00, strlen(pattern) - index + 1);
memcpy_s((void*)diff, strlen(pattern) - index,
(const void*)&pattern[index], strlen(pattern) - index);
parsed = true;
if (strcmp("\0", diff) < 0)
df(diff, diff_pos[1] + index, strlen(pattern) - index, 1);
diff_pos[1] += strlen(pattern) - index;
}
Another important task that we need to perform while finding and locating each non-identical fragment is the computation of the absolute position for each non-identical fragment being found. Depending on what type of non-identical fragments either the “insertion” or “overwrite” being found using different methods of the position value computation.
Let’s remind that the value of absolute position of a non-identical fragment being found is not the same for either string []
and pattern []
buffer. To compute that value, we use the array diff_pos[2]
in which the first item diff_pos[0]
will be assigned to the position for each non-identical fragment being found in string []
buffer and the second item diff_pos[1]
will be used to store the position of a fragment from pattern []
. To compute the position of each “overwrite” fragments found in string []
and pattern []
, we simply add the value of the length of each fragment to either diff_pos[0]
or diff_pos[1]
counter variables respectively, so that the values of diff_pos[0]
and diff_pos[1]
will contain the position for the next non-identical fragment being found. To determine the position of non-identical “insertion” fragments, first we need to determine in which string buffer either string []
or pattern []
the non-identical “insertion” fragment has been found. If the non-identical “insertion” fragment is found in the string buffer, we increment the value of diff_pos[0]
data item by the value of the position index and also the length of the chunks[0]
buffer from the array chunks in which the following fragment has been copied. Then we need to increment the value of diff_pos[1]
data item by the value of the position index. Similarly, if the non-identical “insertion” fragment was found in the pattern []
buffer, then we’re incrementing the value of diff_pos[1]
by the value of the position index and the length of the chunks[1]
buffer, and the value of diff_pos[0]
is incremented by the value of the position index:
diff_pos[0] += index;
diff_pos[1] += strlen(diff) + index;
diff_pos[0] += strlen(diff1) + index;
diff_pos[1] += strlen(diff2) + index;
Using the Code
#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <string.h>
#define N_CHUNKS 3
#pragma warning(disable:4018)
typedef __int64 ullong;
typedef void(*diffproc)(const char* data, ullong pos, ullong len, unsigned nbuf);
static char** filenames = NULL;
typedef struct tagChunkInfo
{
ullong max_pos;
ullong max_len;
} CHUNK_INFO;
template<typename ctype>
void swap(ctype& str1, ctype& str2)
{ ctype temp = str1; str1 = str2; str2 = temp; }
bool loadfromfile(const char* filename, char*& buffer)
{
FILE* fp = NULL;
if (fopen_s(&fp, filename, "rb+") != 0)
return 0;
fseek(fp, 0, SEEK_END);
size_t fsize = ftell(fp);
fseek(fp, 0, SEEK_SET);
if (buffer == NULL)
{
buffer = new char[fsize + sizeof(char)];
memset((void*)buffer, 0x00, fsize + sizeof(char));
}
char ch = '\0'; ullong index = 0;
while ((ch = fgetc(fp)) != EOF)
buffer[index++] = ch;
buffer[index] = '\0';
fclose(fp);
return 1;
}
void textdiff(const char* str1, const char* str2, diffproc df)
{
size_t size1 = strlen(str1);
size_t size2 = strlen(str2);
char* string = const_cast<char*>(str1);
char* pattern = const_cast<char*>(str2);
ullong diff_pos[2] = { 0 };
bool parsed = false; int nbuf = 0;
while (strcmp(string, pattern) != 0 && parsed == false)
{
size_t index = 0;
while (string[index] == pattern[index] &&
string[index] != '\0' && pattern[index] != '\0') index++;
if (index == strlen(string) - 1 && index == strlen(pattern) - 1)
{
char* diff1 = new char[sizeof(char) + 1];
memset((void*)diff1, 0x00, sizeof(char) + 1);
char* diff2 = new char[sizeof(char) + 1];
memset((void*)diff2, 0x00, sizeof(char) + 1);
parsed = true;
diff1[0] = string[index]; diff2[0] = pattern[index];
df(diff1, diff_pos[0] + index, strlen(diff1), 0);
df(diff2, diff_pos[1] + index, strlen(diff2), 1);
delete[] diff1; diff1 = NULL;
delete[] diff2; diff2 = NULL;
}
else if (index >= strlen(pattern) && pattern[index] == '\0')
{
char* diff = new char[strlen(string) - index + 1];
memset((void*)diff, 0x00, strlen(string) - index + 1);
memcpy_s((void*)diff, strlen(string) - index,
(const void*)&string[index], strlen(string) - index);
parsed = true;
if (strcmp("\0", diff) < 0)
df(diff, diff_pos[0] + index, strlen(string) - index, 0);
diff_pos[0] += strlen(string) - index;
delete[] diff; diff = NULL;
}
else if (index >= strlen(string) && string[index] == '\0')
{
char* diff = new char[strlen(pattern) - index + 1];
memset((void*)diff, 0x00, strlen(pattern) - index + 1);
memcpy_s((void*)diff, strlen(pattern) - index,
(const void*)&pattern[index], strlen(pattern) - index);
parsed = true;
if (strcmp("\0", diff) < 0)
df(diff, diff_pos[1] + index, strlen(pattern) - index, 1);
diff_pos[1] += strlen(pattern) - index;
delete[] diff; diff = NULL;
}
else
{
CHUNK_INFO* chunk_info = new CHUNK_INFO[N_CHUNKS];
memset((void*)chunk_info, 0x00, sizeof(CHUNK_INFO) * N_CHUNKS);
for (int ci = 0; ci <= N_CHUNKS - 1; ci++)
{
chunk_info[ci].max_pos = -1;
chunk_info[ci].max_len = chunk_info[ci].max_pos;
}
for (size_t offset = 0; offset < strlen(pattern); offset++)
{
long long max_pos = -1, max_len = -1;
for (size_t pos = index; string[pos] != '\0'; pos++)
{
size_t pos_s = pos, pos_p = index + offset;
while (pos_s < size1 && pos_p < size2 &&
string[pos_s] == pattern[pos_p] &&
string[pos_s] != '\0' && pattern[pos_p] != '\0')
{
pos_s++; pos_p++;
}
if (pos_s - pos >= max_len || max_len == -1)
{
max_pos = pos;
max_len = pos_s - pos;
}
}
if (max_len > 1 && max_len >= chunk_info[0].max_len ||
chunk_info[0].max_len == -1)
{
chunk_info[0].max_len = max_len;
if (max_pos < chunk_info[0].max_pos || chunk_info[0].max_pos == -1)
chunk_info[0].max_pos = max_pos;
}
}
size_t pos_p1 = index;
while (pattern[pos_p1] != string[chunk_info[0].max_pos] &&
pattern[pos_p1] != '\0' && chunk_info[0].max_pos > -1) pos_p1++;
for (size_t offset = 0; offset < strlen(string); offset++)
{
long long max_pos = -1, max_len = -1;
for (size_t pos = index; pattern[pos] != '\0'; pos++)
{
size_t pos_s = pos, pos_p = index + offset;
while (pos_s < size2 && pos_p < size1 &&
pattern[pos_s] == string[pos_p] &&
pattern[pos_s] != '\0' && string[pos_p] != '\0')
{
pos_s++; pos_p++;
}
if (pos_s - pos >= max_len || max_len == -1)
{
max_pos = pos;
max_len = pos_s - pos;
}
}
if (max_len > 1 && max_len >= chunk_info[1].max_len ||
chunk_info[1].max_len == -1)
{
chunk_info[1].max_len = max_len;
if (max_pos < chunk_info[1].max_pos || chunk_info[1].max_pos == -1)
chunk_info[1].max_pos = max_pos;
}
}
size_t pos_p2 = index;
while (string[pos_p2] != pattern[chunk_info[1].max_pos] &&
pattern[pos_p2] != '\0' && chunk_info[1].max_pos > -1) pos_p2++;
if (pos_p1 == index)
{
char* diff = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[0].max_pos - index,
(const void*)&string[index], chunk_info[0].max_pos - index);
df(diff, diff_pos[0] + index, strlen(diff), 0);
char* temp_s = new char[size1 + 1];
memset((void*)temp_s, 0x00, size1 + 1);
memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
(chunk_info[0].max_pos - index)],
strlen(&string[index + (chunk_info[0].max_pos - index)]));
char* temp_p = new char[size2 + 1];
memset((void*)temp_p, 0x00, size2 + 1);
memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
(pos_p1 - index)], strlen(&pattern[index + (pos_p1 - index)]));
if (temp_s != NULL && temp_p != NULL)
{
strcpy_s(string, size1 + 1, temp_s);
strcpy_s(pattern, size2 + 1, temp_p);
delete[] temp_s; temp_s = NULL;
delete[] temp_p; temp_p = NULL;
}
diff_pos[0] += strlen(diff) + index;
diff_pos[1] += index;
delete[] diff; diff = NULL;
}
else if (pos_p2 == index)
{
char* diff = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[1].max_pos - index,
(const void*)&pattern[index], chunk_info[1].max_pos - index);
df(diff, diff_pos[1] + index, strlen(diff), 1);
char* temp_s = new char[size1 + 1];
memset((void*)temp_s, 0x00, size1 + 1);
memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
(pos_p2 - index)], strlen(&string[index + (pos_p2 - index)]));
char* temp_p = new char[size2 + 1];
memset((void*)temp_p, 0x00, size2 + 1);
memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
(chunk_info[1].max_pos - index)],
strlen(&pattern[index + (chunk_info[1].max_pos - index)]));
if (temp_s != NULL && temp_p != NULL)
{
strcpy_s(string, size1 + 1, temp_s);
strcpy_s(pattern, size2 + 1, temp_p);
delete[] temp_s; temp_s = NULL;
delete[] temp_p; temp_p = NULL;
}
diff_pos[0] += index;
diff_pos[1] += strlen(diff) + index;
delete[] diff; diff = NULL;
}
else
{
char* diff1 = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff1, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff1, chunk_info[0].max_pos - index,
(const void*)&string[index], chunk_info[0].max_pos - index);
df(diff1, diff_pos[0] + index, strlen(diff1), 0);
char* diff2 = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff2, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff2, chunk_info[1].max_pos - index,
(const void*)&pattern[index], chunk_info[1].max_pos - index);
df(diff2, diff_pos[1] + index, strlen(diff2), 1);
char* temp_s = new char[size1 + 1];
memset((void*)temp_s, 0x00, size1 + 1);
memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
strlen(diff1)], strlen(&string[index + strlen(diff1)]));
char* temp_p = new char[size2 + 1];
memset((void*)temp_p, 0x00, size2 + 1);
memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
strlen(diff2)], strlen(&pattern[index + strlen(diff2)]));
if (temp_s != NULL && temp_p != NULL)
{
strcpy_s(string, size1 + 1, temp_s);
strcpy_s(pattern, size2 + 1, temp_p);
delete[] temp_s; temp_s = NULL;
delete[] temp_p; temp_p = NULL;
}
diff_pos[0] += strlen(diff1) + index;
diff_pos[1] += strlen(diff2) + index;
delete[] diff1; diff1 = NULL;
delete[] diff2; diff2 = NULL;
}
delete[] chunk_info; chunk_info = NULL;
}
}
}
void df(const char* data, ullong pos, ullong len, unsigned nbuf)
{
printf("\nfile: %s pos = %llu len = %llu\n", filenames[nbuf], pos, len);
printf("=======================================================================\n");
printf("%s\n", data);
}
int main(int argc, char* argv[])
{
if ((argc < 3) || (*argv[1] == '\0') || (*argv[2] == '\0'))
{
printf("Usage: textdiff <inp_file1> <inp_file2>\n");
return 0;
}
if (filenames == NULL)
{
filenames = new char*[2];
memset((void*)filenames, 0x00, sizeof(char*) * 2);
}
int arg_i = 1;
for (int fi = 0; fi < 2; fi++)
{
filenames[fi] = new char[266];
strcpy_s(filenames[fi], 266, argv[arg_i++]);
}
char *string = NULL, *pattern = NULL;
if (!loadfromfile(argv[1], string))
{
printf("Unable to read from file %s\n", "");
return -1;
}
if (!loadfromfile(argv[2], pattern))
{
printf("Unable to read from file %s\n", "");
return -1;
}
textdiff(string, pattern, df);
return 0;
}
Points of Interest
The smart textdiff utility algorithm being discussed can be an alternative of the linux/unix diff utility for Windows platform.
History
- 30th April, 2016 – First version of article published