Click here to Skip to main content
15,921,660 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi all,

I want to extract all 3 grams(each 3 gram contain 3 byte with 1 byte shift each time) of files in a directory and count frequency of each 3 gram in files. I have written a simple C++ program to extract 3 grams of binary files recursively and saved them in a hash table as a key.
before I add each key I find key. if it was in heap I did not add this key and just increase member value(frequency of presence 3gram).

The program runs but it stops with an error message saying "windows has triggered a break point in my program.This may be due to the corruption of the heap which indicated a bug in the program or the dlls that it loads"

I would appreciate it if somebody could help me ..

Thanks,

C++
#include "hash_table.h"
#include <string>
#include <windows.h>
#include <fstream>
#include <stdio.h>
#include <iostream>

#define MAX_BUFFER_SIZE 256
typedef CHashTable<int> CLongHashT;

using namespace std;

void makeVocabHash(string dir, CLongHashT HashTperAll, int N) {
	/* N -> N-gram! */
	HANDLE		  hFindFile;
	WIN32_FIND_DATAA  Win32FindData;
	CHAR	        Directory[MAX_PATH];
	int counter;
        int countNgram;
	int  i;
	string tmp;

	fstream fileOpen;

	// copying path to directory
	sprintf(Directory,"%s\\*.*", &dir[0]);
      if((hFindFile = FindFirstFileA(Directory, &Win32FindData)) == INVALID_HANDLE_VALUE){ // if   directory not found (finding first file of directory)
		return ; // error, directory not found
	}

do{
if(strcmp(Win32FindData.cFileName, ".") != 0 && strcmp(Win32FindData.cFileName, "..") != 0){
sprintf(Directory, "%s\\%s", &dir[0], Win32FindData.cFileName);


        // if found a file
        if(! (Win32FindData.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) ) {
	// is a file
	fileOpen.open(Directory, ios::in | ios::binary | ios::ate);

	//size of file
	int end = fileOpen.tellg();
	fileOpen.seekg (0, ios::beg);
	int begin = fileOpen.tellg();
	int size = end - begin;
	char data[MAX_BUFFER_SIZE];

	fileOpen.read(&data[0], size);
	fileOpen.close();

	counter = 0;
	// reading data with N bytes, construct 3 grams and insert to hashT
    while(  (counter !=  ((size - N) + 1) ) ) {
	for( i=0; i!=N; ++i) 
		tmp += data[i+counter];

		// insert to hashT
		if (HashTperAll.GetMember(tmp))
		   countNgram = *(HashTperAll.GetMember(tmp)) + 1;
		else
		    countNgram = 1;
		
                HashTperAll.AddKey(tmp, &countNgram );
		tmp = "";
		counter++;
		}
			
	}
	else {
		// is a directory
		makeVocabHash(Directory, HashTperAll, N);
	}
		}
} while(FindNextFileA(hFindFile,&Win32FindData));//finding next file in directory

	// closing handles
	FindClose(hFindFile);
}
//---------------------------------------------------------------------
void main()
{
   CLongHashT HashTDocs;

   cout<< "enter a path";
   string dir;
   cin>> dir;
   makeVocabHash(dir,HashTDocs, 3);

}
Posted
Updated 1-Oct-13 23:14pm
v3

1 solution

The problem may be located here:
C++
char data[MAX_BUFFER_SIZE];

fileOpen.read(&data[0], size);

This results in a buffer overflow when your file size is greater than
MAX_BUFFER_SIZE<br />
.
To avoid this, allocate the buffer on the heap using new or malloc:
C++
char data = new char[size];
fileOpen.read(&data[0], size);
// Do something with data here
delete data;
 
Share this answer
 
Comments
khorshidpour 7-Oct-13 2:35am    
thanks,I modified my code but I still have same problem, Also
I checked this code with a synthetic file with one line but this error have been shown.
Jochen Arndt 7-Oct-13 3:25am    
Then you should run your program in debug mode and set some breakpoints to check your data.

You should also change the line
while( (counter != ((size - N) + 1) ) )
to
while( (counter < ((size - N) + 1) ) )
I don't think that this is the source of your problem. But it will be a problem when N < size-1.

You told us that your file is some kind of binary file. Your are copying bytes to std::string tmp. If your data are binary and not characters, this will not work as expected.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900