Click here to Skip to main content
15,887,027 members
Please Sign up or sign in to vote.
3.00/5 (3 votes)
See more:
I have a large file i want to load, where in each line of the file i have a "KEY".
I used hash maps, but the size was really big using hash maps.

What other way to do this ?
Posted
Comments
Richard MacCutchan 1-Jan-16 11:33am    
To do what?
Frankie-C 1-Jan-16 12:11pm    
Try using mapping views of file like mmap( ).
Sergey Alexandrovich Kryukov 1-Jan-16 15:17pm    
Would your whole set of keys alone, without any more data, fit in memory at once? How much memory it will take?
—SA

Without specifying what you mean by "really big" and what your actual problem is, it is difficult to help you. Try to answer the following questions:

(1) What is the size of your virtual address space? (In a typical 32-bit system it's about 2 GB; in 64-bit systems it is much larger.)

(2) Does the entire file fit into the virtual memory of your process?

(3) If not, does at least the key set fit into the virtual memory? (Assuming your average key length is 12 bytes, plus an 8 byte pointer to the data, plus 3 pointers for linkage in a tree, gives about 32 bytes per key. In typical 32-bit system you could store about 50 million keys using a B-tree like structure)

If key set and file contents all fit into your virtual memory, just load them and build the key data structure. A hash map is not the most economical method for that, but very fast. If you want to keep the size to a minimum try using a red-black tree. For example you might want to try QMap, which since Qt 5.0 is based on a red-black tree.

If even the red-black tree with your keys doesn't fit into virtual memory, things get a little more complicated. In that case you can resort to structures like a B-tree, which stores multiple nodes per memory block and links those memory blocks together to form a complete index. You only need to have a couple of those blocks in memory, the rest may reside on a disk file. That is how most databases store their indices. Google for B-tree.

The pointers you store in your index may either point directly to a memory location (if index and file fit into virtual memory) or may point to a location relative to the begin of the data file. In the latter case you need an addition disk access to grab the desired line.
 
Share this answer
 
Solution 1: Using text data file to store hash maps:

C++
#include <fstream>
#include <iostream>
#include <string>
#include <cstdio>

using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
	std::ifstream file_in;
	std::string filename = "inp.txt";
	char line[1024] = "\0"; std::streamsize size = 1024;
        // Open file
	file_in.open(filename.c_str(), std::ifstream::in);
        // Fetch each line from the file into the buffer
	while (file_in.good() && file_in.getline(line, size, '\n'))
	{
		char key[256] = "\0", value[256] = "\0";
                // Parse each line according to specific format
		sscanf_s(line, "KEY: %s VALUE: %s", key, 256, value, 256);
                // Output the values of key and value
		std::cout << key << " " << value << endl;
	}

	file_in.close();

	std::cin.get();

	return 0;
}


Solution 2: Using binary files and mmap function:

C++
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

#define NUMITEMS 10

typedef struct tagHash
{
     char key[256];
     char value[256];
}HASH, *PHASH;

int main(int argc, char** argv)
{
    const char fname[256] = "inp.bin";

    int fd = -1;
    if ((fd = open(fname, O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600)) == -1) 
    {
	perror("Error opening file for writing");
	exit(EXIT_FAILURE);
    }    

    PHASH hash_array = new HASH[NUMITEMS];
    memset((void*)hash_array, 0x00, sizeof(HASH) * NUMITEMS);

    for (int i = 0; i < NUMITEMS; i++)
    {
         sprintf(hash_array[i].key,"KEY: %d",i);
         sprintf(hash_array[i].value,"VALUE: %d",i);
    }

    write(fd, (void*)hash_array, sizeof(HASH) * NUMITEMS);

    long cur_pos = 0L, file_size = 0L;
    cur_pos = lseek(fd, 0, SEEK_CUR);
    file_size = lseek(fd, 0, SEEK_END); 
    lseek(fd, cur_pos, SEEK_CUR);

    HASH* hash_ptr = NULL;
    if ((hash_ptr = reinterpret_cast<HASH*>(mmap(0, file_size, 
        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0))) == MAP_FAILED) 
    {
	close(fd);
	perror("Error mmapping the file");
	exit(EXIT_FAILURE);
    }

    while (hash_ptr != NULL)
    {
         printf("key = %s value = %s\n",hash_ptr->key, hash_ptr->value);
         hash_ptr++;
    }

    close(fd);

    return 0;
}


Solution 3: Filling QtHash with data item from a file

C++
#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>

#include <QHash>
#include <QString>

#define NUMITEMS 10

typedef struct tagHash
{
     char key[256];
     char value[256];
}HASH, *PHASH;

int main(int argc, char** argv)
{
    const char fname[256] = "inp.bin";

    int fd = -1;
    if ((fd = open(fname, O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600)) == -1) 
    {
	perror("Error opening file for writing");
	exit(EXIT_FAILURE);
    }    

    PHASH hash_array = new HASH[NUMITEMS];
    memset((void*)hash_array, 0x00, sizeof(HASH) * NUMITEMS);

    for (int i = 0; i < NUMITEMS; i++)
    {
         sprintf(hash_array[i].key,"KEY: %d",i);
         sprintf(hash_array[i].value,"VALUE: %d",i);
    }

    write(fd, (void*)hash_array, sizeof(HASH) * NUMITEMS);

    long cur_pos = 0L, file_size = 0L;
    cur_pos = lseek(fd, 0, SEEK_CUR);
    file_size = lseek(fd, 0, SEEK_END); 
    lseek(fd, cur_pos, SEEK_CUR);

    HASH* hash_ptr = NULL;
    if ((hash_ptr = reinterpret_cast<HASH*>(mmap(0, file_size, 
        PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0))) == MAP_FAILED) 
    {
	close(fd);
	perror("Error mmapping the file");
	exit(EXIT_FAILURE);
    }

    QHash<QString, QString> hashmap;
    while (hash_ptr != NULL)
    {
         hashmap.insert(QString(hash_ptr->key), QString(hash_ptr->value));
         hash_ptr++;
    }

    close(fd);

    return 0;
}


P.S. If you still want to experiment with mmap, here's the LINK[^] to the online resources explaining how to use the mmap function.

That's it.
 
Share this answer
 
v4
Comments
Aescleal 2-Jan-16 1:15am    
This doesn't add to the conversation at all. Advocating using a Microsoft specific function (sscan_f) to do something the original poster can already do (he says in the question he can read his file into a QT hashmap) under Linux is not a lot of use.

As an aside your file handling is way overcomplicated. All the guff with the ifstream at the beginning could be replaced by a single line and there's no need to close a stream unless you're going to reattach it to a different file at some point.
Arthur V. Ratz 2-Jan-16 1:34am    
I really use a Microsoft specific sscanf_s function, because I've compiled the program in VSS2013 not in Linux. You may try using std::sscanf in Linux.

Whatever you use QT hashmap or anything else, you should implement your own mechanism to retrive the data from a file and insert each data item into QT Hashmap or something else like simple arrays, etc.

Of course, that I've done in my code can be easily replaced by a single line using STL templates. But, the main goal of my post is to demonstrate how it works and not to wrap the code.

In my opinion, to avoid pumping up the memory consumed by the application, the enquirer should not use any arrays or hashmaps. Instead he should process each line of a data file immediately after its retrieval. As you might notice, in my sample code posted, I've not implemented storing of each line of a data file into any data structures. I've limited just each line retrival and output which can be replaced with more complex data processing.

At last, read my Solution 2 code snippet in which I'm using mmap function. I've compiled this code in Cygwin using GNU C compiler.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900