Click here to Skip to main content
15,867,568 members
Please Sign up or sign in to vote.
1.50/5 (2 votes)
I have a project where I am asked to sort a 100 GB file in 1 GB memory . I'm new to
computer science and I can't understand the concept of sorting 100 GB in 1 GB and I would like your help with solving this specific problem.
Thank you for your time .

What I have tried:

In sorting mergesort is the fastest sorting method with O(nlogn) complexity and b+ trees are good for sorting big amounts of data so I am thinking it has to be one of these 2 methods . However I can't undestand the concept I explained above .
Posted
Updated 29-Jan-20 10:07am

The concept you need to do your task is called External sorting[^].
 
Share this answer
 
The first thing to do is to start by looking at what you have to sort: what does the file contain that needs sorting? What does it need sorting by?

Think about it: if you have a stack of mixed banknotes on your desk, there are many ways to sort them: by denomination, by colour, by size, by serial number. Which one of those is important?
When you select a "sorting property" for the banknotes, you can effectively ignore all the other properties: denomination means you don't care about the size or colour for example.

Your file is probably organised in much the same way: it will contain information in rows which are to be sorted by a particular column (or a set of columns perhaps) and you can - for the purposes of the sort - ignore all the other columns.
When you know what to sort by, you can start considering now to sort it - but do note that if you start actually splitting the input data into separate files, you're going to be looking at a lot of big files: probably far in excess of 100GB will be needed.

What I would do is build an "index file" which holds the offset of the start of each row in the main file, and sort that using whatever method you prefer. When that is sorted, then build a new file and copy the data rows across to it in the new order.
 
Share this answer
 
You can find examples in different languages here: external-sorting · GitHub Topics · GitHub[^]
 
Share this answer
 
Quote:
I'm new to
computer science and I can't understand the concept of sorting 100 GB in 1 GB and I would like your help with solving this specific problem.

Probably because there is no such concept.
HDD storage and memory are essentially the same thing, the difference is that memory space is smaller and memory is faster.
Because memory is natural work place of processor and file is not, the mean of reading and writing is different in code, but they are similar in principle.
The main difference I see is that sequential read/write in a file is more efficient than random read/write.
Quote:
In sorting mergesort is the fastest sorting method with O(nlogn) complexity and b+ trees are good for sorting big amounts of data so I am thinking it has to be one of these 2 methods .

There is much more sorting methods and each have its advantage.
Note that a sorting method is not linked to the size of data, some are just more efficient that others in particular situations.
So Merge sort is a good option. Remember you compare only 2 values at a time.
You can use an hybrid method, merge sort small chunks (less that half memory size) in memory, and larger chinks on file.

Advice: study sorting algortithms
Sorting algorithm - Wikipedia[^]
Sorting Algorithms - GeeksforGeeks[^]
 
Share this answer
 
v3

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900