|
Thanks chandu004,
The size of data is tens of G.
Your solution should work. Thanks. I am thinking whehter your method could be improved by reading more data into memory? Like, maintain 10000 data in the linked list? Is it better? I do not know whether you have better ways to improve it (I mean performance).
regards,
George
|
|
|
|
|
If anything, I think maintaining 10000 values would be worse, as the average and worst-case insertion times would increase, while I can see no real tangible benefit.
|
|
|
|
|
Thanks Cyrilix,
I also agree.
regards,
George
|
|
|
|
|
no using 10000 nodes would in no way increase hte performance, but increases the complexity, because, insertions and traversal depth will be more.
any way, give me three days time and i shall get back to you with my algorithm, which we can post here and discuss on improvising it.
or if you have already sketched the algorithm, then please post it to me i shall give my ideas on it.
|
|
|
|
|
Thanks chandu004,
I have a new idea. I want to use STL nth_element to select the largest 1000 elements from N elements. And I will divide the several G elements into tens of groups and each group has N elements. How do you think of this solution.
I heard the time complexity of nth_element is O(n), but actually I do not believe it. Do you know how nth_element could be implemented in O(n)?
regards,
George
|
|
|
|
|
George_George wrote: I think sorting the several M data in memory is not feasible.
Sure it is, and in only a few seconds. For example, using qsort() I sorted an array of 1,000,000 integers in ~2 seconds, and 2,000,000 integers took ~4 seconds.
So, by "not feasible", did you mean "not possible," or "not in an efficient manner?"
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
It seems to me like it is very feasible as well. Say, you have 10 million integers, that's 10 000 000 * 4 bytes, so 40 MB in an array. That's nothing!
|
|
|
|
|
Hi Cyrilix,
I can understand your calculation of my current situation (preliminary situation) is correct. A little more background, I am working on biological data samples and you know they are very large dues to the experiment of working process. Currrently, every day there will be several M of data and at the end of the experiement there will be finallly tens of G data. My physical memory can not contains such large data.
Any comments or ideas?
regards,
George
|
|
|
|
|
Hmm...chandu's idea might be better than this one, but you could keep a std::map of with 1000 objects, and an int smallest that represents the lowest value (this is easy to track because you can check for every number that you read, whether or not it's larger or smaller than this value). With the map, you can have a map<int, int=""> where the first int is your value, the second int is the index of the number in the array. For every number, if the value is lower or equal to int smallest , discard it and keep going. Otherwise, remove the element with the lowest value and insert a new element with the new value. In order to update the lowest value "counter", you can iterate through the map again to find it. There is only one real difference between my algorithm and chandu's algorithm. With his algorithm, everything is sorted, and you iterate through the linked list on every insertion with (I'm not entirely sure, but I think) an average case of O(n/2) time given unbounded dataset values. Please correct me if I'm wrong. With my algorithm, every insertion is done in O(1) time, but searching for the new lowest value in a dataset of 1000 objects takes O(n) time... however, this can be optimized. What you can do is keep a buffer of x lowest values and access that buffer in O(1), so as to not have to search for the new lowest value everytime you perform an insertion.
-- modified at 0:42 Thursday 8th November, 2007
Edit: I just realized you may not need to use a map at all, if there's no association required. You might as well use any other data structure that fits the purpose.
|
|
|
|
|
Thanks Cyrilix,
I agree with most of your analysis. I want to confirm that,
Cyrilix wrote: so as to not have to search for the new lowest value everytime you perform an insertion.
I think if the new value is larger than the lowest value in the 1000-element set, then you need to replace the lowest value with the second lowest value (and put the second lowest value as the lowest value in next round), right? Then to find the second lowest value in your data structure takes O(n/2), right?
In your above calculation, you does not include the time to find the second lowest value.
regards,
George
|
|
|
|
|
Yes, that's correct -- you have to replace the lowest value with the second lowest value. The calculation to find that is O(n), not O(n/2), since you have no idea what is the lowest value now, therefore you must iterate through all of them.
To quote myself "but searching for the new lowest value in a dataset of 1000 objects takes O(n) time".
I think I should also mention that there was a flaw in my calculation. I said that my inserts are O(1), but if you use a node-based data structure to "remove" the previous value in an unsorted set, that is likely untrue unless the data structure has more complex features, so let me change what I said to "insertions are only O(1) if you use an array", which is entirely possible.
-- modified at 1:34 Thursday 8th November, 2007
|
|
|
|
|
Thanks Cyrilix,
I have some new idea and I want to listen to your comments. I think STL nth_element algorithm is suitable for my solution (I can input start index, 1000, end index to the function), how do you think of it?
I am not sure about the time complexity of nth_element, do you know that?
http://msdn2.microsoft.com/en-us/library/7s2yb954(VS.80).aspx[^]
regards,
George
|
|
|
|
|
The MSDN documentation seems to say that the average time complexity is O(n) w.r.t. last - first, which is better than any general sorting algorithm (which should be at best O(n log n)), so because of the fact that the entire dataset does not need to be properly sorted, there may be some trick to the implementation that improves the performance, however, I don't see which context you are going to be using this in. If you do something like nth_element(start, 1000, end), I'm assuming you need to have your data structure built up so that it has every single element in a node (the example they use in the documentation is a vector). Wouldn't this mean that you'd have to get every single value and store it in memory, which from what you said, would be too large to fit in memory?
I thought about calling this on 1000 datasets of size 1 million each (for a total dataset size of 1 billion), and then having it called again on the 1000 * 1000 filtered out results, however, this does not guarantee that you get the lowest values, if your dataset is highly randomized. You may be able to implement a more complex algorithm built on nth_element() and some statistical analysis, that does "best guess" estimation as to what the nth element should be for each of the size 1 million datasets, however, from this point on, I'm only speculating as to how performance would look like. Also, I don't know what the requirements are... whether or not a best guess is good enough.
|
|
|
|
|
Hi Cyrilix,
My idea is still similar to you, each time only read a part of data, but I do not do a precise sort but nth_element to find the largest 1000 elements.
Where do you find the MSDN document describing the time complexity of nth_element is O(n)? Could you send me the link please?
I do not know how to implement nth_element in O(n) time, do you have any ideas?
regards,
George
|
|
|
|
|
The average of a sort complexity is linear with respect to _Last – _First.
It's the same link as the one you gave me.
Also, now that I think of it again, you're probably right. It does work. I was trying to think of cases where the data collected would not be accurate, but assuming the "middle" parameter of nth_element() for the final filter (filtering the already filtered data) is less or equal to the "middle" parameter of nth_element() for the initial filtering, there is no problem.
|
|
|
|
|
Thanks Cyrilix,
How do you think nth_element could be implemented in linear time complexity O(n)? I can not imagine an implementation with O(n).
Any ideas?
regards,
George
|
|
|
|
|
Isn't it part of a standard C++ library in Visual C++? Why does it have to be re-implemented on your end?
|
|
|
|
|
Thanks Cyrilix,
I am not going to re-implement it, and I am interested to learn how nth_element is implemented to enhance my algorithm skill.
Any ideas to share?
regards,
George
|
|
|
|
|
I'm probably the worst person to ask, but I can't think of anything off the top of my head.
|
|
|
|
|
Never mind, Cyrilix.
I will start another thread to discuss this.
regards,
George
|
|
|
|
|
Hi DavidCrow,
I agree sorting with qsort in memory itself is fast. The issue for me is the data size is larger than the physical memory size. I can not read tens of G data into memory to sort. My physical memory is only 2G.
Currently, I have several Million data samples, but in the near future it will be grown to tens of G data samples.
Any comments or ideas?
regards,
George
|
|
|
|
|
George_George wrote: I can not read tens of G data into memory to sort.
And I would not have suggested such had you mentioned this requirement up front.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Sorry DavidCrow,
It is my mistake to make a confusion before. When I ask the question, I think the data should be several M, but after some experiment, the data should be several G.
Do you have any new ideas or comments according to the new situation?
regards,
George
|
|
|
|
|
George_George wrote: Do you have any new ideas or comments according to the new situation?
Yes, and it would only require that you keep 1000 of those numbers in memory. If you are familar with a binary tree or a linked list, I would go that route.
"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman
"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne
|
|
|
|
|
Hi DavidCrow,
I have a new idea. I want to use STL nth_element to select the largest 1000 elements from N elements. And I will divide the several G elements into tens of groups and each group has N elements. How do you think of this solution.
I heard the time complexity of nth_element is O(n), but actually I do not believe it. Do you know how nth_element could be implemented in O(n)?
regards,
George
|
|
|
|
|