C / C++ / MFC

Cyrilix7-Nov-07 20:27

7-Nov-07 20:27

The MSDN documentation seems to say that the average time complexity is O(n) w.r.t. last - first, which is better than any general sorting algorithm (which should be at best O(n log n)), so because of the fact that the entire dataset does not need to be properly sorted, there may be some trick to the implementation that improves the performance, however, I don't see which context you are going to be using this in. If you do something like nth_element(start, 1000, end), I'm assuming you need to have your data structure built up so that it has every single element in a node (the example they use in the documentation is a vector). Wouldn't this mean that you'd have to get every single value and store it in memory, which from what you said, would be too large to fit in memory?

I thought about calling this on 1000 datasets of size 1 million each (for a total dataset size of 1 billion), and then having it called again on the 1000 * 1000 filtered out results, however, this does not guarantee that you get the lowest values, if your dataset is highly randomized. You may be able to implement a more complex algorithm built on nth_element() and some statistical analysis, that does "best guess" estimation as to what the nth element should be for each of the size 1 million datasets, however, from this point on, I'm only speculating as to how performance would look like. Also, I don't know what the requirements are... whether or not a best guess is good enough.

George_George7-Nov-07 20:45

7-Nov-07 20:45

Hi Cyrilix,

My idea is still similar to you, each time only read a part of data, but I do not do a precise sort but nth_element to find the largest 1000 elements.

Where do you find the MSDN document describing the time complexity of nth_element is O(n)? Could you send me the link please?

I do not know how to implement nth_element in O(n) time, do you have any ideas?

regards,
George

Cyrilix7-Nov-07 20:53

7-Nov-07 20:53

The average of a sort complexity is linear with respect to _Last – _First.

It's the same link as the one you gave me.

Also, now that I think of it again, you're probably right. It does work. I was trying to think of cases where the data collected would not be accurate, but assuming the "middle" parameter of nth_element() for the final filter (filtering the already filtered data) is less or equal to the "middle" parameter of nth_element() for the initial filtering, there is no problem.

George_George7-Nov-07 21:49

7-Nov-07 21:49

Thanks Cyrilix,

How do you think nth_element could be implemented in linear time complexity O(n)? I can not imagine an implementation with O(n).

Any ideas?

regards,
George

Cyrilix7-Nov-07 21:52

7-Nov-07 21:52

Isn't it part of a standard C++ library in Visual C++? Why does it have to be re-implemented on your end?

George_George7-Nov-07 22:00

7-Nov-07 22:00

Thanks Cyrilix,

I am not going to re-implement it, and I am interested to learn how nth_element is implemented to enhance my algorithm skill. Smile | :)

Any ideas to share?

regards,
George

Cyrilix7-Nov-07 22:46

7-Nov-07 22:46

I'm probably the worst person to ask, but I can't think of anything off the top of my head.

George_George8-Nov-07 3:55

8-Nov-07 3:55

Never mind, Cyrilix. Smile | :)

I will start another thread to discuss this.

regards,
George

George_George7-Nov-07 16:56

7-Nov-07 16:56

Hi DavidCrow,

I agree sorting with qsort in memory itself is fast. The issue for me is the data size is larger than the physical memory size. I can not read tens of G data into memory to sort. My physical memory is only 2G. Smile | :)

Currently, I have several Million data samples, but in the near future it will be grown to tens of G data samples.

Any comments or ideas?

regards,
George

David Crow8-Nov-07 2:28

8-Nov-07 2:28

George_George wrote:
I can not read tens of G data into memory to sort.

And I would not have suggested such had you mentioned this requirement up front.

"Normal is getting dressed in clothes that you buy for work and driving through traffic in a car that you are still paying for, in order to get to the job you need to pay for the clothes and the car and the house you leave vacant all day so you can afford to live in it." - Ellen Goodman

"To have a respect for ourselves guides our morals; to have deference for others governs our manners." - Laurence Sterne

George_George8-Nov-07 3:55

8-Nov-07 3:55

Sorry DavidCrow,

It is my mistake to make a confusion before. When I ask the question, I think the data should be several M, but after some experiment, the data should be several G.

Do you have any new ideas or comments according to the new situation?

regards,
George

David Crow8-Nov-07 4:03

8-Nov-07 4:03

George_George wrote:
Do you have any new ideas or comments according to the new situation?

Yes, and it would only require that you keep 1000 of those numbers in memory. If you are familar with a binary tree or a linked list, I would go that route.

George_George8-Nov-07 4:05

8-Nov-07 4:05

Hi DavidCrow,

I have a new idea. I want to use STL nth_element to select the largest 1000 elements from N elements. And I will divide the several G elements into tens of groups and each group has N elements. How do you think of this solution.

I heard the time complexity of nth_element is O(n), but actually I do not believe it. Do you know how nth_element could be implemented in O(n)?

regards,
George

David Crow8-Nov-07 4:31

8-Nov-07 4:31

George_George wrote:
How do you think of this solution.

I tested it with 16,000,000 integers and it took ~1 second.

George_George8-Nov-07 21:01

8-Nov-07 21:01

Hi DavidCrow,

I can not believe your system is so fast -- looks like some super computers. What machine are you using? Smile | :)

regards,
George

David Crow9-Nov-07 2:43

9-Nov-07 2:43

George_George wrote:
What machine are you using?

It is a Dell Optiplex 170L with a 2.4GHz CPU and 1GB of RAM. I also tried it on an IBM ThinkPad A21m with an 800MHz CPU and 512MB of RAM. It took ~4 seconds.

George_George9-Nov-07 3:11

9-Nov-07 3:11

Thanks DavidCrow,

It is appreciated if we could go back to the topic that the size of data (tens of G, not several) is not enough to hold at one time in physical memory. Any good ideas are appreciated.

regards,
George

David Crow9-Nov-07 5:57

9-Nov-07 5:57

George_George wrote:
Any good ideas are appreciated.

One has already been offered. See here.

George_George9-Nov-07 21:23

9-Nov-07 21:23

Thanks DavidCrow,

I am stilling looking for more efficient solutions, how do you think using nth_element? I could save time to do sorting, since I only want the 1000 largest elements, no need to sort them.

regards,
George

David Crow10-Nov-07 4:03

10-Nov-07 4:03

George_George wrote:
I am stilling looking for more efficient solutions...

Unless you have actual metrics from each solution, how do you know what is efficient and what is not?

George_George wrote:
how do you think using nth_element?

How do you plan on using that with multiple files? At some point in time, you'll need the largest 1,000 from each file in memory, or in a separate file, so that the largest 1,000 from everything can be selected.

George_George10-Nov-07 4:13

10-Nov-07 4:13

Thanks DavidCrow,

I appreciate your comments above. I must make my hands dirty to evaluate further.

regards,
George

David Crow16-Nov-07 9:36

16-Nov-07 9:36

So what have you discovered with this?

BadKarma7-Nov-07 21:28

BadKarma

7-Nov-07 21:28

Hi,

I didn't check the correctness of the answers above. But you might use the auto sort mechanisme of std::map.

std::map<int, int> theMap;
  int data[10];

  data[0] = 1;
  data[1] = 7;
  data[2] = 3;
  data[3] = 8;
  data[4] = 1;
  data[5] = 9;
  data[6] = 12;
  data[7] = 11;
  data[8] = 13;
  data[9] = 6;

  for(int iIndex = 0; iIndex < 10; iIndex++)
  {
    if(theMap.size() < 4) // LIMIT TO THE 4 HIGHEST VALUES
    {
      theMap[data[iIndex]] = data[iIndex]; // Add first 4 items
    }
    else
    {
      // if the new data is higher then the lowest on in the map 
      // replace it with the new data
      if(data[iIndex] > theMap.begin()->second)
      {
        theMap[data[iIndex]] = data[iIndex];

        //  remove first
        theMap.erase(theMap.begin());
      }
    }
  }

I don't know if this is performant on millions of data, but its a simple straightforward algorithm .

codito ergo sum

George_George7-Nov-07 21:54

7-Nov-07 21:54

Thanks BadKarma,

Your solution works if we could contain all the data into memory (a Set instance), but in my situation the data size can not be afforded in memory. It is appreciated if you could share some insights with us.

regards,
George