Click here to Skip to main content
15,887,683 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Please forgive me because I honestly do nott know what I will ask or what exactly I am lucking for, I guess am just stuck in a Math Dilemma, but here it goes anyway...

I have a large set of numbers, i.e. , 50k or 100k of numbers (decimal) stored in an array
They are not distinct between them, they could or could not repeat, there are no restrictions.

Since they are a large set, I need to summarize them, kind of what the Average does, but with Average i can only get 1 average of the whole array, and I need to get 10 or 20 averages, or in other words the most 10 significant averages between the whole set of numbers.

Is there such operation that can be made, and if so how is it called so i can look for more information?
Of course I would need to be able to count the hits of each average or summary number


---

To give this a little bit more sense and context, I am trying to summarize a datalog from a car, each "frame" or "entry" comes with a RPM value , which of course varies from 0 to 8000, i get thousands of those records and I need to represent them in a rpm table and the amount of hits each "fixed" index has received.

In a practical example, lets assume we got the following values to process

{10,50,90,50,10,400,450,300,550,900,950,1100,1200,1000,900}

rpm | hits
-----|---------
100 | 5 hits
500 | 4 hits
1000 | 6 hits

In this example I have kind of grouped similar numbers for the sake of simplicity, I do know how to calculate the hits, and to find out to which "index" each value should go, but what i need to find out first is what are the best indexes for the table.
I have created those 3 indexes (100,500,1000) fixed, but I don't know if those are the best indexes to split my numbers, it might be 500 or 400 or 474, who knows.

That is the situation i am debating on how to be done, how do you find the best indexes, which can even vary, they could be just 3 or could be 10 or N , the user will have the input to "subdivide" the indexes in the amount they wish.

Hope it is making a little bit more of sense now.

What I have tried:

One of the ideas I had is the following, but not sure if this makes sense at all

Take the Array.Max - Array.Min and divide the result by the numbers of summaries i want to have, in this case 10, and then create 10 different arrays with the numbers in that range and get those averages. i.e:

Array.Min = 0
Array.max = 400
Needed summaries = 10

Create 10 arrays, first with numbers that go from 0 to 40, second 40 to 80, third 80 to 120 and so son, and then calculate each array average.

the problem I see with this is that I could potentially not have any number in the range of 200 to 300, so some arrays will be empty and their average wont make sense?
Posted
Updated 7-May-18 17:33pm
v5
Comments
creizlein 7-May-18 18:42pm    
Thanks everyone for their inputs. I have been reading about statistics but I am still not certain of what is the Math approach i should use.

To give this a little bit more sense and context, I am trying to summarize a datalog from a car, each "frame" or "entry" comes with a RPM value , which of course varies from 0 to 8000, i get thousands of those records and I need to represent them in a rpm table and the amount of hits each "fixed" index has received.

In a practical example, lets assume we got the following values to process

{10,50,90,50,10,400,450,300,550,900,950,1100,1200,1000,900}

rpm | hits
-----|---------
100 | 5 hits
500 | 4 hits
1000 | 6 hits

In this example I have kind of grouped similar numbers for the sake of simplicity, I do know how to calculate the hits, and to find out to which "index" each value should go, but what i need to find out first is what are the best indexes for the table.
I have created those 3 indexes (100,500,1000) fixed, but I don't know if those are the best indexes to split my numbers, it might be 500 or 400 or 474, who knows.

That is the situation i am debating on how to be done, how do you find the best indexes, which can even vary, they could be just 3 or could be 10 or N , the user will have the input to "subdivide" the indexes in the amount they wish.

Hope it is making a little bit more of sense now.
Maciej Los 8-May-18 2:55am    
I think, no one is able to guess how to get rpm table from above set of numbers.
I'd strongly suggest to ask car provider.

Based on your groups of 0-40, 40-80 etc, I expect that either the max value is less than 400 or there are 11 groups. In my example I'll use a maximum value smaller than 400, but it's easy to adjust if you want 11 groups or if you want to e.g. include 400 in the last group.

You could do something like this:

Dim array = New Decimal() {10, 20, 30, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 260, 270, 280, 290, 300, 310, 320, 330, 340, 350, 360, 370, 380, 390}

Dim groups = New List(Of IEnumerable(Of Decimal))
For index As Integer = 0 To 9
    Dim lowerBound = index / 10 * 400
    Dim upperBound = (index + 1) / 10 * 400
    groups.Add(array.Where(Function(n) (n >= lowerBound And n < upperBound)))
Next

Dim averages = groups.Select(Function(g) If(g.Count > 0, g.Average(), -1))


"array" holds a number of values with min 0 and max <400.
"groups" holds groups with values 0 to 40, 40 to 80 etc.
"averages" holds the average of each of these groups, or -1 if the group is empty.

If you don't care about the empty groups (40-80 in my code example), you could do this more elegantly:

Dim groups = array.GroupBy(Function(n) Math.Floor(n / 40))
Dim averages = groups.Select(Function(g) g.Average())
 
Share this answer
 
v2
Comments
creizlein 7-May-18 18:43pm    
I have updated the question with a lit bit more of context, hoping it makes more sense now.
Quote:
so some arrays will be empty and their average wont make sense?
Technically you cannot compute the average of 0 samples.

Usually (statistics), for each range is reported the number of samples falling within, not their average. Have a look at Frequency distribution - Wikipedia[^].
 
Share this answer
 
Comments
creizlein 7-May-18 18:43pm    
I have updated the question with a lit bit more of context, hoping it makes more sense now.
First of all: I have no idea why you need to get 10 or more averages from array...

Seems, you're talking about Statistics[^], especially about concepts of Average[^]. There's at least few types of average[^]: Arithmetic mean[^], Median[^], Geometric median[^], Mode (statistics)[^] and few more...
Each of them provides very specific information about your data set. So, depending on what statistical survey you want to make, you need to use according method.
Imagine, your array of decimal numbers represents set of sale of products in time (weeks, months, quarters, years). You may want to split your array into the sub-sets (by product or time) to get more information about sale. Sometimes, in big data analysis (for market sale), a moving (or running) average[^] is used too.

Due to my weakness of English, i can't explain it more... ;(
 
Share this answer
 
Comments
creizlein 7-May-18 18:43pm    
I have updated the question with a lit bit more of context, hoping it makes more sense now.
Quote:
I am trying to summarize a datalog from a car, each "frame" or "entry" comes with a RPM value , which of course varies from 0 to 8000, i get thousands of those records and I need to represent them in a rpm table and the amount of hits each "fixed" index has received.

I agree with others solutions, it is not a bout Averages, it is more about bins and frequencies.
So you ave to define bins (categories of values) and count number of values in each bin.
Quote:
I have created those 3 indexes (100,500,1000) fixed, but I don't know if those are the best indexes to split my numbers, it might be 500 or 400 or 474, who knows.

I don't know of any rule to choose what goes in each bin, the only technique I can think of is trial and error.
You may find that extreme values are useless and that dummy bins will handle them. Like a car motor is not often below 500 RPM.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900