15,959,039 members
Articles / Programming Languages / Visual C++ 10.0

# Precise and safe calculation method for the average of large integer arrays

Rate me:
13 Aug 2014CPOL4 min read 27.1K   11   15
How to avoid overflow and still retain precision on large integer calculations

## Introduction

Integer calculations can be tricky, if done excessively. Even with values in an unproblematic range, repeated additions and other operations can lead to integer overflow. When this problem arises, even trivial algorithms can be difficult to implement.

The usual solution is to do the calculations using a higher precision data type for the intermediate values, such as `long long` or `double`.

This article offers an alternate algorithm to calculate an average for a large number of integer values under the premise that the calculation needs to be exact, and, therefore, using `double`s for the intermediate values is not an option. It is inspired by a question posted here: http://www.codeproject.com/Messages/4878957/Calculute-average-value-of-large-array.aspx.

## Background

A link posted by user k5054 in the thread mentioned points to the Cumulative Moving Average method: http://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average. This method provides an alternate way to calculate the average of a series of numbers. While the method is not concerned with the aspect of integer overflow, it offers a way to incrementally determine averages of a part of the series, and thus obviates the need for an intermediary value holding a considerably larger value, which may by subject to an overflow.

## Modified Algorithm

The algorithm incrementally calculates the cumulative moving average (CMA) of all values xi up to a point. To avoid having to revisit previous values for each iteration, the calculation is using an alternate formula based on the previous average:

CMAn+1 = CMAn + (xn+1 - CMAn) / (n+1)

In this equation, all values and intermediate results are on the same order of magnitude as the original values. The largest intermediate value you can get is the difference between the largest and smallest input value: e. g. if these two values are stored in x1 and x2, then the above formula will calculate the difference x2 - x1 on the second iteration.

This formula does assume an exact calculation however: if the division of the second summand is an integer division, some fractions will be cut off, and these can add up to a significant amount. Therefore we will need to keep track of the cumulated remainder (CMR) of the division:

CMAn+1 = CMAn + (xn+1 - CMAn + CMRn) div (n+1)

CMRn+1 = (xn+1 - CMAn + CMRn) mod (n+1)

The biggest intermediate value here is the divisor used in both formulas, (xn+1 - CMAn + CMRn) . The subtraction will at most deliver the difference between the smallest and largest value of the xi, and CMR can be any value between -n and n (the remainder can be negative if the divisor is). Thus there will be no overflow as long as the range of input values doesn't cover most of the integer range to start with, or the number of elements gets close to the maximum integer value of the integer type being used.

## Using the code

The code uses the C++11 features `decltype` and `std::remove_reference`. I have implemented and tested it in VS2010. Note that I deliberately did not use a size type for `n_values`, even though it can't be negative. The reason is that the division and modulo operators will not always provide the expected results for mixed signed/unsigned arguments. Besides, `n_values` must be in the base type range to avoid overflow.

C++
```#include <utility>

template <class container_iterator>
auto average(const container_iterator start, container_iterator end)
-> typename std::remove_reference<decltype(*start)>::type
{
typedef std::remove_reference<decltype(*start)>::type basetype;

basetype cumulated_average = 0;
basetype cumulated_remainder = 0;
basetype n_values = 0;
for (auto pvalue = start; pvalue != end; ++pvalue) {
++n_values;
addendum = *pvalue - cumulated_average + cumulated_remainder;
}
return cumulated_average;
}```

The code provides a single function that can be used like many of the algorithms of the STL library. It takes two arguments that define a range of values within a container. The function will automatically derive the type of the values inside the container from those arguments. If your compiler does not support the C++11 features `decltype` and `std::remove_reference`, you can add the value type as an argument to the template argument list instead.

You can call this function by simply passing it two iterators or pointers that point to the first and beyond the last elements of the container, that you wish to calculate the average of. Here is an example:

C++
```void test_average() {
char cvalues[] = { 13,7,-27, 34, -3, 22, 33, -1, 18, 29,
13,7,-27, 34, -3, 22, 33, -1, 18, 29,
13,7,-27, 34, -3, 22, 33, -1, 18, 29,
13,7,-27, 34, -3, 22, 33, -1, 18, 29,
13,7,-27, 34, -3, 22, 33, -1, 18, 29
};
auto cavg = average(cvalues, cvalues+50);
}```

While debugging this example I've found the range of values to be:

• `cvalues`: -27 to +34
• `cumulated_remainder`: -28 to +25
• `cumulated_average`: -2 to +13

All values were in the predicted range. This is only one test case, but the original code does contain more debug code that helps assert that the values of `cumulated_average` and `cumulated_remainder` were always accurate at all times.

## Points of Interest

This is the first time I've actually used `decltype`, and, while trying to figure out how to use it correctly, the first time I've even heard of `std::remove_reference`. These new language features can be quite useful, and it's well worth reading up on them, if you haven't done so already. In this example, using them saved me one template argument. but, more importantly, it saves the caller the effort to provide it, and me the effort to verify that the iterators passed to the function actually point to values of the exact same type.

I'd like to point out as well that some other algorithms relevant to statistics also exist in iterative form, e. g. for determining the standard deviation. However, I can't imagine a reason why anyone would want to restrict himself on integer arithmetic for these. But the iterative formulas can still be useful when you're facing huge amounts of data.

## History

2014-08-13 Article first version

Written By
Software Developer (Senior)
Switzerland

Programming and designing applications in C++ in the areas AI, real-time programming, client-server applications and CAD/CAM since 1985.

Personal interests: AI, computer graphics, games, reading

 First Prev Next
 My vote of 5 Shao Voon Wong3-Jan-20 17:09 Shao Voon Wong 3-Jan-20 17:09
 Re: My vote of 1 Stefan_Lang15-Sep-14 20:01 Stefan_Lang 15-Sep-14 20:01
 Re: My vote of 1 Yoldas Askan15-Sep-14 20:12 Yoldas Askan 15-Sep-14 20:12
 Re: My vote of 1 Stefan_Lang15-Sep-14 20:25 Stefan_Lang 15-Sep-14 20:25
 Message Closed 15-Sep-14 20:30 Yoldas Askan 15-Sep-14 20:30
 Re: My vote of 1 Stefan_Lang15-Sep-14 21:10 Stefan_Lang 15-Sep-14 21:10
 Re: My vote of 1 Yoldas Askan15-Sep-14 21:16 Yoldas Askan 15-Sep-14 21:16
 Re: My vote of 1 Yoldas Askan15-Sep-14 21:27 Yoldas Askan 15-Sep-14 21:27
 Re: My vote of 1 Yoldas Askan15-Sep-14 21:41 Yoldas Askan 15-Sep-14 21:41
 Re: My vote of 1 Stefan_Lang16-Sep-14 0:33 Stefan_Lang 16-Sep-14 0:33
 Message Closed 16-Sep-14 17:36 Yoldas Askan 16-Sep-14 17:36
 Re: My vote of 1 Yoldas Askan16-Sep-14 22:19 Yoldas Askan 16-Sep-14 22:19