Precise and safe calculation method for the average of large integer arrays

Stefan_Lang

4.91/5 (11 votes)

Aug 13, 2014

CPOL

4 min read

28103

How to avoid overflow and still retain precision on large integer calculations

Introduction

Integer calculations can be tricky, if done excessively. Even with values in an unproblematic range, repeated additions and other operations can lead to integer overflow. When this problem arises, even trivial algorithms can be difficult to implement.

The usual solution is to do the calculations using a higher precision data type for the intermediate values, such as long long or double.

This article offers an alternate algorithm to calculate an average for a large number of integer values under the premise that the calculation needs to be exact, and, therefore, using doubles for the intermediate values is not an option. It is inspired by a question posted here: http://www.codeproject.com/Messages/4878957/Calculute-average-value-of-large-array.aspx.

Background

A link posted by user k5054 in the thread mentioned points to the Cumulative Moving Average method: http://en.wikipedia.org/wiki/Moving_average#Cumulative_moving_average. This method provides an alternate way to calculate the average of a series of numbers. While the method is not concerned with the aspect of integer overflow, it offers a way to incrementally determine averages of a part of the series, and thus obviates the need for an intermediary value holding a considerably larger value, which may by subject to an overflow.

Modified Algorithm

The algorithm incrementally calculates the cumulative moving average (CMA) of all values x_i up to a point. To avoid having to revisit previous values for each iteration, the calculation is using an alternate formula based on the previous average:

CMA_n+1 = CMA_n + (x_n+1 - CMA_n) / (n+1)

In this equation, all values and intermediate results are on the same order of magnitude as the original values. The largest intermediate value you can get is the difference between the largest and smallest input value: e. g. if these two values are stored in x₁ and x₂, then the above formula will calculate the difference x₂ - x₁ on the second iteration.

This formula does assume an exact calculation however: if the division of the second summand is an integer division, some fractions will be cut off, and these can add up to a significant amount. Therefore we will need to keep track of the cumulated remainder (CMR) of the division:

CMA_n+1 = CMA_n + (x_n+1 - CMA_n + CMR_n) div (n+1)

CMR_n+1 = (x_n+1 - CMA_n + CMR_n) mod (n+1)

The biggest intermediate value here is the divisor used in both formulas, (x_n+1 - CMA_n + CMR_n) . The subtraction will at most deliver the difference between the smallest and largest value of the x_i, and CMR can be any value between -n and n (the remainder can be negative if the divisor is). Thus there will be no overflow as long as the range of input values doesn't cover most of the integer range to start with, or the number of elements gets close to the maximum integer value of the integer type being used.

Using the code

The code uses the C++11 features decltype and std::remove_reference. I have implemented and tested it in VS2010. Note that I deliberately did not use a size type for n_values, even though it can't be negative. The reason is that the division and modulo operators will not always provide the expected results for mixed signed/unsigned arguments. Besides, n_values must be in the base type range to avoid overflow.

#include <utility>

template <class container_iterator>
auto average(const container_iterator start, container_iterator end)
   -> typename std::remove_reference<decltype(*start)>::type
{
   typedef std::remove_reference<decltype(*start)>::type basetype;

   basetype cumulated_average = 0;
   basetype cumulated_remainder = 0;
   basetype addendum = 0;
   basetype n_values = 0;
   for (auto pvalue = start; pvalue != end; ++pvalue) {
      ++n_values;
      addendum = *pvalue - cumulated_average + cumulated_remainder;
      cumulated_average += addendum / n_values;
      cumulated_remainder = addendum % n_values;
   }
   return cumulated_average;
}

The code provides a single function that can be used like many of the algorithms of the STL library. It takes two arguments that define a range of values within a container. The function will automatically derive the type of the values inside the container from those arguments. If your compiler does not support the C++11 features decltype and std::remove_reference, you can add the value type as an argument to the template argument list instead.

You can call this function by simply passing it two iterators or pointers that point to the first and beyond the last elements of the container, that you wish to calculate the average of. Here is an example:

void test_average() {
   char cvalues[] = { 13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29,
                      13,7,-27, 34, -3, 22, 33, -1, 18, 29
                    };
   auto cavg = average(cvalues, cvalues+50);
}

While debugging this example I've found the range of values to be:

cvalues: -27 to +34
cumulated_remainder: -28 to +25
cumulated_average: -2 to +13

All values were in the predicted range. This is only one test case, but the original code does contain more debug code that helps assert that the values of cumulated_average and cumulated_remainder were always accurate at all times.

Points of Interest

This is the first time I've actually used decltype, and, while trying to figure out how to use it correctly, the first time I've even heard of std::remove_reference. These new language features can be quite useful, and it's well worth reading up on them, if you haven't done so already. In this example, using them saved me one template argument. but, more importantly, it saves the caller the effort to provide it, and me the effort to verify that the iterators passed to the function actually point to values of the exact same type.

I'd like to point out as well that some other algorithms relevant to statistics also exist in iterative form, e. g. for determining the standard deviation. However, I can't imagine a reason why anyone would want to restrict himself on integer arithmetic for these. But the iterative formulas can still be useful when you're facing huge amounts of data.

History

2014-08-13 Article first version