Click here to Skip to main content
15,867,308 members
Articles / Programming Languages / Visual C++ 9.0

Fixed Point Class

Rate me:
Please Sign up or sign in to vote.
4.99/5 (57 votes)
5 Nov 2009BSD23 min read 137K   3K   97   32
A C++ template class for fixed point mathematics.

Introduction

I was working on algorithms for color space conversion, as the need for a fixed-point implementation of real numbers arose. Earlier, I have occasionally used fixed-point mathematics here and then, but coded in an ad-hoc manner, quick and dirty.

Therefore, I decided to start working on a reusable class for fixed-point mathematics. Another motivating reason was that I wanted to know whether one really could write a class in C++ whose objects would behave similar to a built-in numeric type. The goals were that this fixed-point class:

  • should be usable as a drop-in replacement for the existing floating point types,
  • should be faster than floating point emulation on non floating point enabled hardware,
  • should be generically usable in many circumstances,
  • should be inherently beautiful, nicely written, and documented for future reference.

After a very short introduction to fixed-point mathematics, the text explains how to use the fixed_point class. Then it explains the implementation of the class.

The latest version of this document can always be found here. The latest version of the code can always be found at http://fpmath.googlecode.com.

Short Introduction to Fixed-Point Mathematics

A fixed-point number can be split into an optional sign bit, an integer, and a fractional part.

fpmath-1.png

The integer part consists of i bits, the fractional part consists of f bits.

The total value of the fixed point number V is:

fpmath-2.png

A fixed point number as defined above is said to be in i.f format.

The range of an unsigned fixed point number is between 0 and 2^i-1. The range of a signed fixed point number is between -2^i and 2^i-1. The precision of a fixed point number is 2^(-f).

An important text to consult for more information is the paper, What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys. This text is easy to find on the internet, there are many download locations.

Here is an example: assume that we have a 16 bit fixed point number, in 7.8 format, with an additional sign bit, where the bits are set to 0000 0110 1010 0000 (that is, 0x06A0 in hexadecimal notation). The value of this fixed point number is +6.625.

If you have such a bit pattern, regardless of the fixed point format, you can treat it as an integer and perform integer mathematics on it. Only one little complication arises when you do multiplication or division: you need to make sure that the decimal point is at the right position. This is roughly comparable to the way you learned to multiply by hand in school: at the end of the multiplication, you had to count positions after the decimal point, and then set the decimal point at the right position.

The implication for the machine is that fast integer commands can be used to carry out the calculations. The downside is that range and precision of fixed point numbers are usually smaller than those of floating point numbers.

Installation of the Project

If you just want to use the fixed_point class in your code, you only need to download the project. You will find fixed_point.h in the include/fpml folder.

However, the project also contains test code including unit tests and benchmarks. If you want to run these, you also need CMake (version 2.6 or higher) which can be downloaded at http://www.cmake.org. Run CMake, and point it to the directory where you’ve downloaded the project, and use it to generate the Visual Studio solutions and projects. You can then load FPMATH.sln, which will contain the test projects.

Usage of the fixed_point Class

The first thing you should do is to include the header file fixed_point.h:

C++
#include <fixed_point.h>

The fixed_point<B, I, F> class is defined inside the namespace fmpl. In order to use it, you can either prefix with fpml:: everywhere it is necessary, or you can use a using statement:

C++
using namespace fmpl;

There are basically two different use-cases for the fixed_point<B, I, F> class. You can either use it in newly written code, where you know that you want fixed-point mathematics, or where you control the behavior, i.e., determine the number of integer and fractional bits. An extremely simple code sample of this first use case could look like this:

C++
#include <fixed_point.h>
using namespace fpml;

main()
{
    fixed_point<int, 16> a = 256;
    fixed_point<int, 16> b = sqrt(a);
}

Of course, you can use all the other operators and functions as well. The fixed_point<B, I, F> class has implementations for all the important operators and functions you would take for granted when working with floating point numbers.

The second use case is the scenario of porting, when the code already exists and has been written with either the float or double types. When you have carefully checked the code for range and precision issues, and have decided that the code will still work when the floating point calculations are replaced with fixed point calculations, you can use #define to port the code, with minor changes made to the original code:

C++
#include <fixed_point.h>
#define double fpml::fixed_point<int, 16>

… original code here, unchanged …

#undef double

You should be very careful though, because both range and precision of the fixed-point numbers will differ from those of the original floating-point numbers, and this may introduce bugs into the original code. Especially when functions like sqrt, log, or sin are used, the error propagation of the fixed-point numbers is no longer linear, and surprises may be encountered.

Implementation of the fixed_point Class

One of the goals is that the code should be as generic as possible. One way to achieve this is to use templates. I decided to use three template parameters:

C++
template<typename B, unsigned char I, unsigned char F
        = std::numeric_limits<B>::digits - I>
class fixed_point
{
    BOOST_CONCEPT_ASSERT((boost::Integer<B>));
    BOOST_STATIC_ASSERT(I + F == std::numeric_limits<B>::digits);
    ...
private:
    B value_;
}

B is the base type. This must be an integer type. This is the type used for most of the calculations. Examples are unsigned char, signed short, or int. This type can be chosen according to the size, precision, and performance requirements. If an unsigned type is chosen, the behavior of the the fixed_point class is unsigned; otherwise, it is signed. Signed behavior more closely matches the behavior of the built-in floating point types.

The statement BOOST_CONCEPT_ASSERT((boost::Integer<B>)) in the class body assures that only integer types can be used as the base type. Please note that the base type is not used in the sense of base class here. In fact, the fixed_point class is not derived from any base class, but stands on its own feet.

I is the number of bits of the integer part, not counting the sign bit. This determines the range of the numbers that can be represented.

F is the number of bits of the fractional part. This determines the precision of the numbers that can be represented. There is no need to specify F when the template is instantiated, since it can always be deferred automatically from the base type B and the number of integer bits I. However, if it is specified, then it needs to be correct. The statement BOOST_STATIC_ASSERT(I + F == std::numeric_limits<B>::digits) in the class body assures that this condition is enforced.

The number of bits of the base type must satisfy the following formula: #(B)=S+I+F, where:

  • #(B) is the number of bits of the base type,
  • S is 1 for signed types and 0 for unsigend types,
  • I is the number of integer bits,
  • F is the number of fractional bits.

The table below shows the usable types and the requirements for I and F:

B

#(B)

S

I

F

signed char

8

1

7…1

0…6

unsigned char

8

0

8…1

0…7

short

16

1

15…1

0…14

unsigned short

16

0

16…1

0…15

int

32

1

31…1

0…30

unsigned int

32

0

32…1

0…31

Objects of type fixed_point<B, I, F> are of the same size as the underlying type B. The class has been carefully designed not to impose additional size requirements.

By using all available bits for the integer part, you achieve integer behavior. However, you can call the defined functions such as sqrt or log etc., for integers then. I have more to say about these functions later.

Types larger than 32 bit are not yet supported. One reason for this is that some functions need internal results twice as big. If 64 bit types would have been allowed, these internal results would be 128 bit wide. The second reason is that if you could afford 64 bits for the fixed-point type, you could as well use the type double in many cases.

Construction and Conversion

If an object of type fixed_point<B, I, F> should be usable, it first needs to be constructed. The class provides a set of constructors. First, there is the parameter-less default constructor:

C++
fixed_point()
{ }

Just as with built-in types, no initialization is done. The value is undetermined after executing this constructor.

Second, there is a constructor which allows construction from integer values. This is implemented with a template:

C++
template<typename T> fixed_point(T value) : value_((B)value << F)
{
    BOOST_CONCEPT_ASSERT((boost::Integer<T>));
}

This constructor takes an integer value of type T and converts it to the fixed_point<B, I, F> type. The conversion from the integer format to the fixed point format is done by shifting F positions to the left, so that the position of the binary point is correct.

Third, there is a constructor which allows construction from a boolean value:

C++
fixed_point(bool value) : value_((B)(value * power2<F>::value))
{ }

A fixed_point<B, I, F> object constructed from false has a value of 0.0, a fixed_point<B, I, F> object constructed from true has a value of 1.0.

Fourth, there are some constructors which allow construction from floating point values:

C++
fixed_point(float value) : value_((B)(value * power2<F>::value))
{ }

fixed_point(double value) : value_((B)(value * power2<F>::value))
{ }

fixed_point(long double value) : value_((B)(value * power2<F>::value))
{ }

These constructors take a floating point value of type float, double, or long double, and convert it to the fixed_point<B, I, F> type. The conversion is done by multiplying with an appropriate power of 2 and casting to the result to base type B. The appropriate power of 2 is determined by the number of bits F of the fractional part. This corresponds to the shift operation performed by the integer constructors.

All the constructors so far with one parameter also serve as implicit conversion operators, and convert from the type of the constructor to the fixed_point<B, I, F> type. Therefore, you can initialize fixed_point<B, I, F> variables with numeric values, as follows:

C++
fixed_point<int, 16> a(0);
fixed_point<int, 16> b = -1.5;

Finally, a copy constructor is implemented:

C++
fixed_point(fixed_point<B, I, F> const& rhs) : value_(rhs.value_) 
{ }

The copy constructor simply copies the content of the fixed_point<B, I, F>::value_ member.

Strictly seen, this constructor is not necessary, since the compiler would automatically synthesize a similar copy constructor. However, I like to make things explicit.

As mentioned before, the constructors with one parameter doubly serve as conversions from numeric values to the fixed_point<B, I, F> type. The other direction of the conversion is also needed, and is implemented with conversion operators, for the same numeric types:

C++
template<typename T> operator T() const
{
    BOOST_CONCEPT_ASSERT((boost::Integer<T>));
    return value_ >> F;
}

operator float() const
{
    return (float)value_ / power2<F>::value;
}

operator double() const
{
    return (double)value_ / power2<F>::value;
}

operator long double() const
{
    return (long double)value_ / power2<F>::value;
}

The conversions from type fixed_point<B, I, F> are symmetric to the constructors. As such, the integer conversions shift to the right, and the floating point conversions divide by an appropriate power of 2.

There are a few other remarks to be made here.

Sadly, C++ does not specify the exact behaviour of the shift operators, but leaves it implementation defined. Any implementation is free to do an arithmetic shift (correctly handling the sign bit for negative numbers) or a logic shift (not handling the leftmost bit specially). This has the potential that the code may not work as intended. However, the implementations I tried (Visual Studio 2005, Visual Studio 2008) do the right thing: they do an arithmetic shift for signed numbers, and they do a logic shift for unsigned numbers.

The floating point conversions need to calculate 2 to the power of F. I could have used the runtime function pow, but I did not want to defer a calculation to runtime that could equally well be done at compile time. Unfortunately, it is not so straightforward. I used a template metaprogramming technique I’ve seen somewhere (but I don’t remember where) in order to calculate at compile time:

C++
template<int F> struct power2 
{
    static const long long value = 2 * power2<P-1,T>::value;
};

template <> struct power2<0> 
{
    static const long long value = 1;
};

The power2 template works by template recursion. For example, if F == 2, the following steps are carried out:

  1. power2<2> is instantiated, power2<2>::value is set to 2 * power2<1>::value. Since power2<1>::value is yet unknown, the compiler needs to instantiate power2<1>.
  2. power2<1> is instantiated, power2<1>::value is set to 2 * power2<0>::value. Since power2<0>::value is yet unknown, the compiler needs to instantiate power2<0>.
  3. power2<0> is instantiated, power2<0>::value is 1. Now, the recursion can wind the whole way back, effectively calculating 2 * 2 * 1, which is 2^2, equaling 4.

Operators

Of course, to be able to do something useful with objects of type fixed_point<B, I, F>, some operators are needed.

Assignment and Conversion

There is a simple assignment operator, which is implemented in terms of the copy constructor and the swap operation.

C++
fixed_point<B, I, F> & operator =(fixed_point<B, I, F> const& rhs)
{
    fixed_point<B, I, F> temp(rhs);
    swap(temp);
    return *this;
}

The swap operation itself delegates to the swap of the C++ standard library.

C++
void swap(fixed_point<B, I, F> & rhs)
{ 
    std::swap(value_, rhs.value_); 
}

There is also a version of the assignment which can convert between different fixed-point formats, and consequently also a converting copy constructor is needed. Conversion between different fixed-point formats can be done by shifting the representation left or right as needed.

C++
template<unsigned char I2, unsigned char F2>
fixed_point<B, I, F> & operator =(fixed_point<B, I2, F2> const& rhs)
{
    fixed_point<B, I, F> temp(rhs);
    swap(temp);
    return *this;
}

template<unsigned char I2, unsigned char F2>
fixed_point(fixed_point<B, I2, F2> const& rhs)
    : value_(rhs.value_)
{ 
    if (I-I2 > 0)
        value_ >>= I-I2;
    if (I2<I > 0)
        value_ <<= I2-I;
}

Comparison

For comparison, operator < and operator == are implemented.

C++
bool operator <(fixed_point<B, I, F> const& rhs) const
{
    return value_ < rhs.value_; 
}

bool operator ==(fixed_point<B, I, F> const& rhs) const
{
    return value_ == rhs.value_; 
}

The boost::ordered_field_operators class automatically implements operator >, operator >=, and operator <= in terms of operator <, as well as operator != in terms of operator ==.

In pseudo-code notation, this automatic provision of operators looks like this:

boost::ordered_field_operators

operator <(a, b)

operator >(a, b):

return operator <(b, a);

operator >=(a, b):

return ! operator <(a, b);

operator <=(a, b):

return ! operator <(b, a);

operator ==(a, b)

operator !=(a, b):

return ! operator ==(a, b);

Conversion to bool

Floating point types float and double support conversion to bool. A floating-point value converts to true when it is != 0, and it converts to false otherwise. Thus, I have implemented a conversion to bool, and operator !, which returns just the inverse.

C++
operator bool() const
{
    return (bool)value_; 
}

bool operator !() const
{
    return value_ == 0; 
}

Unary operator –

For signed fixed-point types, you can apply the unary minus operator to get the additive inverse. For unsigned fixed-point types, this operation is undefined. Also, shared with the integer base type B, the minimum value representable by the type cannot be inverted, since it would yield a positive value that is out of range and cannot be represented.

C++
fixed_point<B, I, F> operator -() const
{
    fixed_point<B, I, F> result;
    result.value_ = -value_;
    return result;
}

Increment and Decrement

Floating point types can be incremented and decremented by 1.

C++
fixed_point<B, I, F> & operator ++()
{
    value_ += power2<F>::value;
    return *this;
}

fixed_point<B, I, F> & operator --()
{
    value_ -= power2<F>::value;
    return *this;
}

The boost::unit_steppable class automatically implements operator ++(int) (postincrement) and operator -–(int) (postdecrement) in terms of operator ++ and operator --.

In pseudo-code notation, this automatic provision of operators looks like this:

boost::unit_steppable

operator ++()

operator ++(int):

tmp(*this); ++tmp; return tmp;

operator --()

operator --(int):

tmp(*this); --tmp; return tmp;

Addition and Subtraction

Addition and subtraction are implemented for fixed_point<B, I, F> with operator += and operator -=.

C++
fixed_point<B, I, F> & operator +=(fixed_point<B, I, F> const& summand)
{
    value_ += summand.value_;
    return *this;
}

fixed_point<B, I, F> & operator -=(fixed_point<B, I, F> const& diminuend)
{
    value_ -= diminuend.value_;
    return *this;
}

The boost::ordered_field_operators class automatically implements operator + and operator - in terms of operator += and operator -=.

In pseudo-code notation, this automatic provision of operators looks like this:

boost:: ordered_field_operators

operator +=(s)

operator +(s):

tmp(*this); tmp += s; return tmp;

operator -=(d)

operator -(d):

tmp(*this); tmp -= d; return tmp;

You should be careful with addition and subtraction, since – because of their integer implementation heritage – they can overflow.

Multiplication and Division

Multiplication and division are implemented for fixed_point<B, I, F> with operator *= and operator /=.

C++
fixed_point<B, I, F> & operator *=(fixed_point<B, I, F> const& factor)
{
    value_ = (static_cast<fpml::fixed_point<B, I, F>::promote_type<B>::type>
    (value_) * factor.value_) >> F;
    return *this;
}

fixed_point<B, I, F> & operator /=(fixed_point<B, I, F> const& divisor)
{
    value_ = (static_cast<fpml::fixed_point<B, I, F>::promote_type<B>::type>
    (value_) << F) / divisor.value_;
    return *this;
}

The boost::ordered_field_operators class automatically implements operator * and operator / in terms of operator *= and operator /=.

In pseudo-code notation, this automatic provision of operators looks like this:

boost:: ordered_field_operators

operator *=(s)

operator *(s):

tmp(*this); tmp *= s; return tmp;

operator /=(d)

operator /(d):

tmp(*this); tmp /= d; return tmp;

The implementation of the multiplication and division presents a little challenge. You probably remember: the fixed_point<B, I, F> class has a value_ member of type B, which is an integer type.

Now consider what happens when you multiply two integers, let’s say each of length 8 bits. You will get a result that needs 16 bits. The corner case is when you square the biggest representable value, so let’s do this for our example:

a = 255 = 0xFF
a*a = 65025 = 0xFE01

You clearly see that in order to be able to keep every possible result of the multiplication, you will need 16 bits, that is twice the number of bits than the original factors. In other words: two factors of bits yield a result of bits when multiplied together.

The number of bits of an integer is a property of its type, i.e., an unsigned char has a length of 8 bits, an unsigned short has a length of 16 bits etc. Fortunately, we can do a little type manipulation with templates, in order to find the right bit sizes and types for our multiplication results.

For each type usable as a base class B (a.k.a. small type), we have to find a corresponding type with twice the number of bits that can be used for the result (a.k.a. big type).

Small type

Big type

signed char

signed short

unsigned char

unsigned short

signed short

signed int

unsigned short

unsigned int

signed int

signed long long

unsigned int

unsigned long long

I have provided a set of private template structs that provide the necessary information at compile time:

C++
template<>
struct promote_type<signed char>
{
    typedef signed short type;
};

template<>
struct promote_type<unsigned char>
{
    typedef unsigned short type;
};

template<>
struct promote_type<signed short>
{
    typedef signed int type;
};

template<>
struct promote_type<unsigned short>
{
    typedef unsigned int type;
};

template<>
struct promote_type<signed int>
{
    typedef signed long long type;
};

template<>
struct promote_type<unsigned int>
{
    typedef unsigned long long type;
};

This bigger result is only used temporarily. Fixing the decimal point after the multiplication is done with a right shift of the result of exactly F bits. After the shift, the result is cast back to the original type. And here is where you have to be careful, because the multiplication can overflow, in pretty much the same way as it can for integer multiplication.

The division is similar, in that a dividend with bits is divided through a divisor with bits to yield a result with bits.

Left Shift and Right Shift

The left shift operator << and the right shift operator >> are not defined for floating point types. However, fixed_point<B, I, F> defines them, since it can also be used to emulate an integer type, when F is set to 0. In this case, fixed_point<B, I, F> behaves like an integer, but gives access to the elementary mathematical functions (i.e., useful if you want to calculate the square root of an integer number).

C++
fixed_point<B, I, F> & operator >>=(size_t shift)
{
    value_ >>= shift;
    return *this;
}

fixed_point<B, I, F> & operator <<=(size_t shift)
{
    value_ <<= shift;
    return *this;
}

The boost::shiftable class automatically implements operator >> and operator << in terms of operator >>= and operator <<=.

In pseudo-code notation, this automatic provision of operators looks like this:

boost::shiftable

operator >>=(n)

operator >>(n):

tmp(*this); tmp >>= n; return tmp;

operator <<=(n)

operator <<(n):

tmp(*this); tmp <<= n; return tmp;

Input and Output

For convenience, stream input and output operators have been implemented. I have used conversion to and from double in order to implement these operators.

C++
template<typename S, typename B, unsigned char I, unsigned char F>
S & operator>>(S & s, fpml::fixed_point<B, I, F> & v)
{
    double value=0.;
    s >> value;
    if (s)
        v = value;
    return s;
}

The input stream S is a template parameter. This allows you to use just about any stream, be it a file or string stream, be it an ANSI or wide character stream.

C++
template<typename S, typename B, unsigned char I, unsigned char F>
S & operator<<(S & s, fpml::fixed_point<B, I, F> const& v)
{
    double value = v;
    s << value;
    return s;
}

Again, using the template parameter S for the output stream allows you to use any stream, regardless of whether it is a file or a string stream, or whether it is an ANSI or wide stream.

The streaming operators delegate to the double type. I have not seen any big benefit to implement these operators from scratch for the fixed_point<B, I, F> type. Input and output is inherently slow, so it should be affordable to resort to floating point math in these cases.

Functions

With these operators, a lot of mathematical calculations can be readily carried out with the fixed_point<B, I, F> class, such as those needed in matrix multiplication, etc. However, I felt that the class would not be complete without the other functions that the C++ standard library provides for floating point types, such as sqrt, or sin, or log, and more. While we could convert a fixed_point<B, I, F> value to floating point and then call the respective function from the C++ standard library, this somehow would defeat the purpose of the fixed_point<B, I, F> class in the first place.

Therefore, I have also implemented the mathematical functions for the fixed_point<B, I, F> class. These functions are rather hard to implement correctly. I have gone to some length to choose correct algorithms, but I doubt that the quality is as high as that of some C++ standard library implementations.

fabs

The fabs function computes the absolute value of its argument.

C++
friend fixed_point<B, I, F> fabs(fixed_point<B, I, F> x)
{
    return x < fixed_point<B, I, F>(0) ? -x : x;
}

ceil

The ceil function computes the smallest integral value not less than its argument.

C++
friend fixed_point<B, I, F> ceil(fixed_point<B, I, F> x)
{
    fixed_point<B, I, F> result;
    result.value_ = x.value_ & ~(power2<F>::value-1);
    return result + fixed_point<B, I, F>(
        x.value_ & (power2<F>::value-1) ? 1 : 0);
}

floor

The floor function computes the largest integral value not greater than its argument.

C++
friend fixed_point<B, I, F> floor(fixed_point<B, I, F> x)
{
    fixed_point<B, I, F> result;
    result.value_ = x.value_ & ~(power2<F>::value-1);
    return result;
}

fmod

The fmod function computes the fixed point remainder of x/y.

C++
friend fixed_point<B, I, F> fmod(fixed_point<B, I, F> x, fixed_point<B, I, F> y)
{
    fixed_point<B, I, F> result;
    result.value_ = x.value_ % y.value_;
    return result;
}

modf

The modf function breaks the argument into integer and fraction parts, each of which has the same sign as the argument. It stores the integer part in the object pointed to by ptr, and returns the signed fractional part of x/y.

C++
friend fixed_point<B, I, F> modf(fixed_point<B, I, F> x, fixed_point<B, I, F> * ptr)
{
    fixed_point<B, I, F> integer;
    integer.value_ = x.value_ & ~(power2<F>::value-1);
    *ptr = x < fixed_point<B, I, F>(0) ? 
        integer + fixed_point<B, I, F>(1) : integer;
        
    fixed_point<B, I, F> fraction;
    fraction.value_ = x.value_ & (power2<F>::value-1);
    
    return x < fixed_point<B, I, F>(0) ? -fraction : fraction;
}

exp

The exp function computes the exponential function of x.

C++
friend fixed_point<B, I, F> exp(fixed_point<B, I, F> x)
{
    fixed_point<B, I, F> a[] = {
        1.64872127070012814684865078781, 
        …
        1.00000000046566128741615947508 };
 
    fixed_point<B, I, F> e(2.718281828459045);

    fixed_point<B, I, F> y(1);
    for (int i=F-1; i>=0; --i)
    {
        if (!(x.value_ & 1<<i))
            y *= a[F-i-1];
    }

    int x_int = (int)(floor(x));
    if (x_int<0)
    {
        for (int i=1; i<=-x_int; ++i)
            y /= e;
    }
    else
    {
        for (int i=1; i<=x_int; ++i)
            y *= e;
    }
 
    return y;
}

cos

Calculates the cosine.

The algorithm uses a MacLaurin series expansion.

First, the argument is reduced to be within the range -Pi to Pi. Then, the MacLaurin series is expanded. The argument reduction is problematic, since Pi cannot be represented exactly. The more rounds are reduced, the less significant is the argument (every reduction round makes a slight error), to the extent that the reduced argument and, consequently, the result are meaningless.

The argument reduction uses one division. The series expansion uses 3 additions and 4 multiplications.

C++
friend fixed_point<B, I, F> cos(fixed_point<B, I, F> x)
{
    fixed_point<B, I, F> x_ = fmod(x, fixed_point<B, I, F>(M_PI * 2));
    if (x_ > fixed_point<B, I, F>(M_PI))
        x_ -= fixed_point<B, I, F>(M_PI * 2);

    fixed_point<B, I, F> xx = x_ * x_;

    fixed_point<B, I, F> y = - xx * 
        fixed_point<B, I, F>(1. / (2 * 3 * 4 * 5 * 6));
    y += fixed_point<B, I, F>(1. / (2 * 3 * 4));
    y *= xx;
    y -= fixed_point<B, I, F>(1. / (2));
    y *= xx;
    y += fixed_point<B, I, F>(1);

    return y;
}

sin

Calculates the sine.

The algorithm uses a MacLaurin series expansion.

First, the argument is reduced to be within the range -Pi to Pi. Then the MacLaurin series is expanded. The argument reduction is problematic, since Pi cannot be represented exactly. The more rounds are reduced, the less significant is the argument (every reduction round makes a slight error), to the extent that the reduced argument and, consequently, the result are meaningless.

The argument reduction uses one division. The series expansion uses 3 additions and 5 multiplications.

C++
friend fixed_point<B, I, F> sin(fixed_point<B, I, F> x)
{
    fixed_point<B, I, F> x_ = fmod(x, fixed_point<B, I, F>(M_PI * 2));
    if (x_ > fixed_point<B, I, F>(M_PI))
        x_ -= fixed_point<B, I, F>(M_PI * 2);

    fixed_point<B, I, F> xx = x_ * x_;

    fixed_point<B, I, F> y = - xx * 
        fixed_point<B, I, F>(1. / (2 * 3 * 4 * 5 * 6 * 7));
    y += fixed_point<B, I, F>(1. / (2 * 3 * 4 * 5));
    y *= xx;
    y -= fixed_point<B, I, F>(1. / (2 * 3));
    y *= xx;
    y += fixed_point<B, I, F>(1);
    y *= x_;

    return y;
}

sqrt

The sqrt function computes the nonnegative square root of its argument.

It calculates an approximation of the square root using an integer algorithm. The algorithm is described in Wikipedia: http://en.wikipedia.org/wiki/Methods_of_computing_square_roots.

The algorithm seems to have originated in a book on programming abaci by Mr. C. Woo.

The function returns the square root of the argument. If the argument is negative, the function returns 0.

C++
friend fixed_point<B, I, F> sqrt(fixed_point<B, I, F> x)
{
    if (x < fixed_point<B, I, F>(0))
    {
        errno = EDOM;
        return 0;
    }
    fixed_point<B, I, F>::promote_type<B>::type op = 
        static_cast<fixed_point<B, I, F>::promote_type<B>::type>(
            x.value_) << (I - 1);
    fixed_point<B, I, F>::promote_type<B>::type res = 0;
    fixed_point<B, I, F>::promote_type<B>::type one = 
        (fixed_point<B, I, F>::promote_type<B>::type)1 << 
            (std::numeric_limits<fixed_point<B, I, F>::promote_type<B>
                ::type>::digits - 1);
    while (one > op)
        one >>= 2;
    while (one != 0)
    {
        if (op >= res + one)
        {
            op = op - (res + one);
            res = res + (one << 1);
        }
        res >>= 1;
        one >>= 2;
    }
    fixed_point<B, I, F> root;
    root.value_ = static_cast<B>(res);
    return root;
}

Traits Class std::numeric_limits<>

The fixed_point<B, I, F> class would not be complete without a specialization of the std::numeric_limits<> template. The std::numeric_limits<> template allows you to inquire information about any numeric type, such as its minimum and maximum values, and much more. Traits are indispensable to write truly generic code, code that is agnostic of type and magically works for different types.

Trait

type

value

has_denorm

const float_denorm_style

denorm_absent

has_denorm_loss

const bool

false

has_infinity

const bool

false

has_quiet_NaN

const bool

false

has_signaling_NaN

const bool

false

is_bounded

const bool

true

is_exact

const bool

true

is_iec559

const bool

false

is_integer

const bool

false

is_modulo

const bool

false

is_signed

const bool

numeric_limits<B> (1)

is_specialized

const bool

true

tinyness_before

const bool

false

traps

const bool

false

round_style

const float_round_style

round_toward_zero

digits

const int

I

digits10

const int

digits

max_exponent

const int

0

max_exponent10

const int

0

min_exponent

const int

0

min_exponent10

const int

0

radix

const int

0

min()

fixed_point<B, I, F>

(2)

max()

fixed_point<B, I, F>

(2)

epsilon()

fixed_point<B, I, F>

(2)

round_error()

fixed_point<B, I, F>

(2)

denorm_min()

fixed_point<B, I, F>

(3)

infinity()

fixed_point<B, I, F>

(3)

quiet_NaN()

fixed_point<B, I, F>

(3)

signaling_NaN()

fixed_point<B, I, F>

(3)

  1. Depends on the base type. If the base type B is signed, fixed_point<B, I, F> is signed as well. Otherwise, it is not signed.
  2. These values are calculated based on the template parameters.
  3. These values are meaningless for the fixed-point type, and are set to zero.

Debugging Helpers

Visual Studio supports debugger visualizers, which help the debugger to display variables nicely. Without specialized visualizers, the default display of variables of type fixed_point<B, I, F> is useless first, unless you do the proper conversion to floating point by hand.

fpmath-3.png

The displayed value -49152 doesn’t tell you anything meaningful, unless you happen to know that in order to get at the encoded value, you need to divide the value -49152.

However, this is something a debugger visualizer can automatically do for you. With this visualizer, the value is displayed properly (it doesn’t get the number of digits after the decimal point correct, but this is not a major issue).

fpmath-4.png

Visual Studio saves debugger visualizers are in the text file called autoexp.dat, in the [Visualizer] section. This file can be found in the folder %VSINSTALLDIR%\Common7\Packages\Debugger. Here is the definition of the debugger visualizer for the fixed_point<B, I, F> type:

C++
;------------------------------------------------------------------------------
; fpml::fixed_point
;------------------------------------------------------------------------------
fpml::fixed_point<*,*,*>{
    preview (
        #if ($T3 == 32)( #( $e.value_ / 4294967296., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 31)( #( $e.value_ / 2147483648., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 30)( #( $e.value_ / 1073741824., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 29)( #( $e.value_ / 536870912., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 28)( #( $e.value_ / 268435456., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 27)( #( $e.value_ / 134217728., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 26)( #( $e.value_ / 67108864., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 25)( #( $e.value_ / 33554432., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 24)( #( $e.value_ / 16777216., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 23)( #( $e.value_ / 8388608., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 22)( #( $e.value_ / 4194304., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 21)( #( $e.value_ / 2097152., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 20)( #( $e.value_ / 1048576., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 19)( #( $e.value_ / 524288., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 18)( #( $e.value_ / 262144., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 17)( #( $e.value_ / 131072., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 16)( #( $e.value_ / 65536., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 15)( #( $e.value_ / 32768., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 14)( #( $e.value_ / 16384., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 13)( #( $e.value_ / 8192., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 12)( #( $e.value_ / 4096., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 11)( #( $e.value_ / 2048., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 10)( #( $e.value_ / 1024., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 9)( #( $e.value_ / 512., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 8)( #( $e.value_ / 256., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 7)( #( $e.value_ / 128., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 6)( #( $e.value_ / 64., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 5)( #( $e.value_ / 32., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 4)( #( $e.value_ / 16., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 3)( #( $e.value_ / 8., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 2)( #( $e.value_ / 4., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 1)( #( $e.value_ / 2., " fixed_point ", $T2,".",$T3 )) 
        #elif ($T3 == 0)( #( $e.value_ / 1., " fixed_point ", $T2,".",$T3 )) 
    )
}

If you want to learn more about debugger visualizers for native code, there is a nice documentation available at https://svn.boost.org/trac/boost/wiki/DebuggerVisualizers.

Usage Requirements

The fixed_point<B, I, F> type comes in the form of a header-only library. There is no need to link anything. Just include fixed_point.h where needed, and things should work.

The source code requires a recent version of boost (http://www.boost.org), and assumes that this is installed and can be found on the include path. You may need to make sure that the boost installation can be found. Boost is used for static assertions (assertions at compile time) and concept checking, to make sure that types and values are used as intended. In addition, the boost operators library is used in order to automatically provide a set of operators.

All these boost libraries are header only, and do require the boost files, but do not require that you go through the lengthy process of building any boost libraries.

For now, I’ve tested with Visual Studio 2005 and 2008, but other standard conforming compilers should work as well.

Test

Together with the code, I have provided a test program that tests the class and all functions. Not all possible combinations of types, integer and fractional bit sizes are tested, but a reasonable subset of parameters is.

If your requirements vary, you can easily modify the test program to make sure that your requirements are properly tested.

Benchmarks

While the primary focus of the library is the usage on embedded systems with no floating-point hardware, I couldn’t resist and did some timing measurements on a PC platform. I timed various operations, and compared timings with float and double types. The results are presented as clocks per operation. The benchmark program performs a large number of operations in a loop and counts the clock cycles. The total number of clock cycles is then divided by the number of repetitions.

Function

float

double

15.16

Addition

1.00367

1.00365

1.00373

Multiplication

1.00368

1.00371

1.00376

Division

1.00369

1.0037

1.0037

Sqrt

1.00371

1.00371

473.881

Sine

1.00364

1.00375

162.533

You can see that the basic operations like addition, multiplication, and division are fast, but the functions still need work.

Conclusion

Developing this class was quite an endeavor, and at the beginning, I wouldn’t have imagined how difficult it could be to write elementary functions for a fixed-point class. I have not thought about std::numeric_limits<> and I have not thought about debugger visualizers.

I have read a few books (a click on the thumbnail will take you to the books page on Amazon), and I have learned a lot along the way.

fpmath-5.pngfpmath-6.pngfpmath-7.png

Finally, after a time much longer than anticipated, I got many of the things together, and all in all, I’m partly satisfied with the code now.

Which does not mean anything, because ultimately, it is you who need to be happy as well. I’m very much interested in your feedback, which hopefully will help me and give me enough incentive to further improve the code and its usefulness. Thanks for reading thus far and good luck.

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Software Developer (Senior)
Germany Germany
I have been occupied with software development in the field of image processing and image analysis since 1986.

I still like programming in general and image processing and analysis in particular.

Comments and Discussions

 
QuestionWhy did you use that fixed point format instead of one based on 2's complement? Pin
Member 146117733-Oct-19 7:31
Member 146117733-Oct-19 7:31 
QuestionCan this class be easily modified to use it for e.g. 24 bits? Pin
Member 146117733-Oct-19 6:14
Member 146117733-Oct-19 6:14 
Questioncan we handle overflow? Pin
mesreenu8-Apr-18 6:38
mesreenu8-Apr-18 6:38 
SuggestionNegative width of integer or fractional part Pin
Andreas Schoenle19-Mar-16 5:39
Andreas Schoenle19-Mar-16 5:39 
BugProblem with the sin/cos and exp functions Pin
gyula.gubacsi4-Dec-15 3:45
gyula.gubacsi4-Dec-15 3:45 
Suggestionpower2 solution Pin
_gl22-May-14 6:58
_gl22-May-14 6:58 
SuggestionStatement Incorrect Pin
Geoffrey Hunter8-May-13 11:20
Geoffrey Hunter8-May-13 11:20 
GeneralMy vote of 5 Pin
YvesDaoust1-Oct-12 3:51
YvesDaoust1-Oct-12 3:51 
GeneralMy vote of 5 Pin
Andreas Gieriet26-Sep-12 3:17
professionalAndreas Gieriet26-Sep-12 3:17 
GeneralMy vote of 5 Pin
SoMad21-Jul-12 14:22
professionalSoMad21-Jul-12 14:22 
QuestionCan you improve your rounding errors. Pin
Charlie Beattie5-Apr-12 6:21
Charlie Beattie5-Apr-12 6:21 
QuestionBug in exp Pin
imxiaozhu20-Feb-12 20:30
imxiaozhu20-Feb-12 20:30 
QuestionBug in sqrt Pin
imxiaozhu20-Feb-12 19:13
imxiaozhu20-Feb-12 19:13 
Questionproblems with the default constructor Pin
Werner Salomon3-Jan-12 23:56
Werner Salomon3-Jan-12 23:56 
Questioncompilation errors Pin
kostocka12-Oct-11 21:01
kostocka12-Oct-11 21:01 
AnswerRe: compilation errors Pin
Member 93959432-Sep-12 19:34
Member 93959432-Sep-12 19:34 
GeneralRe: compilation errors Pin
Zafarbek Ibrohimov29-Oct-12 5:31
Zafarbek Ibrohimov29-Oct-12 5:31 
GeneralVery Nice! Pin
Mohinda14-Oct-09 7:59
Mohinda14-Oct-09 7:59 
GeneralRe: Very Nice! Pin
Peter Schregle14-Oct-09 10:02
Peter Schregle14-Oct-09 10:02 
GeneralRe: Very Nice! Pin
Mohinda15-Oct-09 1:43
Mohinda15-Oct-09 1:43 
Hello Peter,

thanks for your quick reply!

I've written a simple console application that demonstrates the problem. I have to correct what I said yesterday. Actually, the values are not negated when converting to fixed_point back and forth. It seems to be the msvc visualizer (I've copied your code into the specified file) that produces this negation. When I look at the value in debug mode (through the visualizer), I see the negated value. I would be curious if you see the same effect on your machine.

Anyways, the quantization error I was pointing out yesterday is still there. When I convert a double to fixed point and back to double, the obtained quantized value (called fout) is not correct. It is different by one LSB with regard to the value I would expect (called hout). I believe that hout is the correct value since it is closer to the original value, i.e. its quantization error (errh) is smaller than compared to the fixed point quantized value (errf).

I have commented the values I obtain on my machine directly in the code snippet below.


Thanks for your help and good day!

Mohinda


#include <<iostream>>
#include <<iomanip>>
#include "fixed_point.h"


#define myType fixed_point<long,0,31>
#define round(x) ((x)>=0?(long long)((x)+0.5):(long long)((x)-0.5))

using namespace fpml;
using namespace std;

int main(int argc, char* argv[])
{
	double in,fout,hout,err,errf,errh;
	double shft = 2147483648; // shft = 2^31;
	myType fin;
	int x;
	
    cout << endl << setprecision(30);


	in = 0.9435345422234343242;
	cout << "Original value                     = "  << in << endl; // = 0.94353454222343436

	fin = ( myType ) (in);
	cout << "Fixed point value (fixed_point)    = " << fin << endl;	// = 0.94353454187512398

	fout = ( double ) fin;
	cout << "Quantized value (fpml)              = " << fout << endl;// = 0.94353454187512398

	hout = 1/shft*(double) (round(shft*in));
	cout << "Quantized value (here)             = " << hout << endl;// = 0.94353454234078526

	errf = in - fout;
	cout << "Quantization error (fpml)          = " << errf << endl;// = 3.4831038e-010

	errh = hout - in;
	cout << "Quantization error (here)          = " << errh << endl;// = 1.1735090e-010

	err = hout-fout;
	cout << "Distance between fpml and here     = " << err << endl;	//  = 4.65666e-010

	cout << "Quantization step (1/shft)         = " << 1/shft << endl;//= 4.65666e-010


	cin >> x;
	return 0;
}

GeneralRe: Very Nice! Pin
PeterSchregle15-Oct-09 6:36
PeterSchregle15-Oct-09 6:36 
GeneralRe: Very Nice! Pin
Mohinda16-Oct-09 0:05
Mohinda16-Oct-09 0:05 
GeneralRe: Very Nice! Pin
PeterSchregle16-Oct-09 9:01
PeterSchregle16-Oct-09 9:01 
GeneralRe: Very Nice! Pin
Mohinda19-Oct-09 3:45
Mohinda19-Oct-09 3:45 
GeneralRe: Very Nice! Pin
PeterSchregle19-Oct-09 8:13
PeterSchregle19-Oct-09 8:13 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.