Quadruple precision floating-point calculations in C/C++/FORTRAN

Shmuel Safonov

5.00/5 (2 votes)

Sep 18, 2017

CPOL

6 min read

32473

1027

Creation a numeric library that calculates with quadruple floating-point precision and used from MSVC C/C++ code

Introduction

The article describes how to build a numeric library that performs calculations with quadruple floating-point precision and how to access the library from MSVC C/C++ code. The technique is illustrated by an example. This example demonstrates a dramatic increase in precision of the calculation compared to those performed with thestandard double precision.

Background

Most of existing commercial compilers support the double precision calculation as the highest precision possible. This is due to the hardware restrictions of CPU. Extending the precision demands to emulate the precise calculations on the existing hardware thus leading to vast increase of the computation time.

The multi-precision libraries such as GNU MPFR (http://www.mpfr.org/) can perform the calculations with arbitrary precision. However, the usage of these libraries requires the user to rewrite the existing C or FORTRAN code in large extent. The quadruple floating point method is a good compromise between the double precision and the multi-precision calculations since it does not require the rewriting of the existing code (assuming it is supported by the compiler).

It is suggested to implement the quadruple floating-point calculations in C/C++ or FORTRAN dynamic or static library. The advantages of the precision increase is shown by the example of the polynomial roots calculation based on the LAPACK library usage. For this purpose the LAPACK library is compiled with the quadruple floating-point precision.

Floating-point number formats and variable types

The floating point numbers are presented in the format defined by IEEE Standard for Floating-Point Arithmetic (IEEE 754), see Wiki (https://en.wikipedia.org/wiki/IEEE_754). A floating-point number is presented as following:

| sign bit (S) |exponent bits (exp) | mantissa bits (man) |

The presented number is (-1)^S * 1.mantissa * 2^(exp) where 'mantissa' consists of binary digits and 'exp' is considered as a singed number.

We shall examine the following basic formats:

Format	Description	Mantissa bits	Exponent bits	Decimal digits	Decimal exponent range
binary32	Single precision	24	8	7.22	38.23
binary64	Double precision	53	11	15.95	307.95
binary128	Quadruple precision	113	15	34.02	4931.77

The corresponding variable types in C++/FORTRAN are shown in the following table:

	binary32	binary64	binary128
C/C++	`float`	`double`	`_Quad` (Intel),`__float128` (GNU quadmath)
FORTRAN (real)	REAL*4	REAL*8	REAL*16
FORTRAN (complex)	COMPLEX*8	COMPLEX*16	COMPLEX*32

The popular compilers that support the quadruple floating point emulation on x86 and x86-64 CPU are Intel C/C++ and Fortran and GNU GCC and GFORTRAN with quadlib library. The Microsoft VC compiler does not support this type and is not seem to support it in the nearest future!

LAPACK library quadruple floating-point compilation

The LAPACK library is a standard numerical package (http://www.netlib.org/lapack/) that implements the linear algebra algorithms. The library is used in this article to calculate the roots of a polynomial. The library is FORTRAN written but is has it's C-port CLAPACK. The standard version of this library provides only the double precision functionality. The lack of higher precision implementation is most likely due to the fact that the double precision arithmetic is enough for most applications.

The two ways of the LAPACK modification for quadruple precision are proposed:

Modification of LAPACK FORTRAN source
Modification of CLAPACK C source

In the first case type declarations for variables REAL*8, REAL, DOUBLE PRECISION, COMPLEX should be replaced by REAL*16, COMPLEX*32.

If the FORTRAN90 standard is used, the types should be replaced by real (kind=selected_real_kind (32)) and complex (kind=selected_real_kind (32)) correspondingly.

Replace REAL*8 and COMPLEX*16 versions of in-build functions by their REAL*16and COMPLEX*32versions. This procedure yet seems to require substantial modification of the code but it is rather straightforward.

In the second case, the include file f2c.h should be modified in order to redefine the basic types. The prototypes of the standard functions also should be replaced by their quadruple versions.

Polynomial roots calculation

The polynomial roots are calculated using the method of companion matrix. The companion matrix eigenvalues are the roots of the original polynomial. The companion matrix is calculated from the polynomial coefficients and the matrix eigenvalues are calculated using LAPACK function zgeevx.

Using the Code

Here is a brief description of how to use the article for coding.

The example code consists of the following blocks:
test_qroot.exe : Win32 executable, build by MSVC
qroot.dll : DLL, quadruple precision utilities,
build by GNU C++ with QuadMath or Intel C/C++ or Intel Visual Fortran

lapack_qd_w32.dll : DLL, quadruple precision LAPACK library,
build by GNU C++ with QuadMath or Intel C/C++ or Intel Visual Fortran

The quadruple floating-point type in qroot.dll is declared as follows

// File: clapack_types.h
#ifdef __INTEL_COMPILER
typedef _Quad real;
typedef _Quad doublereal;
#else
#include <quadmath.h>
typedef __float128 real;
typedef __float128 doublereal;
#endif

The prototypes of exported functions in qroot.dll are defined as follows:

//
// Calculate polynomial roots
//
extern "C"
{
__declspec(dllexport) void qroots(doublecomplex *r, doublecomplex *p, long *degP);
}
...

The same types in the caller application are declared as follows

// File: MyQuad.h
struct CMyQuad
{
long a1;
long a2;
long a3;
long a4;
};
typedef struct CMyQuad MyQuad;
typedef MyQuad* PQuad;

struct CMyCQuad
{
 long ra1;
 long ra2;
 long ra3;
 long ra4;
 long ia5;
 long ia6;
 long ia7;
 long ia8;
};
typedef struct CMyCQuad MyCQuad;
typedef MyCQuad* PCQuad;
...

The prototypes of the same imported functions in test_qroot.exe are defined as:

//
// Calculate polynomial roots
//
extern "C"
{
 void  qroots(PCQuad r, PCQuad p, long *degP);
}
...

It is critical here to preserve the same type size! The functions parameters should be passed by pointer, this also preserves the compatibility with FORTRAN.

Now, you must set addition compilation options.

For Intel C/C++ compiler the options are below. They enable the usage of the _Quad type.

-Qoption,cpp,--extended_float_type

For Intel Visual Fortran compiler the options as following. They interpret default real, complex, double precision, and double complex variable types and constants as 64-bit real, 128-bit complex, 128-bit real, and 256-bit complex. Note that the definite types should be replaced manually. Also it is recommended to increase the allowed line length for the LAPACK compilation.

/real-size:128 /Qimf-precision:high /extend_source:132

For the GNU C++ compiler you should only command to link with QuadMath library:

-lquadmath

For the GNU Gfortran compiler the options are:

-fdefault-real-8 -ffixed-line-length-132

The intermediate results of PCQuad type may be passed to the other utility functions. Finally a utility function may convert the results to standard double type.

A tip: If DLLs are built by GCC, then when linking the example by MSVC set /SAFESEH:NO in Advanced MSVC Linker Options

Numeric results

The calculation with standard double precision is performed in MATLAB by the following script:

% Create polynom P(x) = ((x-x00)(x-conj(x00)) )^9
% Calculate the roots and compare with the original root x00.
ar = 0.00001;
ai  = 10;

x0 = ar + ai*j;
x1 = ar - ai*j;

va = [1, -2*ar,ar^2+ai^2];
vb = va;
vb2 = conv(va,va);
vb4 = conv(vb2,vb2);
vb8 = conv(vb4,vb4);
vb9 = conv(vb8,va);
vb9r = real(vb9);
aa = vb9r;

xx =  roots( aa);
 disp('Roots=');
disp(xx);
for cnt=1:length(xx)
    yy(cnt) = polyval(aa,xx(cnt));   
end
disp('P(roots)=');
disp(yy.');

The results are

% Roots=
%   0.112215172144653 +10.131442757995252i
%   0.112215172144653 -10.131442757995252i
%   0.000010235951691 +10.173357532550744i
%   0.000010235951691 -10.173357532550744i
%  -0.112194813184584 +10.131443062600157i
%  -0.112194813184584 -10.131443062600157i
%  -0.169232753684938 +10.027588249761465i
%  -0.169232753684938 -10.027588249761465i
%   0.169252831553145 +10.027587788028235i
%   0.169252831553145 -10.027587788028235i
%  -0.146212177557002 + 9.913350066605950i
%  -0.146212177557002 - 9.913350066605950i
%   0.146231942318667 + 9.913349664139163i
%   0.146231942318667 - 9.913349664139163i
%  -0.057077609506728 + 9.840940518344002i
%  -0.057077609506728 - 9.840940518344002i
%   0.057097171965100 + 9.840940359975059i
%   0.057097171965100 - 9.840940359975059i

% P(roots)=
%   1.0e+004 *
%
%  -3.584000000000000 + 0.515000000000000i
%  -3.584000000000000 - 0.515000000000000i
%  -7.155200000000000 + 0.000062243652344i
%  -7.155200000000000 - 0.000062243652344i
%  -5.324800000000000 - 0.705800000000000i
%  -5.324800000000000 + 0.705800000000000i
%  -4.300800000000000 - 1.023400000000000i
%  -4.300800000000000 + 1.023400000000000i
%  -4.838400000000000 + 0.984000000000000i
%  -4.838400000000000 - 0.984000000000000i
%  -6.412800000000000 - 1.089000000000000i
%  -6.412800000000000 + 1.089000000000000i
%  -6.464000000000000 + 1.140400000000000i
%  -6.464000000000000 - 1.140400000000000i
%  -3.520000000000000 - 0.276900000000000i
%  -3.520000000000000 + 0.276900000000000i
%  -5.030400000000000 + 0.330800000000000i
%  -5.030400000000000 - 0.330800000000000i
%

The same calculations in the present example give roots

/// Roots:
A(0,0)=(1.00761e-005,-10.0016)
A(1,0)=(-0.000986753,-10.0012)
A(2,0)=(0.00100687,-10.0012)
A(3,0)=(0.00153725,-10.0003)
A(4,0)=(-0.00151723,-10.0003)
A(5,0)=(-0.0013331,-9.99922)
A(6,0)=(0.00135303,-9.99922)
A(7,0)=(-0.000520498,-9.99854)
A(8,0)=(0.000540355,-9.99854)
A(9,0)=(1.08464e-005,10.0014)
A(10,0)=(0.000940125,10.0011)
A(11,0)=(-0.000918832,10.0011)
A(12,0)=(0.00143429,10.0003)
A(13,0)=(-0.001414,10.0003)
A(14,0)=(0.00126206,9.99928)
A(15,0)=(-0.0012429,9.99928)
A(16,0)=(0.000503879,9.99864)
A(17,0)=(-0.00048546,9.9986)

and there evaluation by the original polynomial P(roots):

polyval(0.000010,-10.001551)=-5.75096e-014+ i-1.21989e-017
polyval(-0.000987,-10.001188)=-4.41869e-014+ i1.55312e-017
polyval(0.001007,-10.001188)=-5.32907e-014+ i-4.62412e-017
polyval(0.001537,-10.000269)=-5.00711e-014+ i-5.98751e-017
polyval(-0.001517,-10.000269)=-5.63993e-014+ i5.07407e-017
polyval(-0.001333,-9.999225)=-3.36398e-014+ i1.62088e-017
polyval(0.001353,-9.999225)=-3.90799e-014+ i-4.17824e-017
polyval(-0.000520,-9.998543)=-5.80647e-014+ i8.72783e-018
polyval(0.000540,-9.998543)=-5.973e-014+ i-3.54263e-017
polyval(0.000011,10.001446)=-2.81997e-014+ i7.48016e-017
polyval(0.000940,10.001107)=-3.66374e-014+ i9.91367e-017
polyval(-0.000919,10.001108)=-1.12133e-014+ i6.8237e-017
polyval(0.001434,10.000250)=-4.25215e-014+ i1.1563e-016
polyval(-0.001414,10.000252)=-3.06422e-014+ i4.59973e-017
polyval(0.001262,9.999276)=-4.70735e-014+ i1.17243e-016
polyval(-0.001243,9.999278)=-3.43059e-014+ i4.35578e-017
polyval(0.000504,9.998641)=-4.07452e-014+ i8.93112e-017
polyval(-0.000485,9.998641)=-3.9968e-014+ i5.99157e-017

We can see that absolute root error of 0.16 decreased to 0.002; the polynomial evaluation of 20000 decreased to 1e-14.

Points of Interest

Before I chose the described technique, I tried the multi-precision LAPACK library (MLAPACK). I did not succeed since function zgeevx was not implemented and only some functions that deals with symmetrical matrices worked.

I also tried to use new object-oriented FORTRAN standards and incapsulate standart types by the implementation that uses the MFPR library. I also failed due to incorrect functionality of destructors in GNU GFORTRAN implementation.

It turned out that Intel Compiler has an advantage over MSVC by supporting the _Quad variable type. Moreover, the numeric programming in Intel Visual Fortran is much simpler than that of Intel C/C++. It supports debugging, viewing and printing the quadruple variables while C++ environment has no these debugging features. The correct code high-lighting for _Quad type variables is missing also.

But the main force of FORTRAN is direct operations on the complex numbers and working with matrices and arrays. It seems to be a great heritage of the nostalgic DEC compiler but the most of developers barf from FORTRAN now :).

And if you do not want to pay for a commercial compiler, use GNU C/C++/GFORTRAN.

P.S. It is interesting that the precision of the computation of multiple
polynomial roots may be increased with double precision calculations using
the algebraic technique developed by ZHONGGANG ZEN and implemented
in his MultRoot package:

[1] ZHONGGANG ZENG, "COMPUTING MULTIPLE ROOTS OF INEXACT POLYNOMIALS",
MATHEMATICS OF COMPUTATION,
Volume 74, Number 250, Pages 869-903
S 0025-5718(04)01692-8
Article electronically published on July 22, 2004

[2] ZHONGGANG ZENG, "MultRoot - A Matlab package computing polynomial roots and multiplicities"
Journal ACM Transactions on Mathematical Software (TOMS), Volume 30 Issue 2, June 2004 , Pages 218-236