## Introduction

The article describes how to build a numeric library that performs calculations with quadruple floating-point precision and how to access the library from MSVC C/C++ code. The technique is illustrated by an example. This example demonstrates a dramatic increase in precision of the calculation compared to those performed with thestandard double precision.

## Background

Most of existing commercial compilers support the double precision calculation as the highest precision possible. This is due to the hardware restrictions of CPU. Extending the precision demands to emulate the precise calculations on the existing hardware thus leading to vast increase of the computation time.

The multi-precision libraries such as GNU MPFR (http://www.mpfr.org/) can perform the calculations with arbitrary precision. However, the usage of these libraries requires the user to rewrite the existing C or FORTRAN code in large extent. The quadruple floating point method is a good compromise between the double precision and the multi-precision calculations since it does not require the rewriting of the existing code (assuming it is supported by the compiler).

It is suggested to implement the quadruple floating-point calculations in C/C++ or FORTRAN dynamic or static library. The advantages of the precision increase is shown by the example of the polynomial roots calculation based on the LAPACK library usage. For this purpose the LAPACK library is compiled with the quadruple floating-point precision.

### Floating-point number formats and variable types

The floating point numbers are presented in the format defined by IEEE Standard for Floating-Point Arithmetic (IEEE 754), see Wiki (https://en.wikipedia.org/wiki/IEEE_754). A floating-point number is presented as following:

| sign bit (S) |exponent bits (exp) | mantissa bits (man) |

The presented number is (-1)^{S} * 1.mantissa * 2^{(exp)} where 'mantissa' consists of binary digits and 'exp' is considered as a singed number.

We shall examine the following basic formats:

Format | Description | Mantissa bits | Exponent bits | Decimal digits | Decimal exponent range |

binary32 | Single precision | 24 | 8 | 7.22 | 38.23 |

binary64 | Double precision | 53 | 11 | 15.95 | 307.95 |

binary128 | Quadruple precision | 113 | 15 | 34.02 | 4931.77 |

The corresponding variable types in C++/FORTRAN are shown in the following table:

| binary32 | binary64 | binary128 |

C/C++ | `float` | `double` | `_Quad` (Intel),`__float128` (GNU quadmath) |

FORTRAN (real) | REAL*4 | REAL*8 | REAL*16 | |

FORTRAN (complex) | COMPLEX*8 | COMPLEX*16 | COMPLEX*32 | |

The popular compilers that support the quadruple floating point emulation on x86 and x86-64 CPU are Intel C/C++ and Fortran and GNU GCC and GFORTRAN with quadlib library. The Microsoft VC compiler does not support this type and is not seem to support it in the nearest future!

### LAPACK library quadruple floating-point compilation

The LAPACK library is a standard numerical package (http://www.netlib.org/lapack/) that implements the linear algebra algorithms. The library is used in this article to calculate the roots of a polynomial. The library is FORTRAN written but is has it's C-port CLAPACK. The standard version of this library provides only the double precision functionality. The lack of higher precision implementation is most likely due to the fact that the double precision arithmetic is enough for most applications.

The two ways of the LAPACK modification for quadruple precision are proposed:

- Modification of LAPACK FORTRAN source
- Modification of CLAPACK C source

In the first case type declarations for variables `REAL*8, REAL, DOUBLE PRECISION, COMPLEX`

should be replaced by `REAL*16, COMPLEX*32`

.

If the FORTRAN90 standard is used, the types should be replaced by `real (kind=selected_real_kind (32))`

and `complex (kind=selected_real_kind (32))`

correspondingly.

Replace `REAL*8`

and `COMPLEX*16`

versions of in-build functions by their `REAL*16`

and `COMPLEX*32`

versions. This procedure yet seems to require substantial modification of the code but it is rather straightforward.

In the second case, the include file f2c.h should be modified in order to redefine the basic types. The prototypes of the standard functions also should be replaced by their quadruple versions.

### Polynomial roots calculation

The polynomial roots are calculated using the method of companion matrix. The companion matrix eigenvalues are the roots of the original polynomial. The companion matrix is calculated from the polynomial coefficients and the matrix eigenvalues are calculated using LAPACK function `zgeevx`

.

## Using the Code

Here is a brief description of how to use the article for coding.

The example code consists of the following blocks:

test_qroot.exe : Win32 executable, build by MSVC

qroot.dll : DLL, quadruple precision utilities,

build by GNU C++ with QuadMath or Intel C/C++ or Intel Visual Fortran

lapack_qd_w32.dll : DLL, quadruple precision LAPACK library,

build by GNU C++ with QuadMath or Intel C/C++ or Intel Visual Fortran

The quadruple floating-point type in qroot.dll is declared as follows

#ifdef __INTEL_COMPILER
typedef _Quad real;
typedef _Quad doublereal;
#else
#include <quadmath.h>
typedef __float128 real;
typedef __float128 doublereal;
#endif

The prototypes of exported functions in qroot.dll are defined as follows:

extern "C"
{
__declspec(dllexport) void qroots(doublecomplex *r, doublecomplex *p, long *degP);
}
...

The same types in the caller application are declared as follows

struct CMyQuad
{
long a1;
long a2;
long a3;
long a4;
};
typedef struct CMyQuad MyQuad;
typedef MyQuad* PQuad;
struct CMyCQuad
{
long ra1;
long ra2;
long ra3;
long ra4;
long ia5;
long ia6;
long ia7;
long ia8;
};
typedef struct CMyCQuad MyCQuad;
typedef MyCQuad* PCQuad;
...

The prototypes of the same imported functions in test_qroot.exe are defined as:

extern "C"
{
void qroots(PCQuad r, PCQuad p, long *degP);
}
...

It is critical here to preserve the same type size! The functions parameters should be passed by pointer, this also preserves the compatibility with FORTRAN.

Now, you must set addition compilation options.

For Intel C/C++ compiler the options are below. They enable the usage of the `_Quad`

type.

-Qoption,cpp,--extended_float_type

For Intel Visual Fortran compiler the options as following. They interpret default real, complex, double precision, and double complex variable types and constants as 64-bit real, 128-bit complex, 128-bit real, and 256-bit complex. Note that the definite types should be replaced manually. Also it is recommended to increase the allowed line length for the LAPACK compilation.

/real-size:128 /Qimf-precision:high /extend_source:132

For the GNU C++ compiler you should only command to link with QuadMath library:

-lquadmath

For the GNU Gfortran compiler the options are:

-fdefault-real-8 -ffixed-line-length-132

The intermediate results of `PCQuad`

type may be passed to the other utility functions. Finally a utility function may convert the results to standard `double`

type.

A tip: If DLLs are built by GCC, then when linking the example by MSVC set /SAFESEH:NO in Advanced MSVC Linker Options

### Numeric results

The calculation with standard double precision is performed in MATLAB by the following script:

% Create polynom P(x) = ((x-x00)(x-conj(x00)) )^9
% Calculate the roots and compare with the original root x00.
ar = 0.00001;
ai = 10;
x0 = ar + ai*j;
x1 = ar - ai*j;
va = [1, -2*ar,ar^2+ai^2];
vb = va;
vb2 = conv(va,va);
vb4 = conv(vb2,vb2);
vb8 = conv(vb4,vb4);
vb9 = conv(vb8,va);
vb9r = real(vb9);
aa = vb9r;
xx = roots( aa);
disp('Roots=');
disp(xx);
for cnt=1:length(xx)
yy(cnt) = polyval(aa,xx(cnt));
end
disp('P(roots)=');
disp(yy.');

The results are

% Roots=
% 0.112215172144653 +10.131442757995252i
% 0.112215172144653 -10.131442757995252i
% 0.000010235951691 +10.173357532550744i
% 0.000010235951691 -10.173357532550744i
% -0.112194813184584 +10.131443062600157i
% -0.112194813184584 -10.131443062600157i
% -0.169232753684938 +10.027588249761465i
% -0.169232753684938 -10.027588249761465i
% 0.169252831553145 +10.027587788028235i
% 0.169252831553145 -10.027587788028235i
% -0.146212177557002 + 9.913350066605950i
% -0.146212177557002 - 9.913350066605950i
% 0.146231942318667 + 9.913349664139163i
% 0.146231942318667 - 9.913349664139163i
% -0.057077609506728 + 9.840940518344002i
% -0.057077609506728 - 9.840940518344002i
% 0.057097171965100 + 9.840940359975059i
% 0.057097171965100 - 9.840940359975059i
% P(roots)=
% 1.0e+004 *
%
% -3.584000000000000 + 0.515000000000000i
% -3.584000000000000 - 0.515000000000000i
% -7.155200000000000 + 0.000062243652344i
% -7.155200000000000 - 0.000062243652344i
% -5.324800000000000 - 0.705800000000000i
% -5.324800000000000 + 0.705800000000000i
% -4.300800000000000 - 1.023400000000000i
% -4.300800000000000 + 1.023400000000000i
% -4.838400000000000 + 0.984000000000000i
% -4.838400000000000 - 0.984000000000000i
% -6.412800000000000 - 1.089000000000000i
% -6.412800000000000 + 1.089000000000000i
% -6.464000000000000 + 1.140400000000000i
% -6.464000000000000 - 1.140400000000000i
% -3.520000000000000 - 0.276900000000000i
% -3.520000000000000 + 0.276900000000000i
% -5.030400000000000 + 0.330800000000000i
% -5.030400000000000 - 0.330800000000000i
%

The same calculations in the present example give roots

/// Roots:
A(0,0)=(1.00761e-005,-10.0016)
A(1,0)=(-0.000986753,-10.0012)
A(2,0)=(0.00100687,-10.0012)
A(3,0)=(0.00153725,-10.0003)
A(4,0)=(-0.00151723,-10.0003)
A(5,0)=(-0.0013331,-9.99922)
A(6,0)=(0.00135303,-9.99922)
A(7,0)=(-0.000520498,-9.99854)
A(8,0)=(0.000540355,-9.99854)
A(9,0)=(1.08464e-005,10.0014)
A(10,0)=(0.000940125,10.0011)
A(11,0)=(-0.000918832,10.0011)
A(12,0)=(0.00143429,10.0003)
A(13,0)=(-0.001414,10.0003)
A(14,0)=(0.00126206,9.99928)
A(15,0)=(-0.0012429,9.99928)
A(16,0)=(0.000503879,9.99864)
A(17,0)=(-0.00048546,9.9986)

and there evaluation by the original polynomial P(roots):

polyval(0.000010,-10.001551)=-5.75096e-014+ i-1.21989e-017
polyval(-0.000987,-10.001188)=-4.41869e-014+ i1.55312e-017
polyval(0.001007,-10.001188)=-5.32907e-014+ i-4.62412e-017
polyval(0.001537,-10.000269)=-5.00711e-014+ i-5.98751e-017
polyval(-0.001517,-10.000269)=-5.63993e-014+ i5.07407e-017
polyval(-0.001333,-9.999225)=-3.36398e-014+ i1.62088e-017
polyval(0.001353,-9.999225)=-3.90799e-014+ i-4.17824e-017
polyval(-0.000520,-9.998543)=-5.80647e-014+ i8.72783e-018
polyval(0.000540,-9.998543)=-5.973e-014+ i-3.54263e-017
polyval(0.000011,10.001446)=-2.81997e-014+ i7.48016e-017
polyval(0.000940,10.001107)=-3.66374e-014+ i9.91367e-017
polyval(-0.000919,10.001108)=-1.12133e-014+ i6.8237e-017
polyval(0.001434,10.000250)=-4.25215e-014+ i1.1563e-016
polyval(-0.001414,10.000252)=-3.06422e-014+ i4.59973e-017
polyval(0.001262,9.999276)=-4.70735e-014+ i1.17243e-016
polyval(-0.001243,9.999278)=-3.43059e-014+ i4.35578e-017
polyval(0.000504,9.998641)=-4.07452e-014+ i8.93112e-017
polyval(-0.000485,9.998641)=-3.9968e-014+ i5.99157e-017

We can see that absolute root error of 0.16 decreased to 0.002; the polynomial evaluation of 20000 decreased to 1e-14.

## Points of Interest

Before I chose the described technique, I tried the multi-precision LAPACK library (MLAPACK). I did not succeed since function `zgeevx `

was not implemented and only some functions that deals with symmetrical matrices worked.

I also tried to use new object-oriented FORTRAN standards and incapsulate standart types by the implementation that uses the MFPR library. I also failed due to incorrect functionality of destructors in GNU GFORTRAN implementation.

It turned out that Intel Compiler has an advantage over MSVC by supporting the `_Quad`

variable type. Moreover, the numeric programming in Intel Visual Fortran is much simpler than that of Intel C/C++. It supports debugging, viewing and printing the quadruple variables while C++ environment has no these debugging features. The correct code high-lighting for `_Quad `

type variables is missing also.

But the main force of FORTRAN is direct operations on the complex numbers and working with matrices and arrays. It seems to be a great heritage of the nostalgic DEC compiler but the most of developers barf from FORTRAN now :).

And if you do not want to pay for a commercial compiler, use GNU C/C++/GFORTRAN.

P.S. It is interesting that the precision of the computation of multiple

polynomial roots may be increased with double precision calculations using

the algebraic technique developed by ZHONGGANG ZEN and implemented

in his MultRoot package:

[1] ZHONGGANG ZENG, "COMPUTING MULTIPLE ROOTS OF INEXACT POLYNOMIALS",

MATHEMATICS OF COMPUTATION,

Volume 74, Number 250, Pages 869-903

S 0025-5718(04)01692-8

Article electronically published on July 22, 2004

[2] ZHONGGANG ZENG, "MultRoot - A Matlab package computing polynomial roots and multiplicities"

Journal ACM Transactions on Mathematical Software (TOMS), Volume 30 Issue 2, June 2004 , Pages 218-236

Software Engineer and Control Systems Expert at Digital Feedback Technologies Ltd.

PhD, Tel-Aviv University, 2009

M.S. , Faculty of Mechanics and Mathematics of Lomonosov Moscow State University,1994

Blog page: http://csafonov.blogspot.co.il/

Personal Open-source project: TMC Compiler, MATLAB to C code converter (https://sourceforge.net/projects/tmc-m-files-to-c-compiler)