Then parallelizing the extended-precision arithmetic is pointless.
I recommend using the Machin formula:
16 arctg(1/5) - 4 arctg(1/239)
,
where the arctg are evaluated using McLaurin expansion
arctg(x) = Sum (-1)^i x^(2i+1)/(2i+1)
.
(compute the powers of x by recurrence.)
20 digits can nearly be handled by a 64 bits integer (fixed-point), but you'll need a bit more. With three 32 bits integers, you are on the safe side. For convenience, you can work in arithmetic base 10000000000.
You'll need to implement long addition, multiplication, and division by a small integer. Given the small operand length, it is probably worthless to use efficient multiplication algorithms (like Karatsuba).
Some hint for parallelization:
- let every processor compute a range of consecutive terms; every processor will need to start at some power of x (powers will be 2kN+1, 2kN+3, 2kN+5... for processor k among N) hence the need for fast power computation (by squarings) to initialize.
- alternatively, a processor accumulates every N other terms (powers 2k+1, 2N+2k+1, 4N+2k+1, 6N+2k+1... for processor k among N), multiplying every time by x^2N.
Below a very crude implementation of Machin's formula in Python, floating-point:
def ArcTg(X):
Sum= X
Term= X
Y= - X * X
for I in range(3, 17, 2):
Term*= Y
Sum+= Term / I
return Sum
print 16 * ArcTg(1. / 5) - 4 * ArcTg(1. / 239)