Multiple Precision Arithmetic: A Multiplication Algorithm

Andrea Simonassi

1.00/5 (1 vote)

Jun 11, 2022

CPOL

14 min read

10407

The first and simplest of the multiplication algorithms, fast enough to multiply not too large numbers

This article is part of a series of articles:

Multiple Precision Arithmetic: A recreational project

Teaching the Long Multiplication Algorithm to a Digital Computer

This is the third part, we will teach our computer the long multiplication. In this article, we will focus on naive elementary school algorithm.

Many different algorithms exist for long multiplication, the simplest of them all is the elementary school algorithm, which is very important to do right on the computer, because despite big O tells us it's a O(n²) algorithm, each of its primitive parts are very fast, does not require extra memory allocation, so it has advantage over all "fastest" algorithms for small values of n (let's say less than hundred of 64-bit words):

Better algorithm exists, but there is a threshold for which the elementary school algorithm is faster than any other. Using another perspective to say the same: any faster algorithm is going to be slower than this when n below a given threshold: as a consequence, it is useful to hybrid asymptotically faster algorithms with "elementary school" when n below an accurately selected threshold; on source code, you will find that Karatsuba algorithm is faster when used in conjunction with this algorithms.

So this algorithm besides being the humblest, is worth the highest grade of optimization, you are probably going to use it at leaf levels of any divide and conquer algorithm, that means it is going to be used a lot.

I hope to be able to write something on faster algorithms soon.

Ok, said that, let's remember how to do it manually, then port it to computer. This chapter is divided into three parts: on the first, we will find a primitive operation, then we'll analyze the algorithm, finally we will discuss something about implementation on hardware.

1 . Simple Multiplication Using Positional Notation, Base 10

Let’s start with “simple multiplication” (or primitive multiplication, you name it), the multiplication of pair of numbers A and B, where A and B have exactly one digit, this is going to be the primitive upon which we will build our long multiplication.

To compute simple multiplication, you could repeat sums, for example 3x4 can be computed as 3+3+3+3 (4 times).

…or just lookup pre-calculated b-base Pythagorean table.

Example: 3x4 using Pythagorean table, base 10: search intersection between row 3 and col 4, the result is 12.

Multiplying a single digit number by a single digit number may return 2 digits numbers (never more).

Now we have a primitive operation to build the long multiplication, we call it "simple multiplication", we will use Pythagorean table when doing manually and CPU MUL instruction (or equivalent) on computer hardware.

2 . Long Integer Multiplication

The long multiplication algorithm is the following, you have a string of digits A, a string of digits B, foreach position i, j of strings A and B, multiply A[i] * B[j] then write the partial result at position i+j, finally sum partial results.

3 . Long Integer Multiplication and the Digital Computer

We revised the algorithm using pencil and paper with decimal digits. The CPU version is exactly the same. We use the analogy that a CPU register is like a decimal digit and that instead of Pythagorean table lookup, we use an equivalent hardware instruction, which is the multiplication instruction.

There are implementations using base 2 (because 1-bit Pythagorean table is quite simple) but my conjecture is that approach is inefficient (unless implemented in hardware).

I try now to show why using biggest CPU registry is better than using smaller registers or worse single bits: for example, I'll compare multiplication on base 4 and on base 16.

Consider the multiplication of hex E * D (decimal result is 182).

On base-16, we do single simple multiplication: ExD=decimal 182

On base-4, the same multiplication is 32*31, thus we have to compute: (1*2),(1*3),(3*2),(3*3), get their respective results, put partial results in a matrix then compute the long sum.

Conjecture: The moral of the story is: doubling the bits, we can reduce by a factor 4 (at least) the algorithm work for bigger numbers.

Corollary: 64-bit CPU makes our algorithm do at least 4 times less work than a 32-bit CPU (on large numbers).

Corollary: Even if each single 64-bit multiplication costs 4 times more than 32-bit (not, it costs more electricity but not 4x the time) we should still have big gain using 64-bits, because less long sums are needed and less inner loops iterations, not to say about base 2.

Corollary: To multiply 32-bit numbers, better to use 32-bit operation, because single 32-bit operations cost less than 64-bit operation.

Corollary: Any base-2 implementation cannot compete with any n-bit software long multiplication.

Disclaimer: The above conjecture is not evidence based.

The Algorithm

The basic algorithm is the same we learned at school, it is applicable both to computer and to pencil/paper, as usual here I provide a high level overview of the algorithm. It have a big advantage in comparison to other faster algorithms: it's memory requirements are easily predictable, requires no extra allocation than the result array.

Some Consideration Before You Turn This Into Software

To turn the algorithm into a program, we have to think a little about how to effectively implement it.

On sum and subtraction things were easy, but now need more support by our high level language to efficiently use hardware. My conjecture is that the bigger the registry we use, the faster the multiplication is going to be. To falsify that hypothesis, we need to check 2 variants of the code, one using all the CPU power, the other working on what the language we use could provide.

Let me clarify a bit: suppose, to be working on a 64-bit computer: not all high-level languages have a multiplication function that returns two 64-bit values from a single multiplication (as 64 bit CPUs usually do) nor they all have a 128-bit integer type.

We must choose a strategy between two extremes:

Sacrifice portability, write machine specific to use fastest instruction for each specific machine
Sacrifice performance and use whatever your higher-level language provides

My initial intention was to go for option 2: we just want to know how it works, MP libraries created by real experts do exist if you want better throughput, after I decided to use just a little bit of assembler because of some restriction on int size with my compiler, I'm not really using "fastest machine instructions" but neither a pure high level language solution.

Let’s start with an algorithm that is portable, I’ll use constructs that are easily available on a high level language (the C Language, for instance) and try to keep the algorithm as simple as possible. There is only a tiny part of the code that needs to access "secret" hardware feature (multiply two register sized ints and return 2 register ints) because my compiler doesn't let me use 128 bit integer types.

Using the code Understanding the Code

Note to the reader: Here, I explain how I decided to implement the thing, this is not the only way.

If we had a compiler supporting 128 bit integers, I'd do something like the following: I'd be using two unsigned integer datatypes provided by the language: one we’ll call “word type”, then we need another unsigned type, twice as big, that we’ll call “double word”. The ideal, as per the conjecture I made on previous paragraphs, is that the word type has maximum CPU register size.

Some languages provide union type, using union make things simpler, otherwise we use bit shift to get the upper word of the multiplication result.

On this article, I will use union, a decent starting point could be something like (pay attention to endianness):

/*
 We must provide this header for each target platform, or use conservative settings,
 the rest of the algorithm will not change.
 A configuration script could figure out what native type to use for the target machine
*/
typedef uint32_t _word_t;  /*Suppose hour CPU is 32 bit: 
                             ensure we use CPU sized words, in case we have a 64 bit CPU

*/
typedef uint64_t _dword_t; /*machine double words*/

typedef union
{
    _dword_t dword;
    struct
    {
        _word_t L; /*little endian*/
        _word_t H;
    } Pair;
} _word_struct;

But, I have to fight against the fact that my VC++ compiler doesn't (or did not) have a 128 bit type, but my machine is x64 machines, that forces me to use 32 bit word type (on multiplication), and that's not ideal.

As a compromise, I came up with this solution: write a single algorithm, then have the Pythagorean lookup be delegated to another function, that can be written in C or maybe in assembler (the assembler routine will not be called inline probably, but it's probably faster than the downgraded multiplication of half-words).

The result is the following:

#include "BigIntSimple.h"
#include <assert.h>

/* here is the pythagorean function callback declaration, 
   the portable algorithm will call this without knowing 
   which is the actual implementation
 */
typedef reg_t (*_mul_func)(reg_t A, reg_t B,  reg_t* Hiword) ;

/* here i have a small x64 assembler routine implementation 
   of the above callback, which unfortunately is not going to be 
   inlined though, so you might replace it with platform specific intrinsics 
*/
EXTERN reg_t cpu_multiply(reg_t A, reg_t B, reg_t * high);

#ifdef NO_DWORD_INTS
/* 
   this version is a C version for compilers with no dword ints, as mine,
   so this will software emulate the 64 bit multiplication using 32 bit ints
   (but please don't think "64 bit" and "32 bit" as absolute values, it depends
    on underlying hardware, it could be "128 bit word" and "256 bit dword")

   the definition / declaration are private for this module
*/
static reg_t c_multiply(reg_t A, reg_t B, reg_t * high){ ... }
#else
/* BEWARE: this version is for compilers with dword ints which I did not test...
   below is the definition of a version working for compilers that support
   a "dword" integer type (dword i mean an int type twice as big as the register size)

   the definition / declaration are private for this module
*/
static reg_t c_multiply(reg_t A, reg_t B, reg_t * high){ ... }
#endif

/*
   Here we have a private multiplication function that can receive an additional 
   parameter containing the callback function for the multiplication
*/
static numsize_t LongMultiplicationPortable
(reg_t *A, numsize_t m, reg_t *B, numsize_t n, reg_t * R, _mul_func mfunc)
{...}

/* these are the public interface implementation */

/* first we define a version using the assembler version 
   "pythagorean lookup" as callback
*/
numsize_t LongMultiplication(reg_t *A, numsize_t m, reg_t *B, numsize_t n, reg_t * R)
{
    return LongMultiplicationPortable(A, m, B, n, R, cpu_multiply);
}

/* then we define a version using one of the 2 "C" version callbacks
*/
numsize_t LongMultiplicationNoAssembly
(reg_t *A, numsize_t m, reg_t *B, numsize_t n, reg_t * R)
{
    return LongMultiplicationPortable(A, m, B, n, R, c_multiply);
}

and this is the small x64 assembler version of the Pythagorean lookup, which (as we'll see later) make the C code runs 2,5 times faster.

.code
cpu_multiply proc
    mov rax, rdx    
    mul rcx
    mov qword ptr[r8], rdx
    ret
cpu_multiply endp
end

Here is the complete C listing for one of the LongMultiplicationPortable function, on source you find different versions.

static numsize_t LongMultiplicationPortableV2
(reg_t *A, numsize_t m, reg_t *B, numsize_t n, reg_t * R, _mul_func mfunc)
{
    numsize_t outBuffSize = m + n, k;
    reg_t hiword, loword;
    int carry;
    
    for (k = 0; k < outBuffSize; ++k)
        R[k] = 0; /*reset output array*/
    
    for (numsize_t j = 0; j < m; j++)/* read  left number*/            
        for (numsize_t i = 0; i < n; i++) /* read right number*/
        {
            k = i + j;

            loword = mfunc(A[j], B[i], &hiword);

            R[k] += loword;
            carry = R[k] < loword;    /* detect the carry */
            ++k; /* next digit */
            R[k] += hiword + carry;
            carry = (R[k] < hiword) | (carry & (R[k] == hiword)); /* ugly carry 
                                                                     detection */

            while (carry) /* that is unlikely to loop too much */
            {
                R[++k] += 1;     /*sum the previous carry to Result[i+j+k]    */
                carry = R[k] == _R(0);
            }
        }    
        
    while (outBuffSize > 0 && R[outBuffSize - 1] == 0)
        outBuffSize--;

    return outBuffSize;
}

Some more notes to let you understand the above code: A and B are input long numbers, the reg_t type is sized to the machine register size, m is size of A and n the size of B, also we receive a pre-allocated R number (the caller should allocate it by considering that it could contain up to n+m digits). No input checking, no precondition checking, no clean data structure, plain and raw, this must not be an API, this is a primitive that an API will wrap and will also be used by the long division algorithm, so we don't want to do costly stuff at that level.

Remember that A and B numbers have least significant digit at index 0 and most significant digit at index A[m-1] and B[n-1].

Most significant digit of A and B is supposed to not be 0 but the algorithm works anyway, it’s just a waste of time and memory.

I also created in a rush an assembler x64 version, which I somewhat tested and looks like it's working, then I was not happy and created a new version that is a little bit better and I pasted it here below, any x64 expert will do better than that, just for reference:

.data

.code
    ;int LongMulAsm2(
    ;     qword  * A,         RCX
    ;     int ASizeInQWords,     eDX
    ;     qword * B,             R8
    ;     int BSizeInQWords,     R9d
    ;     qword * R            ON STACK
    ; );            
; RAX, RCX, RDX, R8, R9, R10, R11  VOLATILE

LongMulAsmVariant_1 proc
    
    ; mul by zero ret 0
    xor rax, rax
    test rdx, rdx
    jz immediate_ret     
    test r9,r9
    jz immediate_ret

    push rbp
    mov rbp, rsp
    
    push r12
    push r13
    push r14
    push r15

    mov r10, rdx
    
    mov r11, [rbp + 60o]  ; r11 <- R

    ;reset R array    
    add rdx, r9    
    xor rax,rax
_reset_loop:
    mov  [r11+rdx*8], rax           
    sub  edx, 1  
    jns  _reset_loop    

    xor r12, r12
    
_outer_loop:
    cmp r12, r10 ; r12 = j
    je _outer_loop_end

        ; it is unlikely that one digit is zero so... don't optimize that case
        xor r13,r13
        
        ; k= i + j

        mov r15, [rcx +  r12 * 8]
        
        _inner_loop:
            cmp r13, r9
            je _inner_loop_end        
        
            lea r14, [r11 + r12 * 8]

            ; pointer to result
            ; multiply *A * *B
            mov rax, [r8 + r13 * 8]

            lea r14, [r14 + r13*8]   ;randomly mixing instructions 
                                     ;in the hope to achieve better results

            mul r15                
        
            add rax, [r14]        ; R[i+j] += loword (first part)
            adc rdx, [r14+8]      ; R[i+j+1] += highword + carry (first part)
            mov [r14], rax        ; R[i+j] += loword (second part)
            mov [r14+8], rdx      ; R[i+j+1] += highword + carry (second part)
            jnc _inner_loop_continue
                lea r14, [r14+8]
                _carry_loop:                
                    lea r14, [r14+8]
                    mov rax, [r14]
                    add rax, 1
                    mov [r14], rax
                    jnc _carry_loop_ends
                    jmp _carry_loop
                _carry_loop_ends:

            _inner_loop_continue:
            
            inc r13
            jmp _inner_loop
        _inner_loop_end:

    _outer_loop_continue:
    add r12,1
    jmp _outer_loop
_outer_loop_end:
    
    ;compute size of result, the number of iteration will be 1 or 2 if
    ;input numbers have no leading zeroes
        
    lea rax, [r11 + r10 * 8]
    lea rax, [rax + r9 * 8]

        _compute_size_loop:
            cmp r11, rax ;    pointer to begin of array == pointer to end of array ?
            jg  _end_compute_size_loop    
            sub rax, 8    ; pointer to end --
            mov r10, [rax]
            test r10, r10  ; *pointer_to_end == 0 ?
            jz _compute_size_loop
        _end_compute_size_loop:
    
    add rax, 8; need size not index
    sub rax, r11 ; distance in bytes    
    shr eax, 3 ; distance in words , this is the return value    
    
; resume stack frame & non volatile

    pop r15    
    pop r14
    pop r13
    pop r12

    mov rsp, rbp
    pop rbp

    immediate_ret:
ret
LongMulAsmVariant_1 endp
end

Testing the Implementation

Without having to dig too much into details, we basically need some tests to verify correctness of any implementation, in source code, you will find some of those tests. Here I list some possibilities:

Checkout that A * B = B * A, commutative property
Check that A * 1 = A. (one is identity element)
Check that A * 0 = 0. (that's trivial)
If you already implemented SUM, checkout that the series An = (A_n-1+A_n-1) = A₁*2ⁿ. I give you an example algorithm for that:
1. start with A₁ = 5, set also A_n=5.
2. repeat, say 7, times A_n=A_n+A_n, successive values for An are (5,10,20,40,80,160,320,640) at the end, A_n value is 640.
3. then execute Temp=A₁*2⁷ which is 5*128 = 640, finally check that Temp= A_n
Check associative property: (A*B)*C=A*(B*C)
Checkout that (A*B MOD m) = (A MOD m) * (B MOD m), if you choose modulo 9 on base 10 this is called "casting out nine test", on 32 bits number, an easy one is 2³²-1. Casting out nines (in our case casting out 2³²-1s) is easy to do since you just need to sum all the single digits to compute modulo: an example on base 10.
```
53 * 31 = 1643 -> 1+6+4+4 = 14 -> 1+ 4 = 5, so 53*31 MOD 9 = 5.
	53 -> 5 + 3 -> 8, so 53 MOD 9 = 8
	31 -> 3 + 1 -> 4, so 31 MOD 9 = 4
	8 * 4 = 32 -> 3 + 2 = 5, so (53 MOD 9) * (31 MOD 9) = 5
```
The 2 moduli on bold text have the same value, so far so good.

On source code, I provide algorithm for compute A MOD 2⁶⁴-1, this test is not bulletproof but if it says "your algorithm is wrong", then the algorithm is definitely wrong.
Compare implementation against each other, if one of them is wrong, you will notice it, if both are wrong, you have a good chance to notice it unless they have the same error.
Do tests on peculiar input numbers for which the output is well known.
Also do tests on some manually crafted numbers so that you are sure to trigger some part of your code: for example, if you have optimization for when you get a zero digit, you better create a number having some zeroes inside otherwise your branch will have a low chance to be ever executed on random tests.
We should have some tests for handling the possibility that an implementor will mess up the carry detection algorithm, that's difficult to do so I didn't do it, just be extremely careful and test the carry detection part thoroughly in a separate project, because this is a kind of error very unlikely to be caught doing random number tests.

By having a routine passing all of the above tests, we may have a good confidence on implementation correctness. Of course, we have to test it against some well crafted sumbers and then run some tests on randomly chosen numbers having any possible different length.

Proofing Correctness Using Math

If you want to achieve better confidence, besides adding more and more tests, you can always proof your implementation using mathematical techniques which are beyond what I am able to do, but more or less it is something like: for each line of code, build a proof that the program state after execution of instruction is correct (respects assertions), for example.

carry = ((operand + operand2 + carry) < (operand + carry)

To proof correctness of this instruction, first you need to have some assertions about the state of the program like: after execution, carry value will be 1 if operand + operand 2 + carry overflows, then you should proof that for each possible input state (result, operand, operand2, carry) _{that you know because you had a proof for all the previous instructions}, the output state respects your assertion.

For example in this case, suppose that your computer works on base 10, so result, operand and operand2 are base 10 digits and carry can be only 1 or 0.

We can then say that the instruction is not respecting assertion if we have operand = 9, operand2 = 0 and carry = 1:

carry = (9 + 0 + 1) <  (9 + 1)

so:

carry = 10 < 10

and since we working modulo 10:

carry = 0 < 0

thus:

carry = 0

But we had overflow because 9 + 0 + 1 does not fit a base 10 digit.

So did I prove that the instruction is not correct? Depends on what we proved about input state, if we can prove that operand will never be 9, then the instruction is correct...

That's just the idea, things are more profound but maybe you liked the idea and will dig into it.

Aftermath

Note that the proposed algorithm requires much more work than the Long sum.

There are three nested loops, the second of those is using multiplication, a relatively expensive op, but on modern hardware, it is expensive on terms of power consumption and transistors rather than time.

In general, a multiplication is many times more expensive than a sum, but less expensive than a division... because the division runs many multiplications so multiplication algorithm performance will affect directly division performance.

The division will be the last part of the series. Since the algorithm complexity is O(N^2), the test I made is on 4KB*2KB numbers (10K iterations per run) if I keep 512KB numbers as I did for sum it runs for too long.

The above results were produced by VS 2017 C++ on i7 processor yr2011. (The algorithm is not parallel so using 1 core.)

Anyway, I think I demonstrated that the 32 bit version of the algorithm (the one on the yellow bar, which corresponds to the LongMultiplicationNoAssembly func) runs over two times slower than the version using native 64 bits CPU MUL, as it should be in theory, would be nice to have a good C version able to use 64 bit registers.

I re run the tests after removing the supposed "optimization" of skip loop when we have one of the digit operands be zero: I started thinking about the probability to get a full 64 operand at zero... well in base 10 could be smart to improve a case that is going to happen 1 in 10 times, but in base 64 is a really bad idea to write some code to handle a case that is going to happen 1 in billion-billion times, the above gray assembler procedure already ignores that "optimization" and it's a lot faster than the other, let's make the other algorithm do the same so we can do a fair comparison.

Here are the updated stats, improvements are small but measurable:

The above results were produced by VS 2017 C++ on i7 processor yr2011. (The algorithm is not parallel so using 1 core.)

A final thought, we should get a better idea of which is the speed limit we could aim in finding better implementations:

What I achieved here on best implementation is to run 180 million of word multiplications per seconds per core, if I'm not wrong, check out my calculation:

My 3GHZ CPU spent then, if Windows performance counter is correct, average 16 cycles of clock per word multiplication, it is maybe possible to do better using this algorithm.

Below, some new results were produced by VS 2022 C++ on i9-10900 processor. (the algorithm is not parallel so using 1 core).

History

2017: Started writing
6^th February, 2019: Almost ready to publish
20^th March, 2019: Added new implementation to try to demonstrate my conjecture
9^th June, 2020: Added suggestion to test for implementation correctness
11^th June, 2022: Online