## Introduction

The partition scheduling problem asks that given a set S of integer values representing process running times, can the set be subdivided into two subsets S' and S'' such that the sum of all process running times in set S' is equal to the sum of all process running times in set S''. Thus, the total sum of process running times in either subset is equivalent to a bound B where B is exactly half of the total cost of all process running times in set S. This problem is a known NP-Complete problem and can be proven via a reduction from three dimensional matching [1].

NP-Complete problems to date, do not have any polynomial time algorithm that can solve them [1]. It is conjectured that such problems may not have a polynomial time solution. However, approximation algorithms have been developed to deal with many NP-Complete problems. Approximation algorithms are algorithms which do not solve the NP-Complete problem optimally in all cases, and trade accuracy for performance. Not all NP-Complete problems can be approximated [1]; however, the partition scheduling problem has several. This article describes a technique where an approximation for the partition scheduling problem can be generalized to approximate scheduling for multiprocessor machines where the number of processors is a power of two, i.e., 2, 4, 8, 16, and so on.

## Background

The partition scheduling problem may be approximated using the following heuristic:

**heuristic1**:

sort input values in
descending order

select values from the largest element down to the smallest element

assign the value to the processor that has the least load

continue until all elements are assigned

Thus, a processor will get its next assignment if it has the lowest work load and that such a schedule is a greedy schedule in that it tries to select the best and most favorable condition at every iteration. This schedule will produce a good approximation for the problem. In the two processor case, the greedy approximation yields a 4/3 * OPT schedule [2].

The idea presented in this article is to extend the two processor schedule to multiprocessor schedules where the number of processors is a power of two. This can be done by the following heuristic:

**heuristic2**:

let S be a set of elements

apply heuristic1 to obtain a schedule for two processors, schedule A and schedule B

take schedule A and set it to be S, apply heuristic1 producing schedule A' and schedule A''

take schedule B and set it to be S, apply heuristic1 producing schedule B' and schedule B''

continue, setting each sub-schedule to be S and re-applying heuristic1 as needed

One will note that heuristic2 will repeatedly subdivide the schedules
each time finding an assignment for a power of two many processors. For example,
if there are only two processors, then there will only be schedule A and schedule
B produced from heuristic1. If there are four processors, then schedule A and schedule
B are reassigned back as input set S to be subdivided again by the heuristic. The
end result is that schedule A would be subdivided into schedule A' and A'' for assignment
to processors 1 and 2, and similarly, schedule B would be subdivided into schedule
B' and B'' for assignment to processors 3 and 4. This assignment requires log_{2}
N many steps and at each pass a call to the heuristic is made. The code can be implemented
within a loop using a queue to hold the schedules. The implementation is as follows:

work_q .push(new work(a, size, 0)); for(int i = 0; i < iter; ++i) { j = work_q.front(); if(!j) { break; } while(j && j->pass == i) { work_q.pop(); p(j->a,j->size,&b,&bsize,&c,&csize); if(bsize) work_q.push(new work(b,bsize,j->pass+1)); if(csize) work_q.push(new work(c,csize,j->pass+1)); free((void *)j->a); delete j; j = work_q.front(); } }

Where work is a struct of the following form:

struct work { work(int *_a, int _size, int _pass) : a(_a), size(_size), pass(_pass) { } int *a; int size; int pass; };

And the queue is defined as:

queue<work *> work_q;

Note that in the above implementation, the initial work pushed
into the queue logically represents the initial set S of integer elements which
is stored in the variable `int * a`

. The for loop iterates a
log_{2} N number of times where N is the number of processors and
for which the value is calculated and stored in the variable `iter`

. The while
loop pops the work from the queue and processes it. At each iteration, the work
that is inserted into the queue has its pass value incremented by one. Further,
at each time the while loop is entered, all work with pass equal to the current
iteration is processed. This ensures that the work is split correctly. For example,
for four processors, the number of iterations is 2. The initial pass is 0. At the
time the while loop is encountered, there is only one unit of work with pass 0,
and it is processed by a function called `p`

which contains the approximation
logic for the partition scheduling problem. However, the results from the processing
produces two schedules, stored in variables b and c respectively with their lengths
stored in variables bsize and csize. If these sizes are non-zero, new work is pushed
into the queue this time with pass equal to pass + 1. In the next iteration, there
would be two schedules to process as there are two schedules with pass now equal
to 1. In this step, the while loop must process both schedules, and this in turn
produces four new schedules as each of the two schedules are processed by splitting
into two new schedules. Thus, the loops produce 2^{iter}^{ }many
schedules which is equal to N where N is a power of two.

The logic for approximating partition is contained in a method
called `p`

and its implementation is:

int compare(const void *_a, const void *_b) { int a = *(int *)_a; int b = *(int *)_b; if(a == b) { return 0; } else if (a < b) { return 1; } else { return -1; } } void p(int *a, int size, int **b, int *bsize, int **c, int *csize) { int i = 0, j = 0, k = 0, bsum = 0, csum = 0; if(!a || !size) { (*b) = (*c) = NULL; (*bsize) = (*csize) = 0; return; } qsort(a, size, sizeof(int), compare); (*b) = (int *)calloc(1, size * sizeof(int)); (*c) = (int *)calloc(1, size * sizeof(int)); (*bsize) = 0; (*csize) = 0; for(i = 0, j = 0, k = 0; i < size; i++) { if(bsum == csum || bsum < csum) { (*b)[j++] = a[i]; (*bsize)++; bsum += a[i]; continue; } if(csum < bsum) { (*c)[k++] = a[i]; (*csize)++; csum += a[i]; } } }

Before the
approximation runs, the input set of elements stored in array `a `

(referred
to in the heuristic as set S) must be sorted in reverse order. The
approximation then runs selecting the processor with the least processing time
and assigning to it an element in descending order. Notice that the function
produces two new arrays, b and c, and their respective sizes, bsize and csize.
The approximation implemented is the aforementioned greedy approximation for
partition scheduling. The approximation uses two variables bsum and csum to
keep track of which processor has the least load at every pass. This is a
better implementation than repeatedly calculating the sum at every pass.

## Using the Code

The code is available in a C++ file called multischedule.cc. The code has been compiled and tested on Linux using g++. Run the command g++ multischedule.cc -o ms. Then, to test out the algorithm, run:

**./ms data.dat**

Where data.dat is a file containing integer values corresponding to:

Nr_Elements Nr_CPUs elem1 elem2 ... elemN

All integer values are separated by a single white space. Nr_Elements refers to the number of elements or processor times given in the file. Nr_CPUs is a power of two number of CPUs for which the algorithm is to try to find a schedule for.

An example instance may be:

**20 4 24 27 12 54 6 4 7 10 54 55 26 10 9 43 54 2 9
6 38 34**

where there are 20 elements and 4 processors followed by 20 elements (processor times) from 24 to 34.

The algorithm then produces the following schedule:

size [20] cpus [4] log2 [2] queue size [4] cpu = [0] len = [5] { 55 34 24 6 2 } sum = [121] cpu = [1] len = [5] { 54 38 10 10 9 } sum = [121] cpu = [2] len = [5] { 54 43 12 7 6 } sum = [122] cpu = [3] len = [5] { 54 27 26 9 4 } sum = [120]

Notice that there are four schedules, one for each processor. The sum of the total cost of running the assigned schedule on a processor is provided. These times are 121, 121, 122, and 120 for processors 0 to 3, respectively. The times are close to optimal if not optimal in this instance. The maximum difference between any schedule is just two, and for two processors (0 and 1) the times are equal. At each pass, a 4/3 * OPT approximation algorithm[2] is run to produce a schedule and the resulting final schedules is efficient.

## Points of Interest

It is intended that heursitic2 be used interchangeably
with any approximation of the partition scheduling problem. However, not all approximations
may yield a good schedule for the multiprocessor assignment. The approximation suggested
in heuristic1 seemed to work well in the cases for which it was tried. To change
the algorithm to use a different approximation, one has to implement function `p`

in the code using a different approach.

## References

- Garey, M. R., & Johnson, D. S. (2000).
*Computers**and**intractability.*New York, NY: W. H. Freeman and Company. - [2] Mertens, S. (2003). The Easiest Hard Problem: Number Partitioning. Retrieved March 17, 2012 from http://www.arxiv.org/pdf/cond-mat/0310317.