Introduction
The partition scheduling problem asks that given
a set S of integer values representing process running times, can the set be subdivided
into two subsets S' and S'' such that the sum of all process running times in set
S' is equal to the sum of all process running times in set S''. Thus, the total
sum of process running times in either subset is equivalent to a bound B where B
is exactly half of the total cost of all process running times in set S. This problem
is a known NP-Complete problem and can be proven via a reduction from three dimensional
matching [1].
NP-Complete problems to date, do not have any polynomial
time algorithm that can solve them [1]. It is conjectured that such problems may
not have a polynomial time solution. However, approximation algorithms have been
developed to deal with many NP-Complete problems. Approximation algorithms are algorithms
which do not solve the NP-Complete problem optimally in all cases, and trade accuracy
for performance. Not all NP-Complete problems can be approximated [1]; however,
the partition scheduling problem has several. This article describes a technique
where an approximation for the partition scheduling problem can be generalized
to approximate scheduling for multiprocessor machines where the number of processors
is a power of two, i.e., 2, 4, 8, 16, and so on.
Background
The partition scheduling problem may be approximated
using the following heuristic:
heuristic1:
sort input values in
descending order
select values from the largest element down to the smallest element
assign the value to the processor that has the least load
continue until all elements are assigned
Thus, a processor will get its next assignment if it has the
lowest work load and that such a schedule is a greedy schedule in that it tries
to select the best and most favorable condition at every iteration. This schedule
will produce a good approximation for the problem. In the two processor case, the
greedy approximation yields a 4/3 * OPT schedule [2].
The idea presented in this article is to extend the
two processor schedule to multiprocessor schedules where the number of processors
is a power of two. This can be done by the following heuristic:
heuristic2:
let S be a set of elements
apply heuristic1 to obtain a schedule for two processors, schedule A and schedule B
take schedule A and set it to be S, apply heuristic1 producing schedule A' and schedule A''
take schedule B and set it to be S, apply heuristic1 producing schedule B' and schedule B''
continue, setting each sub-schedule to be S and re-applying heuristic1 as needed
One will note that heuristic2 will repeatedly subdivide the schedules
each time finding an assignment for a power of two many processors. For example,
if there are only two processors, then there will only be schedule A and schedule
B produced from heuristic1. If there are four processors, then schedule A and schedule
B are reassigned back as input set S to be subdivided again by the heuristic. The
end result is that schedule A would be subdivided into schedule A' and A'' for assignment
to processors 1 and 2, and similarly, schedule B would be subdivided into schedule
B' and B'' for assignment to processors 3 and 4. This assignment requires log2
N many steps and at each pass a call to the heuristic is made. The code can be implemented
within a loop using a queue to hold the schedules. The implementation is as follows:
work_q .push(new work(a, size, 0));
for(int i = 0; i < iter; ++i) {
j = work_q.front();
if(!j) {
break;
}
while(j && j->pass == i) {
work_q.pop();
p(j->a,j->size,&b,&bsize,&c,&csize);
if(bsize)
work_q.push(new work(b,bsize,j->pass+1));
if(csize)
work_q.push(new work(c,csize,j->pass+1));
free((void *)j->a);
delete j;
j = work_q.front();
}
}
Where work is a struct of the following form:
struct work {
work(int *_a, int _size, int _pass)
: a(_a), size(_size), pass(_pass)
{
}
int *a;
int size;
int pass;
};
And the queue is defined as:
queue<work *> work_q;
Note that in the above implementation, the initial work pushed
into the queue logically represents the initial set S of integer elements which
is stored in the variable int * a
. The for loop iterates a
log2 N number of times where N is the number of processors and
for which the value is calculated and stored in the variable iter
. The while
loop pops the work from the queue and processes it. At each iteration, the work
that is inserted into the queue has its pass value incremented by one. Further,
at each time the while loop is entered, all work with pass equal to the current
iteration is processed. This ensures that the work is split correctly. For example,
for four processors, the number of iterations is 2. The initial pass is 0. At the
time the while loop is encountered, there is only one unit of work with pass 0,
and it is processed by a function called p
which contains the approximation
logic for the partition scheduling problem. However, the results from the processing
produces two schedules, stored in variables b and c respectively with their lengths
stored in variables bsize and csize. If these sizes are non-zero, new work is pushed
into the queue this time with pass equal to pass + 1. In the next iteration, there
would be two schedules to process as there are two schedules with pass now equal
to 1. In this step, the while loop must process both schedules, and this in turn
produces four new schedules as each of the two schedules are processed by splitting
into two new schedules. Thus, the loops produce 2iter many
schedules which is equal to N where N is a power of two.
The logic for approximating partition is contained in a method
called p
and its implementation is:
int compare(const void *_a, const void *_b)
{
int a = *(int *)_a;
int b = *(int *)_b;
if(a == b) {
return 0;
} else if (a < b) {
return 1;
} else {
return -1;
}
}
void p(int *a, int size, int **b, int *bsize, int **c, int *csize)
{
int i = 0, j = 0, k = 0, bsum = 0, csum = 0;
if(!a || !size) {
(*b) = (*c) = NULL;
(*bsize) = (*csize) = 0;
return;
}
qsort(a, size, sizeof(int), compare);
(*b) = (int *)calloc(1, size * sizeof(int));
(*c) = (int *)calloc(1, size * sizeof(int));
(*bsize) = 0;
(*csize) = 0;
for(i = 0, j = 0, k = 0; i < size; i++) {
if(bsum == csum || bsum < csum) {
(*b)[j++] = a[i];
(*bsize)++;
bsum += a[i];
continue;
}
if(csum < bsum) {
(*c)[k++] = a[i];
(*csize)++;
csum += a[i];
}
}
}
Before the
approximation runs, the input set of elements stored in array a
(referred
to in the heuristic as set S) must be sorted in reverse order. The
approximation then runs selecting the processor with the least processing time
and assigning to it an element in descending order. Notice that the function
produces two new arrays, b and c, and their respective sizes, bsize and csize.
The approximation implemented is the aforementioned greedy approximation for
partition scheduling. The approximation uses two variables bsum and csum to
keep track of which processor has the least load at every pass. This is a
better implementation than repeatedly calculating the sum at every pass.
Using the Code
The code is available in a C++ file called multischedule.cc.
The code has been compiled and tested on Linux using g++. Run the command g++ multischedule.cc
-o ms. Then, to test out the algorithm, run:
./ms data.dat
Where data.dat is a file containing integer values corresponding
to:
Nr_Elements Nr_CPUs elem1 elem2 ... elemN
All integer values are separated by a single white space. Nr_Elements
refers to the number of elements or processor times given in the file. Nr_CPUs is
a power of two number of CPUs for which the algorithm is to try to find a schedule
for.
An example instance may be:
20 4 24 27 12 54 6 4 7 10 54 55 26 10 9 43 54 2 9
6 38 34
where there are 20
elements and 4 processors followed by 20 elements (processor times) from 24 to
34.
The algorithm then produces
the following schedule:
size [20] cpus [4]
log2 [2]
queue size [4]
cpu = [0]
len = [5] { 55 34 24 6 2 }
sum = [121]
cpu = [1]
len = [5] { 54 38 10 10 9 }
sum = [121]
cpu = [2]
len = [5] { 54 43 12 7 6 }
sum = [122]
cpu = [3]
len = [5] { 54 27 26 9 4 }
sum = [120]
Notice that there are
four schedules, one for each processor. The sum of the total cost of running
the assigned schedule on a processor is provided. These times are 121, 121,
122, and 120 for processors 0 to 3, respectively. The times are close to
optimal if not optimal in this instance. The maximum difference between any
schedule is just two, and for two processors (0 and 1) the times are equal. At
each pass, a 4/3 * OPT approximation algorithm[2] is run to produce a schedule
and the resulting final schedules is efficient.
Points of Interest
It is intended that heursitic2 be used interchangeably
with any approximation of the partition scheduling problem. However, not all approximations
may yield a good schedule for the multiprocessor assignment. The approximation suggested
in heuristic1 seemed to work well in the cases for which it was tried. To change
the algorithm to use a different approximation, one has to implement function p
in the code using a different approach.
References
- Garey, M. R., & Johnson, D. S. (2000). Computers and intractability.New York, NY: W. H. Freeman and Company.
- [2] Mertens, S. (2003). The Easiest Hard Problem: Number Partitioning. Retrieved March 17, 2012 from http://www.arxiv.org/pdf/cond-mat/0310317.
Dr. Doss holds a PhD in Applied Computer Science. His research interests include Operating Systems, Networking, Theory of NP Completeness,and Software Engineering.
https://www.rdoss.com