UPDATE: This article has now been integrated into the GAF documentation. The documentation can be found here.
Introduction
A question often asked of me is how to determine the encoding of data variables using a binary string, when they are all different units and different ranges of values.
This question is often followed by a question relating to how to ensure that when decoding a binary string, you only have sensible, valid, values.
The questions above can be answered by using something I refer to as normalisation.
Encoding and Decoding
The first point about binary based GAs is that you only need to decode the binary elements to the real values. There is no concept of encoding. If you are not used to GAs, this seems rather odd. However, the GA creates all of the solutions based on fitness or some other random effect and is therefore responsible for creating the binary strings. The GA has no knowledge of the values or how they are encoded. All that we have to do is tell the GA how long the binary strings should be. We then, at each generation, take the GA generated binary strings and decode them into real domain values. Once we have the real values, we can evaluate the fitness of each solution and leave the GA to do its work until the next generation.
What normalisation does is simplify the decoding process.
Let's say we have three variables, x
, y
and z
, with the following ranges:
x: -2.5 + 9.5 (range is 12 with an offset of -2.5)
y: 0 to 1000 (range is 1000 with an offset of 0)
x:-50 to 0 (range is 50 with an offset of -50)
Assuming that we are not working with integers, we need to determine the resolution, i.e., how many values we need between the range. For example, if we are happy with say a million values between the upper and lower values in the ranges, 20 bits per variable will suffice (2^20 = 1048576). Based on this, we can determine how many bits each variable will use.
Each variable can have any appropriate number of bits, however, let’s say we use the same bit length of 20 bits for x, y and z. This gives us three values in our chromosome, each with a bit length of 20
, therefore, we will need a chromosome that is 60 bits long.
The GA will continually generate solutions (chromosomes) 60 bits long. In order to retrieve the real domain values from our chromosome, we use a range constant.
This is calculated as follows:
rangeConstant = range /(System.Math.Pow(2, numberOfBits) - 1);
Here is an example of the range constant approach for x, y and z.
var rcx =12/(System.Math.Pow(2,20)-1);
var rcy =1000/(System.Math.Pow(2,20)-1);
var rcz =50/(System.Math.Pow(2,20)-1);
var x1 =Convert.ToInt32(chromosome.ToBinaryString(0, 20), 2);
var y1 =Convert.ToInt32(chromosome.ToBinaryString(20, 20), 2);
var z1 =Convert.ToInt32(chromosome.ToBinaryString(40, 20), 2);
double x = (x1 * rcx) - 2.5;
double y = (y1 * rcy);
double y = (y1 * rcz) - 50;
This simple approach ensures that any given binary string will always decode to valid, within range, values for each of our variables. Once we have the real values, we can test for fitness and the GA will do the rest. If you are not used to GAs, it can seem a little like magic, to be honest, it probably is.
Enjoy!