Click here to Skip to main content
16,002,992 members
Articles / Programming Languages / C#

Word Aligned Hybrid (WAH) Compression for BitArrays

Rate me:
Please Sign up or sign in to vote.
4.89/5 (35 votes)
28 Feb 2015CPOL6 min read 137.6K   2.8K   75   50
Word Aligned Hybrid (WAH) compression for BitArrays

Introduction

After searching the internet for a .NET implementation for WAH compressed BitArray and not finding one, I decided to write and publish an article here so the .NET community are not left out as all the available implementations are in the Java language. This ties into a pretty advanced topic of bitmap indexing techniques for databases and I needed this for my RaptorDB Document data-store database engine.

What Is It? 

A BitArray is a data structure in the .NET library for storing true/false or bits of data in a compact form within an array of Int32 objects. A WAH or word-aligned hybrid BitArray is a special run length compressed version of a BitArray which saves a lot of space and memory. All the implementations that exist in Java essentially duplicate the functionality of a BitSet, that is the AND, OR, NOT, and XOR operations with the compressed internal storage format.

In my implementation, I defer the functionality of the BitArray to itself and just add compression and decompression routines. This is much faster than the Java way at the expense of memory usage. To overcome this, I have also added a FreeMemory method to release the BitArray contents and keep the compressed contents. Arguably, if you are using 100 million bits, then a full implementation is more performant than my implementation, but for most of our Use Cases, we are at most in the 10 millions of bits range.

This original method was invented at the Berkeley Labs of US Department of Energy; it is a project named FastBit and is used for high energy physics department experiments; you can see it here: FastBit site.

Why Should I Care?

So what?, you ask. Well, as mentioned before, BitArrays are used in an indexing technique called bitmap indexes (Wiki) and column based databases which store data in columns instead of rows. An example which you might know is Microsoft's PowerPivot for Excel which can process millions of rows in seconds. Interestingly, Microsoft has only recently announced the implementation of bitmap indexes in the upcoming SQL Server platform, post 2008 R2. It has long been in use by other RDBM vendors like Oracle.

How it Works

The WAH compression algorithm is as follows:

  1. Take 31 bits from the array.
  2. If all zeros, then increment the zero count by 31.
    1. if ones count > 0, then output 32 bits with bit 32 =1 and bit 31=1 and 0-30 = ones count.
  3. If all ones, then increment the ones count by 31.
    1. If zeros count >0, then output 32 bits with bit 32=1 and bit 31=0 and 0-30 = zeros count.
  4. Else output the 31 bits as 32 bits with bit 32 = 0.
    1. If zeros or ones count >0, then output as above.

From the above, in the worst case, you will get N/31 more bits encoded or about 3% increase in size to the original.

What You Get

WAHBitArray is essentially the same as the standard BitArray in the .NET Framework, with the following additions:

  • FreeMemory(): This will first compress the internal BitArray then free the memory it uses.
  • GetCompressed(): This will compress the current BitArray then return it as an array of uint.
  • CountOnes(), CountZeros(): will count the respective bits in the array.
  • GetBitIndexes(bool): will return an enumeration using yield of the respective bit position; for example, if the bit array contains 10001... this will return integers 0,4,... if the bool parameter was true, and 1,2,3,... if bool was false.
  • Get(), Set(): Methods implemented with auto resizing and no exceptions.
  • DebugPrint(): generates a string of 1 and 0 values.

Using the Code

To create a WAHBitArray, you can do the following:

C#
WAHBitArray ba1 = new WAHBitArray(1000000); // 1 million bits
// 2 million bits from another bitarray
WAHBitArray ba2 = new WAHBitArray(new BitArray(2000000));
WAHBitArray ba3 = new WAHBitArray(1000000, new uint[] { /* compressed values here */});
// from a compressed list of uint

To perform operations, you can do the following:

C#
WAHBitArray result1 = ba1.And( ba2);
WAHBitArray result2 = ba1.Or( ba3);
WAHBitArray result3 = ba1.Xor( ba2);

long count = result1.CountOnes(); // will count the true values

result2.Set(2000000, true); // will resize as needed
bool b = result2.Get(3000000); // will resize as needed

foreach(int pos in result2.GetBitIndexes(true))
{
    ReadDatabaseRecordNumber(pos); // pos is the index of true values
}

Using the Code v1.3

As of version 1.3, you don't need to specify the size of the BitArray and all operations will auto resize as needed.

C#
WAHBitArray ba1 = new WAHBitArray(); // no need to specify the size
WAHBitArray ba3 = new WAHBitArray(new uint[] { /* compressed values here */});
// from a compressed list of uint

Points of Interest

  • BitArray class is sealed by Microsoft so inheriting from it was not possible.
  • BitArray throws an exception if the length of two BitArrays are not equal on bit operations, WAHBitArray makes them the same as the largest before operations.
  • BitArray must be resized in 32 increments, otherwise it mangles the compression bits.

Version 2.0

For extra speed in compressing and uncompressing the bits, and the fact that the .NET Framework implementation does not give access to the internal data structures in the BitArray, I had to rewrite all the BitArray functionality in WAHBitArray.

Using Reflector to see the internal implementation of the BCL BitArray one can see the following snippets:

C#
// AND operation
for (int i = 0; i < array.Length; i++)    
    array[i] &= val[i];

// OR operation
for (int i = 0; i < array.Length; i++)
    array[i] |= val[i];

// XOR operation
for (int i = 0; i < array.Length; i++)
    array[i] ^= val[i];

As you can see, the bit operations are done on the int32 values.

Now with access to the internal uint[] bits, the compression method gets 31 bit blocks of data instead of one by one. This is done in the Take31Bits() method, which finds the two adjacent uint values in the _uncompressed list and does some bit manipulations as follows:

C#
public WAHBitArray And(WAHBitArray op)
{
    this.CheckBitArray(); // check the bit array is uncompressed

    uint[] ints = op.GetUncompressed(); // get the values

    FixSizes(ints, _uncompressed); // make the sizes the same

    for (int i = 0; i < ints.Length; i++)
        ints[i] &= _uncompressed[i]; // do the AND operation

    return new WAHBitArray(false, ints); // return a new object
}

The compression and uncompression routines were rewritten to operate in the uint[] arrays as follows:

C#
private void Compress()
{
    _compressed = new List<uint>();
    uint zeros = 0;
    uint ones = 0;
    int count = _uncompressed.Count << 5;
    for (int i = 0; i < count; )
    {
        uint num = Take31Bits(i);
        i += 31;
        if (num == 0)
        {
            zeros += 31;
            FlushOnes(ref ones);
        }
        else if (num == 0x7fffffff)
        {
            ones += 31;
            FlushZeros(ref zeros);
        }
        else
        {
            FlushOnes(ref ones);
            FlushZeros(ref zeros);
            _compressed.Add(num);
        }
    }
    FlushOnes(ref ones);
    FlushZeros(ref zeros);
}

private void Uncompress()
{
    int index = 0;
    List<uint> list = new List<uint>();
    if (_compressed == null)
        return;

    foreach (uint ci in _compressed)
    {
        if ((ci & 0x80000000) == 0) // literal
        {
            this.Write31Bits(list, index, ci & 0x7fffffff);
            index += 31;
        }
        else
        {
            uint c = ci & 0x3ffffff;
            if ((ci & 0x40000000) > 0) // ones count
                this.WriteBits(list, index, c);

            index += (int)c;
        }
    }
    this.ResizeAsNeeded(list, index);
    _uncompressed = list;
}

Because taking or updating 31 bits of data can overlap two adjacent uint values, some bit massage is necessary as follows:

C#
private uint Take31Bits(int index)
{
    long l1 = 0;
    long l2 = 0;
    long l = 0;
    long ret = 0;
    int off = (index % 32);
    int pointer = index >> 5;

    l1 = _uncompressed[pointer];
    pointer++;
    if (pointer < _uncompressed.Count)
        l2 = _uncompressed[pointer];

    l = (l1 << 32) + l2;
    ret = (l >> (32 - off)) & 0x07fffffff;

    return (uint)ret;
}

Version 2.5

This version is a post back from the work done in RaptorDB, which is an overhaul of all the routines focusing on multi thread access and concurrency issues. A very important lesson learned was locking at the source of the resource being used as opposed to a higher level, which takes care of concurrency issues at the lowest levels, and saves a lot of headaches and debugging.

Much of the code has been reformatted and optimized. I am very confident in the correctness of this version as it is being tested to it's maximum in RaptorDB.

History

  • Initial release v1.0: 22 June 2011
  • Update v1.1: 24 June 2011
    • Bit operations now return a WAHBitArray instead of BitArray
    • A bit operation will take either a WAHBitArray or a BitArray as the input
    • CountZeros(), CountOnes() methods added
    • GetBitIndexes() method added
  • Update v1.2: 28 June 2011
    • Get, set methods with auto resizing
  • Update v1.3: 23 July 2011
    • Removed the need to specify the initial size
    • Bug fix resize in 32 increments
    • Bug fix in BitArray arithmetic
    • DebugPrint() method implemented
  • Update v2.0
    • Complete rewrite
    • Optimized compression and decompression ~9x speed increase
    • All bit operations are done internally without BCL BitArray
  • Update v2.5: 10 May 2012
    • Thread safe access to bits
    • Optimized Count() method
    • Bits set in int[] in high order
    • Bug fix getting uncompressed bits
    • GetBitIndexes() will return the one bits only
  • Update v2.6 : 15th June 2013
    • bug fix edge case decompress code (thanks to andrey269 )
  • Update v2.6.1 : 22nd June 2013  
    • bug fix a logic mistake
    • added a unit test project
  • Update v2.6.2 : 6th October 2013
    • bug fix last remaining bits output in WriteOnes()
  • Update v2.7.0 : 1st March 2015
    • post back fixes and changes from RaptorDB

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect -
United Kingdom United Kingdom
Mehdi first started programming when he was 8 on BBC+128k machine in 6512 processor language, after various hardware and software changes he eventually came across .net and c# which he has been using since v1.0.
He is formally educated as a system analyst Industrial engineer, but his programming passion continues.

* Mehdi is the 5th person to get 6 out of 7 Platinum's on Code-Project (13th Jan'12)
* Mehdi is the 3rd person to get 7 out of 7 Platinum's on Code-Project (26th Aug'16)

Comments and Discussions

 
GeneralRe: My vote of 5 Pin
Mehdi Gholam16-Jul-11 13:33
Mehdi Gholam16-Jul-11 13:33 
SuggestionAn alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
SoftwareBotanyDan29-Jun-11 9:17
SoftwareBotanyDan29-Jun-11 9:17 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
Mehdi Gholam29-Jun-11 9:28
Mehdi Gholam29-Jun-11 9:28 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
SoftwareBotanyDan30-Jun-11 3:15
SoftwareBotanyDan30-Jun-11 3:15 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
Mehdi Gholam30-Jun-11 3:25
Mehdi Gholam30-Jun-11 3:25 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
supercat927-Jul-11 10:00
supercat927-Jul-11 10:00 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
Mehdi Gholam27-Jul-11 20:58
Mehdi Gholam27-Jul-11 20:58 
GeneralRe: An alternative .NET 4.0 C# WAH implementation: http://softwarebotanysun.codeplex.com Pin
supercat928-Jul-11 7:23
supercat928-Jul-11 7:23 
Mehdi Gholam wrote:
Interesting, what about runs of ones and zeros?


If word[5] of the array is supposed to represent 32 bits of literal data, then bit 5 of word[31] would be zero. If word[5] of the array is supposed to represent something else, then bit 5 of word[31] would be set, and the bits of word[5] would indicate what exactly it's supposed to be. A really simple approach would be to use bit 31 to indicate whether the word represents a run one of 0x00000000 or 0xFFFFFFFF, and the bottom 31 bits to indicate how many such words it represents. Since most runs aren't going to be anywhere near 2 billion words long, it may be helpful to allow some bits in the word to say something about data that follows the run. For example, one could use bits 31-30 to select one of four types of run:

  1. A run of 0 to a billion words of zero
  2. A run of 0 to a billion words of ones (0xFFFFFFFF)
  3. A run of 0-255 BITS of zeroes, 0-127 BITS of ones, 0-255 BITS of zeroes, 0-127 BITS of ones, and enough zeroes to pad out any partial word.
  4. A run of 0-255 BITS of ones, 0-127 BITS of zeroes, 0-255 BITS of ones, 0-127 BITS of zeroes, and enough ones to pad out any partial word.

Note that's just a simple example of how things could be done. Note that any four runs comprising 130 bits or less (and many combinations of four runs totaling 256 bits or more) could be stored in a single word.
QuestionMy 5 Pin
Kiran Sonawane28-Jun-11 0:19
Kiran Sonawane28-Jun-11 0:19 
AnswerRe: My 5 Pin
Mehdi Gholam29-Jun-11 9:21
Mehdi Gholam29-Jun-11 9:21 
QuestionExcellent but do you have any performance info Pin
KenJohnson22-Jun-11 11:36
KenJohnson22-Jun-11 11:36 
AnswerRe: Excellent but do you have any performance info Pin
Mehdi Gholam22-Jun-11 15:40
Mehdi Gholam22-Jun-11 15:40 
AnswerRe: Excellent but do you have any performance info Pin
Mehdi Gholam19-Jul-11 18:42
Mehdi Gholam19-Jul-11 18:42 
QuestionExcellent Pin
RugbyLeague22-Jun-11 5:09
RugbyLeague22-Jun-11 5:09 
AnswerRe: Excellent Pin
Mehdi Gholam22-Jun-11 5:26
Mehdi Gholam22-Jun-11 5:26 
GeneralRe: Excellent Pin
RugbyLeague22-Jun-11 5:30
RugbyLeague22-Jun-11 5:30 
GeneralRe: Excellent Pin
Mehdi Gholam22-Jun-11 5:36
Mehdi Gholam22-Jun-11 5:36 
AnswerRe: Excellent Pin
Mehdi Gholam22-Jun-11 5:29
Mehdi Gholam22-Jun-11 5:29 
GeneralRe: Excellent Pin
RugbyLeague22-Jun-11 5:48
RugbyLeague22-Jun-11 5:48 
GeneralRe: Excellent Pin
Mehdi Gholam22-Jun-11 6:00
Mehdi Gholam22-Jun-11 6:00 
GeneralRe: Excellent Pin
RugbyLeague22-Jun-11 6:05
RugbyLeague22-Jun-11 6:05 
GeneralRe: Excellent Pin
Mehdi Gholam22-Jun-11 6:13
Mehdi Gholam22-Jun-11 6:13 
GeneralRe: Excellent Pin
RugbyLeague22-Jun-11 6:23
RugbyLeague22-Jun-11 6:23 
AnswerRe: Excellent Pin
Mehdi Gholam19-Jul-11 18:39
Mehdi Gholam19-Jul-11 18:39 
QuestionNice work Pin
JV999922-Jun-11 3:44
professionalJV999922-Jun-11 3:44 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.