Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

RunLength Compression

0.00/5 (No votes)
21 Jan 2005 1  
Fast Run-Length coding with variable runs sizes.

Sample Image - RunLength.gif

Introduction

Run-Length is a lossless compression technique, its performance depends heavily on the input data statistics. It depends on counting sequence of repeated runs and save the run plus its count instead of the sequence of runs. So it depends heavily on the repetition in the input data. I have used it to compress array of bits ('1's and '0's), and each time I use it, I found my self rewrite the code to handle the new data format, and the type of the run (one bit, byte, word, ...). So I preferred to write a general code that has the runs size and count as an input to the compression function. You can skip the Background section if you have good experience about Run-Length algorithm.

Background

The main idea of the Run-Length is to replace runs of the same data (like 'aaaaaa' or 'ccccc') with a counter saying how many repetitions we have. If we have 'aaaaaa' we just output 'a' and a byte to save the count, so we replaced 6 characters with only 2, and that can be done for all repeated runs in the input data, and the other unrepeated runs will be saved as it is. Compressed data format can be of any thing, just the decompression algorithm should follow this format structure in the decompression.

Example

Input data: aaaaaabcdddddcccccrrrrrrrrrfdde (31 bytes)
Output data: a,5,b,0,c,0,d,4,c,4,r,8,f,0,d,1,e,0 (17 bytes)

Run-Length algorithm

Compression Algorithm can summarized as follows:.

// for input b with length l
index = 0
while( index < l )
{
    run = b[index]
    length = 0
    if run = b[++index]
        while run = b[index+length]
            index++, length++;
    output (run, length);
}

The algorithm scans one character (run) at a time. If the a character is equal to its next character, then it checks for repetition, then outputs the character with the repetition that follow it. If the a character is not equal to its next character, then it outputs it and a repetition of zero.

Decompression Algorithm is as follows:

// for input b with length l 
index = 0
while( index < l )
{
    run = b[index++]
    length = b[index++]
    output (run, length+1)
}

The decompression is so simple, just outputs the run repeated with its length plus one.

Code Usage

Code usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes additional parameters:

  • nBitsPerSample: to determine number of bits to save run length value, so maximum run length will be up to 2^nBitsPerSample-1.
  • nRuns: to determine an optional dictionary of runs, with a default runs of the ASCII characters (0...255).
  • nRunCount: to determine number of runs in the previous parameter, with a default value of 256 (256 characters).
  • nRunSize: to determine the number of bytes in the each run, with a default value of 1 (1 byte).
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes, 
       int &nDesLen, int nBitsPerSample, void* nRuns, 
       int nRunCount, int nRunSize);
bool DecompressRunLength(BYTE *pSrc, int nSrcLen, 
       BYTE *&pDes, int &nDesLen);

You can use the parameters (..., void* nRuns, int nRunCount, int nRunSize) to input any dictionary for runs and you can decide run size in bytes, see the next example to compress a file contains a runs with size 3:

BYTE by[] = { 0x0,0x0,0x0,  0x84,0x82,0x84,  
              0xce,0xd3,0xd6,  0xff,0xff,0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen, 
              12/*nBitsPerSample*/, by, 4/*nRunCount*/, 
              3/*nRunSize*/);

Note: Input runs must be sorted as I am using binary search to search in runs array buffer.

And if you want to compress only for zeros and ones, you can use that:

BYTE by[] = { 0x0, 0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen, 
    12/*nBitsPerSample*/, by, 2/*nRunCount*/, 1/*nRunSize*/);
// (^_^)

Code Description

CompressRunLength function

  • Scan the input buffer pSrc and check if the run is repeated or not using binary search.
  • If run is repeated then,
    • set next bit in the destination buffer:
      *(pDes+(nDesLen>>3)) |= 1 << (nDesLen++&7)
    • add run index to the destination buffer:
      *(DWORD*)(pDes+(++nDesLen>>3)) |= nRunIndex << (nDesLen&7)
    • loop to get run length
      while(...) nRunLength++, ...
    • add run length to the destination buffer:
      *(DWORD*)(pDes+(nDesLen>>3)) |= nRunLength << (nDesLen&7)
  • If run is not repeated, then add next byte to the destination buffer:
    *(WORD*)(pDes+(++nDesLen>>3)) |= *(pSrc+nByte++) << (nDesLen&7)
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
                         int &nDesLen, int nBitsPerSample, 
                         void* nRuns, int nRunCount, int nRunSize)
{
    ...
    // loop in the source buffer
    while(nByte < nSrcLen)
        if((... (nRunIndex =
          BinarySearch(pSrc+nByte, nRunSize, nRuns, nRunCount))!= -1 &&
          memcmp(pSrc+nByte+nRunSize, 
          (BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) ||
            ...)
        {    // set bit to 1 to indicate type found
            *(pDes+(nDesLen>>3)) |= 1 << (nDesLen++&7);
            *(DWORD*)(pDes+(++nDesLen>>3)) |= 
                      nRunIndex << (nDesLen&7);
            nDesLen += nBitsPerTypeIndex;
            // skip the two repeated runs
            nByte += nRunSize*2;
            // get run length - 2 (without the two repeated runs)
            nRunLength = 0;
            while(nRunLength < nMaxRunLength && nByte < nSrcLen && 
                ((... memcmp(pSrc+nByte, 
                (BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) || ...))
                nRunLength++, nByte += nRunSize;
            // save run length and increment destination length 
            //by bits per sample
            *(DWORD*)(pDes+(nDesLen>>3)) |=
                      nRunLength << (nDesLen&7);
            nDesLen += nBitsPerSample;
        }
        else    // copy one byte
            *(WORD*)(pDes+(++nDesLen>>3)) |=
                      *(pSrc+nByte++) << (nDesLen&7), nDesLen += 8;
    nDesLen = (nDesLen+7)/8;    // bits to bytes
    ...
}
  • (nDesLen>>3): >>3 to divide by 8 to reach the right byte to start with.
  • |=: to avoid writing over the previous code.
  • (nDesLen&7): &7 to get the reminder of dividing by 8, to get the start bit.
  • nRunLength << (nDesLen&7): to shift the value to left to match the starting bit at the destination buffer.

DecompressRunLength function

  • If next bit in the source buffer is on, then it is a run with its length:
    if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
    • read run index:
      nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex)
    • read run length:
      nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2
    • loop to out the repeated runs.
      for(...)
  • If next bit in the source buffer is off, then it is a byte that needs to out:
    *(WORD*)(pDes+(nDesIndex>>3)) |= GetWORD(pSrc, nSrcIndex, 0xff)
bool DecompressRunLength(BYTE *pSrc, int nSrcLen,
                                BYTE *&pDes, int &nDesLen)
{
    ...
    nSrcLen <<= 3; // bytes to bits
    while(nSrcIndex < nSrcLen-8)
        if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
        {
            nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex), 
                                           nSrcIndex += nBitsPerTypeIndex;
            nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2,
                                           nSrcIndex += nBitsPerSample;
            for(nRun = 0; nRun < nRunLength; nRun++)
                for(nByte = 0; nByte < nRunSize; nByte++, nDesIndex += 8)
                    *(WORD*)(pDes+(nDesIndex>>3)) |= 
                                    nRuns ? GetWORD(nRuns+nRunSize*nRunIndex, 
                                    nByte<<3, 0xff) : (BYTE)nRunIndex;
        }
        else    // copy one byte
            *(WORD*)(pDes+(nDesIndex>>3)) |=  
                                        GetWORD(pSrc, nSrcIndex, 0xff), 
                                        nDesIndex += 8, nSrcIndex += 8;
    ...
}
  • (nDesIndex>>3): >>3 to divide by 8 to reach the right byte to start with.
  • (nDesIndex&7): &7 to get the reminder of dividing by 8, to get the start bit.
  • >>(nDesIndex&7): to shift right to the start bit to take the code at bit 0 (closer the wall).

Points of Interest

  1. At the compression function, I am using a binary search to search for the run that matches the current run in the input data:
    int BinarySearch(void* pValue, int nVlaueSize, 
                               void* pArray, int nCount)
    {
        int nIndex, nResult, nStart = 0, nEnd = nCount-1;
        while(nStart <= nEnd)
        {
            nIndex = (nEnd+nStart)/2;
            if((nResult = memcmp((BYTE*)pArray+nIndex*nVlaueSize,
                                      pValue, nVlaueSize)) == 0)
                return nIndex;
            if(nResult > 0)
                nEnd = nIndex-1;
            else
                nStart = nIndex+1;
        }
        return -1;
    }

    So, as I noted before, input runs must be sorted.

  2. Also, I am using a very good macros to extract values starting from any bit in the input buffer:
    #define GetDWORD(buf,bit,mask) (
                 (*(DWORD*)(((BYTE*)buf)+((bit)>>3)))>>((bit)&7)&(mask))
    #define GetWORD(buf,bit,mask) (
                 (*(WORD*)(((BYTE*)buf)+((bit)>>3)))>>((bit)&7)&(mask))
    

    mask is used to reset any extra indeed bits.

  3. My code here in this compression is using bits to save values, not bytes, so I am saving any value starting from any bit, so I didn't waste any bits. Or I can say, I use every bit in the compressed data.

Source code files

  • RunLength.cpp
  • RunLength.h

Thanks to...

I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK)

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here