Introduction
Run-Length is a lossless compression technique, its performance depends heavily on the input data statistics. It depends on counting sequence of repeated runs and save the run plus its count instead of the sequence of runs. So it depends heavily on the repetition in the input data. I have used it to compress array of bits ('1's and '0's), and each time I use it, I found my self rewrite the code to handle the new data format, and the type of the run (one bit, byte, word, ...). So I preferred to write a general code that has the runs size and count as an input to the compression function. You can skip the Background section if you have good experience about Run-Length algorithm.
Background
The main idea of the Run-Length is to replace runs of the same data (like 'aaaaaa' or 'ccccc') with a counter saying how many repetitions we have. If we have 'aaaaaa' we just output 'a' and a byte to save the count, so we replaced 6 characters with only 2, and that can be done for all repeated runs in the input data, and the other unrepeated runs will be saved as it is. Compressed data format can be of any thing, just the decompression algorithm should follow this format structure in the decompression.
Example
Input data: aaaaaabcdddddcccccrrrrrrrrrfdde (31 bytes)
Output data: a,5,b,0,c,0,d,4,c,4,r,8,f,0,d,1,e,0 (17 bytes)
Run-Length algorithm
Compression Algorithm can summarized as follows:.
index = 0
while( index < l )
{
run = b[index]
length = 0
if run = b[++index]
while run = b[index+length]
index++, length++;
output (run, length);
}
The algorithm scans one character (run
) at a time. If the a character is equal to its next character, then it checks for repetition, then outputs the character with the repetition that follow it. If the a character is not equal to its next character, then it outputs it and a repetition of zero.
Decompression Algorithm is as follows:
index = 0
while( index < l )
{
run = b[index++]
length = b[index++]
output (run, length+1)
}
The decompression is so simple, just outputs the run
repeated with its length
plus one.
Code Usage
Code usage is so simple, just the input buffer and its length, and the output buffer and its length. The compress function includes additional parameters:
nBitsPerSample
: to determine number of bits to save run length value, so maximum run length will be up to 2^nBitsPerSample
-1.
nRuns
: to determine an optional dictionary of runs, with a default runs of the ASCII characters (0...255).
nRunCount
: to determine number of runs in the previous parameter, with a default value of 256 (256 characters).
nRunSize
: to determine the number of bytes in the each run, with a default value of 1 (1 byte).
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
int &nDesLen, int nBitsPerSample, void* nRuns,
int nRunCount, int nRunSize);
bool DecompressRunLength(BYTE *pSrc, int nSrcLen,
BYTE *&pDes, int &nDesLen);
You can use the parameters (..., void* nRuns, int nRunCount, int nRunSize)
to input any dictionary for runs and you can decide run size in bytes, see the next example to compress a file contains a runs with size 3:
BYTE by[] = { 0x0,0x0,0x0, 0x84,0x82,0x84,
0xce,0xd3,0xd6, 0xff,0xff,0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen,
12, by, 4,
3);
Note: Input runs must be sorted as I am using binary search to search in runs array buffer.
And if you want to compress only for zeros and ones, you can use that:
BYTE by[] = { 0x0, 0xff };
CompressRunLength(pSrc, nSrcLen, pDes, nDesLen,
12, by, 2, 1);
Code Description
CompressRunLength function
bool CompressRunLength(BYTE *pSrc, int nSrcLen, BYTE *&pDes,
int &nDesLen, int nBitsPerSample,
void* nRuns, int nRunCount, int nRunSize)
{
...
while(nByte < nSrcLen)
if((... (nRunIndex =
BinarySearch(pSrc+nByte, nRunSize, nRuns, nRunCount))!= -1 &&
memcmp(pSrc+nByte+nRunSize,
(BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) ||
...)
{
*(pDes+(nDesLen>>3)) |= 1 << (nDesLen++&7);
*(DWORD*)(pDes+(++nDesLen>>3)) |=
nRunIndex << (nDesLen&7);
nDesLen += nBitsPerTypeIndex;
nByte += nRunSize*2;
nRunLength = 0;
while(nRunLength < nMaxRunLength && nByte < nSrcLen &&
((... memcmp(pSrc+nByte,
(BYTE*)nRuns+nRunIndex*nRunSize, nRunSize) == 0) || ...))
nRunLength++, nByte += nRunSize;
*(DWORD*)(pDes+(nDesLen>>3)) |=
nRunLength << (nDesLen&7);
nDesLen += nBitsPerSample;
}
else
*(WORD*)(pDes+(++nDesLen>>3)) |=
*(pSrc+nByte++) << (nDesLen&7), nDesLen += 8;
nDesLen = (nDesLen+7)/8;
...
}
(nDesLen>>3)
: >>3
to divide by 8 to reach the right byte to start with.
|=
: to avoid writing over the previous code.
(nDesLen&7)
: &7
to get the reminder of dividing by 8, to get the start bit.
nRunLength << (nDesLen&7)
: to shift the value to left to match the starting bit at the destination buffer.
DecompressRunLength function
- If next bit in the source buffer is on, then it is a run with its length:
if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
- read run index:
nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex)
- read run length:
nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2
- loop to out the repeated runs.
for(...)
- If next bit in the source buffer is off, then it is a byte that needs to out:
*(WORD*)(pDes+(nDesIndex>>3)) |= GetWORD(pSrc, nSrcIndex, 0xff)
bool DecompressRunLength(BYTE *pSrc, int nSrcLen,
BYTE *&pDes, int &nDesLen)
{
...
nSrcLen <<= 3;
while(nSrcIndex < nSrcLen-8)
if((*(pSrc+(nSrcIndex>>3)) >> (nSrcIndex++&7)) & 1)
{
nRunIndex = GetDWORD(pSrc, nSrcIndex, nMaxTypeIndex),
nSrcIndex += nBitsPerTypeIndex;
nRunLength = GetDWORD(pSrc, nSrcIndex, nMaxRunLength)+2,
nSrcIndex += nBitsPerSample;
for(nRun = 0; nRun < nRunLength; nRun++)
for(nByte = 0; nByte < nRunSize; nByte++, nDesIndex += 8)
*(WORD*)(pDes+(nDesIndex>>3)) |=
nRuns ? GetWORD(nRuns+nRunSize*nRunIndex,
nByte<<3, 0xff) : (BYTE)nRunIndex;
}
else
*(WORD*)(pDes+(nDesIndex>>3)) |=
GetWORD(pSrc, nSrcIndex, 0xff),
nDesIndex += 8, nSrcIndex += 8;
...
}
(nDesIndex>>3)
: >>3
to divide by 8 to reach the right byte to start with.
(nDesIndex&7)
: &7
to get the reminder of dividing by 8, to get the start bit.
>>(nDesIndex&7)
: to shift right to the start bit to take the code at bit 0 (closer the wall).
Points of Interest
- At the compression function, I am using a binary search to search for the
run
that matches the current run
in the input data:
int BinarySearch(void* pValue, int nVlaueSize,
void* pArray, int nCount)
{
int nIndex, nResult, nStart = 0, nEnd = nCount-1;
while(nStart <= nEnd)
{
nIndex = (nEnd+nStart)/2;
if((nResult = memcmp((BYTE*)pArray+nIndex*nVlaueSize,
pValue, nVlaueSize)) == 0)
return nIndex;
if(nResult > 0)
nEnd = nIndex-1;
else
nStart = nIndex+1;
}
return -1;
}
So, as I noted before, input runs must be sorted.
- Also, I am using a very good macros to extract values starting from any bit in the input buffer:
#define GetDWORD(buf,bit,mask) (
(*(DWORD*)(((BYTE*)buf)+((bit)>>3)))>>((bit)&7)&(mask))
#define GetWORD(buf,bit,mask) (
(*(WORD*)(((BYTE*)buf)+((bit)>>3)))>>((bit)&7)&(mask))
mask
is used to reset any extra indeed bits.
- My code here in this compression is using bits to save values, not bytes, so I am saving any value starting from any bit, so I didn't waste any bits. Or I can say, I use every bit in the compressed data.
Source code files
- RunLength.cpp
- RunLength.h
Thanks to...
I owe a lot to my colleagues for helping me in implementing and testing this code. (JAK)