I would try to make things simpler and functional first and then address the speed. Speed means nothing when the code doesn't work in the first place. Try something like this :
const int width = 4096;
typedef union
{
UCHAR byte;
struct nibble
{
UCHAR first : 4;
UCHAR second : 4;
} nibs;
} bytenib;
bytenib byte2;
UCHAR byte1;
UCHAR byte3;
USHORT pixel1;
USHORT pixel2;
const int width = 4096;
bool oddrow = row % 2;
int i = 0;
while( i < width )
{
byte1 = image[row][ i ];
byte2.byte = image[row][ i + 1];
byte3 = image[row][ i ];
pixel1 = byte1 << 4;
pixel1 |= byte2.nibs.second;
pixel2 = byte3;
pixel2 |= ( byte2.nibs.first << 8 );
i += 3;
}
This is partial code to get two twelve-bit pixels from three bytes. It doesn't take into account the odd/even aspect of things. It also shows a very easy way to determine whether a row is odd or even.
The key thing there is the union of a byte and two bit fields that are one nibble each (half a byte) or four bits. This is a very easy way to peel a byte in half.