Redux: Hex Strings to Raw Data and Back

Orjan Westin

Rate me:

5.00/5 (1 vote)

5 Aug 2010BSD4 min read

15K

Revisiting an old function to convert hexadecimal strings into integer byte arrays

During the writing of my last post, I did the due diligence thing and considered alternative implementations and algorithms to solve the problem at hand (converting a string representation of an 8-bit hexadecimal value to an unsigned 8-bit integer value). Because I was, in effect, documenting code written some years ago, I can't recall exactly what other options, if any, I tried at the time.

I think I first tried using a std::stringstream, but gave up on that as being too slow, and went with strtoul instead. I might also have played around with using a std::map lookup table, with all the headaches that brought in terms of storage and initialization, and decided against it.

What I didn't try was a straight, non-clever switch-based lookup table to find the integer value of a hexadecimal character digit:

C++

inline unsigned char hex_digit_to_nybble(char ch)
{
  switch (ch)
  {
    case '0': return 0x0;
    case '1': return 0x1;
    case '2': return 0x2;
...
    case 'f': return 0xf;
    case 'F': return 0xf;
    default: throw std::invalid_argument();
  }
}

I did when writing that post, though, and found that this was twice as fast as using strtoul. That in itself isn’t surprising, as it’s a much simpler functional requirement to fulfill (the standard strtoul is a very capable parser).

As it happens, using this digit-to-nybble function is not only faster, but also simplifies the main conversion function calling it. For one thing, this function handles all sanity checking and validation of the characters given, and for another, it removes the need to copy the two characters into a hex string before doing the conversion.

Faster, neater, tidier code – brilliant!

However, I was troubled by one question: what character set does C++ use?
See, I was using literal characters, and we've all been warned about making assumptions about character sets. I even referred to that kind of dangerous assumption in the post I wrote, without, if I'm honest, thinking much about it.

Looking at the code again, I realized that I was already assuming things about the character set, since I use an array of characters when encoding integer values into hex strings:

C++

static const std::string hex_char = "0123456789abcdef";
...
char high_nybble_char =  hex_char[[(data_char >> 4) & 0x0f];
char low_nybble_char =  hex_char[[data_char & 0x0f];

In other words, if this was safe, the switch was safe. If it wasn't, neither was the switch, and while I already had a portable safe and working version for the decoding (using strtoul), I might have to come up with an alternative for the encoding.

Essentially, for this code to be guaranteed to be portable, standards-compliant, and safe, three potentially different character sets need to agree that the sixteen characters used to represent hexadecimal numbers are the same. Those three are:

The character set used to store the source code files
The character set used by the compiler when creating executable code
The character set used by the data sent into the decoding function, and accepted as output by the caller of the encoding function

The problem is that a character is only a character when printed or written or displayed. It’s a visual shape, a representation. This A (probably) looks like a character to you, but it isn’t stored as one. It’s stored in some sort of binary representation, using some defined rule set and look-up that tells your computer (or mobile, iPad, text-to-speech system, etc.) that it should be rendered in a manner that represents the concept of the letter “A” in one of the many Latin alphabets. Probably.

So the compiler needs to be able to read the source file, which is a stream of bytes (which, of course, may be of any number of lengths, although 8 bits is the most common), and interpret those into a representation where, for instance, the bytes {0×69, 0×66, 0×20, 0×28} are parsed as “if (” and not “%ö”.

Of course, this is what a compiler is required to do, and regardless of how the source code is stored (well, provided it’s in a form the compiler supports), the compiler will read it and convert it to “basic source character set”. This character set includes all hexadecimal digits, so we're safe there. (It actually includes most of your basic printable 7-bit ASCII. There are a couple of good explanations of this on StackOverflow.)

Next, the compiler has to worry about the “basic execution character set”, which is what character set – at minimum – is used during execution. This is defined as the “basic source character set” plus a few extra characters, so we're good there, too.

Finally, the data sent in to the function is safe, too, because the “basic source character set” covers the calling functions too, and in the case of data read from disk or otherwise externally sourced, it is the responsibility of the calling function to provide it in that form.

For more information on these character sets, see the C++ standard, or these two articles, which have been very helpful to me:

So, anyway, it’s safe to use character literals to represent hexadecimal digits, which means that the bytes_to_hex function can remain unchanged, and the hex_to_bytes can be rewritten to use the much faster switch lookup:

C++

inline unsigned char hex_digit_to_nybble(char ch)
{
  switch (ch)
  {
    case '0': return 0x0;
    case '1': return 0x1;
...
    case 'f': return 0xf;
    case 'F': return 0xf;
    default:
    {
      std::string e = "Invalid character in hex string: \'";
      e += ch;
      e += "'";
      throw std::invalid_argument(e);
    }
  }
}

void hex_to_bytes2(const std::string& str,
  std::vector<unsigned char>& data)
{
  // Sanity check
  static_assert<8 == CHAR_BIT>::valid_expression();

  // Clear output
  data.clear();

  // No data? Then we're done
  if (str.empty())
    return;

  // Must be prepared that string can have odd number of
  // nybbles, in which case the first is treated like the low
  // nybble of the first byte
  size_t lengthOverflow = str.length() % 2;

  // This also affects the length of the data buffer we
  // allocate (need full  byte for nybble)
  const size_t length = lengthOverflow + str.length() / 2;
  data.resize(length);

  // Iterators for input and output
  size_t i = 0;
  size_t c = 0;

  // If the first nybble is a low, we'll do it separately
  if (1 == lengthOverflow)
  {
    data[i++] = hex_digit_to_nybble(str[c++]);
  }

  // For each output byte, we use two input characters for
  // high and low nybble, respectively
  while (i < length)
  {
    data[i++] = (hex_digit_to_nybble(str[c++]) << 4) |
      hex_digit_to_nybble(str[c++]);
  }
}

So there we are. I spent an evening learning a bit more about C++, improving some old code, and writing this post saying “You can always improve”.

Tagged: C++, string

This article was originally posted at http://coolcowstudio.wordpress.com/2010/08/05/redux-hex-strings-to-raw-data-and-back

License

This article, along with any associated source code and files, is licensed under The BSD License

Written By

Orjan Westin

Software Developer (Senior)

Sweden

Orjan has worked as a professional developer - in Sweden and England - since 1993, using a range of languages (C++, Pascal, Delphi, C, C#, Visual Basic, Python and assemblers), but tends to return to C++.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Redux: Hex Strings to Raw Data and Back

License

Comments and Discussions