Splitting Strings Again – strtok Redeemed

Orjan Westin

0/5 (0 vote)

Oct 18, 2012

BSD

4 min read

22595

Splitting strings again - strtok redeemed

The C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found here.

One of the more curious omissions from the C++ standard library is a string splitter, e.g., a function that can take a string and split it up into its constituent parts, or tokens, based on some delimiter. There is one in other popular languages ((C# – String.Split, Java – String.split, Python – string.split, etc.), but C++ programmers are left to roll their own, or use one from a third-party library like the boost::tokenizer (or the one I presented in Splitting strings).

There are many ways of doing this; the Stack Overflow question How do I tokenize a string in C++? has 23 answers at the time of writing, and those contain 20 different solutions (boost::tokenizer and strtok are suggested multiple times).

The strtok recommendations, however, all have comments pointing out the problems with this function – it’s destructive, and not reentrant (it can’t be nested or run in parallel on multiple strings). As functions go, strtok has a rather poor reputation – there’s even a popular reentrant version, strtok_r, available in many C library implementations, though it’s not a standard function.

So it’s a good thing that there are so many other ways of splitting strings, isn’t it? Well, yes, but the clever thing with strtok, which you don’t get from any of the string splitters, not even the flexibility-obsessed Boost, is that you can change the token separator as you go along. The splitters often give the option to provide a selection of possible delimiters like this:

  char punct = " ,.?!;:-";
  std::string text = "Silly, right? This is - obviously - an example: Boo!";
  
  std::vector<std::string> tokens;
  std::vector<char> separators(punct, punct + strlen(punct));
  
  tokenise_string(text, separators, tokens);
  // tokens now contains the eight words of the string

But what if there are characters that are both legal inside some tokens, and separators for others? What if you have to parse strings with the format [Last name],[First name] – [Profession] – [Age] like this table:

Bradshawe, Adam - Colonel, retired - 73
Burton-West,Jenny - Surgeon - 37
Smith,Ben - Taxi driver - 56

While a string splitter would stumble over this – since both space, hyphen and comma can be both delimiters and valid content – old strtok doesn’t even raise an eyebrow:

  const char* table = "Jones, Adam - Colonel, retired - 73\n"
    "Burton-West,Jenny - Surgeon - 37\n"
    "Smith,Ben - Taxi driver - 56";
  char* changeable = (char*) malloc(strlen(table)); // cast not needed in C
  strcpy(changeable, table);

  char *last, *first, *profession, *age;
  // Start it off with the first search
  last = strtok(changeable, " ,");
  while (last)
  {
    first = strtok(NULL, " -");
    profession = strtok(NULL, "-");
    // Only snag - this probably has a trailing space we need to trim
    int proflen = strlen(profession);
    if (profession[proflen - 1] == ' ')
      profession[proflen - 1] = '\0';
      
    age = strtok(NULL, " -\n");
    // Have whole row, take care of data
    // ...
    // Start next row, if any
    last = strtok(NULL, " ,");
  }
  free(changeable);

In other words, strtok offers unique functionality not found in the string splitters. So let’s design a C++ version that is efficient, non-destructive, and reentrant. Thanks to the object-orientation support of C++, we can let each tokeniser have a const reference to the string we’re tokenising. This gives us both reentrancy and non-destructiveness. And while we’re at it, let’s have a flag to decide whether we should include empty tokens or not – something quite useful that’s missing from strtok.

  class string_tokeniser
  {
    // The string we're searching in
    const std::string& source_;
    // Flag indicating whether to include empty strings
    bool empty_;
    // Current location in string
    std::string::size_type current_;
    // Length of current string
    std::string::difference_type length_;
    // Location to start next search
    std::string::size_type next_;
    
  public:
    // Constructor, setting the string to work on
    string_tokeniser(const std::string& source, bool empty = false);
    ...

We’ll also need variables to keep track of where we are, where to start looking for next token, and so on, which I’ve also included above.

Now, what do we want to do with this? Well, we actually want to do two distinct tasks – advance to the next token, and extract a token. In strtok, those are done at the same step, but since we’re not constrained by the limitations of C, it’s better to keep it tidy. Like I did in my string splitter, I’ll overload the advancing function to allow the user to give a single character, a selection of characters, or a whole string as a delimiter.

    ...
    // Advance to next token, by given character separator
    bool next(const std::string::value_type& separator);

    // Advance to next token, by any of given character separators
    bool next(const std::vector<char>& separators);

    // Advance to next token, by given string separator
    bool next(const std::string& separator);
    ...

These all return true if a new token was found in the string. We’ll also need a way of accessing the tokens found, and some housekeeping:

    ...
    // Get current token, if any, safely
    bool get_token(std::string& token) const;

    // Get current token, if any, otherwise an empty token
    std::string get_token() const;

    // Reset search
    void reset();

    // Check token availability
    bool has_token() const;

    // Check if search is at end
    bool at_end() const;

    // Get source string
    const std::string& get_source() const;
  };

That’s the interface, let’s do some implementation. The way this will work is that we keep hold of a current location and token length, which are used to retrieve the token using std::string::substr, while also keeping the next location in which to start the search for a token. This will have to be next_ = current_ + length_ + length of delimiter, so the next search does not pick up the last delimiter.

When the string_tokeniser is first created, we have no search results, so need to initialize appropriately. The same values are set on a reset, and used to get the current token and check status:

  // Constructor
  string_tokeniser::string_tokeniser(const std::string& source, bool empty /*= false*/)
    : source_(source)
    , current_(std::string::npos)
    , length_(0)
    , next_(source.empty() ? std::string::npos : 0)
    , empty_(empty)
  {}

  // Reset search so it can be restarted
  void string_tokeniser::reset()
  {
    current_ = std::string::npos;
    next_ = source_.empty() ? std::string::npos : 0;
    length_ = 0;
  }

  // Return true if there is a current token
  bool string_tokeniser::has_token() const
  {
    // Not worried about length here, as it might be an empty token
    return (std::string::npos != current_);
  }

  // Return true if no further searches can be done
  bool string_tokeniser::at_end() const
  { 
    return (std::string::npos == next_);
  }

  // Get source string
  const std::string& string_tokeniser::get_source() const
  {
    return source_;
  }

  // Get current token, if any, safely
  bool string_tokeniser::get_token(std::string& token) const
  {
    if (!has_token())
      return false;
    if (0 == length_)
      token.clear();
    else
      token = source_.substr(current_, length_);
    return true;
  }

  // Get current token, if any, otherwise an empty token
  std::string string_tokeniser::get_token() const
  {
    if (!has_token() || (0 == length_))
      return std::string();
    return source_.substr(current_, length_);
  }

Right, that just leaves the implementation of the key function: next(). Since there are three overloads, with almost identical implementation, the sensible thing is to break out most of the common stuff into a helper function. Unfortunately, we can’t easily do that, since if we do not care about empty tokens, we have to recurse and try to find the next, in the case of repeated delimiters, which means it will have to be aware of which overload to chose. Instead, we’ll break out the common handling into a template function:

In class declaration:
    ...
    // Helper - handle the result of a search, advancing to prepare for next
    template <typename T>
    bool handle_next(size_t advance, const T& separator);
  public:
    // Constructor, setting the string to work on
    ...

Implementation:
  // Advance to next token
  bool string_tokeniser::next(const std::string::value_type& separator)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find(separator, current_);
    // Deal with result of search
    return handle_next(1, separator);
  }

  // Advance to next token
  bool string_tokeniser::next(const std::string& separator)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find(separator, current_);
    // Deal with result of search
    return handle_next(separator.size(), separator);
  }

  // Advance to next token
  bool string_tokeniser::next(const std::vector<char>& separators)
  {
    // Store the start
    current_ = next_;
    if (at_end())
      return false;
    // Find next
    next_ = source_.find_first_of(&separators[0], current_, separators.size());
    // Deal with result of search
    return handle_next(1, separators);
  }

  // Handle the result of a search, advancing to prepare for next search
  template <typename T>
  bool string_tokeniser::handle_next(size_t advance, const T& separator)
  {
    if (std::string::npos == next_)
    {
      // Separator not found, but there might still be data, at the end 
      length_ = source_.size() - current_;
    }
    else
    {
      // Store the length of the current token
      length_ = next_ - current_;
      // and move next starting point to beyond the one we found
      next_ += advance;
      // In the case of double separators (e.g. | in "a|b||d"), this gives an 
      // empty token. If empties aren't accepted, we'll recurse
      if ((0 == length_) && !empty_)
      {
        return next(separator);
      }
    }
    // Do we have a token?
    if (0 < length_)
      return true;
    // Even if empties are accepted, dismiss an empty token at the end of a 
    // string (e.g. "a|b|" gives "a" and "b" only)
    if (!empty_ || (std::string::npos == next_))
    {
      // Invalidate current, so extraction isn't valid
      current_ = next_;
      return false;
    }
    return true;    
  }

There, all done. And because we can use both single characters, a selection of characters, and strings as delimiters, the equivalent of the strtok example avoids the need to trim spaces:

  std::string table = "Jones, Adam - Colonel, retired - 73\n"
    "Burton-West,Jenny - Surgeon - 37\n"
    "Smith,Ben - Taxi driver - 56";

  std::string last, first, profession, age;
  // Prepare delimiters to use
  std::vector<char> comma_space;
  comma_space.push_back(' ');
  comma_space.push_back(',');
  std::string sp_dash_sp(" - ");
  char endl('\n');

  string_tokeniser tok(table);
  while (!tok.at_end())
  {
    tok.next(comma_space);
    tok.get_token(last);
    tok.next(sp_dash_sp);
    tok.get_token(first);
    tok.next(sp_dash_sp);
    tok.get_token(profession);
    tok.next(endl);
    tok.get_token(age);
  }

Because it is non-destructive, it is by necessity less efficient than strtok, since the tokens have to be copied. On the other hand, if you have to copy the const char* to a char* buffer to use strtok, maybe the efficiency loss isn’t that bad.

As always, if you found this interesting or useful, or have suggestions for improvements, please let me know.

Update: Jens Ayton has informed me that C11 introduced strtok_s, which is even safer, but not, I believe, incorporated into C++11 .