Introduction
Often we need to parse a string and store the fragments in an array or a list. For example, we might need to parse a line from a comma separated value (CSV) file or NMEA string. MFC provides the CStringArray
and CStringList
classes for handling arrays and lists of strings, respectively. The idea of this submission is simple: a tokenizer class that would inherit publicly either from CStringArray
or CStringList
depending on a template parameter. Once the string is tokenized, the calling code can access the tokens through direct calls to the methods of the parent collection class.
Parameterized inheritance
Since both of these collection classes inherit from CObject
, they support run time type identification (RTTI), which prevents the CStringTokenizer
class from inheriting from other classes.
template <class T> class CStringTokenizer : public T
public:
enum OPTIONS
{
IGNORE_EMPTY_TOKENS = 0x01,
TERMINATING_STRING = 0x02
};
CStringTokenizer();
UINT Tokenize(CString& strSrc, LPCTSTR pStrDelimit,
LPCTSTR strTerminate = '\0', UINT iOffset = 0);
virtual ~CStringTokenizer() {;}
void AddOptions(OPTIONS iOptions) {m_iFlags |= iOptions;}
void RemoveOptions(OPTIONS iOptions) {m_iFlags &= ~iOptions;}
protected:
void Add(LPCTSTR pStrNew);
UINT m_iFlags; CMutex m_mtxTokenize; };
template <class T>
CStringTokenizer<t />::CStringTokenizer()
: m_mtxTokenize(FALSE) {
m_iFlags = IGNORE_EMPTY_TOKENS;
CRuntimeClass* pRTC = T::GetRuntimeClass();
if (RUNTIME_CLASS(CStringArray) == pRTC) return; if (RUNTIME_CLASS(CStringList) == pRTC) return; ASSERT(FALSE);
}
Template specialization helps to smooth-out the difference between CStringArray and CStringList
The addition of a new token to a collection is the only place where the tokenizer code has to interact with the parent collection class. Unfortunately, between CStringArray
and CStringList
, there isn't a method with a common name for adding a new member to a collection. CStringList
has AddHead()
and AddTail()
, while CStringArray
has Add()
.
At first, I tried to fix this problem with RTTI built into the MFC framework. I tried to write code which would choose an appropriate method at run-time. This approach failed to compile. Then, I was suggested to try template specialization, and it worked! I've declared my own Add()
method and added two separate implementations for the cases when CStringArray
or CStringList
is a parent.
template <>
void
CStringTokenizer<CStringArray>::Add(LPCTSTR pStrNew)
{
TRY
{
CStringArray::Add(pStrNew); }
CATCH(CMemoryException, pExc)
{
THROW(pExc); }
END_CATCH
}
template <>
void
CStringTokenizer<CStringList>::Add(LPCTSTR pStrNew)
{
TRY
{
CStringList::AddTail(pStrNew); }
CATCH(CMemoryException, pExc)
{
THROW(pExc); }
END_CATCH
}
Tokenization
Call the Tokenize(...)
function to tokenize a string. After this call, you can deal with the tokens through the methods of CStringArray
and CStringList
. Note that the new tokens are appended to the collection, and Tokenize(...)
doesn't remove the old tokens.
template <class T>
UINT CStringTokenizer<T>::Tokenize(CString& strSrc, LPCTSTR pStrDelimit, LPCTSTR pStrTerminate, UINT iOffset)
Options
IGNORE_EMPTY_TOKENS
If there are two delimiters in a row, the token between them is an empty string. By default, this token will be ignored. If RemoveOptions()
is called with IGNORE_EMPTY_TOKENS
, these tokens will be added to the collection (not ignored). This option can be useful for parsing <stockticker>CSV files and NMEA strings.
TERMINATING_STRING
If this option is set, the tokenization stops when a terminating substring is encountered. Tokenize(...)
treats pStrTerminate
as an ordered substring. If this option is not set, the tokenization will stop when a character from a set of terminating characters is encountered. Tokenize(...)
treats pStrTerminate
as an unordered set of characters.
Thread safety notes
Even though the Tokenize(...)
method is protected from re-entrancy with a mutex, the CStringTokenizer
class is only partially thread-safe. The parent collection classes (CStringArray
and CStringList
) themselves are thread-safe. However, parsing is not thread-safe. If a producer thread writes the tokens to the CStringTokenizer
object by calling Tokenize(...)
and a consumer thread reads the tokens by calling the accessor methods of the parent collection classes, a situation may occur, when the consumer will see a combination of the old data and the new data.
Demo application / Test bed
void TestTokenizer()
{
TRACE("Beginning of template string tokenizer demo\n");
CString str1 = "She sells sea shells on a sea shore. \nShells shine.";
CStringTokenizer<CStringArray> strTokArray;
strTokArray.RemoveOptions(CStringTokenizer<CStringArray>::IGNORE_EMPTY_TOKENS);
UINT iStartOffset = strTokArray.Tokenize(str1, ". ", "\n");
TRACE("Tokens in the Array:\n");
for (int i = 0; i < strTokArray.GetSize(); ++i)
TRACE("\t%s\n",strTokArray[i]); CStringTokenizer<CStringList> strTokList;
strTokList.Tokenize(str1, ". ", "\n", iStartOffset);
TRACE("Tokens in the List:\n");
for (POSITION pos = strTokList.GetHeadPosition(); pos != NULL; )
TRACE("\t%s\n", strTokList.GetNext(pos));
str1 = "Marry had a little lamb... for dinner.";
strTokList.RemoveAll();
strTokList.AddOptions(CStringTokenizer<CStringList>::TERMINATING_STRING);
strTokList.Tokenize(str1, ". ", "dinner"); TRACE("Tokens in the List:\n");
for (pos = strTokList.GetHeadPosition(); pos != NULL; )
TRACE("\t%s\n", strTokList.GetNext(pos)); TRACE("End of template string tokenizer demo\n");
}
Conclusion
This idea seems very obvious. Probably, I couldn't find similar code on the web because I wasn't looking well enough. However, Googling for 'parser tokenizer CStringArray CStringList template' didn't produce anything similar.
Of course, there are loads of string tokenizers out there on the web. Most of them have an interface similar to Java's StringTokenizer
. I didn't follow this de-facto standard. Maybe, I should have. On the other hand, my class preserves the original string.
As usual, suggestions, bug notes, comments etc., are most welcome!
References
- http://www.codeproject.com/cpp/strtok.asp: Another string tokenizer class on CodeProject.
- http://www.codeguru.com/cpp/cpp/cpp_mfc/parsing/article.php/c781/: Yet another string tokenizer class (derived from
CObject
) on CodeGuru. - http://www.codeproject.com/string/cstringparser.asp
- http://www.c-plusplus.de/forum/viewtopic-var-p-is-18971.html: String parser in German.
History
- 0.1: Initial submission: December 4, 2006.
- 0.2: Added a mutex to prevent re-entrance; added thread-safety notes: December 29, 2006.
- 0.3: Changed the tokenization algorithm code slightly; added the
TERMINATING_STRING
option and updated the demo app to exercise this option; added notes about the options: January 5, 2007