Introduction
From time to time, you just need to open a comma-separated values (*.csv) text file and roll through the data. Unfortunately, once you have a line of text, you cannot simply split it on the commas because the fields may contain commas delimited as text. Here, I present a relatively simple lexer method that parses a line character-by-character (instead of using the Regular Expression engine) that may get the job done for you.
Background
I was looking for a quick copy/paste code snippet that would solve this problem. I found numerous Regular Expressions (which, by conventional wisdom with regard to style and practice, are generally the best way to solve this kind of problem); but none of the ones I tried seemed to parse everything in my file correctly. I found some other potential solutions online, but they wanted me to download something (... and I just wanted to copy/paste and move on, remember?). After fiddling with some expressions for a while, I figured I might actually get to my goal more quickly if I were to write a little lexer method that did the job.
Once I had the method written, I thought it might be helpful to someone else in two potential ways: First, it does solve a particular common problem. But second and perhaps more important, I thought that it could serve as an interesting working example of how to do this kind of string parsing, and serve as starting point, for someone who might have a similar parsing task at hand. Otherwise, you can simply think of it as an exercise in do-it-yourself parsing. I have done my best to keep the code snippet simple and explicit (sometimes sacrificing generality for clarity) so that if you want to use it and modify it, it should be reasonably easy to do so.
One advantage to parsing in this way (rather than using an expression) is that if you need to modify the logic, you can step through it in serial fashion and examine the states of the variables character-by-character, rather than passing the task off to the Regular Expression engine.
But use Regular Expressions when you can.
Using the Code
The code is just a single method which I include here as a snippet. Pass a line of comma-separated values from your *.csv file to the method, and you should receive an array of the individual "fields" in the line.
public string[] SplitCsv(string s)
{
List<string> tokens = new List<string>();
StringBuilder buffer = new StringBuilder();
char[] chars = s.ToCharArray();
char c = char.MinValue;
bool inText = false;
bool escaped = false;
for (int i = 0; i < chars.Length; i++)
{
c = chars[i];
if (!inText && c == ',')
{
tokens.Add(buffer.ToString());
buffer.Length = 0;
continue;
}
if (c == '\\' && inText)
{
if (!escaped)
{
escaped = true;
}
else
{
buffer.Append(c);
escaped = false;
}
continue;
}
if (c == '"')
{
if (buffer.Length == 0)
{
inText = true;
}
else if (i == chars.Length - 1 || chars[i + 1] == ',')
{
inText = false;
}
}
buffer.Append(c);
escaped = false;
}
if (buffer.Length > 0)
tokens.Add(buffer.ToString());
return tokens.ToArray();
}
I hope this lets you get on with parsing that file so you can get back to the business at hand, or maybe gives you some new ideas about different ways to tackle your string parsing problem.
History
- 8th September, 2009: Initial post
- 9th September, 2009: Added a couple of lines to the code snippet to handle problems with the "escape" mode, and to correct the fact that the last field in the line was not included in the list of tokens
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.