Click here to Skip to main content
15,888,521 members
Articles / Programming Languages / C#

Splitting a Line of Comma-Separated Text

Rate me:
Please Sign up or sign in to vote.
2.56/5 (5 votes)
11 Sep 2009BSD2 min read 31K   9   21
A quick and simple method for splitting up the lines in your .csv file.

Introduction

From time to time, you just need to open a comma-separated values (*.csv) text file and roll through the data. Unfortunately, once you have a line of text, you cannot simply split it on the commas because the fields may contain commas delimited as text. Here, I present a relatively simple lexer method that parses a line character-by-character (instead of using the Regular Expression engine) that may get the job done for you.

Background

I was looking for a quick copy/paste code snippet that would solve this problem. I found numerous Regular Expressions (which, by conventional wisdom with regard to style and practice, are generally the best way to solve this kind of problem); but none of the ones I tried seemed to parse everything in my file correctly. I found some other potential solutions online, but they wanted me to download something (... and I just wanted to copy/paste and move on, remember?). After fiddling with some expressions for a while, I figured I might actually get to my goal more quickly if I were to write a little lexer method that did the job.

Once I had the method written, I thought it might be helpful to someone else in two potential ways: First, it does solve a particular common problem. But second and perhaps more important, I thought that it could serve as an interesting working example of how to do this kind of string parsing, and serve as starting point, for someone who might have a similar parsing task at hand. Otherwise, you can simply think of it as an exercise in do-it-yourself parsing. I have done my best to keep the code snippet simple and explicit (sometimes sacrificing generality for clarity) so that if you want to use it and modify it, it should be reasonably easy to do so.

One advantage to parsing in this way (rather than using an expression) is that if you need to modify the logic, you can step through it in serial fashion and examine the states of the variables character-by-character, rather than passing the task off to the Regular Expression engine.

But use Regular Expressions when you can.

Using the Code

The code is just a single method which I include here as a snippet. Pass a line of comma-separated values from your *.csv file to the method, and you should receive an array of the individual "fields" in the line.

C#
public string[] SplitCsv(string s)
{
    // Create a list to hold the tokens as we find them.
    List<string> tokens = new List<string>();
    // We'll need a "buffer" object to build up the tokens character-
    // by-character.
    StringBuilder buffer = new StringBuilder();

    // Convert the string to an array of characters.
    char[] chars = s.ToCharArray();
    // Create a pointer for the characters in the loop below. 
    // We'll just re-use this pointer each time.
    char c = char.MinValue;

    // We'll keep a couple of flags to manage state while we parse...
    // At any given moment, we'll want to know if we think we're
    // inside delimited text.
    bool inText = false;
    // And, as we evaluate one character, we'll want to know if the
    // one before it "escaped" it.
    bool escaped = false;

    // Now, let's look at each character...
    for (int i = 0; i < chars.Length; i++)
    {
        // Get the character at this index.
        c = chars[i];

        // If we are not currently within a block of text, and we've
        // hit the field delimiter (,)...
        if (!inText && c == ',')
        {
            // ...the contents of the buffer are a new token.
            tokens.Add(buffer.ToString());
            // Now clear the buffer.
            buffer.Length = 0;
            // And move along.
            continue;
        }

        // If this character is the "escape" character and we are
        // presently within text...
        if (c == '\\' && inText)
        {
            // If we weren't already in the escape mode, we are now.
            if (!escaped)
            {
                escaped = true;
            }
            else
            {
                // Otherwise, the previous character escaped this
                // one.
                buffer.Append(c);
                // And we're no longer in the escaped mode.
                escaped = false;
            }

            // This character is handled, so move along.
            continue;
        }

        // If we see a text delimiter, i.e. a quote (")...
        if (c == '"')
        {
            // But if this is the very first character we've seen
            // since the last field delimiter (,)...
            if (buffer.Length == 0)
            {
                // ...this is our signal that this field is delimited
                // with quotes.
                inText = true;
            }
            // Otherwise, if this is the last character in the string,
            // or the very next character is the field delimiter...
            else if (i == chars.Length - 1 || chars[i + 1] == ',')
            {
                // ...that means that text delimiting is at an end.
                inText = false;
            }
        }

        // If none of the blocks above handled this character,
        // simply add it to the buffer.
        buffer.Append(c);
        // Since this character was not the "escape" character (\),
        // we are not, at this point, in an escape mode.
        escaped = false;
    }

    // Place any remaining buffer contents as a token in the array.
    if (buffer.Length > 0)
        tokens.Add(buffer.ToString());

    // Convert the tokens to an array and return them.
    return tokens.ToArray();
}

I hope this lets you get on with parsing that file so you can get back to the business at hand, or maybe gives you some new ideas about different ways to tackle your string parsing problem.

History

  • 8th September, 2009: Initial post
  • 9th September, 2009: Added a couple of lines to the code snippet to handle problems with the "escape" mode, and to correct the fact that the last field in the line was not included in the list of tokens

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Software Developer (Senior)
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralThoughts Pin
PIEBALDconsult10-Sep-09 16:39
mvePIEBALDconsult10-Sep-09 16:39 
GeneralEscape Pin
AECAEC10-Sep-09 9:00
AECAEC10-Sep-09 9:00 
GeneralRe: Escape Pin
pat daburu10-Sep-09 9:13
pat daburu10-Sep-09 9:13 
GeneralRe: Escape Pin
PIEBALDconsult10-Sep-09 11:01
mvePIEBALDconsult10-Sep-09 11:01 
GeneralMy vote of 1 Pin
voloda29-Sep-09 5:11
voloda29-Sep-09 5:11 
QuestionWhat about using RegEx Pin
warny8-Sep-09 21:27
warny8-Sep-09 21:27 
AnswerRe: What about using RegEx Pin
voloda28-Sep-09 23:13
voloda28-Sep-09 23:13 
I would also suggest the RegEx approach Smile | :) .

---
Voloda

GeneralRe: What about using RegEx Pin
pat daburu9-Sep-09 3:40
pat daburu9-Sep-09 3:40 
GeneralRe: What about using RegEx Pin
voloda29-Sep-09 4:23
voloda29-Sep-09 4:23 
GeneralRe: What about using RegEx Pin
pat daburu9-Sep-09 6:28
pat daburu9-Sep-09 6:28 
GeneralRe: What about using RegEx [modified] Pin
voloda29-Sep-09 8:13
voloda29-Sep-09 8:13 
AnswerRe: What about using RegEx Pin
tgrt9-Sep-09 17:21
tgrt9-Sep-09 17:21 
GeneralRe: What about using RegEx Pin
PIEBALDconsult10-Sep-09 10:50
mvePIEBALDconsult10-Sep-09 10:50 
GeneralRe: What about using RegEx Pin
pat daburu10-Sep-09 11:03
pat daburu10-Sep-09 11:03 
GeneralRe: What about using RegEx Pin
PIEBALDconsult10-Sep-09 14:49
mvePIEBALDconsult10-Sep-09 14:49 
GeneralRe: What about using RegEx Pin
tgrt10-Sep-09 11:51
tgrt10-Sep-09 11:51 
GeneralRe: What about using RegEx Pin
PIEBALDconsult10-Sep-09 14:45
mvePIEBALDconsult10-Sep-09 14:45 
QuestionHow about... Pin
supercat98-Sep-09 12:52
supercat98-Sep-09 12:52 
AnswerRe: How about... Pin
pat daburu9-Sep-09 3:52
pat daburu9-Sep-09 3:52 
GeneralRe: How about... Pin
PIEBALDconsult10-Sep-09 10:38
mvePIEBALDconsult10-Sep-09 10:38 
GeneralRe: How about... Pin
pat daburu10-Sep-09 10:49
pat daburu10-Sep-09 10:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.