Introduction
Considering the significant number of existing free and open source CSV reader components/classes, I was surprised I could not pick one to use because of a number of different reasons: some readers need additional work since they do not cover formats I am interested in, some are not that simple in use, some implement logic in the way that is not easy to analyze, etc.
I wanted to resolve the “I need CSV reader” issue once and forever and decided to come up with one more solution. Eventually, the main requirements for the reader were crystallized into the following:
- Ability to handle majority if not all existing variations of CSV formats, including "TAB separated", etc.
- Extremely intuitive and simple in use. Ideally, you get the idea how to use it by just looking at the list of
public
methods and properties. - Expandable with regard to sources of CSV data
- Fast, straightforward and clean parser with minimum conditional logic
I believe that base class (and its descendants), I present here, satisfies the above requirements.
CSVFileReader
and CSVStringReader
are light weighted and fast classes that resemble unidirectional data set. They are very simple in use and have properties that allow handling a number of existing variations of CSV and "CSV-like" formats.
Classes are derived from abstract
CSVReader
class that does not specify data source and instead works with instance of TextReader
class.
CSVFileReader
and CSVStringReader
accept file and string
as data sources respectively. They introduce additional “CSV source” related properties and override abstract
method that returns instance of specific TextReader
descendant:
protected abstract TextReader CreateDataSourceReader();
Classes for other CSV data sources can be created in a similar way.
Input Data Format (CSV Format)
According to Wikipedia: “A general standard for the CSV file format does not exist, but RFC 4180 provides a de facto standard for some aspects of it.”
While this CSVReader
is RFC4180-compliant, it provides lots of “extras” (see Appendix below for summary of RFC4180).
CSVReader Features
- Supports three kinds of line delimiters:
<CR>
, <CR><LF>
and <LF>
, all of which can be present in the same CSV file simultaneously. Consequently, pair <LF><CR>
will result in empty line, which situation still can be handled setting property IgnoreEmptyLines
to true
. - Presence of header in the very first record of file is controlled by
bool
property HeaderPresent
. - Empty lines can be ignored (by default, they are not ignored).
- Number of fields is auto-detected (by default) on the base of first record or must be set explicitly if auto-detection is off.
- Field separator by default is comma (0x2C) but virtually any (Unicode) character can be used, for example, TAB, etc.
- Field quoting allows multi-line field values and presence of quote and field separator characters within field. By default, it is assumed that field may or may not be enclosed in quotes but reader can be instructed not to use field quoting.
- Quote character by default is double quotes (0x22) but virtually any (Unicode) character can be used. It is assumed that quote character is also used as escape character.
- Unicode range of the character codes is assumed by default but can be limited to ASCII only by setting corresponding property to true
- Characters with codes below 0x20 are considered to be “Special characters” and by default must not appear in the file. That requirement does not affect line delimiters and field separator and/or quote character if they are from that range. Optionally, the reader can be instructed to simply ignore special characters.
- Reader itself does not use buffering. It uses memory just enough to store field names and field values of current record. If any buffering is happening, then standard .NET classes like
StreamReader
and StringReader
are responsible for that. - Reader supposedly is fast since it reads each character directly from
TextReader
and analyzes character just once, i.e., reader does one-pass parsing. Also, parser uses minimum conditional logic.
Public Class Members
Constructors
Each class has single constructor with no parameters.
Input Properties
Attempt to change their values in Active/Open state causes exception.
Common (CSVReader) properties
bool HeaderPresent
bool FieldCount_AutoDetect
int FieldCount
int FieldSeparatorCharCode
bool UseFieldQuoting
int QuoteCharCode
bool IgnoreEmptyLines
bool ASCIIonly
bool IgnoreSpecialCharacters
CSVFileReader specific properties
string FileName
CSVStringReader specific properties
string DataString
Other Common Properties
bool Active
bool Bof
bool Eof
CSVFields Fields
int RecordCountProcessedSoFar
Methods
void Open()
void Close()
void Next()
Events
event EventHandler FieldCountAutoDetectCompleted
Using the Code
Use is straightforward. Just create instance of corresponding class, specify source of CSV data, modify some properties if necessary, call Open()
and iterate through records calling Next()
. Within each record iterate through field values. Call Close()
when done.
Using CSVFileReader Class
using Nvv.Components.CSV;
using (CSVFileReader csvReader = new CSVFileReader())
{
csvReader.FileName = "CSVFilePath";
csvReader.HeaderPresent = true;
csvReader.Open();
if (csvReader.HeaderPresent)
for (int i = 0; i < csvReader.FieldCount; i++)
{
Console.WriteLine("Name{0}={1}", i, csvReader.Fields[i].Name);
}
while (!csvReader.Eof)
{
for (int i = 0; i < csvReader.FieldCount; i++)
{
Console.WriteLine("Value{0}={1}", i, csvReader.Fields[i].Value);
}
csvReader.Next();
}
csvReader.Close();
}
Using CSVStringReader Class
using Nvv.Components.CSV;
using (CSVStringReader csvReader = new CSVStringReader())
{
csvReader.DataString = "1,2,3";
csvReader.HeaderPresent = true;
csvReader.Open();
if (csvReader.HeaderPresent)
for (int i = 0; i < csvReader.FieldCount; i++)
{
Console.WriteLine("Name{0}={1}", i, csvReader.Fields[i].Name);
}
while (!csvReader.Eof)
{
for (int i = 0; i < csvReader.FieldCount; i++)
{
Console.WriteLine("Value{0}={1}", i, csvReader.Fields[i].Value);
}
csvReader.Next();
}
csvReader.Close();
}
Downloading Source Code
The following source code which was prepared in Visual C# 2010 Express is available for download:
- C# solution and project of assembly containing classes
CSVReader
, CSVFileReader
and CSVStringReader
is in CSVClasses folder. - C# solution of application that tests both
CSVFileReader
and CSVStringReade
r classes is in CSVTest folder. This solution also includes and references the above CSVClasses
assembly project. If there is interest in this application, then in order to avoid reference breakage, it probably makes sense to unzip everything together exactly how it is in zip file.
Both solutions target .NET 4.0 though at least CSVClasses
most likely can be “retargeted” to other versions as well.
Brief summary of definition of the CSV Format from RFC 4180 (http://tools.ietf.org/html/rfc4180):
- Each record is on separate line(s) delimited by line break (
<CR><LF> = 0x0D 0x0A
) except last record where it is optional. - Optional header (with field names) can be present as first record.
- Each record should contain the same number of fields throughout the file. That actually does not allow empty lines except for CSV file with single field, in which case it just holds single “empty” value.
- Field separator is comma (
0x2C
). - Field may or may not be enclosed in double quotes (0x22), which, if enclosed, allows line break, double quotes and comma within field. Double quotes is also used as escape character.
- Spaces are considered part of a field and should not be ignored.
- Text data that can appear in the field is limited to code ranges
0x20 – 0x7E
(which obviously limits it to ASCII codes).
History
Version 1.1 (2014-09-03)
1. Namespace changed
2. Significant performance improvement:
- Use of StringBuilder where it is appropriate.
- Assembled frequently called methods/procedures into big procedure at expense of code structuring and readability. Apparently time of procedural call is significant.
Version 1.0 (2014-05-20)
Extensive experience developing pure software and combined soft-hardware systems using variety of languages and tools.