Click here to Skip to main content
16,020,714 members
Articles / Programming Languages / C# 7.0

Rolling Your Own? - A Simple CSV Parser Example

Rate me:
Please Sign up or sign in to vote.
5.00/5 (13 votes)
13 Jan 2020CPOL8 min read 14.5K   202   10   4
Implementations in C# and F#

Table of Contents

Introduction

I was having this conversation with a coworker about using a third-party library vs. rolling your own, specifically regarding parsing a comma separated value (CSV) data, or more generally, any text data separated by a known delimiter.

This subject has been pretty much beaten to death and there definitely are many libraries in every language under the sun to parse CSV files. My coworker brought this particular article to my attention: CSV Parsing in .NET Core which is a useful example (which I ignore here) of some edge cases -- certainly not all of them. The article references a couple projects, CsvHelper and TinyCsvParser and one can easily find several CSV parsers on Code Project:

That last link is quite comprehensive in parsers that Tomas Takac evaluated and has a great graph of the finite state machine to parse CSV data. Even more interesting (to me at least), he references RFC-4180 which documents Common Format and MIME Type for Comma-Separated Values (CSV) Files. Pretty cool.

That said, having looked at a few myself and having some very basic requirements (really no edge cases), the question again came up as to whether to roll my own or use a library. Some of the discussion points were:

  • What if there's a bug in the roll-your-own code?
  • What if it needs to be extended?
  • What if the file specification changes?

My answers here are biased of course, but they still might be of some value:

What if there's a bug in the roll-your-own code?

Well, given that the actual parser is 80 lines long (not including extension methods and helper functions), how hard is it to have bugs? And there's unit tests, which actually are 3x more lines of code! My first reaction to rolling my own is always a question to complexity: how complex is the problem to be solved, and how complex are the third party libraries out there that solve that problem? The fewer lines of code, the less bugs and the fewer unit tests one has to write. It cannot be guaranteed that a third-party library is bug free or has implemented decent unit tests. Granted, this is all rather moot for non-edge cases anyways. Still, looking at the code for some of these libraries, in which the implementation is in the thousands of lines of code, I really really can't justify adding that to my code base -- the npm install, nuget install, and so forth are so easy that they preclude any thinking about what you're about to do. This is really a dangerous situation, especially when packages have hidden dependencies and you're suddenly installing 10s or 100s of additional, breakable, dependencies.

What if it needs to be extended?

Extensibility is easily implemented with virtual methods, assuming the implementation breaks the process apart in small steps. There are of course other mechanisms. I find this question however to be a specious argument as it poses a future "what-if" scenario that is unrealistic given the requirements scope. Certainly if the requirements say that all sorts of edge cases have to be handled, then yes, let's go for a third-party library. But I disagree with the argument that edge-cases will suddenly crop up in data files that are produced by other programs and in such a way as to break the parser.

What if the file specification changes?

This is definitely plausible but is not actually an issue of the parser but the class in which the data is parsed and the mapping of properties to fields. So this is not a valid argument in my opinion.

Implementation Time!

So having just said all that, I put together a proof of concept. It's actually a good example to see how far programming languages have evolved, as this code is basically just a set of map, reduce, and filter operations. It took me about 30 minutes to write this code (compared to my writing the F# example, which took over 5 hours, as F# is not something I'm particularly well-skilled at!

The Dataset

I decided to use a simplified version of the data in that article on CSV Parsing in .NET Core referenced earlier:

C#
string data =
@"Make,Model,Type,Year,Cost,Comment
Toyota,Corolla,Car,1990,2000.99,A Comment";

You'll note here that I'm assuming there's a header.

The Class it Goes Into

At a minimum, I wanted the ability to rename properties that are mapped to fields in the CSV:

C#
public enum AutomobileType
    {
        None,
        Car,
        Truck,
        Motorbike
    }

public class Automobile : IPopulatedFromCsv
{
    public string Make { get; set; }
    public string Model { get; set; }
    public AutomobileType Type { get; set; }
    public int Year { get; set; }

    [ColumnMap(Header = "Cost")]
    public decimal Price { get; set; }

    public string Comment { get; set; }
}

Note the ColumnMap attribute:

C#
public class ColumnMapAttribute : Attribute
{
    public string Header { get; set; }
}

and for giggles, I require that any class that is going to be populated by the CSV parser implements IPopulatedFromCsv as a useful hint to the user.

The Code

Here's the core parser:

C#
using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;

using Clifton.Core.Utils;

namespace SimpleCsvParser
{
    public class CsvParser
    {
        public virtual List<T> Parse<T>(string data, string delimiter) 
               where T : IPopulatedFromCsv, new()
        {
            List<T> items = new List<T>();
            var lines = GetLines(data);
            var headerFields = GetFieldsFromHeader(lines.First(), delimiter);
            var propertyList = MapHeaderToProperties<T>(headerFields);
            lines.Skip(1).ForEach(l => items.Add(Populate<T>(l, delimiter, propertyList)));

            return items;
        }

        // Get all non-blank lines
        protected virtual IEnumerable<string> GetLines(string data)
        {
            var lines = data.Split(new char[] { '\r', '\n' }).Where
                                (l => !String.IsNullOrWhiteSpace(l));

            return lines;
        }

        protected virtual T Create<T>() where T : IPopulatedFromCsv, new()
        {
            return new T();
        }

        protected virtual string[] GetFieldsFromHeader(string headerLine, string delimiter)
        {
            // The || allows blank headers if at least one header or delimiter exists.
            var fields = headerLine.Split(delimiter)
               .Select(f => f.Trim())
               .Where(f => 
                   !string.IsNullOrWhiteSpace(f) || 
                   !String.IsNullOrEmpty(headerLine)).ToArray();

            return fields;
        }

        protected virtual List<PropertyInfo> MapHeaderToProperties<T>(string[] headerFields)
        {
            var map = new List<PropertyInfo>();
            Type t = typeof(T);

            // Include null properties so these are skipped when parsing the data lines.
            headerFields
                .Select(f =>
                    (
                        f,
                        t.GetProperty(f, 
                           BindingFlags.Instance | 
                           BindingFlags.Public | 
                           BindingFlags.IgnoreCase | 
                           BindingFlags.FlattenHierarchy) ?? AliasedProperty(t, f)
                    )
                )
                .ForEach(fp => map.Add(fp.Item2));

            return map;
        }

        protected virtual PropertyInfo AliasedProperty(Type t, string fieldName)
        {
            var pi = t.GetProperties()
                .Where(p => p.GetCustomAttribute<ColumnMapAttribute>()
                ?.Header
                ?.CaseInsensitiveEquals(fieldName) 
                ?? false);
            Assert.That<CsvParserDuplicateAliasException>(pi.Count() <= 1,
                $"{fieldName} is aliased more than once.");

            return pi.FirstOrDefault();
        }

        protected virtual T Populate<T>(
            string line, 
            string delimiter, 
            List<PropertyInfo> props) where T : IPopulatedFromCsv, new()
        {
            T t = Create<T>();
            var fieldValues = line.Split(delimiter);
            // Unmapped fields will have a null property, and we also skip empty fields, 
            // and trim the field value before doing the type conversion.
            props.ForEach(fieldValues, 
               (p, v) => p != null && !String.IsNullOrWhiteSpace(v), 
               (p, v) => p.SetValue(t, Converter.Convert(v.Trim(), p.PropertyType)));

            return t;
        }
    }
}

There's copious use of LINQ, null continuation and null coalescence operators to map fields to properties, and properties with names overridden with the ColumnMap attribute, and handle cases where there is no property that maps to a specific field.

Also note that all methods, even Create are virtual so that they can be overridden for custom behaviors.

The AliasedProperty Method

This piece of code is not entirely obvious. What it does is:

  • Get all the public instance properties of the class we're populating
  • Try and get the custom attribute
  • Using null continuation, get the aliased header
  • Again using null continuation, compare the aliased header with the CSV header
  • Using null coalescence, return either the result of the comparison or return false to the where condition
  • Assert that 0 or 1 properties have that alias
  • Finally, FirstOrDefault is used because we now know we have only 0 or 1 items in the resulting enumeration, and we want to return null as a placeholder if after all this, there is no property that is aliased to the CSV header field value.

The Populate Method

The piece of code uses an extension method ForEach:

  • that iterates both collections in lock step and..
  • executes the action only when the condition is met.

Thus the code:

C#
props.ForEach(fieldValues, 
   (p, v) => p != null && !String.IsNullOrWhiteSpace(v), 
   (p, v) => p.SetValue(t, Converter.Convert(v.Trim(), p.PropertyType)));

does the following:

  • The two collections are props and fieldValues.
  • The property must exist (remember, we use null as a placeholder).
  • The value is not null or whitespace.
  • When that condition is met, we use reflection to set the property value, using the Converter helper.

Using the Parser

C#
class Program
    {
        static void Main(string[] args)
        {
            string data =
@"Make,Model,Type,Year,Cost,Comment
Toyota,Corolla,Car,1990,2000.99,A Comment";

            CsvParser parser = new CsvParser();
            var recs = parser.Parse<Automobile>(data, ",");
            var result = JsonConvert.SerializeObject(recs);
            Console.WriteLine(result);
        }
    }

And the output:

[{"Make":"Toyota","Model":"Corolla","Type":1,"Year":1990,"Price":2000.99,"Comment":"A Comment"}]

The JsonConvert is my anti-example of using a huge third party library simply so as to write out an array of instances values! Because I was lazy! See? I validate my own argument on the dangers of reaching for a third-party library when something much simpler will do!

Supporting Code

So there's a bunch of supporting code. We have extension methods:

C#
public static class ExtensionMethods
{
    public static void ForEach<T>(this IEnumerable<T> collection, Action<T> action)
    {
        foreach (var item in collection)
        {
            action(item);
        }
    }

    // Double iteration -- both collections are iterated over 
    // and are assumed to be of equal length
    public static void ForEach<T, U>(
        this IEnumerable<T> collection1, 
        IList<U> collection2, 
        Func<T, U, 
        bool> where, 
        Action<T, U> action)
    {
        int n = 0;

        foreach (var item in collection1)
        {
            U v2 = collection2[n++];

            if (where(item, v2))
            {
                action(item, v2);
            }
        }
    }

    public static string[] Split(this string str, string splitter)
    {
        return str.Split(new[] { splitter }, StringSplitOptions.None);
    }

    public static bool CaseInsensitiveEquals(this string a, string b)
    {
        return String.Equals(a, b, StringComparison.OrdinalIgnoreCase);
    }
}

And we have a simple Assert static class:

C#
public static class Assert
{
    public static void That<T>(bool condition, string msg) where T : Exception, new()
    {
        if (!condition)
        {
            var ex = Activator.CreateInstance(typeof(T), new object[] { msg }) as T;
            throw ex;
        }
    }
}

And we have this Converter class I wrote ages ago:

C#
public class Converter
{
    public static object Convert(object src, Type destType)
    {
        object ret = src;

        if ((src != null) && (src != DBNull.Value))
        {
            Type srcType = src.GetType();

            if ((srcType.FullName == "System.Object") || 
                            (destType.FullName == "System.Object"))
            {
                ret = src;
            }
            else
            {
                if (srcType != destType)
                {
                    TypeConverter tcSrc = TypeDescriptor.GetConverter(srcType);
                    TypeConverter tcDest = TypeDescriptor.GetConverter(destType);

                    if (tcSrc.CanConvertTo(destType))
                    {
                        ret = tcSrc.ConvertTo(src, destType);
                    }
                    else if (tcDest.CanConvertFrom(srcType))
                    {
                        if (srcType.FullName == "System.String")
                        {
                            ret = tcDest.ConvertFromInvariantString((string)src);
                        }
                        else
                        {
                            ret = tcDest.ConvertFrom(src);
                        }
                    }
                    else
                    {
                        // If the target type is a base class of the source type, 
                        // then we don't need to do any conversion.
                        if (destType.IsAssignableFrom(srcType))
                        {
                            ret = src;
                        }
                        else
                        {
                            // If no conversion exists, throw an exception.
                            throw new ConverterException("Can't convert from " + 
                               src.GetType().FullName + 
                               " to " + 
                               destType.FullName);
                        }
                    }
                }
            }
        }
        else if (src == DBNull.Value)
        {
            if (destType.FullName == "System.String")
            {
                // convert DBNull.Value to null for strings.
                ret = null;
            }
        }

        return ret;
    }
}

Really, the last commit was Nov 26, 2015! It's so old that inline string parsing $"{}" didn't exist (or I wasn't aware of it, haha.)

Unit Tests

So then I thought, well, let's write some unit tests, since this code is actually unit-testable:

C#
public class NoAliases : IPopulatedFromCsv
{
    public string A { get; set; }
    public string B { get; set; }
    public string C { get; set; }
}

public class WithAlias : IPopulatedFromCsv
{
    public string A { get; set; }
    public string B { get; set; }

    [ColumnMap(Header = "C")]
    public string TheCField { get; set; }
}

public class WithDuplicateAlias : IPopulatedFromCsv
{
    public string A { get; set; }
    public string B { get; set; }

    [ColumnMap(Header = "C")]
    public string TheCField { get; set; }

    [ColumnMap(Header = "C")]
    public string Oops { get; set; }
}

public class NoMatchingField : IPopulatedFromCsv
{
    public string A { get; set; }
    public string B { get; set; }
    public string C { get; set; }

    public string D { get; set; }
}

public class NoMatchingProperty : IPopulatedFromCsv
{
    public string A { get; set; }
    public string C { get; set; }
}

public class Disordered : IPopulatedFromCsv
{
    public string C { get; set; }
    public string B { get; set; }
    public string A { get; set; }
}

// public class NoMatching

/// <summary>
/// By inheriting from CsvParser, we have access to the protected methods.
/// </summary>
[TestClass]
public class CsvParserTests : CsvParser
{
    [TestMethod]
    public void BlankAndWhitespaceLinesAreSkippedTest()
    {
        GetLines("").Count().Should().Be(0);
        GetLines(" ").Count().Should().Be(0);
        GetLines("\r").Count().Should().Be(0);
        GetLines("\n").Count().Should().Be(0);
        GetLines("\r\n").Count().Should().Be(0);
    }

    [TestMethod]
    public void CRLFBothCreateNewLineTest()
    {
        GetLines("a\rb").Count().Should().Be(2);
        GetLines("a\nb").Count().Should().Be(2);
        GetLines("a\r\nb").Count().Should().Be(2);
    }

    [TestMethod]
    public void HeaderSplitByDelimiterTest()
    {
        GetFieldsFromHeader("a,b,c", ",").Count().Should().Be(3);
    }

    [TestMethod]
    public void SingleHeaderTest()
    {
        GetFieldsFromHeader("a", ",").Count().Should().Be(1);
    }

    [TestMethod]
    public void NoHeaderTest()
    {
        GetFieldsFromHeader("", ",").Count().Should().Be(0);
    }

    [TestMethod]
    public void EmptyHeaderFieldIsAllowedTest()
    {
        GetFieldsFromHeader("a,,c", ",").Count().Should().Be(3);
    }

    [TestMethod]
    public void BlankHeaderFieldsAreAllowedTest()
    {
        GetFieldsFromHeader(",,", ",").Count().Should().Be(3);
    }

    [TestMethod]
    public void HeaderIsTrimmedTest()
    {
        GetFieldsFromHeader("a ", ",")[0].Should().Be("a");
        GetFieldsFromHeader(" a", ",")[0].Should().Be("a");
    }

    [TestMethod]
    public void DirectMappingTest()
    {
        MapHeaderToProperties<NoAliases>
                 (new string[] { "A", "B", "C" }).Count().Should().Be(3);
    }

    [TestMethod]
    public void AliasTest()
    {
        MapHeaderToProperties<WithAlias>
                 (new string[] { "A", "B", "C" }).Count().Should().Be(3);
    }

    [TestMethod]
    public void CaseInsensitiveTest()
    {
        MapHeaderToProperties<NoAliases>
                 (new string[] { "a", "b", "c" }).Count().Should().Be(3);
    }

    [TestMethod]
    public void AdditionalPropertyTest()
    {
        MapHeaderToProperties<NoMatchingField>
                 (new string[] { "A", "B", "C" }).Count().Should().Be(3);
    }

    [TestMethod]
    public void MissingPropertyTest()
    {
        MapHeaderToProperties<NoMatchingProperty>
                 (new string[] { "A", "B", "C" }).Count().Should().Be(3);
    }

    [TestMethod]
    public void FieldNotMappedTest()
    {
        var props = MapHeaderToProperties<NoMatchingProperty>(new string[] { "A", "B", "C" });
        props.Count().Should().Be(3);
        props[0].Should().NotBeNull();
        props[1].Should().BeNull();
        props[2].Should().NotBeNull();
    }

    [TestMethod]
    public void CaseInsensitiveAliasTest()
    {
        AliasedProperty(typeof(WithAlias), "c").Should().NotBeNull();
    }

    [TestMethod]
    public void PropertyNotFoundTest()
    {
        AliasedProperty(typeof(NoAliases), "D").Should().BeNull();
    }

    [TestMethod]
    public void DuplicateAliasTest()
    {
        this.Invoking((_) => AliasedProperty(typeof(WithDuplicateAlias), "C")).Should()
            .Throw<CsvParserDuplicateAliasException>()
            .WithMessage("C is aliased more than once.");
    }

    [TestMethod]
    public void PopulateFieldsTest()
    {
        var props = MapHeaderToProperties<NoAliases>(new string[] { "a", "b", "c" });
        var t = Populate<NoAliases>("1,2,3", ",", props);
        t.A.Should().Be("1");
        t.B.Should().Be("2");
        t.C.Should().Be("3");
    }

    [TestMethod]
    public void PopulatedFieldsAreTrimmedTest()
    {
        var props = MapHeaderToProperties<NoAliases>(new string[] { "a", "b", "c" });
        var t = Populate<NoAliases>("1 , 2, 3 ", ",", props);
        t.A.Should().Be("1");
        t.B.Should().Be("2");
        t.C.Should().Be("3");
    }

    [TestMethod]
    public void PopulateFieldNotMappedTest()
    {
        var props = MapHeaderToProperties<NoMatchingProperty>(new string[] { "a", "b", "c" });
        var t = Populate<NoMatchingProperty>("1,2,3", ",", props);
        t.A.Should().Be("1");
        t.C.Should().Be("3");
    }

    [TestMethod]
    public void OrderedTest()
    {
        var map = MapHeaderToProperties<NoAliases>(new string[] { "A", "B", "C" });
        map[0].Name.Should().Be("A");
        map[1].Name.Should().Be("B");
        map[2].Name.Should().Be("C");
    }

    [TestMethod]
    public void DisorderedTest()
    {
        // Field mapping should be independent of the order of the fields.
        var map = MapHeaderToProperties<Disordered>(new string[] { "A", "B", "C" });
        map[0].Name.Should().Be("A");
        map[1].Name.Should().Be("B");
        map[2].Name.Should().Be("C");
    }
}

Woohoo!

F# Implementation Without Generic Class

And then I decided to flog myself by rewriting this in F# and learn some things along the way. I wanted to try this because I really like the forward pipe |> operator in F#, as it lets you chain functions together using the output of one function as the input of another function. I also "made the mistake" of not implementing a CsvParser class -- this led to discovering that generics apply only to members, not functions, but I discovered a really cool workaround. So here's my crazy F# implementation:

F#
open System
open System.Reflection
open System.ComponentModel

type AutomobileType = 
    None = 0 
    | Car = 1 
    | Truck = 2 
    | Motorbike = 3

// Regarding AllowNullLiteral: https://sergeytihon.com/2013/04/10/f-null-trick/
[<AllowNullLiteral>]
type ColumnMapAttribute(headerName) =
    inherit System.Attribute()
    let mutable header : string = headerName
    member x.Header with get() = header

type Automobile() =
    let mutable make : string = null
    let mutable model : string = null
    let mutable year : int = 0
    let mutable price : decimal = 0M
    let mutable comment : string = null
    let mutable atype : AutomobileType = AutomobileType.None

    member x.Make with get() = make and set(v) = make <- v
    member x.Model with get() = model and set(v) = model <- v
    member x.Type with get() = atype and set(v) = atype <- v
    member x.Year with get() = year and set(v) = year <- v
    member x.Comment with get() = comment and set(v) = comment <- v

    [<ColumnMap("Cost")>]
    member x.Price with get() = price and set(v) = price <- v

module String = 
    let notNullOrWhiteSpace = not << System.String.IsNullOrWhiteSpace

// I would never have been able to figure this out.
// https://stackoverflow.com/a/32345373/2276361
type TypeParameter<'a> = TP

[<EntryPoint>]
let main _ = 
    let mapHeaderToProperties(_ : TypeParameter<'a>) fields = 
        let otype = typeof<'a>
        fields |> 
            Array.map (fun f -> 
                match otype.GetProperty(f, 
                    BindingFlags.Instance ||| 
                    BindingFlags.Public ||| 
                    BindingFlags.IgnoreCase ||| 
                    BindingFlags.FlattenHierarchy) with
                | null -> match otype.GetProperties() |> Array.filter 
                     (fun p -> 
                              match p.GetCustomAttribute<ColumnMapAttribute>() with 
                              | null -> false
                                   // head::tail requires a list, not an array!
                              | a -> a.Header = f) |> Array.toList with                               
                          | head::_ -> head
                          | [] -> null
                | a -> a
            )

    let convert(v, destType) = 
        let srcType = v.GetType();

        let tcSrc = TypeDescriptor.GetConverter(srcType)
        let tcDest = TypeDescriptor.GetConverter(destType)

        if tcSrc.CanConvertTo(destType) then
            tcSrc.ConvertTo(v, destType)
        else 
            if tcDest.CanConvertFrom(srcType) then
                if srcType.FullName = "System.String" then
                    tcDest.ConvertFromInvariantString(v)
                else
                    tcDest.ConvertFrom(v);
             else
                if destType.IsAssignableFrom(srcType) then
                    v :> obj
                else
                    null

    let populate(_ : TypeParameter<'a>) 
        (line:string) (delimiter:char) (props: PropertyInfo[]) =
        let t = Activator.CreateInstance<'a>()
        let vals = line.Split(delimiter)
        
        for i in 0..vals.Length-1 do
            match (props.[i], vals.[i]) with
            | (p, _) when p = null -> ()
            | (_, v) when String.isNullOrWhiteSpace v -> ()
            | (p, v) -> p.SetValue(t, convert(v.Trim(), p.PropertyType))

        t 

    let data ="Make,Model,Type,Year,Cost,Comment\r\nToyota,Corolla,Car,1990,2000.99,A Comment";
    let delimiter = ',';
    let lines = data.Split [|'\r'; '\n'|] |> Array.filter 
                        (fun l -> String.notNullOrWhiteSpace l)
    let headerLine = lines.[0]
    let fields = headerLine.Split(delimiter) |> Array.map (fun l -> l.Trim()) |> 
        Array.filter (fun l -> String.notNullOrWhiteSpace l || 
                               String.notNullOrWhiteSpace headerLine)
    let props = mapHeaderToProperties(TP:TypeParameter<Automobile>) fields
    let recs = lines |> Seq.skip 1 |> Seq.map (fun l ->
               populate(TP:TypeParameter<Automobile>) l delimiter props) |> Seq.toList
    
    printf "fields: %A\r\n" fields
    printfn "lines: %A\r\n" lines
    printfn "props: %A\r\n" props
    recs |> Seq.iter 
    (fun r -> printfn "%A %A %A %A %A %A\r\n" r.Make r.Model r.Year r.Type r.Price r.Comment)

    0

And the output is:

fields: [|"Make"; "Model"; "Type"; "Year"; "Cost"; "Comment"|]
lines: [|"Make,Model,Type,Year,Cost,Comment";
  "Toyota,Corolla,Car,1990,2000.99,A Comment"|]

props: [|System.String Make; System.String Model; AutomobileType Type; Int32 Year;
  System.Decimal Price; System.String Comment|]

"Toyota" "Corolla" 1990 Car 2000.99M "A Comment"

It was a struggle not to use FirstOrDefault:

F#
// Using FirstOrDefault:
fields |> 
    Array.map (fun f -> 
        match otype.GetProperty(f, 
            BindingFlags.Instance ||| 
            BindingFlags.Public ||| 
            BindingFlags.IgnoreCase ||| 
            BindingFlags.FlattenHierarchy) with
        | null -> otype.GetProperties() |> 
                    Array.filter (fun p -> match p.GetCustomAttribute<ColumnMapAttribute>() with
                                           | null -> false
                                           | a -> a.Header = f)
                |> Enumerable.FirstOrDefault
        | a -> a
    )

But I persevered! A few things I learned:

  • match is touchy, I spend a while figuring out why an array [] wouldn't work. My F# ignorance rears its ugly head.
  • F# doesn't like nullable types, and C# is full of them, so I learned about the [<AllowNullLiteral>] attribute and its use!
  • This type TypeParameter<'a> = TP was one gnarly piece of code to spoof functions into being "generic!" I don't take the credit for that -- see the code comments.

F# Implementation with SimpleCsvParser<'a>

So I decided to rework the implementation into a true generic class -- very easily done:

F#
type SimpleCsvParser<'a>() =
    member __.mapHeaderToProperties fields = 
        let otype = typeof<'a>
        fields |> 
            Array.map (fun f -> 
                match otype.GetProperty(f, 
                    BindingFlags.Instance ||| 
                    BindingFlags.Public ||| 
                    BindingFlags.IgnoreCase ||| 
                    BindingFlags.FlattenHierarchy) with
                | null -> match otype.GetProperties() |> Array.filter (fun p -> 
                              match p.GetCustomAttribute<ColumnMapAttribute>() with 
                              | null -> false
                              | a -> a.Header = f) |> Array.toList with
                          | head::_ -> head
                          | [] -> null
                | a -> a
            )

    member __.populate (line:string) (delimiter:char) (props: PropertyInfo[]) =
        let t = Activator.CreateInstance<'a>()
        let vals = line.Split(delimiter)
        
        for i in 0..vals.Length-1 do
            match (props.[i], vals.[i]) with
            | (p, _) when p = null -> ()
            | (_, v) when String.isNullOrWhiteSpace v -> ()
            | (p, v) -> p.SetValue(t, convert(v.Trim(), p.PropertyType))

        t 

    member this.parse (data:string) (delimiter:char) = 
        let lines = data.Split [|'\r'; '\n'|] |> Array.filter 
                           (fun l -> String.notNullOrWhiteSpace l)
        let headerLine = lines.[0]
        let fields = headerLine.Split(delimiter) |> Array.map 
                           (fun l -> l.Trim()) |> Array.filter 
                     (fun l -> String.notNullOrWhiteSpace l || 
                     String.notNullOrWhiteSpace headerLine)
        let props = this.mapHeaderToProperties fields
        lines |> Seq.skip 1 |> Seq.map 
             (fun l -> this.populate l delimiter props) |> Seq.toList

Note that convert, being a general purpose method, is not implemented in the class. Also note that the bizarre type TypeParameter<'a> = TP type definition isn't needed, as we use the class like this:

F#
let parser = new SimpleCsvParser<Automobile>();
let recs2 = parser.parse data delimiter
recs2 |> Seq.iter
(fun r -> printfn "%A %A %A %A %A %A\r\n" r.Make r.Model r.Year r.Type r.Price r.Comment)

Conclusion

As I mentioned, the C# code (excluding unit tests) took 30 minutes to write. The F# code, about 5 hours. A great learning experience!

But more to the point is really still the unanswered question as to whether you want to roll your own or use a third party library. My view is that if the requirement is simple and unlikely to change, rolling your own should at least be considered before reaching for a third party library, especially if the library has one or more of the following qualities:

  • Has a significantly larger code base because it's trying to handle all sorts of scenarios
  • Has many dependencies on other libraries
  • Does not have good test coverage
  • Does not have good documentation
  • Is not extensible
  • It takes more time to evaluate libraries than to roll the code!

History

  • 13th January, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect Interacx
United States United States
Blog: https://marcclifton.wordpress.com/
Home Page: http://www.marcclifton.com
Research: http://www.higherorderprogramming.com/
GitHub: https://github.com/cliftonm

All my life I have been passionate about architecture / software design, as this is the cornerstone to a maintainable and extensible application. As such, I have enjoyed exploring some crazy ideas and discovering that they are not so crazy after all. I also love writing about my ideas and seeing the community response. As a consultant, I've enjoyed working in a wide range of industries such as aerospace, boatyard management, remote sensing, emergency services / data management, and casino operations. I've done a variety of pro-bono work non-profit organizations related to nature conservancy, drug recovery and women's health.

Comments and Discussions

 
SuggestionGood Article Pin
Jonathan Wood17-Feb-20 4:15
Jonathan Wood17-Feb-20 4:15 
QuestionAn Excellent Example Pin
George Swan14-Jan-20 21:49
mveGeorge Swan14-Jan-20 21:49 
QuestionI am pretty much for rolling your own too Pin
asiwel14-Jan-20 11:02
professionalasiwel14-Jan-20 11:02 
QuestionYou Walk a Fine Line Pin
David A. Gray14-Jan-20 7:32
David A. Gray14-Jan-20 7:32 
Having many times found existing libraries wanting, I am absolutely with you with respect to the advantages of rolling your own. With that said, however, there are occasions, IMO, when finding and using a solid library is the most conservative approach.

Having worked with CSV files off and on for decades, I have become keenly aware of the many edge cases that can trip a CSV parser. The example that prompted me to create A Robust CSV Reader is a string that I encountered when I attempted to parse the standard identity fields of the X.509 digital certificates in the Trusted Root Certificates store of a Microsoft Windows installation. Subsequent to its initial publication in 2015, I published the source code in an open GitHub repository and as a NuGet package. I suspect this package meets your tests.
  1. Brevity: There are 3 classes, ICSVParser (224 lines), Parser (354 lines), and CSVParseEngine (1879 lines). Those line counts include lines occupied by copious comments and XML islands that drive the DocFX parser that generates the documentation included in the GitHub repository and afford the class with robust Intellisense help.
  2. Independence: The library has no dependencies other than system libraries, and its target is version 2 of the Microsoft .NET Framework.
  3. Test Coverage: The repository includes a stand-alone executable program that doubles as a unit test stand and a usage demonstration program.
Though the editors, in their infinite wisdom, classified the program as a tip, I see it as much more significant than that classification suggests, although its classification is not sufficiently important to me to justify the effort that I would probably have to exert to petition the editors on its behalf. Likewise, I've put very little effort into promoting it. Nevertheless, the NuGet package has been downloaded hundreds of times without any real effort on my part. I don't know how many times the repository has been cloned, since I am unaware of any such statistics kept by GitHub, although I suspect they exist somewhere.
David A. Gray
Delivering Solutions for the Ages, One Problem at a Time
Interpreting the Fundamental Principle of Tabular Reporting

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.