A Look at LINQ

David Stone

4.81/5 (44 votes)

20 Sep 200510 min read

450

An overview of the new Language Integrated Query (LINQ) framework.

Download source files - 5.03 Kb

Introduction

For most programmers today, our jobs require us to integrate some sort of data into our application. Often, we have to take data from multiple sources, be they in memory collections, a database like SQL Server or Access, an XML file, Active Directory, the File System, etc. With today's languages and technologies, getting to this data is often tedious. For databases, using ADO.NET is just a bunch of plumbing code that gets really boring really fast. The story for dealing with XML is even worse as the System.Xml namespace is very cumbersome to use. Also, all data sources have different means of querying the data in them. SQL for databases, XQuery for XML, LDAP queries for Active Directory etc. In short, today's data access story is a mess.

Enter Linq

The folks at Microsoft aren't oblivious to the problems of today's data access story. And so, now that C# 2.0 is almost about to be released, they've given us a look at C# 3.0 in the form of the LINQ project. LINQ stands for Language Integrated Query framework. The LINQ project's stated goal is to add "general purpose query facilities to the .NET Framework that apply to all sources of information, not just relational or XML data". The beauty of the LINQ project is twofold.

First, LINQ is integrated directly into your favorite language. Because the underlying API is just a set of .NET classes that operate like any other .NET class, language designers can integrate the functionality these classes expose directly into the language. Second, and perhaps most importantly, the query functionality in LINQ extends to more than just SQL or XML data. Any class that implements IEnumerable<T> can be queried using Linq. That should elicit a feeling of absolute joy in you. Or maybe I'm just weird.

Let's look at LINQ

In this article, I only want to focus on the language features of LINQ. Too many people confuse LINQ with DLinq and XLinq. They are not one and the same. LINQ is the set of language features and class patterns that DLinq and XLinq follow. That being said, in this article, we will only work with in-memory collections of data. So, without further ado, let's take a look at a basic LINQ program that serves absolutely no practical purpose (those are the best types, aren't they?):

using System;
using System.Query;

namespace LinqArticle
{
    public static class GratuitousLinqExample
    {
        public static void Main()
        {
            // The most active list on CP
            var mostActive = new string[] { 
                "Christian Graus",
                "Paul Watson",
                "Nishant Sivakumar",
                "Roger Wright",
                "Jörgen Sigvardsson",
                "David Wulff",
                "ColinDavies",
                "Chris Losinger",
                "peterchen",
                "Shog9" };

            // Get only the people whose name begins with D
            var namesWithD = from poster in mostActive
                       where poster.StartsWith("D")
                       select poster;

            // Print each person out
            foreach(var individual in namesWithD)
            {
                Console.WriteLine(individual);
            }
        }
    }
}

There we go. Now, at this point, you're probably looking at that saying, "That serves no practical purpose!" I would like to remind you: That's the point. So what we have here is a list of the most active CPians. Then we write this funky query in what looks kinda like SQL to get only the CPians whose names start with the letter D. And we write their names to the console. In this case, we've got everybody's favorite Tivertonian, David Wulff. There's not a whole lot special here. You could be sitting there thinking that you could just replace the entire thing with code that looks like this:

// Print each person out
foreach(var someone in mostActive)
{
    if(someone.StartsWith("D"))
    {
        Console.WriteLine(someone);
    }
}

And you'd be right; you could replace it with that. But that wouldn't be cool, because it wouldn't have the gratuitous LINQ usage in the previous example.

Looking closer

Now let's take a longer and closer look at LINQ. We'll also be looking at the new language features in C# 3.0. I'd like to point out first that these new language features run on the .NET 2.0 CLR. This is the key because, unlike the C# 2.0 features of iterators, anonymous methods, etc., there isn't any deep plumbing work that goes into making these 3.0 features possible. This will, most likely, mean a shorter release cycle for C# 3.0 and LINQ. (At least, that's how the rumor goes.) Anyway, now that we've got that out of the way, let's take a closer look at our LINQ code. First, we've got the cool new System.Query namespace reference:

using System.Query;

This namespace is all you'll need to get started with LINQ. Contained within it are vast troves of treasure, innumerable tomes of knowledge, and power beyond your wildest imagination…or maybe just a few classes and delegates.

Type inference

Then we've got this weird var keyword that keeps popping up all over the place. The JavaScript people in the audience should feel right at home now. var is a new keyword introduced in C# 3.0 that has a special meaning. First, let's talk about what var is not. var is not a variant datatype (the JavaScript people are no longer at home now) nor another keyword for object. var is used to signal the compiler that we're using the new Local Variable Type Inference in C# 3.0. So, unlike in JavaScript, where var means that this variable can hold absolutely anything we want, this var keyword tells the C# compiler to infer the type from the assignment. What do I mean by that? Well, let's look at a little snippet of code to demonstrate:

var myInt = 5;
var myString = "This is pretty stringy";
var myGuid = new System.Guid();

In the above example, the compiler sees the var keyword, looks at the assignment to myInt, and determines that it should be an Int32, then assigns 5 to it. When it sees that we assign our string to the myString variable, it determines that myString should be of type System.String. Same goes for myGuid. Pretty cool, huh? If you then try to do something stupid like:

myInt = "Haha, let's see if we can trick the compiler!";

We're going to get a nice compiler error message telling us how foolish we are: Cannot implicitly convert type 'string' to 'int'. (I'm sure if the compiler had seen Napoleon Dynamite, it would be saying, "Gosh! Freakin' idiot!" right about now.)

Standard query operators

Now, moving on, we can see that after we have our string array, we have this funky piece of code:

// Get only the people whose name begins with D
var namesWithD = from poster in mostActive
                  where poster.StartsWith("D")
                  select poster;

This is where the real fun begins. What we see here is a variable being assigned to something that looks a lot like a SQL query. It's got the select (albeit in the wrong place), the from, the where; it's SQL, right? No. These keywords are some of LINQ's Standard Query Operators. When the C# compiler sees these keywords, it maps them to a set of methods that perform the appropriate operations. Alternatively, you could also have written the query given above like this:

// Get only the people whose name begins with D
var namesWithD = mostActive
    .Where(person => person.StartsWith("D"))
    .Select(person => person);

This is what the LINQ people call Explicit Dot Notation. It's the same exact query and you can write your queries either way. And this, of course, leads to the side question of what those funny "=>" marks are.

Lambda Expressions

Those are another C# 3.0 feature: Lambda Expressions. Lambda Expressions are the natural evolution of C# 2.0's Anonymous Methods. Essentially, a Lambda Expression is a convenient syntax we use to assign a chunk of code (the anonymous method) to a variable (the delegate). In this case, the delegates we use in the above query are defined in the System.Query namespace as such:

public delegate T Func<T>();
public delegate T Func<A0, T>(A0 arg0);

So this code snippet:

person => person.StartsWith("D")

Could be written as:

Func<string, bool> person = delegate (string s) {
                        return s.StartsWith("D"); 
                    };

Lot more compact than the first way, isn't it? Lambda Expressions are basically just syntactic sugar around Anonymous Methods, and you can use either of them or even regular named methods when creating filters for these query operators. Lambda Expressions, though, have the benefit of being compiled either to IL or to an Expression Tree, depending on how they're used. That stuff's a bit too much for the current discussion though. Suffice it to say that Lambda Expressions are way cool. Next subject!

Extension Methods

The astute reader will notice that, till now, there's been no discussion as to where these methods come from that the standard query operators map to. I mentioned before that LINQ worked on anything that implemented IEnumerable<T>. One could reasonably assume, therefore, that these methods reside in the new C# 3.0 definition of the IEnumerable<T> interface. That assumption, however, would be wrong. These methods, which reside in the System.Query.Sequence class (whose source is available in the LINQ Preview install, by the way), are part of a new feature in C# 3.0 called Extension Methods.

Extension Method is a new way of extending existing types. Basically, this works by adding a "this" modifier on the first argument, like so: (Example code shamelessly stolen from the Sequence class.)

public static IEnumerable<T> Where<T>(
  this IEnumerable<T> source, Func<T, bool> predicate) {
      foreach (T element in source) {
        if (predicate(element)) yield return element;
      }
}

There's nothing really special here, except for that "this" modifier on the first argument. The compiler sees this and treats it as a new method on the specified type. So now IEnumerable<T> gets the Where() method. Pretty cool, huh? Something to remember is that "real" methods get first priority. If you call Where() on an object, then the compiler goes to find Where() on the object itself first. If Where() doesn't exist, then it goes off to find an Extension Method. Clearly, while this feature is cool and really powerful, extension methods should be used extremely sparingly. Anders Hejlsberg warned those of us in the LINQ talk at the PDC not to add our favorite 10 methods to System.Object. This feature is probably the one that gives you the most potential to shoot yourself in the foot.

A more interesting LINQ example

Now that we've seen the basics of LINQ and C# 3.0, let's look at a slightly more interesting example. First, let's define a new Poster class for ourselves:

public class Poster
{
    public string name;
    public int numberOfPosts;
    public int numberOfArticles;
    
    public Poster(string name, 
               int numberOfPosts, int numberOfArticles)
    {
        this.name = name;
        this.numberOfPosts = numberOfPosts;
        this.numberOfArticles = numberOfArticles;
    }
}

Now let's modify the previous example to utilize our new Poster class (clearly these values are going to change):

public static void Main()
{
    // The most active list on CP, with 
    // names, posts, and message count
    var mostActive = new Poster[] {
        new Poster("Christian Graus", 22215, 32),
        new Poster("Paul Watson", 20185, 7),
        new Poster("Nishant Sivakumar", 18608, 99),
        new Poster("Roger Wright", 16790, 1),
        new Poster("Jörgen Sigvardsson", 14118, 7),
        new Poster("David Wulff", 13748, 4),
        new Poster("ColinDavies", 12919, 0),
        new Poster("Chris Losinger", 11970, 18),
        new Poster("peterchen", 11163, 9),
        new Poster("Shog9", 10605, 3)
    };

    // Get only the people who have ridiculously 
    // large post counts
    var peopleWithoutLives = from poster in mostActive
             where poster.numberOfPosts > 15000
             select new {poster.name, poster.numberOfPosts};


    // Print each person out
    foreach(var individual in peopleWithoutLives)
    {
        Console.WriteLine("{0} has posted {1} messages",
            individual.name,
            individual.numberOfPosts);
    }
}

Anonymous Types

Now, we've got an array of the most active CPians by message count and their articles. In our query, we specify that we only want those CPians with more than 15000 posts…but the select clause is different. Since we only want their name and their message count, not the number of articles they've posted, we just specify those two fields. This is a new feature of C# 3.0 called Anonymous Types (what's with all the anonymity in .NET now? Good grief!). Usually we only want certain fields from the collections we query, so this is a nice, easy way to query out just those fields. But, you say, what is that type called? Well, the CLR assigns it a name. It's probably something horribly unpronounceable too. But just accept the fact that it's a new type and has just those fields you asked for.

A look at some more advanced features

Let's jazz up the sample a little and include some new operators: groupby and orderby.

// Group the people with really large post counts
var peopleWithoutLives = from poster in mostActive
    group poster by (poster.numberOfPosts / 5000) into postGroup
    orderby postGroup.Key descending
    select postGroup;

// Print each person out in their respective group
Console.WriteLine("Posters by group");
foreach(var group in peopleWithoutLives)
{
    Console.WriteLine("{0}-{1}",
        (group.Key + 1) * 5000,
        group.Key * 5000);
    foreach(var person in group.Group)
    {
        Console.WriteLine("\t{0}", person.name);
    }
}

So we see here that we've got the ability to group people into categories by a certain criteria: the Key. In this case, our criterion is the number of posts they've made divided by 5000, so we can see who fits into each 5000 post block. The value of the criteria expression is then stored in the group's Key field. The difference between this query and the others is the return value. This query returns groups, which then contain the Poster items. Pretty nifty, eh?

Last letters on LINQ

Well, there's a quick look at LINQ. In summary, we looked at the current, rather sad, state of today's data access story. Then we looked at how LINQ and the new language features in C# 3.0 solve these issues by giving us a consistent set of Standard Query Operators that we can use to query any collection that implements IEnumerable<T>. In this article, we only focused on in-memory collections of data in order to avoid the confusion that most people have when mixing LINQ with DLinq and XLinq, but rest assured that there's a way to access relational and XML data with LINQ. Otherwise there wouldn't be much point, now, would there? Furthermore, because LINQ is just a set of methods that adhere to the naming conventions for the Standard Query Operators, anybody can implement their own LINQ-based collections for accessing any other type of data. For instance, the WinFS team is going to be making their product LINQ-enabled.

If you're as totally stoked about LINQ as I am, and want to read more about it, I'd recommend heading over to the LINQ Preview Site. There, you can download the LINQ preview package which integrates into Visual Studio 2005 Beta 2 to provide the new LINQ features and you can read more about DLinq and XLinq and the new C# 3.0 specifications.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here