Introduction
For most programmers today, our jobs require us to integrate some sort of data into our application. Often, we have to take data from multiple sources, be they in memory collections, a database like SQL Server or Access, an XML file, Active Directory, the File System, etc. With today's languages and technologies, getting to this data is often tedious. For databases, using ADO.NET is just a bunch of plumbing code that gets really boring really fast. The story for dealing with XML is even worse as the System.Xml
namespace is very cumbersome to use. Also, all data sources have different means of querying the data in them. SQL for databases, XQuery for XML, LDAP queries for Active Directory etc. In short, today's data access story is a mess.
Enter Linq
The folks at Microsoft aren't oblivious to the problems of today's data access story. And so, now that C# 2.0 is almost about to be released, they've given us a look at C# 3.0 in the form of the LINQ project. LINQ stands for Language Integrated Query framework. The LINQ project's stated goal is to add "general purpose query facilities to the .NET Framework that apply to all sources of information, not just relational or XML data". The beauty of the LINQ project is twofold.
First, LINQ is integrated directly into your favorite language. Because the underlying API is just a set of .NET classes that operate like any other .NET class, language designers can integrate the functionality these classes expose directly into the language. Second, and perhaps most importantly, the query functionality in LINQ extends to more than just SQL or XML data. Any class that implements IEnumerable<T> can be queried using Linq. That should elicit a feeling of absolute joy in you. Or maybe I'm just weird.
Let's look at LINQ
In this article, I only want to focus on the language features of LINQ. Too many people confuse LINQ with DLinq and XLinq. They are not one and the same. LINQ is the set of language features and class patterns that DLinq and XLinq follow. That being said, in this article, we will only work with in-memory collections of data. So, without further ado, let's take a look at a basic LINQ program that serves absolutely no practical purpose (those are the best types, aren't they?):
using System;
using System.Query;
namespace LinqArticle
{
public static class GratuitousLinqExample
{
public static void Main()
{
var mostActive = new string[] {
"Christian Graus",
"Paul Watson",
"Nishant Sivakumar",
"Roger Wright",
"Jörgen Sigvardsson",
"David Wulff",
"ColinDavies",
"Chris Losinger",
"peterchen",
"Shog9" };
var namesWithD = from poster in mostActive
where poster.StartsWith("D")
select poster;
foreach(var individual in namesWithD)
{
Console.WriteLine(individual);
}
}
}
}
There we go. Now, at this point, you're probably looking at that saying, "That serves no practical purpose!" I would like to remind you: That's the point. So what we have here is a list of the most active CPians. Then we write this funky query in what looks kinda like SQL to get only the CPians whose names start with the letter D. And we write their names to the console. In this case, we've got everybody's favorite Tivertonian, David Wulff. There's not a whole lot special here. You could be sitting there thinking that you could just replace the entire thing with code that looks like this:
foreach(var someone in mostActive)
{
if(someone.StartsWith("D"))
{
Console.WriteLine(someone);
}
}
And you'd be right; you could replace it with that. But that wouldn't be cool, because it wouldn't have the gratuitous LINQ usage in the previous example.
Looking closer
Now let's take a longer and closer look at LINQ. We'll also be looking at the new language features in C# 3.0. I'd like to point out first that these new language features run on the .NET 2.0 CLR. This is the key because, unlike the C# 2.0 features of iterators, anonymous methods, etc., there isn't any deep plumbing work that goes into making these 3.0 features possible. This will, most likely, mean a shorter release cycle for C# 3.0 and LINQ. (At least, that's how the rumor goes.) Anyway, now that we've got that out of the way, let's take a closer look at our LINQ code. First, we've got the cool new System.Query
namespace reference:
using System.Query;
This namespace is all you'll need to get started with LINQ. Contained within it are vast troves of treasure, innumerable tomes of knowledge, and power beyond your wildest imagination…or maybe just a few classes and delegates.
Type inference
Then we've got this weird var
keyword that keeps popping up all over the place. The JavaScript people in the audience should feel right at home now. var
is a new keyword introduced in C# 3.0 that has a special meaning. First, let's talk about what var
is not. var
is not a variant datatype (the JavaScript people are no longer at home now) nor another keyword for object. var
is used to signal the compiler that we're using the new Local Variable Type Inference in C# 3.0. So, unlike in JavaScript, where var
means that this variable can hold absolutely anything we want, this var
keyword tells the C# compiler to infer the type from the assignment. What do I mean by that? Well, let's look at a little snippet of code to demonstrate:
var myInt = 5;
var myString = "This is pretty stringy";
var myGuid = new System.Guid();
In the above example, the compiler sees the var
keyword, looks at the assignment to myInt
, and determines that it should be an Int32, then assigns 5 to it. When it sees that we assign our string to the myString
variable, it determines that myString
should be of type System.String
. Same goes for myGuid
. Pretty cool, huh? If you then try to do something stupid like:
myInt = "Haha, let's see if we can trick the compiler!";
We're going to get a nice compiler error message telling us how foolish we are: Cannot implicitly convert type 'string
' to 'int
'. (I'm sure if the compiler had seen Napoleon Dynamite, it would be saying, "Gosh! Freakin' idiot!" right about now.)
Standard query operators
Now, moving on, we can see that after we have our string array, we have this funky piece of code:
var namesWithD = from poster in mostActive
where poster.StartsWith("D")
select poster;
This is where the real fun begins. What we see here is a variable being assigned to something that looks a lot like a SQL query. It's got the select
(albeit in the wrong place), the from
, the where
; it's SQL, right? No. These keywords are some of LINQ's Standard Query Operators. When the C# compiler sees these keywords, it maps them to a set of methods that perform the appropriate operations. Alternatively, you could also have written the query given above like this:
var namesWithD = mostActive
.Where(person => person.StartsWith("D"))
.Select(person => person);
This is what the LINQ people call Explicit Dot Notation. It's the same exact query and you can write your queries either way. And this, of course, leads to the side question of what those funny "=>" marks are.
Lambda Expressions
Those are another C# 3.0 feature: Lambda Expressions. Lambda Expressions are the natural evolution of C# 2.0's Anonymous Methods. Essentially, a Lambda Expression is a convenient syntax we use to assign a chunk of code (the anonymous method) to a variable (the delegate). In this case, the delegates we use in the above query are defined in the System.Query
namespace as such:
public delegate T Func<T>();
public delegate T Func<A0, T>(A0 arg0);
So this code snippet:
person => person.StartsWith("D")
Could be written as:
Func<string, bool> person = delegate (string s) {
return s.StartsWith("D");
};
Lot more compact than the first way, isn't it? Lambda Expressions are basically just syntactic sugar around Anonymous Methods, and you can use either of them or even regular named methods when creating filters for these query operators. Lambda Expressions, though, have the benefit of being compiled either to IL or to an Expression Tree, depending on how they're used. That stuff's a bit too much for the current discussion though. Suffice it to say that Lambda Expressions are way cool. Next subject!
Extension Methods
The astute reader will notice that, till now, there's been no discussion as to where these methods come from that the standard query operators map to. I mentioned before that LINQ worked on anything that implemented IEnumerable<T>
. One could reasonably assume, therefore, that these methods reside in the new C# 3.0 definition of the IEnumerable<T>
interface. That assumption, however, would be wrong. These methods, which reside in the System.Query.Sequence
class (whose source is available in the LINQ Preview install, by the way), are part of a new feature in C# 3.0 called Extension Methods.
Extension Method is a new way of extending existing types. Basically, this works by adding a "this
" modifier on the first argument, like so: (Example code shamelessly stolen from the Sequence
class.)
public static IEnumerable<T> Where<T>(
this IEnumerable<T> source, Func<T, bool> predicate) {
foreach (T element in source) {
if (predicate(element)) yield return element;
}
}
There's nothing really special here, except for that "this
" modifier on the first argument. The compiler sees this and treats it as a new method on the specified type. So now IEnumerable<T>
gets the Where()
method. Pretty cool, huh? Something to remember is that "real" methods get first priority. If you call Where()
on an object, then the compiler goes to find Where()
on the object itself first. If Where()
doesn't exist, then it goes off to find an Extension Method. Clearly, while this feature is cool and really powerful, extension methods should be used extremely sparingly. Anders Hejlsberg warned those of us in the LINQ talk at the PDC not to add our favorite 10 methods to System.Object
. This feature is probably the one that gives you the most potential to shoot yourself in the foot.
A more interesting LINQ example
Now that we've seen the basics of LINQ and C# 3.0, let's look at a slightly more interesting example. First, let's define a new Poster
class for ourselves:
public class Poster
{
public string name;
public int numberOfPosts;
public int numberOfArticles;
public Poster(string name,
int numberOfPosts, int numberOfArticles)
{
this.name = name;
this.numberOfPosts = numberOfPosts;
this.numberOfArticles = numberOfArticles;
}
}
Now let's modify the previous example to utilize our new Poster
class (clearly these values are going to change):
public static void Main()
{
var mostActive = new Poster[] {
new Poster("Christian Graus", 22215, 32),
new Poster("Paul Watson", 20185, 7),
new Poster("Nishant Sivakumar", 18608, 99),
new Poster("Roger Wright", 16790, 1),
new Poster("Jörgen Sigvardsson", 14118, 7),
new Poster("David Wulff", 13748, 4),
new Poster("ColinDavies", 12919, 0),
new Poster("Chris Losinger", 11970, 18),
new Poster("peterchen", 11163, 9),
new Poster("Shog9", 10605, 3)
};
var peopleWithoutLives = from poster in mostActive
where poster.numberOfPosts > 15000
select new {poster.name, poster.numberOfPosts};
foreach(var individual in peopleWithoutLives)
{
Console.WriteLine("{0} has posted {1} messages",
individual.name,
individual.numberOfPosts);
}
}
Anonymous Types
Now, we've got an array of the most active CPians by message count and their articles. In our query, we specify that we only want those CPians with more than 15000 posts…but the select
clause is different. Since we only want their name and their message count, not the number of articles they've posted, we just specify those two fields. This is a new feature of C# 3.0 called Anonymous Types (what's with all the anonymity in .NET now? Good grief!). Usually we only want certain fields from the collections we query, so this is a nice, easy way to query out just those fields. But, you say, what is that type called? Well, the CLR assigns it a name. It's probably something horribly unpronounceable too. But just accept the fact that it's a new type and has just those fields you asked for.
A look at some more advanced features
Let's jazz up the sample a little and include some new operators: groupby
and orderby
.
var peopleWithoutLives = from poster in mostActive
group poster by (poster.numberOfPosts / 5000) into postGroup
orderby postGroup.Key descending
select postGroup;
Console.WriteLine("Posters by group");
foreach(var group in peopleWithoutLives)
{
Console.WriteLine("{0}-{1}",
(group.Key + 1) * 5000,
group.Key * 5000);
foreach(var person in group.Group)
{
Console.WriteLine("\t{0}", person.name);
}
}
So we see here that we've got the ability to group people into categories by a certain criteria: the Key. In this case, our criterion is the number of posts they've made divided by 5000, so we can see who fits into each 5000 post block. The value of the criteria expression is then stored in the group's Key field. The difference between this query and the others is the return value. This query returns groups, which then contain the Poster
items. Pretty nifty, eh?
Last letters on LINQ
Well, there's a quick look at LINQ. In summary, we looked at the current, rather sad, state of today's data access story. Then we looked at how LINQ and the new language features in C# 3.0 solve these issues by giving us a consistent set of Standard Query Operators that we can use to query any collection that implements IEnumerable<T>
. In this article, we only focused on in-memory collections of data in order to avoid the confusion that most people have when mixing LINQ with DLinq and XLinq, but rest assured that there's a way to access relational and XML data with LINQ. Otherwise there wouldn't be much point, now, would there? Furthermore, because LINQ is just a set of methods that adhere to the naming conventions for the Standard Query Operators, anybody can implement their own LINQ-based collections for accessing any other type of data. For instance, the WinFS team is going to be making their product LINQ-enabled.
If you're as totally stoked about LINQ as I am, and want to read more about it, I'd recommend heading over to the LINQ Preview Site. There, you can download the LINQ preview package which integrates into Visual Studio 2005 Beta 2 to provide the new LINQ features and you can read more about DLinq and XLinq and the new C# 3.0 specifications.