Demo project : Lucene_VS2012_Demo_App.zip
Table Of Contents
I have just left a job, and am about to start a new one, but just before I
left one of the other guys in my team was tasked with writing something using
the Lucene search engine for .NET. We had to search across 300,000 or some
objects and it appeared to be pretty quick. We were doing it in response to a
user typing a character, and no delay was not noticeable at all, even though it
was going through loads of Rx and loads of different application layers, and
finally hitting Lucene to search for results.
This spiked my interest a bit and I decided to give Lucene a try and see if I
could some up with a simple demo that I could share.
So that is what I did and this is the results of that.
Lucene.Net is a port of the Lucene search engine library, written in C# and
targeted at .NET runtime users.
The general idea is that you build an Index of .NET objects that are stored
within a specialized Lucene Document with searchable fields. You are then able
to run queries against these stored Documents are rehydrate them back into .NET
objects
Index build is the phase where you take your .NET objects and created a
Document for each one and add certain fields (you do not have to store/add
fields for all the .NET objects properties, that is up to you to decide) and
then save these Document(s) to a physical directory on disk, which will later be
searched.
Querying data is obviously one of the main reasons that you would want to use
Lucene .NET and it should come as no suprise that it has good querying
facilities.
I think one of the nice resources of the query syntax that Lucene .NET uses
can be found here :
http://www.lucenetutorial.com/lucene-query-syntax.html
Some simple examples might be
title:foo | : Search for word "foo" in the title
field. |
title:"foo bar" | : Search for phrase "foo bar" in the title
field. |
title:"foo bar" AND
body:"quick fox" | : Search for phrase "foo bar" in the title field
AND the phrase "quick fox" in the body field. |
(title:"foo bar" AND
body:"quick fox") OR title:fox | : Search for either the phrase "foo bar" in the
title field AND the phrase "quick fox" in the body field, or the word
"fox" in the title field. |
title:foo -title:bar | : Search for word "foo" and not "bar" in the
title field. |
There are actual lots of different Analyzer types in Lucene.NET, such as
(there are many more than this, these are just a few):
- SimpleAnalyzer
- StandardAnalyzer
- StopAnalyzer
- WhiteSpaceAnalyzer
Choosing the correct one, will depend on what you are trying to achieve, and
what your requirements dictate.
This section will talk about the attached demo app, and should give you
enough information to start building your own Lucene.NET powered search should
you wish to use it in your own applications.
The demo app is pretty simple really, here is what it does:
- There is a static text file (in my case a poem) that is available to
index
- On startup the text file is indexed and added to the overall Lucene
Index directory (which in my case is hardcoded to C:\Temp\LuceneIndex)
- There is a UI (I used WPF, but that is irrelavent) which :
- Allows a user to enter a search key word that is used to search the
indexed Lucene data
- Will show all the lines from the text file that was originally used
to create the Lucene Index data
- Will show the matching lines in the poem when the user conducts a
search.
I think the best bet is to see an example. So this is what the UI looks like
when it first loads:
Then we type a search term in, say the word "when", and we would see this:
And that is all the demo does, but I think that is enough to demonstrate how
Lucene works.
So what gets stored. Well that is pretty simple, recall I stated that we had
a static text file (a poem), well we start by reading that static text file
using a simple utility class which is shown below, into actual
SampleDataFileRow
objects that be added to the Lucene index
public class SampleDataFileReader : ISampleDataFileReader
{
public IEnumerable<SampleDataFileRow> ReadAllRows()
{
FileInfo assFile = new FileInfo(Assembly.GetExecutingAssembly().Location);
string file = string.Format(@"{0}\Lucene\SampleDataFile.txt", assFile.Directory.FullName);
string[] lines = File.ReadAllLines(file);
for (int i = 0; i < lines.Length; i++)
{
yield return new SampleDataFileRow
{
LineNumber = i + 1,
LineText = lines[i]
};
}
}
}
Where the SampleDataFileRow
objects look like this
public class SampleDataFileRow
{
public int LineNumber { get; set; }
public string LineText { get; set; }
public float Score { get; set; }
}
And then from there we build the Lucene Index, which is done as follows:
public class LuceneService : ILuceneService
{
private Analyzer analyzer = new WhitespaceAnalyzer();
private Directory luceneIndexDirectory;
private IndexWriter writer;
private string indexPath = @"c:\temp\LuceneIndex";
public LuceneService()
{
InitialiseLucene();
}
private void InitialiseLucene()
{
if(System.IO.Directory.Exists(indexPath))
{
System.IO.Directory.Delete(indexPath,true);
}
luceneIndexDirectory = FSDirectory.GetDirectory(indexPath);
writer = new IndexWriter(luceneIndexDirectory, analyzer, true);
}
public void BuildIndex(IEnumerable<SampleDataFileRow> dataToIndex)
{
foreach (var sampleDataFileRow in dataToIndex)
{
Document doc = new Document();
doc.Add(new Field("LineNumber",
sampleDataFileRow.LineNumber.ToString() ,
Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.Add(new Field("LineText",
sampleDataFileRow.LineText,
Field.Store.YES,
Field.Index.TOKENIZED));
writer.AddDocument(doc);
}
writer.Optimize();
writer.Flush();
writer.Close();
luceneIndexDirectory.Close();
}
....
....
....
....
....
}
I think that code is fairly simple and easy to follow, we essentially just do
this:
- Create new Lucene index directory
- Create a Lucene writer
- Create a new Lucene Document for our source object,
- Add the fields to the Lucene Document
- Write the Lucene Document to disk
One thing that may be of interest, is that if you are dealing with vast
quantites of data you may want to create static Field
fields and
reuse them rather than creating new one each time you rebuild the index.
Obviously for this demo the Lucene index is only created once per application
run, but in a production application you may build the index every 5 mins or
something like that, in which case I would recommend reusing the
Field
objects by making static fields that get re-used.
So in terms of searching the indexed data this is really easy and all you
need to do is something like this:
public class LuceneService : ILuceneService
{
private Analyzer analyzer = new WhitespaceAnalyzer();
private Directory luceneIndexDirectory;
private IndexWriter writer;
private string indexPath = @"c:\temp\LuceneIndex";
public LuceneService()
{
InitialiseLucene();
}
....
....
public IEnumerable<SampleDataFileRow> Search(string searchTerm)
{
IndexSearcher searcher = new IndexSearcher(luceneIndexDirectory);
QueryParser parser = new QueryParser("LineText", analyzer);
Query query = parser.Parse(searchTerm);
Hits hitsFound = searcher.Search(query);
List<SampleDataFileRow> results = new List<SampleDataFileRow>();
SampleDataFileRow sampleDataFileRow = null;
for (int i = 0; i < hitsFound.Length(); i++)
{
sampleDataFileRow = new SampleDataFileRow();
Document doc = hitsFound.Doc(i);
sampleDataFileRow.LineNumber = int.Parse(doc.Get("LineNumber"));
sampleDataFileRow.LineText = doc.Get("LineText");
float score = hitsFound.Score(i);
sampleDataFileRow.Score = score;
results.Add(sampleDataFileRow);
}
return results.OrderByDescending(x => x.Score).ToList();
}
}
There is not much too that to be honest, and I think the code explains all you need to know
There is also a pretty cool GUI for examining your stored Lucene data, which
is called "Luke.NET", and it freely available from codeplex using the following
link:
http://luke.codeplex.com/releases/view/82033
When you run this tool you will need to enter the path to the index directory
for the Lucene index that was created. For this demo app that is
C:\Temp\LuceneIndex
One you enter that you click "Ok" and you will be presented with a UI that
allows you to examine all the indexed data that Lucene stored, and also run
searches should you wish to.
Its a nice tool and worth a look.
Anyway that is all I have to say for now, I do have a few article done, but
they just need writing up and I am struggling to find time of late. I'll get
there when I get there I guess. Anyway as always if you enjoyed this, a
vote/comment is most welcome.