Anatomy of a relevant search engine (part 1)

Adrian Pirvu

4.88/5 (18 votes)

Mar 7, 2009

CPOL

23 min read

204010

1395

Information retrieval, semantic search relevance and ranking. About anatomy of a search engine. The simplest search engine source code.

Download SimplestSearchEngine.zip - 26.16 KB

Introduction

My intention is to set a new base for the relevance of semantic searches on internet and to bring a practical approach in the information retrieval from large scale text-based databases. Actually what I am about to say it's not really new, since it works very well in the major internet search engine nowadays. It's just that the details have been ignored or not observed, even if they were available all the time, and that's why today we have very few truly relevant search engines.

Intended audience

People who deal with information retrieval, semantic search, everybody using internet. There are a lot to say and complexity may grow but I'm sure people involved in search engines or who want to know the internal anatomy of a search engine will have the patience to read it.

Where we are today

There are a lot of problems when searching inside large sites and particulary in ones having active forums or discussion boards. Most of these sites use to perform exact searches in the topics table, sometimes using the full text search engine provided by the database with a very low relevance. In the picture bellow, I've been searching for search engines on the Codeproject website, although the internal search engine is far better than engines provided by common sites we can notice at least one irrelevant result on the first page. When this happen, a method prove to be useful, I type in browser something like site:website.com words to find and a major search engine reveals far better results than the search provided by the website itself. In this particular case, we can use the query site:www.codeproject.com search engines

Relevance in Information Retrieval

Pretty much the same issue with the online newspapers, where searching archives provide very poor results. There are far more cases when the local search is really bad and you better use a internet search engine for a better relevance. But sometimes you want to index your data and the information is sensitive - you can't just use a public search engine to index it.
If you work at computer and perform several searches a day using a major search engine you quickly find out that there are not really many choices. Try to change it and you become disappointed by the new results. You will still find bulk text documents without titles on the first results page...

And all these happen after more than 20 years of information retrieval theories. Isn't it something wrong in all these? Why some engines provide relevant results while the rest of them not?

A bit of history

Everything started by trying to optimize one of my websites for crawlers and by noticing techniques for categorizing a document as relevant. While digging into details I had to read more about information retrieval and ended up with a set of new conclusions and with my own search engine. When this one proved to be relevant, I knew that all the assumptions I made were true. In the meantime my site got first on some really hard keywords since most of the software companies here were using them and of course knowing the internals of a search engine helped a lot.
Things are pretty much settled in internet today, you can hardly make something to overpass a big name. But maybe this theory will improve generation of small local relevant search engines thus improving site searches.

Definitions

I will start with a set of definitions that will help us see things in a bit different light. Some of them are classic, some others will improve existing ones.

Relevance: According to wikipedia, relevance denotes how well a retrieved set of documents (or a single document) meets the information need of the user. This would be good enough for us, we can stick to it. Or we can consider relevance as being the capacity of a search engine to retrieve the proper information.
Document: Also from wikipedia, a bounded physical representation of body of information designed with the capacity (and usually intent) to communicate. We will refer to the text documents only, since a recorded voice mail message may be considered a document too. So any document means text document in this... document...
Document summary (or subject): Ideas coming out from a document, usually reflected in title or subject.

IMPORTANT: When the document is human made and it has a summary, the summary reflects the document relevance very well (maybe best possible) since we have to agree the human decides what is relevant, machines only search for the documents that the humans may find relevant.
Well formed document: A human language based relevant document having a title (summary). These are most common forms of documents, made by people with the intention to communicate something. Well formed documents refer to a single subject unlike compound documents (see bellow). All of them have a title, since people tend to summarize the content of the document into a title/subject. Of course they can have additional information like subtitles, dates, descriptions, external keywords, paragraphs, noises etc. "Well formed" does not refer to the internal representation of the document like html formatting or similar. Basically, most of the human made documents are well formed documents.
Examples of well formed documents: a word document, an email, a blog post, a good web page etc. All of them have titles and represent small pieces of relevant information. We can also consider a well formed documents a web page named "Links to tourism sites" and no phrases but links, since the information is relevant and summarized into a title.
Bulk text document: Text information without title. A bulk text can range from a forum post (without title) to a machine generated log file. There is no human-made summary for it.

IMPORTANT: There is a huge difference between extracting information from well formed documents and bulk texts. While in the first case there is for sure an idea/subject/summary coming out of them, usually reflected through title, in the second case trying to extract relevant information may be just useless. Some of them may just be irrelevant bulk texts like log files and in this case only basic search is justified.
In some other cases they are relevant bulk texts, like let's say, forum posts, and they may have a idea / summary, though they may have none (see forum posts like "wow looool" or similar). These should be kept separate as much as possible. For the internet user usually the well formed documents are important although the search should be performed in bulk texts too.
Still, we can't talk about a semantic search when the user tries to find the number of occurrences of an IP in a log file, but a good search engine will do this too. Sometimes a group of bulk texts can have a title and they may form a well formed document, a forum thread for example. If the analysis takes place at the database level, then it's about bulk texts, if it is done at the page level, then we have a well formed document. A search engine should prioritize well formed documents, without ignoring bulk texts.
Compound documents: This category has been added because sometimes we can find mixed content in the same file. For example one part of the document has a summary, another part of the document has a completely different idea. Like the first page of a news website containing headlines, there is not really a summary for that page, its just a sum of documents references. They should be split in well formed documents if possible, or rather be treated like "since they talk about everything, they talk about nothing". Sometimes they may have relevant information but its hard to believe that someone able to generate relevant content will mix apples with eggs so statistically they may not be as relevant is possible, related to other documents.

So what is this all about? Well, basically we will try to find relevant documents with summary matching our search criteria. We prefer well formed documents but they can be bulk texts or compound documents. For example, if we look for "search engine crawler" we expect to find documents containing software or source code for internet crawlers / spirders. Notice that those documents providing information about "search engines" only are not quite relevant to my search.

I won't define the terms like query, term, weight, density(frequency) as they are well known. If you want to read further and don't know these terms yet, you can perform a search on internet for them :).

Relevance in Information Retrieval

A search engine providing documents with irrelevant summary on the first results page among billions of available pages.

Inverse document frequency

I will start with the most common theory today mostly used in academic area but not only. In fact so many search engines are based on it that it is hard to ignore it. Of course these search engines provide some results, sometimes even relevant results, but not the most relevant possible. I won't be long on it either since I consider everything about it based on a wrong foundation.

The basics of the inverse document frequency theory is:

Inverse document frequency

Given a query Q, containing keywords q1,...,qn, the BM25 score of a document D is calculated as above, where f(qi,D) is qi's term frequency in the document D, | D | is the length of the document D in words, and avgdl is the average document length in the text collection from which documents are drawn. k1 and b are free parameters, usually chosen as k1 = 2.0 and b = 0.75. IDF(qi) is the IDF (inverse document frequency) weight of the query term qi. IDF is usually computed as:

tf-idft = log (N / n(q)) where N is number of documents in collection and N(q) is the number of documents containing q.

Suppose a query term q appears in n(q) documents. Then a randomly picked document D will contain the term with probability (where N is again the set of documents in the collection). Therefore, the information content of the message "D contains q" is (see above). This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tfidf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

Well, I am going to stop here with quotations and present the first conclusions.

the weight of a word in a document is higher if it occurs many times within a small number of documents (thus lending high discriminating power to those documents)
the weight of a word is lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal)
the weight of a word is lowest when the term occurs in virtually all documents.

Apparently well so far but... Relevance is not really an exact science, we could agree that some formulas may work well but there are some really serious issues with this one. It wouldn't be that bad if hundreds of people wouldn't take it as is and made whole theories based on it then tried to make search engines afterwards. So, some really visible things about this state-of-the-art retrieval functions used in document retrieval, such as Web search...(wikipedia) next:

the weight of a word in a document is higher then it occurs many times within a small number of documents - sure, but only inside those documents. If I search for that word, those documents are good for query but this doesn't tell me anything about which one of them is more relevant related to others. Also IDF supposes to provide a general method of extraction even from bulk texts, when the real problem is IF we need to search for relevant information inside bulk texts.
The probability of picking a document containing two terms has nothing to do with the intrinsic relevance of those two terms inside a document thus to the relevance of the document to other documents in collection, related to the query terms. The computed score is just a some kind of a probability score that has nothing to do with the relevance of a document among others, related to a query. Misleading probability for relevance is the first big error.
The formula was extended to a database with virtually an infinite number of documents - internet. It also refers to words related to other words, not to documents related to other documents. The length of documents is important indeed but I will come back to this later.
Not every word counts the same indeed but it really does not depends on the number of documents in which the term appear. I'm not even mentioning the most searched 3 letters word in internet today, there are a lot of documents around containing it and looks like most people consider it important... If "cars" occurs less than "phones" is because people buy phones more often than cars because of lower price. Also the companies selling phones are more for the same reason. So the models. Nothing justify considering "car" less important than "phone", especially if you live two hours far from office (oh yes, another joke). Anyway, this is just like saying that the "v" letter is more relevant because it occurs in more small words then letter r, it just has nothing to do with the relevancy, its just the way the human language is.

So much about inverse document frequency, but indeed, not every word counts the same, we will see a bit later the real reason. There is a whole foundation of theories built on top of IDF, it won't be easy to shutter such a system, thousands of people worked on it for years and spent many millions. Well, they have something but not the real thing.

We can check for example the algorithm used in a known database full text search engine here. A word that appear almost nowhere scored 100. Looks like someone just gone too far...

Concepts

Human brain is a huge neural network with enormous capabilities. Patterns identification is done at incredible speed, so the storage and retrieval of information. Whenever we meet a friend our brain perform a fast search to locate his image and associate it with some other memories.

When we read a document, the same neural network analyses the written language, extracts concepts from it, makes a set of summary concepts and store it as a memory. This represents our understanding about that document. If someone is asking you what was about in this document so far, you will say that something like "search engines, relevance, internet" etc. ("no big deal, the guy is wasting his time" is also possible)

The brain extracted the summary concepts from the document ignoring the noise around them. The same process takes place while evaluating a document relevance. The brain tries to match the concepts to find with the concepts inside the document. In case all the concepts are found and they are highlighted enough among other concepts inside the document, then the document is considered relevant.

Sample:

When searching for "cars insurance" will try to mach the concepts "car" and "insurance" with the documents where we are looking for. A document containing "cars" alone won't be enough, it should contain also the other concept, "insurance". Same for "insurance" it could be "real estate insurance" or another type, so both concepts are required every time. So all the documents found containing "cars" and "insurances" may be good but some may be better than others, depending on how much the concepts are highlighted / emphasized inside them. Those are considered relevant documents.

IMPORTANT: Each document is a set of concepts. We are trying to match the concepts we look for with the concepts inside documents. The documents that contain the set of concepts we look for among their concepts are good for us. If those concepts are highlighted, those documents are relevant.

Each document is a set of concepts.

document containing concepts in query -> good documents for the query
document with highlighted concepts -> relevant documents (inside the set of good documents)

I suggest to look at a list containing what people search on internet, there is one here or here. Now, you can count the verbs and adjectives in lists; verbs are very few or at all - nobody searches "walks" or "bring", also few adjectives.

People are looking for concepts. In human language, concepts are usually transmitted through nouns. A text can be understood (or evaluated) even if all words but nouns are missing (this also happen when we are reading fast). Here is how the first phrases in this document look like, by ignoring everything else but nouns:

"intention base relevance searches internet approach information scale text databases internet search engine"

Now the same phrases without nouns:

"my is to set a new for the of semantic on and to bring a practical in the from large based"

Obviously, in the first sample, someone knowing nothing about this document can estimate that this document is about search engines and internet. You can hardly say what is this document about from the second sample... Nouns have around them a lot of estimative words, usually adjectives, that describe the concept. That's why we all search nouns, and expect the nouns found to match the nouns we looked for.
They are in fact concepts, for the moment let's just associate the concepts with the nouns in the document. I am sure that future studies will get close to human language and find better ways to extract concepts, even the concepts are not directly expressed by a noun. Semantic analysis of the phrase may lead to better results, don't know how far the progress went in some "laboratories".

IMPORTANT Nouns can be used to extract concepts from a document. Therefore nouns are more important then other words.

There are several good theories about retrieving nouns, still there is a long way to go. Sometimes concepts are indirect, not expressed through direct nouns and we need to learn to deal with them.

At a practical level, let's consider that someone is searching for "best cars". Actually the concept to find is "cars", "best" is only descriptive. Suppose we found two similar documents, one containing highlighted (or more occurrences) of "best", the other of "cars". For a simple occurence ranking, let's say the first document has 5 best and 7 cars occurrences, the other has 7 best 5 cars occurrences, for the same document length. Obviously, statistically the one with highlighted "cars" is more important. The other one could talk about "best phones", and "cars" may be only a secondary concept.

Words don't have the same weight inside a document but not because they occur more or less often in others documents but because they are or they are not concepts.

Someone can say that the word "best" occurs more often than "car" along internet, I will tell that "hypanthial" occurs really more rarely than "car". (I don't know what that word is, I just found it in Webster as an adjective) So, if we need to build a "word-weight" function for a document, somehow we need to give concepts / nouns a bit more importance (or "best car 5-7" document will have the same score as "best car 7-5" document). But there are a lot of other more important things in text information retrieval, that's because people already search mostly nouns / concepts. I did this more like to fix the inverse document frequency issue and to bring on scene the "concept" issue.

To summarize:
- a document is a set of concepts, mostly expressed through nouns but these concepts can be indirect too. When searching, we try to match the query concepts with the concepts in documents.

Believe it or not, even Britney Spears is a concept, and a lot of people look for it.

Titles and concepts

Going further, titles, which we already decided that represent ideas or summary from the document, are most of the time composed of concepts. You rarely find titles like "German cars are good", this belong rather to content, but you can find content summarizes into a title like "German cars" therefore "car" concept. When searching for "car", a document having the title "German cars" may at least in the set of good documents if not relevant.
Ex. Amazon's main title, well made: Amazon.com: Online Shopping for Electronics, Apparel, Computers, Books, DVDs - pretty much full of concepts. When you build a query to search something, always remember you search for concepts, and try to visualize these. You will help search engines to provide better results.

IMPORTANT It may not be the right place here but I'm going to mention it. Common HTML documents on internet ussually have two main titles, the html page title in "title" tag and a main "H1" title. Both are basic sources of document concepts. We can consider that if a title is not relevant to the content then the document is not well made, therefore not so relevant.

Checking the relevance of the title into the content is challenging and sometimes you have to choose between selecting the concepts from title or from text if you suspect spamming or document bad made. But relying on the automatic retrieval of the concepts from content alone won't provide better results, since statistically most of the well formed documents on internet are made with good intention therefore title is relevant to the content and guess what - is human made.

I will also give a tricky sample. Let's search for the query "How products are made?", we will find a fairy large number of hits. Even more, they are matching the exact words (all words in query) and content is very relevant to the title. Searching for "products" will provide completely irrelevant results for the query "how products are made?". So why the "products" concepts doesn't provide relevant results for "how products are made" ? Well, the answer is that I haven't properly extracted the right concepts from the query before performing the search. The concepts from "How products are made?" are "products manufacturing" not "products" alone, this will provide better results.

Since is hard to extract indirect concepts from a query, the search engines prefer to perform exact searches and look for words ignored in the past, like stop words. The first result for "how products are made" is a internet forum archive, where the first phrase is like "How Products Are Made explains and details the manufacturing process of a wide variety of products, from daily household items to complicated electronic equipment and heavy machinery." So the concepts they talk about in document are "products manufacturing" but they summarized it into a title with an indirect concept (manufacturing). This hit also probably occurs in the results of a query like "manufacturing process of products" too, since these concepts are explicitly contained into the document. The document itself is not really relevant, but it contains a link to a real relevant document (I rather find directly that one). If I would reduce both the documents and the query to concepts, it would be easier to match "products manufacturing" query in all "products manufacturing" documents.

The exact search we use today may be only a temporary method towards understanding the human language and the ability to extract proper concepts from interrogations and documents.

Categorization

Since we decided that most of the times the problem of searching inside bulk texts is a matter of IF rather than relevance, and the user usually look for structured information (well formed documents), I will discuss one more factor involved in the relevance calculation. I will come back later at full text search in bulk texts and present how today's database full text search engines can be improved. Rankings based on simple occurrences of the words found or on formulas based on IDF (sum of weights - occurrences products) may not be the best possible.

First I will present a simple sample, observed in the search queries list I gathered in time. People are looking for actual events and lot of times this event has just being emphasized lately on media. For example, news about Angelina Jolie or some other star - people will start looking for that name just when the news occur and they expect to find details. For these people, the character of news of the document they found is important, therefore relevant. So the date of the document play a major role in relevance calculation in this particular case.

We can find much more relevant older documents from the point of view of the content, still, statistically they won't be as relevant as the actual documents for the searches in a specific day, even if they have good content.
So the news sites documents should have a different approach than the rest of the sites from the point of view of document creation date. A balance between a recent creation date and relevant content should be done in order to have a good results list. The first conclusions is that classification of the documents must play a major role in ranking. The same document may have a different relevance when coming from different sources, therefore the need of categorizing sources and computing pages ranks. But this applies on top of other calculations, the documents can be ranked very well even without a source relevance system (page ranking) and we will stay focused for the moment over the relevance due to the concepts emphasizing inside documents.

Conclusions (part 1)

Documents are sets of concepts.
Searching means matching the concepts within the query with concepts within documents.
Nouns are more important than other words because they usually represents concepts.
Titles and headers summarize concepts within document.
Well formed documents are more relevant than bulk text documents.
Ranking is also a matter of classification.

The simplest search engine

In order to avoid some negative comments of the one of you who use to look at the code first and then read comments (sometimes I do this myself too), I have attached a model of the simplest search engine. No optimizations, no overhead, it just parse several text files, build and inverted index and perform searches. The first thing it can be improved is to unify the character case into the inverted index (yes, use ToLower()).

This occurs when the index button is clicked. Just choose several .txt files with a decent length.

        // index files
        private void buttonIndex_Click(object sender, EventArgs e)
        {

            if (openFileDialog.ShowDialog() != DialogResult.OK)
                return;
            
            // parse each file and build the inverted index. 
            foreach (string fileName in openFileDialog.FileNames)
            {
                string text = File.ReadAllText(fileName).Replace("\r", " ").Replace("\n", " ");
                string[] terms = text.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

                // parse each term (word) in text file and put it in dictionary
                foreach (string term in terms)
                {
                    if (!InvertedIndex.ContainsKey(term))
                        InvertedIndex.Add(term, new List());

                    if (!InvertedIndex[term].Contains(fileName))
                        InvertedIndex[term].Add(fileName);
                }
            }

            // update label
            labelIndex.Text = openFileDialog.FileNames.Length.ToString() + " files indexed";
        }

When the files are indexed, we fill the search textbox and perform a search on multiple terms. The results are not ranked, we only show the documents containing terms in query.

        // perform a seach over the inverted index built earlier
        private void buttonSearch_Click(object sender, EventArgs e)
        {

            // split query in terms
            string query = textBoxQuery.Text;
            string[] terms = query.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);


            // see documents for each term and merge them
            List commonDocuments = null;
            foreach (string term in terms)
            {
                // if one term not in index, end search
                if (!InvertedIndex.ContainsKey(term))
                {
                    if (commonDocuments != null)
                        commonDocuments.Clear();
                    break;
                }

                // find documents containing all terms
                if (commonDocuments == null)
                    commonDocuments = InvertedIndex[term];
                else
                    commonDocuments = GetCommonDocuments(commonDocuments, InvertedIndex[term]);
            }


            // display results
            if (commonDocuments != null)
            {
                listBoxResults.Items.Clear();
                foreach (string fileName in commonDocuments)
                    listBoxResults.Items.Add(fileName);
            }
        }




        // merge two arrays of file indexes
        private List GetCommonDocuments(List documentList, List newDocumentsList)
        {
            List common = new List();
            foreach (string file in documentList)
                if (newDocumentsList.Contains(file))
                    common.Add(file);

            return common;
        }

I really believe that the code above doesn't need any more comments. That's all, not much use for it, only a model, but we have source code in the first part. Just feel free to email me at my hotmail address, adrian_pirvu

History

March 07, 2009 - Part 1

References

http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html
http://www.texaswebdevelopers.com/docs/pagerank.pdf
http://en.wikipedia.org/wiki/Probabilistic_relevance_model_(BM25)
http://xapian.org/docs/bm25.html
http://queue.acm.org/detail.cfm?id=988407

anatomy of a search engine