Tag Archives: Search

Introduction to Indexing and Search

This article was originally posted as “Indexing and Search (C#)” on 15th June 2013 at Programmer’s Ranch. It has been slightly updated here. The source code is available at the Gigi Labs BitBucket repository.

This article is about indexing and search: what is it that allows you to search through thousands of documents in less than a second?

In order to understand how indexing works, we will use an example where we have the following three documents:

doc1.txt contains:
The Three Little Pigs
doc2.txt contains:
The Little Red Riding Hood
doc3.txt contains:
Beauty And The Beast

Now, let’s say you want to find out which of these documents contain a particular word, e.g. “Little”. The easiest way to do this would be to go through each word in each document, one by one, and see if the word “Little” is in that document. Conceptually, we’re talking about this:

indexingsearch-donkeyway

Doing this in C# is very easy:

            string[] documents = { "doc1.txt", "doc2.txt", "doc3.txt" };

            string keyword = Console.ReadLine();

            foreach (string document in documents)
            {
                if (File.ReadAllText(document).Contains(keyword))
                {
                    Console.WriteLine(document);
                }
            }

            Console.ReadKey(true);

Remember to put in a using System.IO; at the top, to be able to use File. If you press F5 and test this program, you’ll see that it works:

indexingsearch-donkeyout

However, this method isn’t good because it will take longer as the documents get larger (more words) and more numerous (more documents).

The proper way to process search requests quickly is to build an index. This would look something like this:

indexingsearch-index

The index stores a list of all words, each with a list of documents that contain it. If you compare it with the first diagram, you’ll notice that we reversed the mapping of words and documents; this is why we call this an inverted index.

We can do this in C# by first building the index (remember to add using System.Collections.Generic; at the top):

            // Build the index

            string[] documents = { "doc1.txt", "doc2.txt", "doc3.txt" };
            Dictionary<string, List<string>> index = new Dictionary<string, List<string>>();

            foreach (string document in documents)
            {
                string documentStr = File.ReadAllText(document);
                string[] words = documentStr.Split();

                foreach (string word in words)
                {
                    if (!index.ContainsKey(word))
                        index[word] = new List<string>();

                    index[word].Add(document);
                }
            }

…and then using the index to search the documents quickly and efficiently:

            // Query the index

            string keyword = Console.ReadLine();

            if (index.ContainsKey(keyword))
            {
                foreach (string document in index[keyword])
                {
                    Console.WriteLine(document);
                }
            }
            else
                Console.WriteLine("Not found!");

            Console.ReadKey(true);

In this way, there is no need to search every document for the keyword each time the user wants to search for a word. The keyword is simply located in the index (if it exists), and a list of documents that contain it is immediately available:

indexingsearch-indexout

This was a simple proof of concept of how indexing and search works, but here are a few additional notes:

  • The index is usually built as documents are added to it, and then stored in one or more files (unlike in this program, where the index is rebuilt every time the program is run – that’s just to make the illustration easier).
  • Words such as “and” and “the” which are very common are called stop words and are normally excluded from the index.
  • It is common practice to make searches case insensitive, e.g. by converting indexed words and query keywords to lowercase.

This article presented the concept of indexing and how it is used to search for a single keyword. Although there are other techniques used in text search, indexing is definitely one of the most important, and has many applications including databases and text retrieval (e.g. search engines). A fundamental concept to remember is that the whole point of indexing is to make search fast!

Diacritic-insensitive search in C#

When you’re dealing with multiple languages, searching for text can be a little tricky. Using normal string comparison techniques, a search for “Malmo” will not match “Malmö”. Technically it shouldn’t, because the characters are actually different, but it’s a great usability feature to allow people to search for text regardless of diacritics (accents and such).

The Normalization Method

The first idea I had was to strip off the diacritics and simply compare the simplified version of both the query and the text being searched. Using the same example, “Malmö” would become “Malmo” in the text, and so the query would match, since RemoveDiacritics(query) == RemoveDiacritics(text).

The RemoveDiacritics() method is defined in this StackOverflow answer:

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

As I pointed out in my followup question, this approach doesn’t work very well for search. If we run a simple test using the same words from my question…

            var words = new List<string>()
            {
                "Malmö",
                "München",
                "Åge",
                "Strømsgodset",
                "Kulħadd"
            };
            var simplifiedWords = words.Select(word => RemoveDiacritics(word)).ToList();

…you’ll notice that it works for basic accents that seem to be external to the base character, but not for others where it is embedded. Below is the output I got in the immediate window (since the Console can’t handle some of the characters with the default encoding):

simplifiedWords
Count = 5
    [0]: "Malmo"
    [1]: "Munchen"
    [2]: "Age"
    [3]: "Strømsgodset"
    [4]: "Kulħadd"

Apart from this, there is no way to simplify combined characters such as æ into a graphically similar ae.

This all makes sense, because technically æ and ae are different characters, as are ħ and h. But from a user’s perspective, it feels pretty natural to be able to interchange them when searching.

The Collation Method

The answer to my question shows that it is actually pretty easy to have diacritic-insensitive search in C#, even without doing any stripping operations. It is necessary only to specify CompareOptions.IgnoreNonSpace in string comparison methods. Here’s an example from that same answer:

int ix = CultureInfo.CurrentCulture.CompareInfo.IndexOf(
    "Ad aeternitatem", 
    "æter", 
    CompareOptions.IgnoreNonSpace); // 3

Here’s the same thing applied to one of my original examples:

            int ix = CultureInfo.CurrentCulture.CompareInfo.IndexOf(
                "Kulħadd",
                "hadd",
                CompareOptions.IgnoreNonSpace); // returns 3

This other answer shows the string.Compare() being used instead, using the same flag:

string s1 = "hello";
string s2 = "héllo";

if (String.Compare(s1, s2, CultureInfo.CurrentCulture,
    CompareOptions.IgnoreNonSpace) == 0)
{
    // both strings are equal
}

In either case, just add the CompareOptions.IgnoreCase flag to make it case insensitive as well.