Search System using CloudSearch in Django

nitish kansal

4.50/5 (3 votes)

Oct 6, 2014

CPOL

4 min read

11035

Search system implementation using cloudsearch

Introduction

This article is about my first experience with cloudsearch in django, it impressed me so much to write an article to share with others. People who are not aware of cloudsearch, its a search service in AWS cloud, you need to upload your data in defined format and your search system is ready with a superfast speed. I will be taking django as a reference here and I am assuming you guys know django.(Don't worry if you don't know django basically this article is more about cloudsearch)

Why I moved to cloudsearch?

Search system is an important and critical part of a website. Your search system should be fast, accurate and should be able to provide the more precise results.

We were having a django website in which we implemented search system using Haystack as search interface and Whoosh as search backend. It was fine initially when data was less but as we started growing and we got huge data to make searchable like profiles, jobs posted, news shared etc. We got around 100,000 searchable objects and thats the point where we started getting complaints from users that they are not getting results faster as they were used to, moreover results were not precise. This makes us to sit together and rethink about our search architecture(At this point we had only one logic handling all kind of searches). We decided to refactor search system and provide separate logics that will handle specific kind of search. But our bad luck, haystack version we were using was not providing this option with whoosh backend. That forced us to use different search backend and we give a try to elasticsearch. Hmmm, this was good and was ok for all our needs. But we were already on mission to improve search system so we were not satisfied at this point too. Main bottleneck came in Ajax search where we need to show relevant results within a second. Then CLOUDSEARCH came into picture.

We decided to implement a test search page with minimal functioanlities with all data on cloudsearch before starting to implement it on site with no other functionality or scripts on the page to get the accurate benchmarking and results were awesome. We were getting results within a second.

How we proceed?

First I would like to tell you the flow we used in implementing search system

User Request --> Web Server --> Cloudsearch --> WebServer --> User Response

So we followed following steps in implementing search system but before that you need to have following things ready:

1. AWS account with your access id and secret key.

2. A search domain in cloudsearch.

3. aws-cli(optional) configured.

4. Add index fields in search domain via cloudsearch interface.

After these 4 steps you will be having your search-service-endpoint and document-service-endpoint which will be very useful in interacting with cloudsearch.

Now let suppose you have a following django model which you want to make searchable:

class foo(models.Model):
    search_field_1 = models.CharField(max_length=255)
    search_field_2 = models.CharField(max_length=255)

One Extra step if already have data which you want to include in cloudsearch(like our case) Then you may need to upload data to cloudsearch. There are several options to upload data to cloudsearch, you can opt for xml, json, or via dynamodb etc. But we chose json as our javascript can also easily understand this.

Cloudsearch calls every object as a document and you can do batch upload if your AWS account support bigger upload limits. For every document, cloudsearch expects a unique id, its type(add- if you adding a new document or delete- if you are deleting an existing document) and your search data.

So a sample document will look like this:

[
 {"type": "add",
  "id":   "obj-1",
  "fields": {
    "search_field_1": "I am child",
    "search_field_2": "I am daddy"
  }
 },
 {"type": "delete",
  "id":   "obj-2"
 }
]

You can upload multiple documents in a batch as long as it is under your account upload limit. You can upload directly via cloudsearch interface or aws-cli(I will suggest to upload using aws-cli if you have huge data, automation) . For data which will be added in future you can write django signals which will create document on new objects or object deletion and upload to cloudsearch.

aws cloudsearchdomain --endpoint-url DOCUMENT_SERVICE_ENDPOINT upload-documents --content-type application/json --documents FILE_PATH
{
    "status": "success", 
    "adds": COUNT_OF_DOCUMENT_UPLOADED, 
    "deletes": 0
}

Once you are done with uploading data, your data is indexed and ready to be searched. You can make a sample search request with following command:

aws cloudsearchdomain --endpoint-url SEARCH_SERVICE_ENDPOINT search --search-query child
{
    "status": {
        "rid": "/rnE+e4oCAqfEEs=", 
        "time-ms": 6
    }, 
    "hits": {
        "found": 1, 
        "hit": [
            {
                "id": "obj-1"
            }
        ], 
        "start": 0
    }
}

So this was our search system implementation but there is more, you may want to add suggesters(typeahead when user starts typing what if those are more precise and users gets his phrase in suggesters) There is option in cloudsearch to configure suggesters. We can provide a field to this suggester to suggest. Lets say we give "search_field_1" to our suggester "Suggester_1". Then you can get the suggestions by checking this link:

http://SEARCH_SERVICE_ENDPOINT/API_VERSION/suggest/?q=QUERY&suggester=Suggester_1

Now there is one workaround if you have multiple type of objects but you want to go for only 1 search domain. Then you can add multiple index fields just add an extra index field which will be available in all type of documents and will be generic in nature. Then you can pass along this generic field while querying to cloudsearch like following:

aws cloudsearchdomain --endpoint-url SEARCH_SERVICE_ENDPOINT search --search-query (and search_field_1:'child' generic_field:'OBJ_TYPE')
{
    "status": {
        "rid": "/rnE+e4oCAqfEEs=", 
        "time-ms": 6
    }, 
    "hits": {
        "found": 1, 
        "hit": [
            {
                "id": "obj-1"
            }
        ], 
        "start": 0
    }
}

So this was all I did to implement cloudsearch in my django application. Please add your comments below if we can improve more in our architecture. I would be very happy to answer any query regarding this.

Please feel free to contact me at skype : mfsi_nitishk or mail me at nitish@mindfiresolutions.com

Thanks again for your valuable time to go through this.

References: http://aws.amazon.com/documentation/cloudsearch/