Add Full-Text Search to your Django project with Whoosh

Whoosh is a pure-python full-text indexing and searching library. Whoosh was opensourced recently and makes it easy to add a fulltext search to your site without any external services like Lucene or Solr for example.

Whoosh is pretty flexible, but to keep it simple let's assume that the index is stored in settings.WHOOSH_INDEX (which should be a path on the filesystem) and that our application is a Wiki and we want to search the Wiki pages.

Indexing Documents

Before we can query the index we have to make sure that the documents are indexed properly and automatically. It doesn't matter if you put the following code into a new app or an existing app. You only have to make sure, that it lives in a file, which is loaded by Django when the process starts, resonable places would be the __init__.py or the models.py file (or any file imported in those of course) in any app.

The following code listing is interrupted by short explanaitions but should be saved into one file:

import os
from django.db.models import signals
from django.conf import settings
from whoosh import store, fields, index
from rcs.wiki.models import WikiPage

WHOOSH_SCHEMA = fields.Schema(title=fields.TEXT(stored=True),
                              content=fields.TEXT,
                              url=fields.ID(stored=True, unique=True))

At the top of the file a Schema is defined. The Schema tells Whoosh what data should go into the index and how it should be organized. In this example every indexed document is stored as three fields by whoosh:

def create_index(sender=None, **kwargs):
    if not os.path.exists(settings.WHOOSH_INDEX):
        os.mkdir(settings.WHOOSH_INDEX)
        storage = store.FileStorage(settings.WHOOSH_INDEX)
        ix = index.Index(storage, schema=WHOOSH_SCHEMA, create=True)

signals.post_syncdb.connect(create_index)

To make sure the index, which is stored on the filesystem, is available the function create_index is called by Django's post_syncdb signal and creates the index if it is not already present. This method uses the Schema defined earlier.

def update_index(sender, instance, created, **kwargs):
    storage = store.FileStorage(settings.WHOOSH_INDEX)
    ix = index.Index(storage, schema=WHOOSH_SCHEMA)
    writer = ix.writer()
    if created:
        writer.add_document(title=unicode(instance), content=instance.content,
                                    url=unicode(instance.get_absolute_url()))
        writer.commit()
    else:
        writer.update_document(title=unicode(instance), content=instance.content,
                                    url=unicode(instance.get_absolute_url()))
        writer.commit()

signals.post_save.connect(update_index, sender=WikiPage)

To make sure the index is automatically updated everytime a page on the Wiki changes, the function update_index is called whenever a WikiPage object sends the post_save signal via Django's signal framework.

If the instance was created it is added as a new document to the index and if it was edited (but existed before) the entry in the index is updated. The document is identified in the index by it's unique URL.

Query the Index

At this point we have made sure, that Whoosh will always keep an up-to-date index of our WikiPage pages. The next step is to create a view, which allows querying the index.

A single view and a template is all we need to let users search the index. The template contains a simple form:

<form action="" method="get">
    <input type="text" id="id_q" name="q" value="{{ query|default_if_none:"" }}" />
    <input type="submit" value="{% trans "Search" %}"/>
</form>

By setting method to GET and action to an empty string we tell the browsesr to submit the form to the current URL with the value of the input field (named q) appended to the url as a querystring. A search for the term "Django" will result in a request like this:

http://server/somwhere/?q=Django

I've also added the parsed query back to the search form while displaying the results. Therefore the user-experience is further improved, because the user can now easily edit the query and submit it again.

If you have a special search page (instead of a search box on every page) you might also consider giving focus to the input field to save the user an extra click. If you don't use a JavaScript framework a very simple solution would be to use the onload attribute of the body tag:

<body onload="document.getElementById('id_q').focus();">

Now lets have a look at the view-code which handles the requests:

from django.conf import settings
from django.views.generic.simple import direct_to_template
from whoosh import index, store, fields
from whoosh.qparser import QueryParser
from somwhere import WHOOSH_SCHEMA


def search(request):
    """
    Simple search view, which accepts search queries via url, like google.
    Use something like ?q=this+is+the+serch+term

    """
    storage = store.FileStorage(settings.WHOOSH_INDEX)
    ix = index.Index(storage, schema=WHOOSH_SCHEMA)
    hits = []
    query = request.GET.get('q', None)
    if query is not None and query != u"":
        # Whoosh don't understands '+' or '-' but we can replace
        # them with 'AND' and 'NOT'.
        query = query.replace('+', ' AND ').replace(' -', ' NOT ')
        parser = QueryParser("content", schema=ix.schema)
        try:
            qry = parser.parse(query)
        except:
            # don't show the user weird errors only because we don't
            # understand the query.
            # parser.parse("") would return None
            qry = None
        if qry is not None:
            searcher = ix.searcher()
            hits = searcher.search(qry)

    return direct_to_template(request, 'search.html',
                              {'query': query, 'hits': hits})

The view imports the previously defined WHOOSH_SCHEMA and gets the index location from the settings. Most of the clutter is only there to improve the user-experience by tranforming some chars found in search queries into their Whoosh equivalents and by catching all exceptions raised by the Whoosh QueryParser.

Displaying the search results in the template is pretty straight-forward:

{% if hits %}
<ul>
    {% for hit in hits %}
    <li><a href="{{ hit.url }}">{{ hit.title }}</a></li>
    {% endfor %}
</ul>
{% endif %}

Conclusion

With Whoosh and not more than 100 Lines of code (including the template) it is possible to add full-text search capabilities to your Django project. I've already added the code above to two projects and I'm pretty impressed by the ease of use and the performance of Whoosh.

The result is that I can now make my Django powered sites a bit more awesome by adding full-text search (if applicable) and the best is: at ~100 LOC it comes almost for free.

Related Projects

For a different approach to add Whoosh to your Django project you might also want to have a look at django-whoosh by Eric Florenzano which is available on GitHub. Django-Whoosh is basically a Manager which is added to your objects and will take care of indexing and lets you fetch objects by querying the Whoosh index. The idea is clever but only works if you want to edit the Model classes to add the manager. My approach is completely based on signals and will therefore work with any reuseable app without editing the app itself.

Another app which combines Whoosh and Django is djoosh, also available on GitHub but it seems as if it's not finished at the moment. Djoosh aims to provide a mechanism which allows registering of Models with the Indexing infrastructure in a similar way as contrib.admin does.


Kommentare