Hi, I'm Harlin and welcome to my blog. I write about Python, Alfresco and other cheesy comestibles.

How to Get Django and ElasticSearch to Play Together

Ok, I like the idea of NoSQL -- even if there are very few use-cases that really demand it. Honestly, you can use an RDBMS like Postgresql to handle most use-cases instead of going the NoSQL route. Reasoning aside, there is a part of me that really likes it.

I like being able to write my own "ORM" and just add fields, rows or "documents" -- whichever terminology you prefer -- without regard to constraints or schema structure. It's a lot of fun and even freeing. I can focus on code and deal with most any kind of input or attached metadata to objects manipulated by users or handle my own code with little stress.

I know I'm not the only one who likes this kind of flexibility. A lot of the *js frameworks available now seem to make shameless use of NoSQL databases for all persisted data even stuff that was trusted to RDBMS databases.

Something I was excited about working with in the not so distant past was Django-NonRel (a defunct(?) project to support Django on non-relational, 'NoSQL' databases). It was a lot of fun fiddling around with it, though at the time it had a lot of issues and didn't support some things very well like foreign keys or m2m relationships. Early on, I didn't understand schema migrations in Django as well as I do now and the lazy part of me just didn't wanted to overlook them.

This brings me to where I am at the moment. I am interested currently in writing a Python app to handle content management similar to how Alfresco does. This won't be a full rewrite of Alfresco mind you, but more for the fun of it and to get an idea of how content management software is written under the hood. In case you're not familiar, Alfresco is an Enterprise Content Management software written in Java. And if you're not familiar with me, I work there as a technical support engineer.

Alfresco is open source and as far as its repository goes, is somewhat easy for me to understand. Since it's written in Java, I believe I should be able to reasonably write something like it in Python. At its core, Alfresco is a document repository that has a few external features like:

  • A Web UI client - Called Share and it communicates with the repository via REST calls.
  • File servers - exposes itself with common file server type protocols like FTP, WebDAV, CIFS and others.
  • Transformation server - it does have builtin transformation functionality but uses LibreOffice to handle most of that but there is a Windows based transformation server to handle better translations in Microsoft-ese.
  • Search - uses Solr (based on Lucene)
  • And so much much more. Probably too much more.

Alfresco is very much document-centric, as you would expect. As such, I was thinking that Django-NonRel with a NoSQL db like MongoDB might be a natural fit for building a document repository. One of the selling points for a document repository is the ability to be able to add properties/tags on the fly and make a few minor functions and rules available. NoSQL databases certainly support the ability to add new columns (with data) or to ignore column data that may exist for another row/document.

But, after thinking this over a bit and probably being influenced a bit too much by other Django developers, it seems like this project would be better served to use Postgresql to handle more structured pieces of data (like the authentication and permission bits) and use something like ElasticSearch to handle the search parts.

Postgresql can handle full text searches and dynamic json fields where needed. Also, there is ElasticSearch for performing search. ElasticSearch is based on Lucene (so is Solr -- which is what Alfresco uses for search engine functionality). As such, I was a little bit nervous from using the Haystack module to work with Solr. I don't think it's ready for Solr 6 but it does work in some fashion with Solr 4 and older.

I was also a little shy about looking further into ElasticSearch. Since it's based on Lucene, I was afraid it would be something of a Solr clone and just as difficult to work with in some spots. Mostly, I was afraid that there would not be much support with it regarding Django.

I was pleasantly surprised. After looking through the interwebs, I was able to find some very good information on tying Django and ElasticSearch together. It seems to me that Python and ES get along a little better than Python and Solr do.

Like Solr, ES uses a REST api to handle search. As such, using elasticsearch-dsl module with Django is very simple. In addition to being very good at doing searches, I discovered that there is a lot of functionality available to handle tracking. This is one of the trickier parts of dealing with a search engine like Solr or Endeca. It's not uncommon to have to write your own scripts for import/export and data update handling to make sure the index stays consistent with the database. Alfresco certainly has some extra scripting and bits to ensure the indexes stay "fresh". As such, Alfresco's implementation of Solr is very much home-spun and versions 1.4 and 4 will not be familiar to generic Solr users (Alfresco's Solr 6 is something of a different story though). As you update a particular model's field in the database, I've found, it's a snap to write some code that will handle indexing to ElasticSearch at the same time.

So, what I wanted to do in this post was to show how simple it is to get this set up and running.

The first thing we want to do is go get ElasticSearch, install it and get it running.

You can download it from here:

https://www.elastic.co/downloads/elasticsearch

The version I tested here was 5.6.1. Once you download it, you can decompress the package and move it to a parent directory (I always prefer something like ~hseritt/apps/...), change to the bin directory run:

$ ./elasticsearch

So far, there's not an admin console out of the box but you can check that it's running by using curl (any other decent http client should work as well):

$ curl -XGET http://localhost:9200

Now, let's build a project directory and get a Django project set up:

(btw, if you're not using pyenv to manage your Python installs, you should be -- follow these instructions to install it)

Create the top project directory (I've called mine "elastic_demo"):

$ mkdir elastic_demo
$ cd elastic_demo
$ pyenv global 3.6.2
$ pyenv virtualenv elastic_demo
$ pyenv local elastic_demo

Now that we have pyenv set up, we can use the active Python version's pip to install all needed packages:

Note that if you want to use MySQL instead of Postgresql, you can install mysqlclient instead of psycopg2.

$ pip install django psycopg2 elasticsearch-dsl

For Postgresql (or other database environment you should have similar clients):

  • Start Postgresql server.
  • Start up pgadmin4 client.
  • Create a database called elastic_demo.
  • Create a role called 'admin', assign it superuser status. I've used 'admin' as my password.
  • Set up the documentsapp app:
$ django-admin.py startproject elastic_demo
$ cd elastic_demo
$ ./manage.py startapp documentsapp

In documentsapp/models.py add:

from django.db import models
from django.utils import timezone
from django.contrib.auth.models import User

class Document(models.Model):
    author = models.ForeignKey(User, on_delete=models.CASCADE, related_name='document')
    posted_date = models.DateTimeField()
    title = models.CharField(max_length=200)
    text = models.TextField(max_length=1000)

Make sure that you add 'documentsapp' to INSTALLED_APPS in settings.py

Also, in settings.py, add the following database config:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'elastic_demo',
        'USER': 'admin',
        'PASSWORD': 'admin',
        'HOST': '127.0.0.1',
        'PORT': '5432',
    }
}

If you use MySQL, your database configs might look something more like this:

DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.mysql',
        'NAME': 'elastic_demo',
        'USER': 'admin',
        'PASSWORD': 'admin',
        'HOST': '127.0.0.1',
        'PORT': '3306',
    }
}

Now, add the Document model to your admin.py models to be shown in the admin console. Your admin.py should have this code:

from .models import Document

admin.site.register(Document)

Now, let's sync up the database with models and start up our Django project:

$ ./manage.py makemigrations
$ ./manage.py migrate
$ ./manage.py createsuperuser
$ ./manage.py runserver

Go to http://localhost:8000/admin and login. You should see the Document model. Go ahead and create a document.

In documentsapp/search.py add:

from elasticsearch_dsl.connections import connections
from elasticsearch_dsl import DocType, Text, Date
from elasticsearch.helpers import bulk
from elasticsearch import Elasticsearch
from . import models

connections.create_connection()

class DocumentIndex(DocType):
    author = Text()
    posted_date = Date()
    title = Text()
    text = Text()

    class Meta:
        index = 'document-index'

    def bulk_indexing():
        DocumentIndex.init()
        es = Elasticsearch()
        bulk(client=es, actions=(b.indexing() for b in models.Document.objects.all().iterator()))

Now, let's add the following to your models.py so that our app knows what to index when indexing() is called:

...
from .search import DocumentIndex
...
...
# Add indexing method to Document
def indexing(self):
    obj = DocumentIndex(
        meta={'id': self.id},
        author=self.author.username,
        posted_date=self.posted_date,
        title=self.title,
        text=self.text
    )
    obj.save()
    return obj.to_dict(include_meta=True)

From the Django shell, run the bulk_indexing() function to make sure it works as expected:

$ ./manage.py shell
>>> from documentsapp.search import *
>>> bulk_indexing()

We can verify it with:

$ curl -XGET 'localhost:9200/document-index/document_index/1?pretty'

So that we can now make sure that our model's data gets indexed on any update or save, we can add a post_save signal. In the documentsapp directory, create a file called signals.py:

from .models import Document
from django.db.models.signals import post_save
from django.dispatch import receiver

@receiver(post_save, sender=Document)
def index_post(sender, instance, **kwargs):
    instance.indexing()

We have just a little bit more configuring to do. In the documentsapp directory, open apps.py and add:

from django.apps import AppConfig

class DocumentsappConfig(AppConfig):
    name = 'documentshapp'

    def ready(self):
        import documentsapp.signals

In the init.py file in the documentsapp directory, add:

default_app_config = 'documentsapp.apps.DocumentsappConfig'

Lastly (I promise), add to search.py our search function:

...
from elasticsearch_dsl import DocType, Text, Date, Search
...
...
def search(author):
    s = Search().filter('term', author=author)
    response = s.execute()
    return response

In the shell, let's test:

$ ./manage.py shell

>>> from documentsapp.search import *
>>> print(search(author="admin"))

We should now see the search results.

Any Comments, Always Welcome!