How to Index and Query Data With Haystack and Elasticsearch in Python

Haystack

Haystack is a Python library that provides modular search for Django. It features an API that provides support for different search back ends such as Elasticsearch, Whoosh, Xapian, and Solr.

Elasticsearch

Elasticsearch is a popular Lucene search engine capable of full-text search, and it’s developed in Java.

Google search uses the same approach of indexing their data, and that’s why it’s very easy to retrieve any information with just a few keywords, as shown below.

Install Django Haystack and Elasticsearch

The first step is to get Elasticsearch up and running locally on your machine. Elasticsearch requires Java, so you need to have Java installed on your machine.

We are going to follow the instructions from the Elasticsearch site.

Download the Elasticsearch 1.4.5 tar as follows:

curl -L -O https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.4.5.tar.gz

Extract it as follows:

tar -xvf elasticsearch-1.4.5.tar.gz

It will then create a batch of files and folders in your current directory. We then go into the bin directory as follows:

cd elasticsearch-1.4.5/bin

Start Elasticsearch as follows.

./elasticsearch

To confirm if it has installed successfully, go to http://127.0.0.1:9200/, and you should see something like this.

{
  "name" : "W3nGEDa",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "ygpVDczbR4OI5sx5lzo0-w",
  "version" : {
    "number" : "5.6.3",
    "build_hash" : "1a2f265",
    "build_date" : "2017-10-06T20:33:39.012Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

Ensure you also have haystack installed.

pip install django-haystack

Let’s create our Django project. Our project will be able to index all the customers in a bank, making it easy to search and retrieve data using just a few search terms.

django-admin startproject Bank

This command creates files that provide configurations for Django projects.

Let’s create an app for customers.

cd Bank

python manage.py startapp customers

`settings.py` Configurations

In order to use Elasticsearch to index our searchable content, we’ll need to define a back-end setting for haystack in our project’s settings.py file. We are going to use Elasticsearch as our back end.

HAYSTACK_CONNECTIONS is a required setting and should look like this:

HAYSTACK_CONNECTIONS = {
    'default': {
        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
        'URL': 'http://127.0.0.1:9200/',
        'INDEX_NAME': 'haystack',
    },
}

Within the settings.py, we are also going to add haystack and customers to the list of installed apps.

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'rest_framework',
    'haystack',
    'customer'
]

Create Models

Let’s create a model for Customers. In customers/models.py, add the following code.

from __future__ import unicode_literals

from django.db import models


# Create your models here.
customer_type = (
    ("Active", "Active"),
    ("Inactive", "Inactive")
)


class Customer(models.Model):
    id = models.IntegerField(primary_key=True)
    first_name = models.CharField(max_length=50, null=False, blank=True)
    last_name = models.CharField(
        max_length=50, null=False, blank=True)
    other_names = models.CharField(max_length=50, default=" ")
    email = models.EmailField(max_length=100, null=True, blank=True)
    phone = models.CharField(max_length=30, null=False, blank=True)
    balance = models.IntegerField(default="0")
    customer_status = models.CharField(
        max_length=100, choices=customer_type, default="Active")
    address = models.CharField(
        max_length=50, null=False, blank=False)

    def save(self, *args, **kwargs):
        return super(Customer, self).save(*args, **kwargs)

    def __unicode__(self):
        return "{}:{}".format(self.first_name, self.last_name)

from django.contrib import admin
from .models import Customer

# Register your models here.

admin.site.register(Customer)

Create Database and Super User

Apply your migrations and create an admin account.

python manage.py migrate
python manage.py createsuperuser

Run your server and navigate to http://localhost:8000/admin/. You should now be able to see your Customer model there. Go ahead and add new customers in the admin.

Indexing Data

To index our models, we begin by creating a SearchIndex. SearchIndex objects determine what data should be placed in the search index. Each type of model must have a unique searchIndex.

SearchIndex objects are the way haystack determines what data should be placed in the search index and handles the flow of data in. To build a SearchIndex, we are going to inherit from the indexes.SearchIndex and indexes.Indexable, define the fields we want to store our data with, and define a get_model method.

Let’s create the CustomerIndex to correspond to our Customer modeling. Create a file search_indexes.py in the customers app directory, and add the following code.

from .models import Customer
from haystack import indexes


class CustomerIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.EdgeNgramField(document=True, use_template=True)
    first_name = indexes.CharField(model_attr='first_name')
    last_name = indexes.CharField(model_attr='last_name')
    other_names = indexes.CharField(model_attr='other_names')
    email = indexes.CharField(model_attr='email', default=" ")
    phone = indexes.CharField(model_attr='phone', default=" ")
    balance = indexes.IntegerField(model_attr='balance', default="0")
    customer_status = indexes.CharField(model_attr='customer_status')
    address = indexes.CharField(model_attr='address', default=" ")

    def get_model(self):
        return Customer

    def index_queryset(self, using=None):
        return self.get_model().objects.all()

The EdgeNgramField is a field in the haystack SearchIndex that prevents incorrect matches when parts of two different words are mashed together.

It allows us to use the autocomplete feature to conduct queries. We will use autocomplete when we start querying our data.

document=True indicates the primary field for searching within. Additionally, the use_template=True in the text field allows us to use a data template to build the document that will be indexed.

Let’s create the template inside our customers template directory. Inside search/indexes/customers/customers_text.txt, add the following:

{{object.first_name}}
{{object.last_name}}
{{object.other_names}}

Reindex Data

Now that our data is in the database, it’s time to put it in our search index. To do this, simply run ./manage.py rebuild_index. You’ll get totals of how many models were processed and placed in the index.

Indexing 20 customers

Alternatively, you can use RealtimeSignalProcessor, which automatically handles updates/deletes for you. To use it, add the following in the settings.py file.

HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

Querying Data

We are going to use a search template and the Haystack API to query data.

Search Template

Add the haystack urls to your URLconf.

url(r'^search/', include('haystack.urls')),

Let’s create our search template. In templates/search.html, add the following code.

{% block head %}




{% endblock %}
{% block navbar %}
 
  
    
      
      HOME
    
    
      
          
    
    
  

{% endblock %}
{% block content %}
  

        {{ form.non_field_errors }}
        
                {{ form.as_p }}
        
        
            
        

        {% if query %}
            Results
              
            
                
    
                    {% for result in page.object_list %}
                       
                
                  
                             
                    
                        First name : {{result.first_name}} 
                    

                    
                        Last name : {{result.last_name}} 
                        
                    

                    
                        Balance : {{result.balance}} 
                    
                    
                        Email : {{result.email}} 
                    
                    
                        Status : {{result.customer_status}} 
                    
                  
                
                {% empty %}
                    
                   No results found.
                    {% endfor%}
                
           
           
        {% endif %}



{% endblock %}

The page.object_list is a list of SearchResult objects that allows us to get the individual model objects, for example, result.first_name.

Your complete project structure should look something like this:

Now run server, go to 127.0.0.1:8000/search/, and do a search as shown below.

A search of Albert will give results of all customers with the name Albert. If no customer has the name Albert, then the query will give empty results. Feel free to play around with your own data.

Haystack API

Haystack has a SearchQuerySet class that is designed to make it easy and consistent to perform searches and iterate results. Much of the SearchQuerySet API is familiar with Django’s ORM QuerySet.

In customers/views.py, add the following code:

from django.shortcuts import render
from rest_framework.decorators import (
    api_view, renderer_classes,
)
from .models import Customer
from haystack.query import SearchQuerySet

from rest_framework.response import Response
# Create your views here.


@api_view(['POST'])
def search_customer(request):
    name = request.data['name']
    customer = SearchQuerySet().models(Customer).autocomplete(
        first_name__startswith=name)

    searched_data = []
    for i in customer:
        all_results = {"first_name": i.first_name,
                       "last_name": i.last_name,
                       "balance": i.balance,
                       "status": i.customer_status,
                       }
        searched_data.append(all_results)

    return Response(searched_data)

autocomplete is a shortcut method to perform an autocomplete search. It must be run against fields that are either EdgeNgramField or NgramField.

In the above Queryset, we are using the contains method to filter our search to retrieve only the results that contain our defined characters. For example, Al will only retrieve the details of the customers which contain Al. Note that the results will only come from fields that have been defined in the customer_text.txt file.

Apart from the contains Field Lookup, there are other fields available for performing queries, including:

content
contains
exact
gt
gte
lt
lte
in
startswith
endswith
range
fuzzy

Conclusion

A huge amount of data is produced at any given moment in social media, health, shopping, and other sectors. Much of this data is unstructured and scattered. Elasticsearch can be used to process and analyze this data into a form that can be understood and consumed.

Elasticsearch has also been used extensively for content search, data analysis, and queries. For more information, visit the Haystack and Elasticsearch sites.