Populating Elastic With Your Database

okay-we-get-it-elastic-search
Standard

The Problem

Recently I was tasked with providing fast full-text searching, across multiple tables, for a reasonably big database (~2m records).

Immediately I reached for some sort of Solr or Elastic solution. Being the lazy developer that I am, I chose the path of least resistance (most samples online on how to get started), and chose Elastic [edit: would seriously consider Angolia for my next project, if I had less than 10k records to index].

Getting Elastic set up was relatively simple. Having access to a few AWS credits, I used the AWS ElasticSearch service because I didn’t want the responsibility for maintaining the server.

Now to populate with some data.

Taking data from a database and putting it in an Elastic index must be a common scenario with lots of examples and help, right?

Well, not quite as common as I thought, and not as easy as I was hoping.

I was “hoping” for a nice simple user interface; point at my database, select what data I want, import into Elastic, done. I thought, with the relative maturity of Elastic, that it would be baked into the product, or someone would have created an add-on already. At the least, I thought, there would be a plethora of examples to show how to get data from a database to Elastic.

What I found was lacking, and its the reason why I’m making this post to help others looking to do something similar.

Logstash

The currently recommended solution for indexing database data is Logstash (up until 2015 the recommended way was through Rivers).

logstash

Logstash. Looks great from the diagram

With Logstash you can connect to your data, filter it, transform it, and then add it to Elastic. Great!, just what I was looking.

Logstash takes quite a bit of work to set up and configure though. The setup is fairly straightforward; just download and install on Windows or Linux. After installation, there is no UI with Logstash, just access to a command line command, to which you supply configurations.

Unfortunately Logstash does not have a way to connect with databases out of the box. To connect logstash to your database you will first need to jump through a couple of hoops:

  1. You will need to install the JDBC plugin for Logstash, the instructions on how to do this can be found here.
  2. The JDBC plugin does not include any database drivers so you will need to find the driver you need and then download. I am mostly connecting to Postgres so I need to download the Postgres JDBC driver. Make a note of where you download this driver to, you will need the path to the driver to your configuration later.

Providing you have no issues with your Java installation, you should now be ready to go.

Logstash Configurations

Most of your time will be spent fine tuning configurations. Simple configurations are straight forward i.e. connect to a single table with basic field types and import into Elastic. Anything outside these parameters can be challenging.

A basic configuration to connect to a Postgres database and import into a new Elastic index could look something like this:


# file: simple-out.conf
input {
    jdbc {
        # Postgres jdbc connection string to our database, mydb
        jdbc_connection_string = "jdbc:postgresql://localhost:5432/mydb"
        # The user we wish to execute our statement as
        jdbc_user = "postgres"
        # password
        jdbc_password = "mypassword"
        # The path to our downloaded jdbc driver
        jdbc_driver_library = "/path/to/postgresql-9.4-1201.jdbc41.jar"
        # The name of the driver class for Postgresql
        jdbc_driver_class = "org.postgresql.Driver"
        # our query
        statement = "SELECT * from contacts"
    }
}
output {
    elasticsearch {
        protocol = http
        index = "contacts"
        document_type = "contact"
        document_id = "%{uid}"
        host = "https://urltomyelasticinstance"
    }
}

The configuration file above can be run with the following Logstash command:

"\pathtologstashbindirectory\logstash.bat" -f simple-out.conf
The configuration, when run, will:
  1. Connect to the postgres database.
  2. Get all the data from the ‘contacts’ table.
  3. Connect to your elastic instance.
  4. Create a new index called ‘contacts’
  5. Create a new document type called ‘contact’
  6. Automatically create mappings for the fields in the contacts table.
  7. Populate the index with the data.
Easy, right?
The challenges come when your scenario deviates from this very simple example.
In the next post I’ll detail some of the issues that you might come across when deviating from this simple example.