Tag Archives: Elasticsearch

Log Shipping with Filebeat and Elasticsearch

Introduction

Aside from being a powerful search engine, Elasticsearch has in recent years become very popular as a special-purpose logging storage and analysis solution. Logstash and beats were eventually introduced to help Elasticsearch cope better with the volume of logs being ingested.

In this article, we’ll see how to use Filebeat to ship existing logfiles into Elasticsearch, so that they can be viewed and analysed in Kibana.

Since it’s not possible to cover all scenarios exhaustively and keep this article concise and at a reasonable length, we’ll make a few assumptions here:

  1. We’ll use Filebeat on Windows.
  2. We’ll ship logs directly into Elasticsearch, i.e. no Logstash. This is good if the scale of logging is not so big as to require Logstash, or if it is just not an option (e.g. using Elasticsearch as a managed service in AWS).
  3. We’re running on-premises, and already have log files we want to ship. If we were running managed services within the cloud, then logging to file would often not be an option, and in that case we should use whatever logging mechanism is available from the cloud provider.

Motivation

Logging is ubiquitous. You’ll find it in virtually every application out there. As such, it’s a problem that has been solved to death. There are so many logging frameworks out there, it’s just crazy.

And despite this, it baffles me why so many companies today still opt to write their own logging libraries, either from scratch or as abstractions of other logging libraries. They could just use one of the myriad existing solutions out there, which are probably far more robust and performant than theirs will ever be.

In order to realise just how stupid reinventing the wheel is, let’s take an example scenario. You have your big software monolith that’s writing to one or more log files. You begin to break up the monolith into microservices, and realise that you now have log files everywhere: in multiple applications across different servers. So… you need to write a logging library that all your microservices can use to write the logs to a central data store (could be any kind of relational or NoSQL database).

 

That’s great! Your logs are now in one place and you can search through them effortlessly! And your code is even DRY because you wrote another common library (hey, you only need like 35 of them now to write a new microservice).

But wait, having applications write directly to a log store is wrong on so many levels. Here are a few:

  1. Logs buffered in memory can be permanently lost if the application terminates unexpectedly.
  2. The application must take the performance hit of communicating with the remote endpoint.
  3. Through the logging library, the application must depend on a client library for that logging store. This is a form of coupling that doesn’t work very well with microservices. Even worse, if the logging library isn’t designed properly, it may carry dependencies on mutiple logging stores.

These practical issues don’t even take into consideration the effort and complexity involved in creating a fully-featured logging library.

So what is the alternative? Simply keep writing to log files, and have a separate application (a log shipper) send those logs to a centralised store. Again, you don’t have to write the log shipper yourself. There are more than enough out there that you can just pick up and use.

 

This approach has a number of advantages:

  1. The log shipper is an offline process, and will not directly impact performance of applications.
  2. Files are about as fast as it gets for an application to write logs.
  3. If there is a problem sending logs to the store, the original log files are still there as a single source of truth.
  4. The log shipper can send logs to the store in bulk. There is no need to dangerously buffer them in memory. They are already there on disk.
  5. If the original logger (to file) is configured to flush on each write, then it’s virtually impossible that logs will be lost.
  6. There are no additional dependencies for the application. Just the original logging library.
  7. Developers can leverage their knowledge of existing libraries, and don’t have to learn to use a new one every time they start a new job.
  8. Developers can focus on solving real problems, rather than reinventing the wheel.

“But wait!” I can already hear the skeptics. “Existing logging libraries are not fast enough!” goes one of them. To this chap, I say:

  • Have you really tried all existing logging libraries? (Only Chuck Norris has done that, as far as I can tell. Twice.)
  • Is it possible that you’re simply not using a library correctly? (Maybe tweak some configuration settings?)
  • Even if you really could write something faster, it’s likely that the benefit will be negligible, and that it will only be faster under certain conditions. Surely you have more important performance consideratons than how many logs you can write per second.

“But wait!” goes another skeptic. “We might need to change the logging library later.” This is the same tired old excuse that is very common about data-access-layer code. “We might have to change our database!” Some folks still go on after some forty years.

This is a very common over-engineering scenario in which we create an abstraction of an abstraction. NLog and other logging libraries can already plug into a variety of output destinations, so it’s very unlikely that you’ll ever need to change them. Actually, it’s more likely that you’ll run into limitations by using abstractions such as Common.Logging where you end up with a common denominator and can’t make use of advanced features that a specific logging library might offer.

Changing a logging library should be mostly a matter of changing packages, and updating code via search and replace. So if you need to change it, just change it. That’s way cheaper than the complexity introduced by an extra layer of unnecessary abstraction for no other reason than “just in case”. Especially if you’re doing microservices (properly) – you should be able to change your logging library and redeploy in a matter of minutes.

Beats and Filebeat

beat is a lightweight agent that can siphon data from a source and send it to Logstash or Elasticsearch. There are several beats that can gather network data, Windows event logs, log files and more, but the one we’re concerned with here is the Filebeat.

After you download Filebeat and extract the zip file, you should find a configuration file called filebeat.yml. For a quick start, look for filebeat.prospectors, and under it:

  • Change the value of enabled from false to true.
  • Under paths, comment out the existing entry for /var/log/*.log, and instead put in a path for whatever log you’ll test against.

This part of filebeat.yml should now look something like this:

filebeat.prospectors:

# Each - is a prospector. Most options can be set at the prospector level, so
# you can use different prospectors for various configurations.
# Below are the prospector specific configurations.

- type: log

  # Change to true to enable this prospector configuration.
  enabled: true

  # Paths that should be crawled and fetched. Glob based paths.
  paths:
    #- /var/log/*.log
    #- c:\programdata\elasticsearch\logs\*
    - C:\ConsoleApp1\*.log

Also if your Elasticsearch server isn’t the default localhost:9200, be sure to change it further down in the file.

In that ConsoleApp1, I have a file called Debug.log which contains the following log entries:

2018-03-18 15:43:40.7914 - INFO: Tick
2018-03-18 15:43:42.8215 - INFO: Tock
2018-03-18 15:43:42.8683 - ERROR: Error doing TickTock!
EXCEPTION: System.DivideByZeroException: Attempted to divide by zero.
   at ConsoleApp1.Program.Main(String[] args) in C:\ConsoleApp1\Program.cs:line 18

I’ll be using this simple (silly) example to show how to work with Filebeat.

Next, we can invoke filebeat.exe. When you do this, two folders get created. One is logs, where you can check Filebeat’s own logs and see if it has run into any problems. The other is data, and I believe this is where Filebeat keeps track of its position in each log file it’s tracking. If you delete this folder, it will go through the log files and ship them again from scratch.

Go into Kibana, and then into Management and Index Patterns. If all went well, Kibana will find the index that was created by Filebeat. You can create the index pattern filebeat-* to capture all Filebeat data:

For the time filter field, choose @timestamp, which is created and populated automatically by Filebeat.

In Kibana, you can now go back to Discover and see the log data (you may need to extend the time range):

As you can see, Filebeat successfully shipped the logs into Elasticsearch, but the logs haven’t been meaningfully parsed:

  • The message field contains everything, including timestamp, log level and actual message.
  • The exception stack trace was split into different entries per line.
  • The Time field showing in Kibana is actually the time when the log was shipped, not the timestamp of the log entry itself.

We’ll deal with these issues in the next sections.

Elasticsearch Pipeline

One way to properly parse the logs when they are sent to Elasticsearch is to create an ingest pipeline in Elasticsearch itself. There’s a good article by James Huang showing how to use this to ship logs from Filebeats to managed Elasticsearch in AWS.

By adapting the example in that article, we can create a pipeline for our sample log file. Run the following in Kibana’s Dev Tools:

PUT /_ingest/pipeline/logpipeline
{
  "description" : "Pipeline for logs from filebeat",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{TIMESTAMP_ISO8601:timestamp} - %{WORD:logLevel}: %{GREEDYDATA:message}"]
      }
    }
  ]
}

Now, getting that pattern right is a pain in the ass. The Grok Debugger is a great help, and there’s also a list of data types you can use.

In filebeat.yml, we now need to configure Filebeat to use this Elasticsearch pipeline:

output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["localhost:9200"]
  pipeline: logpipeline

We can now try indexing the logs again. First, let’s delete the Filebeat index:

DELETE filebeat-*

Next, delete the Filebeat’s data folder, and run filebeat.exe again.

In Discover, we now see that we get separate fields for timestamp, log level and message:

If you get warnings on the new fields (as above), just go into Management, then Index Patterns, and refresh the filebeat-* index pattern.

Now, you’ll see that for the error entry, we did not get the full exception stack trace. If we go into the Filebeat logs, we can see something like this:

2018-03-18T23:16:26.614Z	ERROR	pipeline/output.go:92	Failed to publish events: temporary bulk send failure
2018-03-18T23:16:26.616Z	INFO	elasticsearch/client.go:690	Connected to Elasticsearch version 6.1.2
2018-03-18T23:16:26.620Z	INFO	template/load.go:73	Template already exists and will not be overwritten.
2018-03-18T23:16:27.627Z	ERROR	pipeline/output.go:92	Failed to publish events: temporary bulk send failure
2018-03-18T23:16:27.629Z	INFO	elasticsearch/client.go:690	Connected to Elasticsearch version 6.1.2
2018-03-18T23:16:27.635Z	INFO	template/load.go:73	Template already exists and will not be overwritten.

Correspondingly, in Elasticsearch we can see several errors such as the following accumulating:

[2018-03-18T23:16:25,610][DEBUG][o.e.a.b.TransportBulkAction] [8vLF54_] failed to execute pipeline [logpipeline] for document [filebeat-6.2.2-2018.03.18/doc/null]
org.elasticsearch.ElasticsearchException: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [   at ConsoleApp1.Program.Main(String[] args) in C:\ConsoleApp1\Program.cs:line 18]
	at org.elasticsearch.ingest.CompoundProcessor.newCompoundProcessorException(CompoundProcessor.java:156) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:107) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.ingest.Pipeline.execute(Pipeline.java:58) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.ingest.PipelineExecutionService.innerExecute(PipelineExecutionService.java:169) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.ingest.PipelineExecutionService.access$000(PipelineExecutionService.java:42) ~[elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.ingest.PipelineExecutionService$2.doRun(PipelineExecutionService.java:94) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:637) [elasticsearch-6.1.2.jar:6.1.2]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.1.2.jar:6.1.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [   at ConsoleApp1.Program.Main(String[] args) in C:\ConsoleApp1\Program.cs:line 18]
	... 11 more
Caused by: java.lang.IllegalArgumentException: Provided Grok expressions do not match field value: [   at ConsoleApp1.Program.Main(String[] args) in C:\ConsoleApp1\Program.cs:line 18]
	at org.elasticsearch.ingest.common.GrokProcessor.execute(GrokProcessor.java:67) ~[?:?]
	at org.elasticsearch.ingest.CompoundProcessor.execute(CompoundProcessor.java:100) ~[elasticsearch-6.1.2.jar:6.1.2]
	... 9 more

Elasticsearch is making a fuss because it can’t parse the lines from the exception. This is a problem because if Elasticsearch can’t parse the logs, Filebeat will keep trying to send them and never make progress. We’ll have to deal with that exception stack trace now.

Multiline log entries

In order to log the exception correctly, we have to enable multiline processing in Filebeat. In filebeat.yml, there are some multiline settings that are commented out. We need to enable them and change them a little, such that any line not starting with a date is appended to the previous line:

  ### Multiline options

  # Mutiline can be used for log messages spanning multiple lines. This is common
  # for Java Stack Traces or C-Line Continuation

  # The regexp Pattern that has to be matched. The example pattern matches all lines starting with [
  multiline.pattern: '^\d{4}-\d{2}-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{4}'

  # Defines if the pattern set under pattern should be negated or not. Default is false.
  multiline.negate: true

  # Match can be set to "after" or "before". It is used to define if lines should be append to a pattern
  # that was (not) matched before or after or as long as a pattern is not matched based on negate.
  # Note: After is the equivalent to previous and before is the equivalent to to next in Logstash
  multiline.match: after

Configuring the Filebeat to support multiline log entries is not enough though. We also need to update the pipeline in Elasticsearch to apply the grok filter on multiple lines ((?m)) and to separate the exception into a field of its own. I’ve had to split the two cases (with and without exception) into separate patterns in order to make it work.

PUT /_ingest/pipeline/logpipeline
{
  "description" : "Pipeline for logs from filebeat",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["(?m)%{TIMESTAMP_ISO8601:timestamp} - %{WORD:logLevel}: (?<message>.*?)\n(%{GREEDYDATA:exception})?",
            "(?m)%{TIMESTAMP_ISO8601:timestamp} - %{WORD:logLevel}: %{GREEDYDATA:message}"]
      }
    }
  ]
}

After deleting the index and the Filebeat data folder, and re-running Filebeat, we now get a perfect multiline exception stack trace in its own field!

Fixing the Timestamp

We now have one last issue to fix: the logs being ordered by when they were inserted into the index, rather than the log timestamp. This is actually a pretty serious problem from a usability perspective, because it means people troubleshooting production issues won’t be able to use Kibana’s time filter (e.g. last 15 minutes) to home in on the most relevant logs.

In order to fix this, we need to augment our pipeline with a date processor:

PUT /_ingest/pipeline/logpipeline
{
  "description" : "Pipeline for logs from filebeat",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["(?m)%{TIMESTAMP_ISO8601:timestamp} - %{WORD:logLevel}: (?<message>.*?)\n(%{GREEDYDATA:exception})?",
        "(?m)%{TIMESTAMP_ISO8601:timestamp} - %{WORD:logLevel}: %{GREEDYDATA:message}"]
      },
      "date" : {
        "field" : "timestamp",
        "target_field" : "@timestamp",
        "formats" : ["yyyy-MM-dd HH:mm:ss.SSSS"]
      }
    }
  ]
}

The names of the fields in the date section are important. We’re basically telling it to take whatever is in the timestamp field (based on one of the earlier patterns) and apply it to @timestamp. As it happens, @timestamp is what is being used as the time-series field, which gives us exactly the result we want after reshipping the logs (be sure to extend the time window in Kibana accordingly to see the logs):

Summary

In this article, we’ve explored log shipping to augment regular file logging with purpose-built tools, rather than reinventing the wheel and writing yet another logging library. The latter approach would not only be a tremendous waste of time, but there are reliability, performance and maintainability implications to consider.

We have specifically looked at using Filebeat to ship logs directly into Elasticsearch, which is a good approach when Logstash is either not necessary or not possible to have. In order to get our log data nicely structured so that we can analyse it in Kibana, we’ve had to set up an ingest pipeline in Elasticsearch.

We progressively refined both our Filebeat configuration and this pipeline in order to split up our logs into separate fields, process multiline exception stack traces, and use the original timestamp in the logs as the time series field.

There is a lot more that Filebeats can do. For instance, a Filebeat may be configured with multiple prospectors, meaning it can read log files from different places and apply different options accordingly. One useful example of this is to add a custom field indicating the origin of the logs – this is useful when the log data itself does not include the application name, for instance.

Indexing and Searching Geopolygons using ElasticSearch

ElasticSearch is great for indexing and searching text, but it also has a lot of functionality related to searching points and regions on the world map. In this article, we’ll learn how to index polygons corresponding to territories in the world, and find whether a point is in any indexed polygon.

Building Polygons with Geocoordinates

Back in school, we (hopefully) learned that a point in 2D space can be represented as an (x, y) pair of coordinates. A point in the world can similarly be identified by a (latitude, longitude) pair of geocoordinates. We can obtain geocoordinates for a location by clicking on the map in Google Maps or similar tools.

The analogy is not perfect though; geocoordinates are not linear, which is a result of the curvature of the Earth. This is not really important for us; the point is that we can represent any given point on the Earth’s surface by means of latitude and longitude.

Once we can identify points, it’s natural to extend the concept to 2D geometry. By taking several points, we can create polygons that mark the boundaries of a given territory, such as a country or state. Jeremy Hawes’ Google Maps Polygon Coordinates Tool is great for building such polygons.

Using this tool, we can very easily construct a rough polygon representing the state of Wyoming in the US. Wyoming is great to use as a simple example because it’s roughly rectangular, so we only need four points for a workable approximation.

Below the map in this polygon tool, you’ll get the coordinates of the points along with some extra JavaScript (which you could later paste directly into the code editor). In this case, we’ve got the following coordinates in (latitude, longitude) format:

45.01967,-104.04405
44.99904,-111.03084
41.011,-111.04131
41.00193,-104.03375

Once we have the points that make up the polygon, we can feed them into Elasticsearch.

Indexing Geopolygons in Elasticsearch

Before we can index anything, we need to create a mapping that defines the structure of an index, including any fields and their data types. The Mapping Geo Shapes page in the Elasticsearch documentation provides a starting point. However, the documentation is crap, and if you follow the example in the docs closely, you’ll get an error:

After a quick search, this Stack Overflow answer reveals the cause of the problem: Elasticsearch no longer likes the string data type, and expects you to use text instead. This wouldn’t have been a problem if they bothered to update their documentation once in a while. Anyhow, our mapping request for this example will be as follows:

PUT /regions
{
  "mappings": {
    "region": {
      "properties": {
        "name": {
          "type": "text"
        },
        "location": {
          "type": "geo_shape"
        }
      }
    }
  }
}

This essentially means that each region item in the regions index will have a name and a location, the latter being the polygon itself. While we will be focusing exclusively on polygons in this article, it is worth noting that the geo_shape data type supports a lot of other geometric constructs – refer to the Geo-Shape documentation for more information.

Once our mapping is in place, we can proceed to index our polygons. The Indexing Geo Shapes documentation page shows how to do this. There’s a catch though: Elasticsearch expects to receive coordinates in (longitude, latitude) format, which is is the reverse of what we’ve been using so far. We can use a simple regular expression (e.g. in Notepad++) to swap our coordinates:

(\-?\d+\.?\d*),(\-?\d+\.?\d*)
\2,\1

The first line shows the regular expression that is used to match coordinates, and the second like shows what it should be replaced by, i.e. swapped coordinates.

Let’s use the following query to try to index our Wyoming polygon:

PUT /regions/region/wyoming
{
    "name" : "Wyoming",
    "location" : {
        "type" : "polygon", 
        "coordinates" : [[ 
        [ -104.04405,45.01967 ],
        [ -111.03084,44.99904 ],
        [ -111.04131,41.011   ],
        [ -104.03375,41.00193 ]
        ]]
    }
}

This actually fails with an error:

This is because Elasticsearch expects the polygon to be closed, i.e. it must return to the starting point. Another thing to watch out for is any polygons that have self-intersections, which Elasticsearch doesn’t allow either.

We can fix our error by simply repeating the first coordinate at the end:

PUT /regions/region/wyoming
{
    "name" : "Wyoming",
    "location" : {
        "type" : "polygon", 
        "coordinates" : [[ 
        [ -104.04405,45.01967 ],
        [ -111.03084,44.99904 ],
        [ -111.04131,41.011   ],
        [ -104.03375,41.00193 ],
        [ -104.04405,45.01967 ]
        ]]
    }
}

It should work now:

Great! Our Wyoming polygon is now in Elasticsearch.

Querying Geopolygons in Elasticsearch

We can again turn to the Elasticsearch documentation for examples of how to query our geopolygon. We can do this by taking a circle with a given radius and seeing whether it intersects the polygon, as shown in Querying Geo Shapes. Don’t confuse this with the Geo Polygon Query documentation, which is actually the opposite of our situation (i.e. having a point in Elasticsearch, and providing the polygon to test against at query time).

To test this, we’ll pick a point somewhere in Wyoming. I used Google Maps to pick a point within Yellowstone National Park, which for all we know might just be where Yogi Bear lives:

Having obtained the coordinates, we can hit Elasticsearch with a query:

GET /regions/region/_search
{
  "query": {
    "geo_shape": {
      "location": { 
        "shape": { 
          "type":   "circle", 
          "radius": "25m",
          "coordinates": [ 
            -109.874838, 44.439550
          ]
        }
      }
    }
  }
}

And you’ll see that Wyoming is actually returned in the results:

You’ll also notice that Elasticsearch gave us back all the coordinate data which we don’t really care about in this case. This can be pretty inefficient if you’re using very large and detailed polygons. We can filter that out by specifying the _source property:

GET /regions/region/_search
{
  "_source": "name", 
  "query": {
    "geo_shape": {
      "location": { 
        "shape": { 
          "type":   "circle", 
          "radius": "25m",
          "coordinates": [ 
            -109.874838, 44.439550
          ]
        }
      }
    }
  }
}

The results are now nice and clean:

Next, we’ll take a point in Texas and see that we don’t get results for that:

Geopolygons with Holes

Some territories aren’t simple polygons; they contain other territories inside them, and so the polygon has a hole. Examples include:

  • Rome (Vatican City is a hole within it)
  • New South Wales (Australian Capital Territory is a hole within it)
  • South Africa (Lesotho is a hole within it)

The Indexing Geo Shapes documentation page (which we’ve referred to earlier) explains how to account for holes in polygons you index. Let’s see how this works using a practical example.

The above image shows what New South Wales, Australia looks like in Google Maps. Notice the Australian Capital Territory state inside it. Using Jeremy Hawes’ aforementioned polygon tool, we can draw a very rough polygon for New South Wales:

This gives us the following coordinates (lat, lon) for New South Wales:

-28.92704,141.04445
-33.97411,141.00841
-37.51381,149.94544
-34.98252,150.7789
-32.70393,152.18365
-28.24141,153.49901
-28.98426,148.87874 

We will also need a polygon for Australian Capital Territory. Again, this will be a really rough approximation just for the sake of example:

Our coordinates for Australian Capital Territory are:

-35.91185,149.05898
-35.36119,149.14473
-35.31932,149.40076
-35.11429,149.09984
-35.3126,148.80286
-35.71989,148.81557 

Next, we’ll index Australian Capital Territory. This is nothing new, but remember that we must take care to swap the coordinates so that become (lon, lat), and close the polygon by repeating the first coordinate pair at the end.

PUT /regions/region/act
{
    "name" : "Australian Capital Territory",
    "location" : {
        "type" : "polygon", 
        "coordinates" : [[ 
            [ 149.05898,-35.91185 ],
            [ 149.14473,-35.36119 ],
            [ 149.40076,-35.31932 ],
            [ 149.09984,-35.11429 ],
            [ 148.80286,-35.3126  ],
            [ 148.81557,-35.71989 ],
            [ 149.05898,-35.91185 ]
        ]]
    }
}

For New South Wales, we do something special: we give it two polygons.

PUT /regions/region/nsw
{
    "name" : "New South Wales",
    "location" : {
        "type" : "polygon", 
        "coordinates" : [
            [
                [ 141.04445,-28.92704 ],
                [ 141.00841,-33.97411 ],
                [ 149.94544,-37.51381 ],
                [ 150.7789, -34.98252 ],
                [ 152.18365,-32.70393 ],
                [ 153.49901,-28.24141 ],
                [ 148.87874,-28.98426 ],
                [ 141.04445,-28.92704 ]              
            ],
            [ 
                [ 149.05898,-35.91185 ],
                [ 149.14473,-35.36119 ],
                [ 149.40076,-35.31932 ],
                [ 149.09984,-35.11429 ],
                [ 148.80286,-35.3126  ],
                [ 148.81557,-35.71989 ],
                [ 149.05898,-35.91185 ]
            ]
        ]
    }
}

The first polygon is the New South Wales polygon. The second is the one for Australian Capital Territory. The way Elasticsearch interprets this is that the first polygon is the main one; all subsequent ones are holes in the main polygon.

Once this has also been indexed, we can test this. Remember to swap your coordinates – Google Maps uses (lat, lon) whereas Elasticsearch uses (lon, lat). Let’s take a point in New South Wales – somewhere in Sydney for instance:

Our point was correctly identified as being in New South Wales. Now, let’s take a point in Canberra so that we can test out Australian Capital Territory:

Elasticsearch correctly returned Australian Capital Territory in the results. What is even more significant is that it did not return New South Wales, which it would otherwise have done had we not specified the hole when we indexed it.

Summary

After a brief introduction to geocoordinates and geopolygons, we saw how we can index geopolygons in Elasticsearch and then run queries to find out in which polygon(s) a point belongs. In a slightly more advanced scenario, we saw how to deal with polygons that have holes.

Setting Up Elasticsearch on Linux Ubuntu

Elasticsearch is a lightning-fast and highly scalable search engine built on top of Apache Lucene. In this article, we’re going to see how we can quickly set it up on an Ubuntu Linux environment (using Ubuntu 16.10 here) to be able to play around with it. We do not cover configuring Elasticsearch or setting up a cluster. To set up Elasticsearch on Windows, see “Setting Up Elasticsearch and Kibana on Windows” instead.

Before we can set up Elasticsearch itself, we need Java. We can follow these instructions to set up Java on Ubuntu. Before proceeding, verify that the JAVA_HOME environment variable is set:

echo $JAVA_HOME

It is likely that you won’t see anything as a result of this command. That’s because while the Java setup instructions do set this environment variable, it does not get applied to your current session. Try opening a new terminal window or reboot the machine, and chances are that your JAVA_HOME will be set correctly. If not, you may have to set JAVA_HOME manually.

Once Java is correctly set up (complete with the JAVA_HOME environment variable), we can proceed to set up Elasticsearch. By going to the Elasticsearch downloads page, we can download (among other things) the Debian package containing Elasticsearch:

We can now install the Debian package using dpkg. At the time of writing this article, the latest version of Elasticsearch is 5.4, so after opening a Terminal window based in the Downloads folder, we can use the following command to install Elasticsearch:

sudo dkpg -i elasticsearch-5.4.0.deb

Elasticsearch is now installed, but it is not yet running! So first, we’ll enable the Elasticsearch service so that it will start automatically when the machine is rebooted:

sudo systemctl enable elasticsearch.service

We can now start the Elasticsearch service.

sudo systemctl start elasticsearch.service

The Elasticsearch HTTP endpoint will need a few seconds before it is reachable. After that, we can verify that Elasticsearch is running either by going to localhost:9200 from a web browser, or by hitting that same endpoint using curl in the command line:

curl -X GET http://localhost:9200/

In either case, you should get a response with some JSON data about the Elasticsearch instance you’re running:

We are now all set up to play around with Elasticsearch! Since we didn’t configure anything, we have a single instance with all default settings. If you’re planning to use Elasticsearch in a production environment, you will of course want to read up on configuring it properly and setting up a cluster to ensure that it can handle the use cases you need and that it can survive failure scenarios.

Setting Up Elasticsearch and Kibana on Windows

Elasticsearch is fantastic to index your data so that it can be searched by its lightning-fast search engine. With Kibana, you also get the ability to analyse and visualise that data. Both of these products are provided for free by Elastic.

Installing Java Runtime Environment

Elastic products are developed in Java, so you’ll need the Java Runtime Environment (JRE) to run them. Get the latest JRE from the relevant ugly Oracle downloads page. Either use the .exe installer, or download the .zip file and then extract the folder inside.

Either way, take note of the JRE folder location and add it as an environment variable. To do this, hit the start menu and type “environment variables“:

In the window that comes up, go on Environment Variables…:

You will now see the user and system environment variables. Hit New… under the System variables:

Name it JAVA_HOME, and in the value put in the path to the JRE folder (not its bin folder):

You can now OK out of the various dialog windows.

Setting up Elasticsearch

Go to the Elasticsearch product page, and hit Download:

In the next page, download the ZIP file:

Extract the folder in the .zip file somewhere.

You can now run elasticsearch.bat. If you get “The syntax of the command is incorrect”, you probably didn’t set the JAVA_HOME environment variable as explained in the previous section.

elasticsearch.bat

Running this command, you should see a bunch of initialisation output:

…and if you browse to localhost:9200, you should see some JSON returned:

Now that we know it’s working, we can install it as a Windows service. So press Ctrl+C to kill the instance of Elasticsearch you just ran, and instead run:

elasticsearch-service.bat install

This should install it as a service:

This installs it as Manual startup type, and does not start it. You probably want to change that to Automatic (Delayed Start), from the Services window in Microsoft Windows, and also Start it. Once you have done that, give it a few seconds to start, and then verify again that you get a response from localhost:9200.

Setting up Kibana

Next, we will set up Kibana. Grab the Windows .zip file from the Kibana downloads page:

Extract it wherever your heart desires.

Make sure Elasticsearch is running. Then, in Kibana’s bin folder, run kibana.bat:

kibana.bat

Some text will be written to the console as Kibana is initialised, and then you should be able to go to localhost:5601 and actually get a webpage:

Now we know that it works. Let’s set it up as a service. Kill the instance we just ran using Ctrl+C first.

Oh crap, Kibana does not come with a service installer! What are we gonna do?

Enter NSSM, the Non-Sucking Service Manager, which we can use to install just about any application as a Windows service, using either the command line or an interactive GUI. After downloading NSSM, we can install Kibana as a Windows service with a command like the following from NSSM’s win64 folder:

nssm install "Kibana 5.2.2" C:\[...]\kibana-5.2.2-windows-x86\bin\kibana.bat

With an elevated command prompt, we can also configure the Windows service, such as setting the startup type and the description:

nssm set "Kibana 5.2.2" Start "SERVICE_DELAYED_AUTO_START"
nssm set "Kibana 5.2.2" Description "Kibana lets you visualize your Elasticsearch data"

Finally, we start the service:

nssm start "Kibana 5.2.2"

If all goes well:

…then we can go back to localhost:5601 and verify that it’s really running.

With that, it’s all set up. All that’s needed is an index with some data that you can use Kibana to visualise, but that’s beyond the scope of this article.