Painless Elastic History

@ 100K Inserts Per Second

2018

May 07

Efficiently maintain asset history at high scale with Elasticsearch!

In Cybersecurity, knowing the history of an asset is absolutely critical. It lets you understand how your environment is evolving over time: which new assets showed up, which ones disappeared, who is talking to who from a communication standpoint, etc. Consider the following simple problem:

Let's say that we run a discovery scan every day or even multiple times a day - say every hour. I know that there are better ways to continuously discover assets but that's a topic for another day. :)
This is going to produce a set of assets every hour and for simplicity let us assume that we're interested in "interfaces" and "services". Interfaces are something that you can identify by IP addresses or hostnames. Services are open ports that are running on that specific interface.
e.g. On day 1 we found the following:
- interface (192.168.1.1) with open ports (22)
And then on day 2 we find:
- interface (192.168.1.1) with open ports (22, 12345)
It is imperative that you understand the difference between day 1 and 2 as it potentially hides security issues. Why did port 12345 suddenly open up? Is it part of a malware/virus campaign? Did a user do something malicious?

Most storage technologies will allow you to model a data structure that will let you do that. The problem is when you're dealing with millions of assets that generate hundreds of thousands of events every second. When you are dealing with that kind of scale, you want to be as efficient as possible and you need to avoid as many transactions as possible - especially the "get-and-check-and-write" problem. By that I mean:

Get an asset
Run some checks and manipulate the asset
Write back the asset

Ideally, you want to try and create idempotent "upsert" transactions to make the system as fast as possible. That means:

You shouldn't need to check if an asset exists
You shouldn't need to manipulate the asset to maintain history and other such things
You should be able to blindly update or insert (upsert) the asset when the data comes streaming in from your event stream.

We use Elasticsearch extensively across a wide variety of products. If you're not familiar with Elasticsearch - you can click here to get started. There is a docker image here which will make it super easy for you to play around.

Elasticsearch is awesome. It is a distributed full text indexing technology that reasonably degrades to a document storage engine. It scales to hundreds of servers handling millions of transactions per minute. It has some really elegant aggregation features and a well designed JSON based API to interact with it. It also has pluggable scripting engines that you can use to perform on-the-fly manipulation of the requests. Now, generally, I'm not a big fan of writing business logic in various storage technologies (stored procedures and such) as it can get out of hand if you are not careful. But, selectively using these features can make for an elegant architecture.

For this example, we're using a couple of features:

painless - it is now the default scripting language and is fast, secure and syntax is a mix of java and groovy. See: Painless.
upserts - which is a subset of the Update API that allows for update-or-insert style of interaction. See: Update API.

The idea is pretty simple:

Whenever we run these above mentioned discovery scans, we want to blindly "upsert" the data because we need to keep up to millions of transactions per minute
We would like to keep the history of this data for a configurable number of days
And we want to keep the granularity of the data to a day: i.e. we just want to track the difference between days not intra-day (actually it is pretty simple to take this example and tweak it for finer granularity very easily. I'll leave that as an exercise for the reader :) ).
So, day 1 (Jan 1, 2018) would look like: { "id": "192.168.1.1", "services": [{"port": 22, "proto": "TCP"}], dt": "20180101"}
And day 2 (Jan 2, 2018) would look like: { "id": "192.168.1.1", "services": [{"port": 22, "proto": "TCP"},{"port": 12345, "proto": "TCP"}], dt": "20180102"}
The "dt" field is just the current date to use as a key for the history. There are some reasons (mostly caching etc.) for that vs. using a function like "now()" in the script.
And the magic pixie dust is this little script:

# automatically create a new history section if it doesn't exist

# using a TreeMap which keeps the history sorted

if (ctx._source.history == null)

  ctx._source.history = new TreeMap();

# set the incoming data to the current and history for the supplied date

ctx._source.history[params.dt]=params.current;

ctx._source.current = params.current;

# trim the map to keep only N (2 in our example) days of history

if (ctx._source.history.size() > 2)

  ctx._source.history.remove(ctx._source.history.keySet().iterator().next());

So, long story short, the data:

after day 1 looks like:

"hits": [

    "_index": "interface", "_type": "_doc",

    "_id": "192.168.1.1", "_score": 1.0,

    "_source": {

      "current": {

        "services": [{ "port": "22", "proto": "TCP" } ]

},

      "history": {

        "20180101": { "services": [ { "port": "22", "proto": "TCP" } ] }

after day 2 looks like:

"hits": [

    "_index": "interface", "_type": "_doc",

    "_id": "192.168.1.1", "_score": 1.0,

    "_source": {

      "current": {

        "services": [ { "port": "22", "proto": "TCP" }, { "port": "12345", "proto": "TCP" } ]

},

      "history": {

        "20180101": { "services": [ { "port": "22", "proto": "TCP" } ] },

        "20180102": { "services": [ { "port": "22", "proto": "TCP" }, { "port": "12345", "proto": "TCP" } ] }

There you go! You can insert as many assets as you wish without worrying if the asset exists or not and with the scripts doing the dirty work of keeping the latest history for "N" days and you always have the "current" data. You also have removed the whole optimistic locking problem since you're making idempotent atomic transactions! And trust me, it is blindingly fast: 100K upserts a second on a reasonable sized cluster is very possible :).

What a helpful little painless script!!!

HTTP Requests

Create the index with a fixed mapping. Using dynamic mappings here will do bad things with the history section.
Create a stored script called "keep_history" to use in upserts - our magic glue.
Upsert a document for day 1 (20180101) using this script.
Upsert a document for day 2 (20180102) using this script.
Get all the documents so that you can look at the data.

# once you have an Elasticsearch instance running, you can just "curl" the following HTTP requests to play around

# create an index with the following mapping to hold the current and historical data

PUT http://localhost:9200/interface

Content-Type: application/json

   "settings": {

       "number_of_replicas": 0

},

    "mappings": {

        "_doc": {

            "properties": {

                "current": {"type": "nested"},

                "history": {"type": "nested"}

# create a stored script to execute at every upsert to keep "N" (2 in this example) days of history

POST http://localhost:9200/_scripts/keep_history

Content-Type: application/json

    "script": {

        "lang": "painless",

        "source":

            "if (ctx._source.history == null) ctx._source.history = new TreeMap();

                ctx._source.history[params.dt]=params.current;

             ctx._source.current = params.current;

             if (ctx._source.history.size() > 2)

                 ctx._source.history.remove(ctx._source.history.keySet().iterator().next());

# upsert a document for a date (20180101)

POST http://localhost:9200/interface/_doc/192.168.1.1/_update

Content-Type: application/json

    "scripted_upsert":true,

    "script": {

        "id": "keep_history",

        "params": {

            "current": {

                "services": [{"port": "22","proto": "TCP"}]

},

            "dt":"20180101"

},

    "upsert": {}

# upsert a document for another date (20180102)

POST http://localhost:9200/interface/_doc/192.168.1.1/_update

Content-Type: application/json

    "scripted_upsert":true,

    "script": {

        "id": "keep_history",

        "params": {

            "current": {

                "services": [

                    {"port": "22","proto": "TCP"},

                    {"port": "12345","proto": "TCP"}

},

            "dt":"20180102"

},

    "upsert": {}

# look at the data

POST http://localhost:9200/interface/_search

content-type: application/json

 "query": {

      "match_all": {}

Report abuse