Best practice tips for configuring Elasticsearch

At its core, Elasticsearch is a bunch of Lucene indexes that can communicate with each other. Using an Elasticsearch backed case allows Nuix to spread a single case across many Lucene indexes, and as an extension, many servers.

The following information offers some basic best practice suggestions, that apply to most deployments, for configuring Elasticsearch in a production environment.

Tips on configuring hardware

Do not share nodes on a physical server, like with VMs or on a cluster. Provide each Elasticsearch node with its own physical server.

Avoid running multiple instances of Elasticsearch on a single VM.

If you use a single physical server, only use standard Nuix Derby or Lucene cases.

It is better to have a more low-specification machines rather than a few high-specification machines in an Elasticsearch cluster. In other words, scaling horizontally is generally preferrable to scaling vertically.

For example, it is preferable to have 6 Elasticsearch nodes with 4-8 CPU cores, 64 GB memory, and a 4 TB data drive than to have 2 nodes with 12 CPU cores, 256 GB memory, and a 12 TB data drive.

Fast disks are essential. Use SSDs to store Elasticsearch indexes. Use local disks, not network shares. Image 42 Network latency or network interruptions can disrupt performance.

Tips on configuring memory

Set an Elasticsearch JVM heap to no more than:

30 GB or 31 GB for each node. Beyond this allocation and because of the way the JVM works, you must increase memory to 64 GB to have an improved performance.

50% of the available physical RAM. You want to leave enough available memory for the filesystem cache.

It is fine to run data nodes with 16 GB JVM max heap, and 32 GB total installed memory. If running coordinating nodes, it is better to run those with a maximum heap size of 31 GB.

Elasticsearch does not need redundant data stores like RAID1, because replica shards back up the index more efficiently. But you can use RAID0 to improve disk access times.

There is no need to enable swap - disable it. Use the command: sudo swapoff -a

Usually, Elasticsearch is the only service running on a box, and its memory usage is controlled by the JVM options.

For more details, see: https://www.elastic.co/guide/en/elasticsearch/reference/7.8/setup-configuration-memory .html#disable-swap-files

Configure “Swappiness” to reduce the kernel’s tendency to swap (which should not happen under normal circumstances), while still allowing the whole system to swap in emergency conditions.

For more details, see:https://www.elastic.co/guide/en/elasticsearch/reference/7.8/setup-configuration-memory.html#swappiness

Configure the JVM to disable swapping for Elasticsearch memory, with either `mlockall` (Linux) or `virtual lock` (Windows). The caveat – mlockallmight cause the JVM or shell session to exit if it tries to allocate more memory than is available. For more details, see: https://www.elastic.co/guide/en/elasticsearch/reference/7.8/setup-configuration-memory.html#bootstrap-memory_lock

Tips on allocating shards

Shard allocation is very important for good performance, but also hard to get right. Each Elasticsearch shard is a full Lucene index, with all the trappings and overhead that implies.

Allocating too few shards can mean that each ES node is swamped with Lucene indexes that are too large.

Allocating too many shards means that each ES node must handle dozens of Lucene indexes. I/O and network bandwidth become a problem, if memory does not. But memory probably will become problematic because all shards will be sharing the same 16-30 GB memory space.

Do not allocate more shards than there are CPU cores on a single node.

So, if there are six cores, do not allocate more than six active shards to that node.

The optimal performant shard size, from an Elasticsearch perspective, is between 10 GB and 50 GB.

Remember when allocating shards, that you are not just allocating shards for your current case, but for all future ingestions to that case as well.

Without replica shards, any corruption of shard data can lead to a loss of case data.

Any index in production should have a minimum of one replica to help scale query load across more nodes. Elasticsearch recommends two (2) replicas per shard in production.

Tips on configuring nodes

For production clusters, configure nodes with dedicated roles, that is with dedicated data nodes, dedicated master nodes, and dedicated coordination nodes.

For a large Elasticsearch cluster, have coordinating-only nodes to relieve the pressure on data nodes by handling the shuttling of data through the cluster. Having at least three, means if one goes down (can happen when large aggregation operations like deduplications, date range, and so on, are run), the cluster is not impacted.

Give a coordinating node a high CPU count (~12-16) and the allocate the maximum heap for an Elasticsearch node at 31 GB.

Disable ElasticSearch's sniffer feature to ensure that Nuix Workstation only connects to the nodes specified in the 'nuix.http.hosts/nuix.transport.hosts'. To do so, use switch: '-Dnuix.elasticsearch.sniffer.interval=0'

About the Elasticsearch sniffer feature

The sniffer feature, on first connecting to a cluster, automatically reaches out and maps all nodes within the cluster, and then adds those nodes to the list of connection pools for the REST client (for example, Nuix Workstation). Ideally you want to avoid this for various traffic shaping reasons. By default, if at least one of the hosts nuix.http.hosts is reachable, the Nuix Engine automatically detects nodes that are added and removed from the cluster. Requests are then routed to each detected node in a round-robin fashion, and if one fails to respond, it is blacklisted for a period of time before being added to the roster again.

The Nuix Engine sends requests to all nodes detected in the cluster, regardless of their role (coordinating, data, ingestion, master, and so on). In some topologies, this may not be desired behavior. For instance, you may want all requests to be sent to the master and, or coordinating nodes only, but never to data nodes. In this case, you need to disable the sniffer by setting the switch `-Dnuix.elasticsearch.sniffer.interval=0` on all running instances of the Nuix Engine. If using remote Workers, you must set the switch on them as well.

With the switch set, the Nuix Engine only sends requests to the hosts listed in the Advanced setting nuix.http.hosts (which you can only set at ingestion time, but can later manually change in the .fbi file, if needed. Then, the Nuix Engine does not automatically detect nodes that are added and removed from the cluster; and the retry logic only uses the listed nodes and fails the operation if no node is ready to receive the request.

For further details on the sniffer feature, visit: https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.8/sniffer.html & https://www.elastic.co/blog/elasticsearch- sniffing-best-practices-what-when-why-how.

Tips on monitoring Elasticsearch

Use some type of Elasticsearch monitoring for log collection and resource usage monitoring. To simplify debugging issues, configure Kibana to point to a production Elasticsearch cluster. See Use Kibana with Elasticsearch for details.

For simple resource monitoring and index monitoring, go to: http://www.elastichq.org

Tips on managing config files

Use a configuration management tool like Ansible to manage the configuration of Elasticsearch. Managing all config files by hand is tedious and error-prone.

Elasticsearch uses a lot of file descriptors or file handles. Running out of file descriptors can be disastrous and in all probability lead to data loss. For details, see: https://www.elastic.co/guide/en/elasticsearch/reference/7.8/file-descriptors.html

Elasticsearch challenges

Elasticsearch is very good at searching. What it is not good at, is the following:

Large aggregations - so, try to limit aggregations to just the first n items whenever possible.

If you want an accurate aggregation in Elasticsearch, you must grab metadata or properties for every item in every shard. That is very expensive.

Nested data types or the “join” data type

Regular expressions

Modifying existing documents or items.

This means most “updates” virtually mean creating a new document with deletion of the old document.