Applying Elasticsearch on Kubernetes

Big data, AI, machine learning, and numerous others are all buzzwords we seem to throw around lightly in recent years. Even though they are hugely different from one another, they all have one thing in common. Data! Huge amounts of data that need to be managed.

The downside of that is that the more data you have the more of a headache it is to store, query, and make sense of.

However, running Elasticsearch on Kubernetes can save you a lot of trouble. Elasticsearch handles storing and querying data, while Kubernetes handles the underlying infrastructure. By the end of this tutorial, you will have a running Elasticsearch cluster on Kubernetes, learn best practices to leverage the platforms’ powers, and get some tips about memory requirements and storage.

What is Elasticsearch?

Elasticsearch is a datastore that stores data in indices. It’s also a real-time, distributed, and scalable search engine which allows for full-text and structured search, as well as analytics. It’s great for storing and searching through large volumes of textual data, like logs, but can also be used to search many different kinds of documents.

We at Sematext are running a huge Elasticsearch cluster on Kubernetes that handles millions of data points per minute from ingested logs, metrics, events, traces, etc.

To learn more about Elasticsearch, check outthis Elasticsearch guide.

What is Kubernetes?

Kubernetes is the de-facto standard container orchestrator and by far the easiest way to run and manage clusters in the cloud or on-premises. But what is a container orchestrator? To understand Kubernetes, you first need to understand Docker.

Docker is a container engine that lets you create ephemeral containers to run your applications. These containers are stateless and run isolated from the rest of your system. Running Docker containers is the same across any operating system, as long as the hosts are in a Kubernetes cluster. You don’t have to worry about the underlying infrastructure at all. This makes packaging and shipping apps to production simple.

However, containers are useless without a cluster and orchestrator to run and manage them. Kubernetes manages all of this and does the heavy lifting so you don’t have to. What you have to do is tell Kubernetes what to do through the kubectl command line and with yaml resource files.

Why Run Elasticsearch on Kubernetes

Elasticsearch can store huge amounts of textual data with the ability to quickly search through it when needed. It’s deployed in clusters, at least consisting of three nodes. These nodes have throughout the years often been VMs that you would spin up and then handle connections between them. It’s tiresome and hard to manage.

Kubernetes has stepped in to solve that issue. It has become the de-facto standard for running high-uptime and reliable systems in the cloud and on-premises. Even though Kubernetes is designed to run ephemeral, stateless, apps and not databases, there are upsides of running an Elasticsearch cluster on Kubernetes. You should generally not be running databases on Kubernetes, but you can. Handling persistent data is simple by using persistent volume claims and stateful sets.

With Kubernetes, you get a cluster that’s easier to configure, manage and scale. Once you configure your Elasticsearch cluster on Kubernetes, the process of deploying it to another cloud provider or on-premises is incredibly simple.

Kubernetes is also very developer-friendly. You rely on infrastructure as code configurations and not manually setting up and configuring infrastructure. For many, this may be the only way they know how to deploy a large cluster. Seeing as many teams don’t have dedicated DevOps engineers and they have to rely on their developers to handle the infrastructure, you may be saving yourself a huge headache by letting Kubernetes manage the cluster.

Let’s check out the architecture behind running Kubernetes and Elasticsearch.

Kubernetes Architecture: Basic Concepts

Kubernetes manages your application with several different resource types. First, your application is built and packaged into a Container. This containerized application is deployed to Kubernetes and runs within a Pod.

Kubernetes Pods are grouped in a Deployment. A Deployment is a key concept in Kubernetes that manages Pods and their properties, like how many replicas of each Pod to run.

Service is then used to expose the Deployment to the Internet. If it is of type LoadBalancer it’ll also load balance requests evenly across all the Pods in the Deployment. Simply put, a Service creates a single IP address which is used to access the Containers. Services can also make Pods accessible to other Pods within the Kubernetes cluster.

Kubernetes Nodes are the virtual machines on which the Kubernetes cluster is running, including all Pods. Pods are always ordered randomly across the Nodes. You can use Affinity and Anti-Affinity rules to tell Kubernetes how to spread the running Pods across the Nodes. Maybe you want Elasticsearch Pods to only run on certain Kubernetes Nodes.

Deployments do not keep state in their Pods. It’s assumed the application is stateless. If you need your application to maintain state, like in our case with Elasticsearch, then you need to use a StatefulSet.

Applying Elasticsearch on Kubernetes 1

StatefulSet is a Deployment that can maintain state. Makes sense from the name right?

When using StatefulSets you also need to use PersistentVolumes and PersistentVolumeClaims. A StatefulSet will ensure the same PersistentVolumeClaim stays bound to the same Pod throughout its lifetime. Unlike a Deployment which ensures the group of Pods within the Deployment stay bound to a PersistentVolumeClaim.

PersistentVolume (PV) is a Kubernetes abstraction for storage on the provided hardware. This can be AWS EBS, DigitalOcean Volumes, etc.

PersistentVolumeClaim (PVC) however is a way for a Deployment or StatefulSet to request some storage space from a PersistentVolume. This allocated storage is persisted even if Pods and Nodes restart.

Alongside StatefulSets you have Headless Services that are used for discovery of StatefulSet Pods.

Headless Service is a service when you don’t need load-balancing and a single Service IP. Instead of load-balancing it will return the IPs of the associated Pods. Headless Services do not have a Cluster IP allocated. They will not be proxied by kube-proxy, instead Elasticsearch will handle the service discovery.

Elasticsearch Deployment: Cluster Topology

Elasticsearch should always be deployed in clusters. Every instance of Elasticsearch running in the cluster is called a node. In Kubernetes an Elasticsearch node would be equivalent to an Elasticsearch Pod. Don’t get it confused with a Kubernetes Node, which is one of the virtual machines Kubernetes is running on. For the rest of this Elasticsearch Kubernetes tutorial I’ll use the term Elasticsearch Pod to minimize confusion between the two.

Applying Elasticsearch on Kubernetes 2

By default, when you deploy an Elasticsearch cluster, all Elasticsearch Pods have all roles. The roles can be master, data, and client. The client is often also called coordinator. Master Pods are responsible for managing the cluster, managing indices, and electing a new master if needed. Data Pods are dedicated to store data, while client Pods have no role whatsoever except for funneling incoming traffic to the rest of the Pods.

You need a minimum of three master-eligible Pods to avoid split-brain when a new master needs to be appointed. You set this role for a node by having this combination of roles.

roles:
  master: "true"
  ingest: "false"
  data: "false"

Regarding data Pods, you need at least two. They will persist data, receive queries, and index requests. Basically, they do all the heavy lifting. You set this role like this.

roles:
  master: "false"
  ingest: "false"
  data: "true"

Client Pods are also known as Coordinating Pods. You should have two of these as well. These Pods are exposed to consumers of the cluster data and serve as HTTP proxies. If they are not deployed, Data Pods will serve as coordinating Pods. Avoid this on larger clusters. You set a Pod to be a client by having all roles false.

roles:
  master: "false"
  ingest: "false"
  data: "false"

This setup is considered best practice and scaling up would be needed only when the current node count is insufficient. Luckily, scaling up an Elasticsearch cluster on Kubernetes is as simple as running one command.

This is what the final cluster topology will look like.

Applying Elasticsearch on Kubernetes 3

Data Pods are deployed as StatefulSets with PersistentVolumes and PersistentVolumeClaims. They will persist data between restarts, which is what you want.

Master Pods can be deployed as either Deployments or StatefulSets.

A headless service for each StatefulSet is created and used for inter-cluster discovery.

Client Pods are completely stateless and can be deployed as a simple Kubernetes Deployment.

A Kubernetes LoadBalancer Service is used to forward inbound traffic to the client Pods. All of your apps, as well as Kibana will be configured to go through the LoadBalancer service.

Deploying Elasticsearch on Kubernetes: Memory Requirements

If you are setting up an Elasticsearch cluster on Kubernetes for yourself, keep in mind to allocate at least 4GB of memory to your Kubernetes Nodes. You will need at least 7 Nodes to run this setup without any hiccups. The default size of the PersistentVolumeClaims for each Elasticsearch Pod will be 30GB. This will help determine how much block storage you will need.

The pods are inside of a StatefulSet hence when creating new Pods you need to make sure you have 30GB of storage per additional Pod you want to create. Working with PVCs is complicated because you need to delete them yourself. It gets even more complicated when you are not using a cloud service and you have to configure your own StorageClasses. Often Pods won’t start, and it’s most likely due to lack of storage space or old PVCs still persisting even though you don’t need them.

In the next section I’ll show you how to configure both a 7-Pod production setup with Helm, but also how to get up and running quickly with a 3-Pod master setup where each of the Pods has all roles.

How to Deploy Elasticsearch on Kubernetes

Deploying Elasticsearch on Kubernetes can be a hassle if you choose to do it yourself with custom resource files and kubectl. It’s much easier to use Helm, the Kubernetes package manager. With the help of Helm, you can install a prebuilt chart that’ll configure all required resources by running one simple command. Let’s get our hands dirty and start creating the Elasticsearch cluster on Kubernetes.

Prerequisites

To follow along with this tutorial you’ll need a few things first:

  • A Kubernetes cluster with role-based access control (RBAC) enabled.
    • Ensure your cluster has enough resources available, and if not scale your cluster by adding more Kubernetes Nodes. You’ll deploy a 3-Pod Elasticsearch cluster with 3 master Pods, and a 7-Pod Elasticsearch cluster with 3 master Pods, 2 data Pods, and 2 client Pods. I’d suggest you have 7 Kubernetes Nodes with at least 4GB of RAM and 50GB of storage.
  • The kubectl command-line tool installed on your local machine, configured to connect to your cluster. You can read more about how to install kubectl in the official documentation.
  • The Kubernetes package manager Helm installed. You can learn how to  install Helm inthe official documentation.

Deploying a 3-Pod Elasticsearch cluster on Kubernetes with Helm: Examples and Best Practices

First and foremost you need to initialize Helm on your Kubernetes cluster. It’s done with the init command.

helm init

Note: Helm often needs Tiller installed. If the helm init command does not work, run these commands to install Tiller if you do not have it installed and configured.

kubectl create serviceaccount -n kube-system tiller
kubectl create clusterrolebinding tiller-cluster-admin \
  --clusterrole=cluster-admin \
  --serviceaccount=kube-system:tiller
helm init --service-account tiller \
  --override spec.selector.matchLabels.'name'='tiller',spec.selector.matchLabels.'app'='helm' \
  --output yaml | sed 's@apiVersion: extensions/v1beta1@apiVersion: apps/v1@' | kubectl apply -f -

Once you have Helm initialized you can begin adding charts. First start by adding the elastic repo and install the elasticsearch chart.

helm repo add elastic https://helm.elastic.co
helm install --name elasticsearch elastic/elasticsearch \
  --set service.type=LoadBalancer

You’re adding the –set service.type=LoadBalancer parameter to indicate you want the service to expose a LoadBalancer IP to the Internet. Check to see that the resources are running.

kubectl get all

This will list all the resources the chart created.

[output]

NAME                         READY   STATUS     RESTARTS   AGE

pod/elasticsearch-master-0   1/1     Running    0          2m8s

pod/elasticsearch-master-1   1/1     Running    0          2m8s

pod/elasticsearch-master-2   1/1     Running    0          2m8s

NAME                                    TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                         AGE

service/elasticsearch-master            LoadBalancer   10.98.90.94   <YOUR_IP>     9200:31812/TCP,9300:31635/TCP   2m8s

service/elasticsearch-master-headless   ClusterIP      None          <none>        9200/TCP,9300/TCP               2m9s

service/kubernetes                      ClusterIP      10.96.0.1     <none>        443/TCP                         5d5h

NAME                                    READY   AGE

statefulset.apps/elasticsearch-master   3/3     2m8s

You now have three Elasticsearch master Pods running on your Kubernetes cluster. These Pods now have all three available roles. To keep them healthy, make sure you have enough resources allocated. If you need to scale up, you can configure a Pod autoscaler. To check if everything is running like it should, hit the Elasticsearch state endpoint with curl.

curl http://<YOUR_IP>/_cluster/state?pretty

This setup will work great for smaller clusters where you don’t have huge amounts of data. Some issues you may run into are out of memory exceptions when your indices start growing. In that case you should increase the sysctl max_map_count. Here’s a nice thread explaining it.

But, if you want to follow Elasticsearch best practices you should also configure dedicated data and client Pods apart from master Pods. That’s exactly what we’re doing in the next section.

Deploying a 7-Pod Elasticsearch cluster on Kubernetes with Helm

Let’s get serious for a moment, and configure the cluster with best practices in mind. The 7 Pods will consist of 3 master Pods, 2 data Pods, and 2 client Pods.

This preferred setup is installed in a similar way. First, run the Helm install command, but this time without any additional parameters.

helm install --name elasticsearch elastic/elasticsearch

Now you need to run the upgrade command to update the Elasticsearch pods. You want to upgrade the number of Pods but also assign custom roles to them.

To do this create three yaml config files. First the master.yaml to configure the master-eligible Pods.

# master.yaml
---
clusterName: "elasticsearch"
nodeGroup: "master"
roles:
  master: "true"
  ingest: "false"
  data: "false"
replicas: 3

Then the data.yaml for the data Pods.

---
clusterName: "elasticsearch"
nodeGroup: "data"
roles:
  master: "false"
  ingest: "true"
  data: "true"
replicas: 2

Finally, the client.yaml for the client Pods.

---
clusterName: "elasticsearch"
nodeGroup: "client"
roles:
  master: "false"
  ingest: "false"
  data: "false"
replicas: 2
service:
  type: "LoadBalancer"

Now you can run the upgrade command three times, with each distinct yaml config file in the directory where you created the files.

helm upgrade --wait --timeout=600 --install \
  --values ./master.yaml elasticsearch elastic/elasticsearch

helm upgrade --wait --timeout=600 --install \
  --values ./data.yaml elasticsearch elastic/elasticsearch

helm upgrade --wait --timeout=600 --install \
  --values ./client.yaml elasticsearch elastic/elasticsearch

It’ll take a while to upgrade the Helm chart. But, when they are all finished upgrading you can check if your resources are updated.

kubectl get all

Here’s the output you’re looking for.

[Output]

NAME                                    READY   STATUS    RESTARTS   AGE

pod/elasticsearch-client-0              1/1     Running   0          10m

pod/elasticsearch-client-1              1/1     Running   0          10m

pod/elasticsearch-data-0                1/1     Running   0          11m

pod/elasticsearch-data-1                1/1     Running   0          11m

pod/elasticsearch-master-0              1/1     Running   0          8m27s

pod/elasticsearch-master-1              1/1     Running   0          8m27s

pod/elasticsearch-master-2              1/1     Running   0          8m27s

NAME                                    TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                         AGE

service/elasticsearch-client            LoadBalancer   10.245.114.89    <YOUR_IP>       9200:32366/TCP,9300:31427/TCP   10m

service/elasticsearch-client-headless   ClusterIP      None             <none>          9200/TCP,9300/TCP               10m

service/elasticsearch-data              ClusterIP      10.245.116.115   <none>          9200/TCP,9300/TCP               11m

service/elasticsearch-data-headless     ClusterIP      None             <none>          9200/TCP,9300/TCP               11m

service/elasticsearch-master            ClusterIP      10.245.220.94    <none>          9200/TCP,9300/TCP               8m27s

service/elasticsearch-master-headless   ClusterIP      None             <none>          9200/TCP,9300/TCP               8m27s

service/kubernetes                      ClusterIP      10.245.0.1       <none>          443/TCP                         4h5m

NAME                                    READY   AGE

statefulset.apps/elasticsearch-client   2/2     10m

statefulset.apps/elasticsearch-data     2/2     11m

statefulset.apps/elasticsearch-master   3/3     8m28s

Run curl against the Elasticsearch endpoint once again to check if it works.

curl http://<YOUR_IP>/_cluster/state?pretty

There ya go! Ready to rock!

Note: If you’re having issues with configuring larger clusters, you might need to check out setting up readiness probes. They can check whether your Elasticsearch Pods are ready to accept traffic.

Bonus: Prebuilt Elasticsearch Helm chart with best practices in mind

The peeps over at Bitnami have created a greatChart with preconfigured settings for Elasticsearch master, data, and client Pods. All you need to do is run two commands.

helm repo add bitnami 
helm install --name elasticsearch --set \
  name=elasticsearch,master.replicas=3,coordinating.service.type=LoadBalancer bitnami/elasticsearch

Check the kubectl get all output once again to make sure everything is in order.

[Output]

NAME                                                                 READY   STATUS    RESTARTS   AGE

pod/elasticsearch-elasticsearch-coordinating-only-694b5f94f8-896k5   1/1     Running   0          3m55s

pod/elasticsearch-elasticsearch-coordinating-only-694b5f94f8-jvdrn   1/1     Running   0          3m55s

pod/elasticsearch-elasticsearch-data-0                               1/1     Running   0          3m55s

pod/elasticsearch-elasticsearch-data-1                               1/1     Running   0          3m27s

pod/elasticsearch-elasticsearch-master-0                             1/1     Running   0          3m55s

pod/elasticsearch-elasticsearch-master-1                             1/1     Running   0          3m35s

pod/elasticsearch-elasticsearch-master-2                             1/1     Running   0          3m16s

NAME                                                    TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)          AGE

service/elasticsearch-elasticsearch-coordinating-only   LoadBalancer   10.245.13.251   <YOUR_IP>        9200:32270/TCP   3m56s

service/elasticsearch-elasticsearch-discovery           ClusterIP      None            <none>           9300/TCP         3m56s

service/elasticsearch-elasticsearch-master              ClusterIP      10.245.0.78     <none>           9300/TCP         3m56s

service/kubernetes                                      ClusterIP      10.245.0.1      <none>           443/TCP          30m

NAME                                                            READY   UP-TO-DATE   AVAILABLE   AGE

deployment.apps/elasticsearch-elasticsearch-coordinating-only   2/2     2            2           3m55s

NAME                                                                       DESIRED   CURRENT   READY   AGE

replicaset.apps/elasticsearch-elasticsearch-coordinating-only-694b5f94f8   2         2         2       3m55s

NAME                                                  READY   AGE

statefulset.apps/elasticsearch-elasticsearch-data     2/2     3m56s

statefulset.apps/elasticsearch-elasticsearch-master   3/3     3m56s

All that’s left now is to deploy Kibana on the Kubernetes cluster to visualise your data.

How to Deploy Kibana on Kubernetes

Once you have your Elasticsearch cluster up and running on Kubernetes, you can use Kibana to manage and monitor it.

Kibana is a simple tool to visualize Elasticsearch data. To run Kibana you need to provide the name of the Elasticsearch client Service as an environment variable so the Kibana Pod knows where to connect to.

You’ll use a LoadBalancer Service to access the Kibana deployment. If you wish, you can only expose it internally instead.

To add Kibana you use theofficial Helm chart. Go ahead and run the Helm install command.

Make sure to replace the placeholder with the Service name of your client. The default would be elasticsearch-master if you followed the 3-Pod guide, elasticsearch-client if you followed the 7-Pod guide, or elasticsearch-elasticsearch-coordinating-only if you installed the Bitnami Helm chart.

helm install --name kibana elastic/kibana --set \
  elasticsearchHosts=http://<CLIENT_SERVICE_NAME>:9200 \
  service.type=LoadBalancer

Like always, check to make sure Kibana is running after installing the Helm chart.

kubectl get all
[Output]

NAME                                      READY   STATUS              RESTARTS   AGE

...

pod/kibana-kibana-74bf9fc5f5-sxx4g        1/1     Running             0          1m12s

NAME                                      TYPE           CLUSTER-IP        EXTERNAL-IP       PORT(S)                         AGE

...

service/kibana-kibana                     LoadBalancer   10.245.195.198    <YOUR_KIBANA_IP>         5601:31362/TCP                  20s

service/kubernetes                        ClusterIP      10.245.0.1        <none>            443/TCP                         69m

NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE

deployment.apps/kibana-kibana             1/1     1            1           1m12s

NAME                                       DESIRED   CURRENT   READY   AGE

replicaset.apps/kibana-kibana-74bf9fc5f5   1         1         0       1m12s

...

With that, you’re done! Open up http://<YOUR_KIBANA_IP>:5601 and you can see Kibana running.

Applying Elasticsearch on Kubernetes 4

Wrapping Up

In this tutorial you learned about Elasticsearch and Kubernetes clusters, and how to run and deploy Elasticsearch on Kubernetes. Now you know about best practices, hardware requirements, and tips and tricks on how to maintain a stateful Elasticsearch cluster on Kubernetes.

You’ve created three setups with different numbers of Pods with different roles, while managing state with persistent volumes. By now you know the architectural overview of both how to create a solid Elasticsearch cluster but also how to organize resources in a Kubernetes cluster.

You’ve also installed Kibana so you can interact with the data stored in Elasticsearch, and interacted with the Elasticsearch REST API using curl.

This article has been published from a wire agency feed without modifications to the text. Only the headline has been changed.

Source link