How to Deploy Druid using Kubernetes Druid Operator

Kubernetes Druid

As cloud-native architectures continue to evolve, the management of large-scale, distributed databases and analytics platforms within Kubernetes clusters has become increasingly complex. While Kubernetes excels at automating the deployment, scaling, and operations of containerized applications, effectively managing distributed data systems in such dynamic environments requires a more specialized approach.

Druid, a high-performance, distributed analytics database, is designed to handle real-time data ingestion and interactive analytical queries on large datasets. It is particularly well-suited for use cases involving fast, ad-hoc analytics on high volumes of time-series or event-based data. However, deploying and managing Druid in Kubernetes presents unique challenges, such as ensuring optimal resource allocation, maintaining high availability, and scaling across many nodes.

This article covers the importance of Druid in the context of database management within a Kubernetes environment. We will explore the fundamentals of Druid, highlighting its significance in real-time analytics and effective data handling, while also offering best practices for deploying and configuring Druid on Kubernetes. By the end of this article, you will have a thorough understanding of how to deploy Druid in Kubernetes using the Kubernetes Druid Operator, as well as how Druid’s integration with Kubernetes strengthens your approach to managing distributed analytics platforms in a dynamic ecosystem.

Advantage of Druid in Kubernetes

Deploying Druid in Kubernetes presents a unique set of advantages that align with the platform’s strengths in managing distributed systems at scale. Kubernetes offers powerful automation and orchestration capabilities, enabling organizations to seamlessly manage the complexities of large-scale, distributed data systems like Druid. As a real-time analytics database, Druid requires a robust environment that can handle high data ingestion rates and provide quick responses to complex queries over large datasets. Kubernetes, with its dynamic scaling, self-healing, and efficient resource management, offers an ideal foundation for deploying and operating Druid at scale.

The ability of Kubernetes to scale applications both horizontally and vertically makes it ideal for managing Druid’s distributed architecture, which includes components like coordinator, overlord, historical, and broker nodes. Kubernetes automates scaling based on demand, ensuring Druid can handle spikes in data volume or query load without manual intervention. Whether adding more pods or adjusting resource allocations, Kubernetes simplifies and optimizes the scaling process. Kubernetes provides high availability and fault tolerance, crucial for distributed systems like Druid. If a pod or node fails, Kubernetes automatically redeploys and redistributes the workload, minimizing disruptions. Its built-in load balancing and self-healing mechanisms further ensure that Druid maintains data availability and consistent query performance, even in the face of failures.

Resource management is another key advantage of using Kubernetes with Druid. By allowing administrators to define CPU and memory requests and limits for each Druid component, Kubernetes ensures that resources are allocated efficiently, preventing resource contention and optimizing performance. This fine-grained control over resources is essential in maintaining the performance of resource-intensive analytics workloads.

Additionally, Kubernetes provides declarative management capabilities, which streamline the deployment and management of Druid’s distributed architecture. Through Custom Resource Definitions (CRDs), Druid components can be defined declaratively, making it easier to manage the lifecycle of the system, including configuration updates, scaling, and rolling deployments. This reduces the operational complexity and allows for more predictable and reliable deployments.

Deploy Druid on Kubernetes

Pre-requisites

We have to set up the environment to deploy Druid on Kubernetes using a Kubernetes Druid operator. You requires to have a running Kubernetes cluster and a basic understanding of Druid. Here we are using Kind to create our Kubernetes cluster. Additionally, you should install Helm to your Kubernetes cluster.

In this tutorial, We will use the Kubernetes Druid operator KubeDB to deploy Druid on Kubernetes. We must install KubeDB to our Kubernetes cluster. To set up KubeDB in our Kubernetes cluster, we need a license. We can get a free license using the Appscode License Server. To obtain the license we must provide our Kubernetes cluster ID. Run the following command to get the cluster ID.

$ kubectl get ns kube-system -o jsonpath='{.metadata.uid}'
4ed26615-99dc-4ae8-2413-2e7ec357b05

The license server will email us with a “license.txt” file attached after we provide the necessary data. Run the following commands listed below to install KubeDB.

$ helm install kubedb oci://ghcr.io/appscode-charts/kubedb \
  --version v2024.8.21 \
  --namespace kubedb --create-namespace \
  --set-file global.license=/path/to/the/license.txt \
  --set global.featureGates.Druid=true \
  --set global.featureGates.ZooKeeper=true \
  --wait --burst-limit=10000 --debug

Verify the installation by the following command,

$ kubectl get pods --all-namespaces -l "app.kubernetes.io/instance=kubedb"
NAMESPACE   NAME                                            READY   STATUS    RESTARTS   AGE
kubedb      kubedb-kubedb-autoscaler-67747584cf-t767j       1/1     Running   0          3m37s
kubedb      kubedb-kubedb-ops-manager-68d7f85f64-ks8b9      1/1     Running   0          3m37s
kubedb      kubedb-kubedb-provisioner-5c4669498f-8l7bd      1/1     Running   0          3m37s
kubedb      kubedb-kubedb-webhook-server-66b46b464-xmkrx    1/1     Running   0          3m37s
kubedb      kubedb-petset-operator-77b6b9897f-p4zr8         1/1     Running   0          3m37s
kubedb      kubedb-petset-webhook-server-7d6c8dc854-vfzc5   2/2     Running   0          3m37s
kubedb      kubedb-sidekick-c898cff4c-5ljcj                 1/1     Running   0          3m37s

We can go on to the next stage if every pod status is running.

Create a Namespace

To keep resources isolated, we’ll use a separate namespace called demo throughout this tutorial. Run the following command to create the namespace:

$ kubectl create namespace demo
namespace/demo created

Create a Deep Storage

One of the external dependency of Druid is Deep Storage where the segments are stored. It is a storage mechanism that Apache Druid does not provide. Amazon S3, Google Cloud Storage, or Azure Blob Storage, S3-compatible storage (like Minio), or HDFS are generally convenient options for Deep Storage.

In this tutorial, we will run a minio-server as deep storage in our cluster using minio-operator and create a bucket named druid in it, which the deployed druid database will use.

$ helm repo add minio https://operator.min.io/
$ helm repo update minio
$ helm upgrade --install --namespace "minio-operator" --create-namespace "minio-operator" minio/operator --set operator.replicaCount=1
$ helm upgrade --install --namespace "demo" --create-namespace druid-minio minio/tenant \
--set tenant.pools[0].servers=1 \
--set tenant.pools[0].volumesPerServer=1 \
--set tenant.pools[0].size=1Gi \
--set tenant.certificate.requestAutoCert=false \
--set tenant.buckets[0].name="druid" \
--set tenant.pools[0].name="default"

Now we need to create a Secret named deep-storage-config. It contains the necessary connection information using which the druid database will connect to the deep storage.

apiVersion: v1
kind: Secret
metadata:
  name: deep-storage-config
  namespace: demo
stringData:
  druid.storage.type: "s3"
  druid.storage.bucket: "druid"
  druid.storage.baseKey: "druid/segments"
  druid.s3.accessKey: "minio"
  druid.s3.secretKey: "minio123"
  druid.s3.protocol: "http"
  druid.s3.enablePathStyleAccess: "true"
  druid.s3.endpoint.signingRegion: "us-east-1"
  druid.s3.endpoint.url: "http://myminio-hl.demo.svc.cluster.local:9000/"

Let’s create the deep-storage-config Secret shown above:

$ kubectl apply -f deep-storage-config.yaml
secret/deep-storage-config created

You can also use options like Amazon S3, Google Cloud Storage, Azure Blob Storage or HDFS and create a connection information Secret like this, and you are good to go.

Metadata Storage

Druid uses the metadata store to house various metadata about the system, but not to store the actual data. The metadata store retains all metadata essential for a Druid cluster to work. Apache Derby is the default metadata store for Druid, however, it is not suitable for production. MySQL and PostgreSQL are more production suitable metadata stores.

Luckily, PostgreSQL and MySQL both are readily available in KubeDB as CRD and KubeDB operator will automatically create a MySQL cluster and create a database in it named druid by default.

If you choose to use PostgreSQL as metadata storage, you can simply mention that in the spec.metadataStorage.type of the Druid CR and KubeDB operator will deploy a PostgreSQL cluster for druid to use.

ZooKeeper

Apache Druid uses Apache ZooKeeper (ZK) for management of current cluster state i.e. internal service discovery, coordination, and leader election. Fortunately, KubeDB also has support for ZooKeeper and KubeDB operator will automatically create a ZooKeeper cluster for druid to use.

Deploy Druid using Kubernetes Druid operator

Here is the yaml of the Druid we are going to use:

apiVersion: kubedb.com/v1alpha2
kind: Druid
metadata:
  name: druid-cluster
  namespace: demo
spec:
  deepStorage:
    configSecret:
      name: deep-storage-config
    type: s3
  topology:
    routers:
      replicas: 1
  version: 30.0.0
  deletionPolicy: "WipeOut"

Let’s save this yaml configuration into druid-cluster.yaml Then apply the above Druid yaml,

$ kubectl apply -f druid-cluster.yaml
druid.kubedb.com/druid-cluster created

You can see the detailed yaml specifications in the Kubernetes Druid documentation.

Once these are handled correctly and the Druid object is deployed, you will see that the following resources are created:

$ kubectl get all -n demo
NAME                                 READY   STATUS    RESTARTS   AGE
pod/druid-cluster-brokers-0          1/1     Running   0          54s
pod/druid-cluster-coordinators-0     1/1     Running   0          60s
pod/druid-cluster-historicals-0      1/1     Running   0          58s
pod/druid-cluster-middlemanagers-0   1/1     Running   0          56s
pod/druid-cluster-mysql-metadata-0   2/2     Running   0          2m29s
pod/druid-cluster-mysql-metadata-1   2/2     Running   0          2m23s
pod/druid-cluster-mysql-metadata-2   2/2     Running   0          2m17s
pod/druid-cluster-routers-0          1/1     Running   0          52s
pod/druid-cluster-zk-0               1/1     Running   0          2m28s
pod/druid-cluster-zk-1               1/1     Running   0          2m22s
pod/druid-cluster-zk-2               1/1     Running   0          2m16s
pod/myminio-default-0                2/2     Running   0          12m

NAME                                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                 AGE
service/druid-cluster-brokers                  ClusterIP   10.96.150.199   <none>        8082/TCP                                                62s
service/druid-cluster-coordinators             ClusterIP   10.96.145.130   <none>        8081/TCP                                                62s
service/druid-cluster-mysql-metadata           ClusterIP   10.96.176.254   <none>        3306/TCP                                                2m32s
service/druid-cluster-mysql-metadata-pods      ClusterIP   None            <none>        3306/TCP                                                2m32s
service/druid-cluster-mysql-metadata-standby   ClusterIP   10.96.139.10    <none>        3306/TCP                                                2m32s
service/druid-cluster-pods                     ClusterIP   None            <none>        8081/TCP,8090/TCP,8083/TCP,8091/TCP,8082/TCP,8888/TCP   62s
service/druid-cluster-routers                  ClusterIP   10.96.148.73    <none>        8888/TCP                                                62s
service/druid-cluster-zk                       ClusterIP   10.96.76.149    <none>        2181/TCP                                                2m32s
service/druid-cluster-zk-admin-server          ClusterIP   10.96.139.96    <none>        8080/TCP                                                2m32s
service/druid-cluster-zk-pods                  ClusterIP   None            <none>        2181/TCP,2888/TCP,3888/TCP                              2m32s
service/minio                                  ClusterIP   10.96.31.66     <none>        80/TCP                                                  12m
service/myminio-console                        ClusterIP   10.96.193.192   <none>        9090/TCP                                                12m
service/myminio-hl                             ClusterIP   None            <none>        9000/TCP                                                12m

NAME                               READY   AGE
statefulset.apps/myminio-default   1/1     12m

NAME                                                              TYPE                   VERSION   AGE
appbinding.appcatalog.appscode.com/druid-cluster                  kubedb.com/druid       30.0.0    52s
appbinding.appcatalog.appscode.com/druid-cluster-mysql-metadata   kubedb.com/mysql       8.0.35    2m29s
appbinding.appcatalog.appscode.com/druid-cluster-zk               kubedb.com/zookeeper   3.7.2     2m32s

NAME                                            VERSION   STATUS   AGE
mysql.kubedb.com/druid-cluster-mysql-metadata   8.0.35    Ready    2m32s

NAME                             TYPE                  VERSION   STATUS         AGE
druid.kubedb.com/druid-cluster   kubedb.com/v1alpha2   30.0.0    Provisioning   2m32s

NAME                                    TYPE                  VERSION   STATUS   AGE
zookeeper.kubedb.com/druid-cluster-zk   kubedb.com/v1alpha2   3.7.2     Ready    2m32s

Let’s check if the database is ready to use,

$ kubectl get druid -n demo druid-cluster
NAME            TYPE                  VERSION   STATUS   AGE
druid-cluster   kubedb.com/v1alpha2   30.0.0    Ready    4m37s

Connect with Druid Database

Now, we will access the Druid web UI using port forwarding, which will allow you to manage the Druid cluster via a web interface. It is also possible to use External-IP to access druid nodes if you make service type of that node as LoadBalancer.

Port-forward the Service

KubeDB will create few Services to connect with the database. Let’s check the Services by following command,

$ kubectl get service -n demo
NAME                                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                 AGE
druid-cluster-brokers                  ClusterIP   10.96.150.199   <none>        8082/TCP                                                5m1s
druid-cluster-coordinators             ClusterIP   10.96.145.130   <none>        8081/TCP                                                5m1s
druid-cluster-mysql-metadata           ClusterIP   10.96.176.254   <none>        3306/TCP                                                6m31s
druid-cluster-mysql-metadata-pods      ClusterIP   None            <none>        3306/TCP                                                6m31s
druid-cluster-mysql-metadata-standby   ClusterIP   10.96.139.10    <none>        3306/TCP                                                6m31s
druid-cluster-pods                     ClusterIP   None            <none>        8081/TCP,8090/TCP,8083/TCP,8091/TCP,8082/TCP,8888/TCP   5m1s
druid-cluster-routers                  ClusterIP   10.96.148.73    <none>        8888/TCP                                                5m1s
druid-cluster-zk                       ClusterIP   10.96.76.149    <none>        2181/TCP                                                6m31s
druid-cluster-zk-admin-server          ClusterIP   10.96.139.96    <none>        8080/TCP                                                6m31s
druid-cluster-zk-pods                  ClusterIP   None            <none>        2181/TCP,2888/TCP,3888/TCP                              6m31s
minio                                  ClusterIP   10.96.31.66     <none>        80/TCP                                                  16m
myminio-console                        ClusterIP   10.96.193.192   <none>        9090/TCP                                                16m
myminio-hl                             ClusterIP   None            <none>        9000/TCP                                                16m

To connect to the Druid database, we will use the druid-cluster-routers service. First, we need to port-forward the druid-cluster-routers service to port 8888 on the local machine:

$ kubectl port-forward -n demo svc/druid-cluster-routers 8888
Forwarding from 127.0.0.1:8888 -> 8888
Forwarding from [::1]:8888 -> 8888

Now, the Druid cluster is accessible at localhost:8888. Open the web browser and navigate to localhost:8888. You will see the Druid login panel.

Retrieve the Credentials

To access the database through Druid web UI, we have to get the credentials to access. KubeDB creates several Secrets for managing the database. To view the Secrets created for druid-cluster, run the following command:

$ kubectl get secret -n demo | grep druid-cluster
druid-cluster-admin-cred            kubernetes.io/basic-auth   2      8m19s
druid-cluster-config                Opaque                     11     8m19s
druid-cluster-mysql-metadata-auth   kubernetes.io/basic-auth   2      9m49s
druid-cluster-zk-auth               kubernetes.io/basic-auth   2      9m49s
druid-cluster-zk-config             Opaque                     3      9m49s

From the above list, the druid-cluster-admin-cred Secret contains the admin-level credentials needed to connect to the database.

Use the following commands to obtain the username and password:

$ kubectl get secret -n demo druid-cluster-admin-cred -o jsonpath='{.data.username}' | base64 -d
admin
$ kubectl get secret -n demo druid-cluster-admin-cred -o jsonpath='{.data.password}' | base64 -d
9eKho82VIuiLt*oe

Access the Web UI

With the credentials in hand, log in to the Druid web UI.

Login

After logging in with the credentials, you’ll see the Druid web console.

UI Home

First, click on Load Data in the top menu, then choose Batch - classic. This will bring up the data loading screen.

Batch Classic

Next, select Example data and click the Load example button.

Example Data

Proceed by following the sequential steps provided: click Next at the bottom-right corner of each screen. You will need to go through Parse data, Parse time, Transform, Filter, Configure schema, Partition, Tune, Publish, Edit spec, and finally Submit. Once you have submitted the task, monitor the Tasks list for the Status update, which will initially show as RUNNING and eventually change to SUCCESS.

Tasks

After the task is successfully completed, navigate to the Datasources section from the top menu. You should find your new datasource, such as wikipedia, listed there. Click on it to open the Query option where you can run queries on your data.

Datasources

Queries

Congratulations! We’ve successfully deployed Druid on Kubernetes using Kubernetes Druid operator KubeDB. Additionally, we’ve added sample data to the Druid database using the web UI. This interface provides a convenient way to monitor and manage the Druid cluster.

Druid on Kubernetes: Best Practices

To ensure the stability of your application while using Druid on Kubernetes, there are some best practices that you should follow:

  • Security Configurations: Protect sensitive data and Druid clusters by implementing security best practices. Use Kubernetes Secrets to manage credentials, and consider securing communication between Druid services with TLS encryption. Additionally, enable role-based access control (RBAC) to enforce strict access policies for users and services interacting with Druid.

  • Monitoring and Logging: Monitoring the performance of your Druid cluster is critical for detecting issues and optimizing resource usage. Integrate Prometheus and Grafana to visualize Druid metrics like query performance, data ingestion rates, and resource consumption. Additionally, enable centralized logging (using tools like Fluentd or Elasticsearch) to collect logs from all Druid nodes for easier debugging and analysis.

  • Data Replication and Backup: To ensure high availability and fault tolerance, implement data replication across your Druid cluster. Regular backups of Druid’s metadata store and deep storage are essential for disaster recovery. Configure persistent volumes (PVs) for deep storage in Kubernetes, and backup your metadata database (such as MySQL or PostgreSQL).

  • Using the Kubernetes Druid Operator: The Kubernetes Druid Operator simplifies the management of Druid clusters on Kubernetes. It automates tasks such as scaling, upgrading, and managing Druid components. By utilizing the operator, you reduce the complexity of running Druid in production and benefit from automation that handles many operational tasks.

Conclusion

Apache Druid, known for its high-performance analytics and real-time data processing capabilities, proves invaluable in modern data architectures. Running Druid on Kubernetes introduces significant operational benefits, such as seamless deployment, automated scaling, and centralized management. By adapting to the best practices like resource management, securing data, setting up monitoring, and using the Kubernetes Druid Operator you can ensure the stability, availability, and scalability of your Druid based data pipelines. As Kubernetes continues to power the modern cloud-native ecosystem, leveraging its full potential with Druid ensures that your data infrastructure is both resilient and scalable. For more detailed information on Druid, refer to the official Druid documentation. When combined with best-in-class practices and tools, such as KubeDB, organizations can ensure efficient database management across complex environments, keeping performance and availability at the highest standards.

Share on social media

What They Are Talking About us

Trusted by top engineers at the most ambitious companies

faq-image

frequently asked questions

Here are a few of the questions we get the most. If you don't see what's on your mind, contact us anytime.

Can I manage multiple Databases via KubeDB?

Yes, you can manage as many databases as you want within a single subscription to KubeDB, and there is no extra charge for that!

Can I use KubeDB in any cloud?

Yes, of course! KubeDB is platform-independent. You can use KubeDB in any cloud or on-premises.

My cluster is running on bare metal. Will it be safe to use KubeDB?

KubeDB is running in production by multiple Governments and large organizations. Your data is always safe within KubeDB.

Do you have offer technical support?

We offer 24x7 technical system and maintain SLA to provide 100% reliability to our customers.

Is Stash complementary with KubeDB?

Yes, Stash is seemingly integrated with KubeDB. There is no extra charge for using Stash. It is complimentary with KubeDB.

Can we try KubeDB?

Of course! We offer a 30 days license free of cost to try fully featured KubeDB.

Is there any cancellation fee?

There is no cancellation fee. But plans are subject to minimum duration (1 year) as stated above.

What types of payment do you accept?

We prefer ACH transfer for US based customers and international wire transfer for everyone else. We can also accept all popular credit/debit cards such as Visa, Mastercard, American Express, Discover, etc.

Is my payment information safe?

Yes! For ACH transfer and wire transfer you work with your bank for payment. Our credit card processing is powerd by Stripe. You credit card data never touches our servers. For more information, please visit stripe.com.

Run and Manage your Database on Kubernetes FREE !

KubeDB is FREE to use on any supported Kubernetes engines. You can deploy and manage your database in Kubernetes using KubeDB. There is no up-front investment required. We offer a 30 days license FREE of cost to try KubeDB.