Installing couchbase operator using helm fails readiness probes

Noteworthy · May 18, 2022, 3:24pm

Hello Couchbase,

I am using the helm chart v2.3.001 (last as of now), which contains the couchbase operator v2.3.0.

After creating a fresh k8s cluster v1.23.6 (deployed via kind or kubespray, calico as CNI) and installing the chart by issuing the following command: helm install default couchbase/couchbase-operator.

kubectl get pods:

jupiter-0000 0/1 Running 0 8m
jupiter-couchbase-admission-controller-5d5ff4d897-ldk5b 1/1 Running 0 25h
jupiter-couchbase-operator-7bf5b8556-98khq 1/1 Running 0 25h

kubectl describe pods jupiter-0000:

Events:
Type Reason Age From Message

Normal Scheduled 3m39s default-scheduler Successfully assigned default/jupiter-0000 to node2
Normal Pulled 3m39s kubelet Container image “couchbase/server:7,0,2” already present on machine
Normal Created 3m39s kubelet Created container couchbase-server-init
Normal Started 3m39s kubelet Started container couchbase-server-init
Normal Pulled 3m39s kubelet Container image “couchbase/server:7.0.2” already present on machine
Normal Created 3m39s kubelet Created container couchbase-server
Normal Started 3m39s kubelet Started container couchbase-server
Warning Unhealthy 5s (x13 over 3m20s) kubelet Readiness probe failed: dial tcp 10.233.96.212:8091: connect: connection refused

kubectl logs of the operator:
{“level”:“info”,“ts”:1652886554.1603956,“logger”:“cluster”,“msg”:“Pod deleted”,“cluster”:“default/jupiter”,“name”:“jupiter-0000”}
{“level”:“info”,“ts”:1652886554.169,“logger”:“cluster”,“msg”:“Reconciliation failed”,“cluster”:“default/jupiter”,“error”:“fail to create member’s pod (jupiter-0000): dial tcp 10.233.90.54:8091: connect: connection refused”,“stack”:“github.com/couchbase/couchbase-operator/pkg/util/netutil.WaitForHostPort\n\tgithub.com/couchbase/couchbase-operator/pkg/util/netutil/netutil.go:37\ngithub.com/couchbase/couchbase-operator/pkg/util/k8sutil.WaitForPod\n\tgithub.com/couchbase/couchbase-operator/pkg/util/k8sutil/k8sutil.go:289\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).waitForCreatePod\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/pod.go:108\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createPod\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/pod.go:41\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createMember\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:168\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createInitialMember\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:310\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).create\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:325\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:148\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:481\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:524\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227”}

kubectl logs for the server:

Starting Couchbase Server – Web UI available at http://:8091
and logs available in /opt/couchbase/var/lib/couchbase/logs
chown: changing ownership of ‘var/lib/couchbase’: Operation not permitted

Can someone please help what is the root cause of this issue ?

I have tried getting inside the couchbase operator pod to run the cbopinfo, but seems like I was not able to do it, it seems like the docker images is stripped from all binaries like sh or bash, I would love to know what are the steps needed to run cbopinfo inside a k8s cluster to I can provide you with more logs.

Cheers.

dmitrii.chechetkin · May 20, 2022, 12:34am

Hi!

Could you share the CRDs for the couchbase cluster that operator tried to apply? Are you using persistent volume claims?

Thank you!

Noteworthy · May 20, 2022, 8:45am

Hello Dmitrii,

Here is the CRDs that I applied before I install the operator:

github.com

couchbase-partners/helm-charts/blob/master/charts/couchbase-operator/crds/couchbase.crds.yaml

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  annotations:
    config.couchbase.com/version: 2.3.0
    controller-gen.kubebuilder.io/version: v0.8.0
  name: couchbaseautoscalers.couchbase.com
spec:
  group: couchbase.com
  names:
    kind: CouchbaseAutoscaler
    listKind: CouchbaseAutoscalerList
    plural: couchbaseautoscalers
    shortNames:
    - cba
    singular: couchbaseautoscaler
  scope: Namespaced
  versions:
  - additionalPrinterColumns:
    - jsonPath: .spec.size

This file has been truncated. show original

Yes, I am using PVC with storage classes:

        services:
        - data
        - index
        - query
        - search
        size: 1
        volumeMounts:
          default: couchbase
          data: couchbase
          index: couchbase
    volumeClaimTemplates:
      - metadata:
          name: couchbase
        spec:
          storageClassName: default
          resources:
            requests:
              storage: 50Gi

My default storage class is the local path provisioner (GitHub - rancher/local-path-provisioner: Dynamically provisioning persistent local storage with Kubernetes).

Also, if that might help you, I am using the security context provided by default in the helm chart:

 securityContext:
    fsGroup: 1000
    # -- Indicates that the container must run as a non-root user. If true, the
    # Kubelet will validate the image at runtime to ensure that it does not run
    # as UID 0 (root) and fail to start the container if it does. If unset or
    # false, no such validation will be performed. May also be set in
    # SecurityContext.  If set in both SecurityContext and PodSecurityContext,
    # the value specified in SecurityContext takes precedence.
    runAsNonRoot: true
    runAsUser: 1000
    sysctls: []
    # -- The Windows specific settings applied to all containers. If
    # unspecified, the options within a container's SecurityContext will be
    # used. If set in both SecurityContext and PodSecurityContext, the value
    # specified in SecurityContext takes precedence. Note that this field cannot
    # be set when spec.os.name is linux.
    windowsOptions: {}

dmitrii.chechetkin · May 20, 2022, 4:36pm

Could you try setting couchbaseclusters.spec.securityContext.fsGroup to 1000 as described here? You also may need to set runAsUser to 1000, according to this.

Thank you

Noteworthy · May 20, 2022, 4:45pm

This is exactly what I am doing. By default, it is set like you described, thanks.

tommie · May 23, 2022, 7:58pm

Most likely this behavior is caused by configuration of the Calico CNI networking stack, and so I suggest trying with default kindnet to see if it resolves the readiness error:

There are also logs for Calico Pods which may uncover some interesting information for debugging. I’ve personally spun up a kind cluster and did a successful install, even with the permission denied error.

Noteworthy · May 31, 2022, 12:05pm

Thanks @tommie

I just installed it in kind and it works fine. Kubespray does not support kindnet, I tried different CNI plugins and I get even more errors, many pods crashing

I uploaded the cao logs collected by running ./cao collect-logs --log-level=1:

I run my k8s cluster on bare metal with an L2 connectivity, I kept the default values for Calico that are set by kubespray: kubespray/k8s-net-calico.yml at master · kubernetes-sigs/kubespray · GitHub.

I can’t see any errors in the calico pods.

Cheers.

tommie · June 7, 2022, 9:16pm

Thanks for the logs! From an Operator perspective all I can tell is that there is a failure to communicate with the Pods over this network.

message: 'fail to create member''s pod (default-0000): dial tcp 10.233.96.4:8091:\n+ \t    connect: connection refused'

At this point it’s probably best to drop a line with someone supporting Calico to help resolve issue as I’m not sure what the root cause is for this.