Couchbase openshift operator problem- couchbase cluster will not come up

  1. We recently applied network policies to the our OpenShift project to enable multi-tenancy per OpenShifts documentation Configuring multitenant network policy - Network policy | Networking | OpenShift Container Platform 4.6.
  2. After doing this, when we created a new Couchbase instance with 4 pods, only one pod was getting created.
  3. We opened a ticket with RedHat to diagnose this issue further as we were not seeing any errors.
  4. While working with RedHat, we notice that the Couchbase Operator was installed on the openshift-operator project instead of the inf-auto project where we created the Couchbase cluster instance. I remember selecting inf-auto when I installed it the first time, so this was unexpected.
  5. We removed the operator and re-installed it in the inf-auto project.
  6. When we tried to create a new Couchbase cluster instance, no pods are getting created and we see the following error:

{“level”:“info”,“ts”:1619812059.4318697,“logger”:“cluster”,“msg”:“Cluster does not exist so the operator is attempting to create it”,“cluster”:“a-couchbase-test/cb-example-test4”}

{“level”:“info”,“ts”:1619812059.4931834,“logger”:“cluster”,“msg”:“Creating pod”,“cluster”:“a-couchbase-test/cb-example-test4”,“name”:“cb-example-test4-0000”,“image”:“registry.connect.redhat.com/couchbase/server@sha256:fd6d9c0ef033009e76d60dc36f55ce7f3aaa942a7be9c2b66c335eabc8f5b11e”}

{“level”:“info”,“ts”:1619812059.515399,“logger”:“cluster”,“msg”:“Member creation failed”,“cluster”:“a-couchbase-test/cb-example-test4”,“name”:“cb-example-test4-0000”,“resource”:""}

{“level”:“info”,“ts”:1619812059.5357425,“logger”:“cluster”,“msg”:“Pod deleted”,“cluster”:“a-couchbase-test/cb-example-test4”,“name”:“cb-example-test4-0000”}

{“level”:“info”,“ts”:1619812059.5357752,“logger”:“cluster”,“msg”:“Reconciliation failed”,“cluster”:“a-couchbase-test/cb-example-test4”,“error”:“fail to create member’s pod (cb-example-test4-0000): pods “cb-example-test4-0000” is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.fsGroup: Invalid value: int64{1000}: 1000 is not an allowed group]”,“stack”:“github.com/couchbase/couchbase-operator/pkg/util/k8sutil.CreatePod\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/util/k8sutil/k8sutil.go:246\ngithub.com/couchbase/couchbase-operator/pkg/util/k8sutil.CreateCouchbasePod\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/util/k8sutil/pod_util.go:104\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createPod\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/cluster.go:489\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createMember\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/reconcile.go:299\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).create\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/cluster.go:289\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/reconcile.go:117\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/cluster.go:398\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/cluster/cluster.go:429\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\t/home/couchbase/jenkins/workspace/couchbase-k8s-microservice-build/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/couchbase/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/couchbase/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/home/couchbase/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.0/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/home/couchbase/go/pkg/mod/k8s.io/apimachinery@v0.17.5-beta.0/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/home/couchbase/go/pkg/mod/k8s.io/apimachinery@v0.17.5-beta.0/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/home/couchbase/go/pkg/mod/k8s.io/apimachinery@v0.17.5-beta.0/pkg/util/wait/wait.go:88”}

Our openshfit admin installed our operator since we do not have the permission. This is the summary from them, however we do not have a working couchbase cluster. We really do not understand why the network policy change could affect couchbase cluster.

That’s the hint we need… let me explain. In the distant past, you needed to fill in the fsGroup correctly or persistent volumes wouldn’t work. After every user didn’t fill this in, we decided to try do it for you with the dynamic admission controller. On OCP this interrogates the namespace that the cluster lives in and extracts the fsGroup from the annotations, which makes me suspect that the dynamic admission controller isn’t working correctly. You can manually set the fsGroup using these instructions Persistent Volumes | Couchbase Docs

I manually reset the fsGroup, all the pods come up. Thanks for your help.

I sent the following to our openshift admin, questioned the DAC is not there.
Check the Status of the Operator
You can use the following command to check on the status of the deployments:
$ oc get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
couchbase-operator 1/1 1 1 8s
couchbase-operator-admission 1/1 1 1 8s
CONSOLE
The Operator is ready to deploy CouchbaseCluster resources when both the DAC and Operator deployments are fully ready and available

root@usapprshilt100:/Automation/projects/openshift #oc project a-couchbase-test
Now using project “a-couchbase-test” on server “https://api.ivz-ocp-poc.ops.invesco.net:6443”.
root@usapprshilt100:/Automation/projects/openshift #
root@usapprshilt100:/Automation/projects/openshift #oc get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
couchbase-operator 1/1 1 1 3d3h
root@usapprshilt100:/Automation/projects/openshift #n the poc,
I did not see the couchbase DAC running, couchbase-operator-admission is missing

But our admin mentioned if in our dev env, when they installed couchbase for all named space, they did not see DAC running, but everything is working fine. Now they want to install the couchbase operator only for our name space. this is where the problem pods are not coming up. So when the operator is installed for all the name space, you do not need DAC ?

No the DAC needs to always be installed. We recommend it’s run in the default cluster mode, and therefore you only need one installed, in any namespace.

When we install it from GUI interface, according to our openshift admin, after he click install, the DAC is not installed. I could try to install it using the yaml file according to the instruction on the operator documents, however, I think that my permission as the admin of the name space is not good enough to finish the installation, it will still need openshift cluster admin role to install it ?

That’s correct, you need to install the DAC manually, it is not installed alongside the operator from the Openshift UI.

Hello guys, i feel i had same issue like @lukq 's issue.
Because i’ve installed the operator, only one instance(pod) from the cluster start running & it keep restarting. Checking events in desc order, this is what we had:

94s         Warning   MemberCreationFailed   couchbasecluster/couchbase-cluster   New member couchbase-cluster-0000 creation failed
89s         Normal    ServiceCreated         couchbasecluster/couchbase-cluster   Service for admin console `couchbase-cluster-ui` was created

Also i noticed that there is no deployment nor statefulset. Seems like the operator directly managing pods.
May be because we are running very recent version of the operator:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  labels:
    operators.coreos.com/couchbase-enterprise-certified.openshift-operators: ""
  name: couchbase-enterprise-certified
  namespace: openshift-operators
  annotations:
    argocd.argoproj.io/sync-wave: "4"
spec:
  channel: 2.3.2
  installPlanApproval: Automatic
  name: couchbase-enterprise-certified
  source: certified-operators
  sourceNamespace: openshift-marketplace
  startingCSV: couchbase-operator.v2.3.2-1

the cluster YAML comes by default with Operator UI in OCP console, we just copy it & rename the resource … and this is what we had:

kind: CouchbaseCluster
apiVersion: couchbase.com/v2
metadata:
  name: couchbase-cluster
  namespace: databases-ntr-dev
spec:
  image: registry.connect.redhat.com/couchbase/server@sha256:05aad0f1d3a373b60dece893a9c185dcb0e0630aa6f0c0f310ad8767918fd2af
  cluster:
    clusterName: couchbase-cluster
    dataServiceMemoryQuota: 256Mi
    indexServiceMemoryQuota: 256Mi
    searchServiceMemoryQuota: 256Mi
    eventingServiceMemoryQuota: 256Mi
    analyticsServiceMemoryQuota: 1Gi
    indexStorageSetting: memory_optimized
    autoFailoverTimeout: 120s
    autoFailoverMaxCount: 3
    autoFailoverOnDataDiskIssues: true
    autoFailoverOnDataDiskIssuesTimePeriod: 120s
    autoFailoverServerGroup: false
  upgradeStrategy: RollingUpgrade
  hibernate: false
  hibernationStrategy: Immediate
  recoveryPolicy: PrioritizeDataIntegrity
  security:
    adminSecret: couchbase-cluster-auth
    rbac:
      managed: true
      selector:
        matchLabels:
          cluster: couchbase-cluster
  xdcr:
    managed: false
    selector:
      matchLabels:
        cluster: couchbase-cluster
  backup:
    image: >-
      registry.connect.redhat.com/couchbase/operator-backup@sha256:c0ab51854294d117c4ecf867b541ed6dc67410294d72f560cc33b038d98e4b76
    managed: false
    serviceAccountName: couchbase-backup
    selector:
      matchLabels:
        cluster: couchbase-cluster
  monitoring:
    prometheus:
      enabled: false
      # registry.connect.redhat.com
      image: registry.connect.redhat.com/couchbase/exporter@sha256:d392e6c902f784abfc083c9bf5ce11895d0183347b6c21b259678fd85f312cd4
  networking:
    exposeAdminConsole: true
    adminConsoleServices:
      - data
    exposedFeatures:
      - xdcr
    exposedFeatureServiceType: NodePort
    # adminConsoleServiceType: NodePort
    adminConsoleServiceTemplate:
      spec:
        type: ClusterIP
  buckets:
    managed: true
    selector:
      matchLabels:
        cluster: couchbase-cluster
  logRetentionTime: 604800s
  logRetentionCount: 20
  enablePreviewScaling: false
  servers:
    - size: 3
      name: all_services
      services:
        - data
        - index
        - query
        - search
        - eventing
        - analytics

Is it same root cause ? i mean : the missing of admission controller & DAC?