Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator not scaling up cluster #267

Open
yywandb opened this issue Jan 29, 2021 · 1 comment
Open

Operator not scaling up cluster #267

yywandb opened this issue Jan 29, 2021 · 1 comment
Assignees

Comments

@yywandb
Copy link

yywandb commented Jan 29, 2021

Thanks for opening an issue for the M3DB Operator! We'd love to help you, but we need the following information included
with any issue:

  • What version of the operator are you running? Please include the docker tag. If using master, please include the git
    SHA logged when the operator first starts.

v0.10.0

  • What version of Kubernetes are you running? Please include the output of kubectl version.
❯ kubectl version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.5", GitCommit:"e6503f8d8f769ace2f338794c914a96fc335df0f", GitTreeState:"clean", BuildDate:"2020-06-27T00:38:11Z", GoVersion:"go1.14.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.12", GitCommit:"17c50ce2d686f4346924935063e3a431360e0db7", GitTreeState:"clean", BuildDate:"2020-06-26T03:33:27Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}
  • What are you trying to do?

Increase the instances per isolation group of our m3db cluster by 1, i.e. adding 3 nodes to the cluster, one for each replica.

  • What did you expect to happen?

Operator to detect that it needs to begin adding the node.

  • What happened?

The operator doesn't scale up the cluster. We see logs that look like this:

{"level":"info","ts":"2021-01-29T19:05:39.751Z","msg":"statefulset already exists","controller":"m3db-cluster-controller","name":"m3db-rep0"}
{"level":"info","ts":"2021-01-29T19:05:39.751Z","msg":"successfully synced item","controller":"m3db-cluster-controller","key":"m3/m3db"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-1"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep0-2"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-7"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep0-16"}
{"level":"info","ts":"2021-01-29T19:05:40.254Z","msg":"processing pod","controller":"m3db-cluster-controller","pod.namespace":"m3","pod.name":"m3db-rep2-4"}

We previously saw the same issue when using v0.7.0 of the operator with these logs.

{"level":"error","ts":"2021-01-28T21:18:58.717Z","msg":"statefulsets.apps \"m3db-rep0\" already exists","controller":"m3db-cluster-controller"}
E0128 21:18:58.717342       1 controller.go:319] error syncing cluster 'm3/m3db': statefulsets.apps "m3db-rep0" already exists

At that time, we chatted with @robskillington who suggested we upgrade to 0.8.0 or newer where there would be better state syncing in large k8s clusters that might reduce issues where there are stale view of objects, such as statefulsets not being seen as existing.

We thought it might be resolved by upgrading to v0.10.0 but we think the same issue persists. Though it seems like the "statefulset already exists" log is info level rather than error.

We're trying to understand more about how "statefulset already exists" might relate to the operator is not beginning to scale up the cluster. Still unsure if this is an issue on our k8s cluster side or a bug in the operator.

Other things we've tried:

  • [didn't work] edit the m3dbcluster back to the original number of instances, then restart the operator, then edit the m3dbcluster back up to the desired number of instances
  • [worked] delete the m3db-rep0 statefulset (operator doesn't recreate sts yet), then restart the operator, then we saw the operator started creating the new statefulset with the desired number of instances + started scaling up the cluster
@gibbscullen gibbscullen self-assigned this Mar 25, 2021
@jeromefroe
Copy link
Collaborator

Hi @yywandb! Sorry for the delay in following up on this issue. Based on your description, it seems like the operator doesn't become aware that the cluster spec has been updated unless it's restarted. Does that sound right? If so, it seems like this issue might be similar to a previous issue we ran into, #268, where the operator would update a StatefulSet without waiting for a previous StatefulSet that it just updated to become healthy. The root cause of that issue was that the operator was working with stale copies of the StatefulSet's in the cluster and was addressed in #271. That commit was included in the most recent release, v0.13.0, and while it's concerned with StatefulSet's and not m3db cluster CRD's like this issue, it would be interesting to see if the issue still occurs with the latest release. To that end, would it possible to update your operator to v0.13.0? One tricky thing to be aware of before upgrading is that v0.12.0 contained breaking changes to the default ConfigMap that the operator uses for M3DB, so if you are relying on the default ConfigMap you'll need to provide the old default as a custom ConfigMap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants