Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPIRE Controller Manager Nightly jumps into a crash loopback when ClusterStaticEntries CRD is missing. #177

Open
v0lkan opened this issue Jul 9, 2023 · 3 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@v0lkan
Copy link

v0lkan commented Jul 9, 2023

The component was working as expected ~5 days ago (today is Jul, 9, 2023).

The YAML files used to deploy SPIRE can be found at this snapshot:

https://github.com/shieldworks/aegis/tree/fbeb28f97761a768498aa9f03ca7521f41b641d6/k8s/spire

What happens:

SPIRE Server crashes. Here are the logs related to SPIRE controller manager

Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  36m                 default-scheduler  Successfully assigned spire-system/spire-server-6fb4f57c8-6dcpc to minikube
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-server:1.6.3"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-server:1.6.3" in 2.192539709s (3.846152012s including waiting)
  Normal   Created    36m                 kubelet            Created container spire-server
  Normal   Started    36m                 kubelet            Started container spire-server
  Normal   Pulling    36m                 kubelet            Pulling image "ghcr.io/spiffe/spire-controller-manager:nightly"
  Normal   Pulled     36m                 kubelet            Successfully pulled image "ghcr.io/spiffe/spire-controller-manager:nightly" in 2.192111448s (2.963491192s including waiting)
  Normal   Created    26m (x5 over 36m)   kubelet            Created container spire-controller-manager
  Normal   Started    26m (x5 over 36m)   kubelet            Started container spire-controller-manager
  Normal   Pulled     26m (x4 over 34m)   kubelet            Container image "ghcr.io/spiffe/spire-controller-manager:nightly" already present on machine
  Warning  BackOff    23s (x75 over 32m)  kubelet            Back-off restarting failed container spire-controller-manager in pod spire-server-6fb4f57c8-6dcpc_spire-system(ed1688e0-1e49-4beb-9585-dbcedebd4af3)
~/WORKSPACE/aegis (main) 🐢⚡️ k logs spire-server-6fb4f57c8-6dcpc -n spire-system -c spire-controller-manager
2023-07-09T21:47:26Z	INFO	setup	Config loaded	{"cluster name": "aegis-cluster", "cluster domain": "cluster.local", "trust domain": "aegis.ist", "ignore namespaces": ["kube-system", "kube-public", "spire-system", "local-path-storage", "kube-node-lease", "kube-public", "kubernetes-dashboard"], "gc interval": "10s", "spire server socket path": "/spire-server/api.sock"}
2023-07-09T21:47:26Z	INFO	setup	Dialing SPIRE Server socket
2023-07-09T21:47:26Z	INFO	controller-runtime.metrics	Metrics server is starting to listen	{"addr": "127.0.0.1:8082"}
2023-07-09T21:47:26Z	INFO	webhook-manager	Minting webhook certificate	{"reason": "initializing", "dnsNames": ["spire-controller-manager-webhook-service.spire-system.svc"]}
2023-07-09T21:47:26Z	INFO	webhook-manager	Minted webhook certificate
2023-07-09T21:47:26Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterFederatedTrustDomain"}
2023-07-09T21:47:26Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterFederatedTrustDomain", "path": "/validate-spire-spiffe-io-v1alpha1-clusterfederatedtrustdomain"}
2023-07-09T21:47:26Z	INFO	controller-runtime.webhook	Registering webhook	{"path": "/validate-spire-spiffe-io-v1alpha1-clusterfederatedtrustdomain"}
2023-07-09T21:47:26Z	INFO	controller-runtime.builder	skip registering a mutating webhook, object does not implement admission.Defaulter or WithDefaulter wasn't called	{"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterSPIFFEID"}
2023-07-09T21:47:26Z	INFO	controller-runtime.builder	Registering a validating webhook	{"GVK": "spire.spiffe.io/v1alpha1, Kind=ClusterSPIFFEID", "path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2023-07-09T21:47:26Z	INFO	controller-runtime.webhook	Registering webhook	{"path": "/validate-spire-spiffe-io-v1alpha1-clusterspiffeid"}
2023-07-09T21:47:26Z	INFO	setup	starting manager
2023-07-09T21:47:26Z	INFO	controller-runtime.webhook.webhooks	Starting webhook server
2023-07-09T21:47:26Z	INFO	starting server	{"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8082"}
2023-07-09T21:47:26Z	INFO	controller-runtime.certwatcher	Updated current TLS certificate
I0709 21:47:26.492611     229 leaderelection.go:245] attempting to acquire leader lease spire-system/98c9c988.spiffe.io...
2023-07-09T21:47:26Z	INFO	controller-runtime.certwatcher	Starting certificate watcher
2023-07-09T21:47:26Z	INFO	controller-runtime.webhook	Serving webhook server	{"host": "", "port": 9443}
I0709 21:47:44.050693     229 leaderelection.go:255] successfully acquired lease spire-system/98c9c988.spiffe.io
2023-07-09T21:47:44Z	INFO	Starting EventSource	{"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "source": "kind source: *v1alpha1.ClusterSPIFFEID"}
2023-07-09T21:47:44Z	INFO	Starting Controller	{"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID"}
2023-07-09T21:47:44Z	INFO	Starting EventSource	{"controller": "clusterstaticentry", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterStaticEntry", "source": "kind source: *v1alpha1.ClusterStaticEntry"}
2023-07-09T21:47:44Z	INFO	Starting Controller	{"controller": "clusterstaticentry", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterStaticEntry"}
2023-07-09T21:47:44Z	INFO	Starting EventSource	{"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain", "source": "kind source: *v1alpha1.ClusterFederatedTrustDomain"}
2023-07-09T21:47:44Z	INFO	Starting Controller	{"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain"}
2023-07-09T21:47:44Z	INFO	Starting EventSource	{"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "source": "kind source: *v1.Pod"}
2023-07-09T21:47:44Z	INFO	Starting Controller	{"controller": "pod", "controllerGroup": "", "controllerKind": "Pod"}
2023-07-09T21:47:44Z	DEBUG	events	spire-server-6fb4f57c8-6dcpc_111c5818-baeb-4a17-a464-921151f83677 became leader	{"type": "Normal", "object": {"kind":"Lease","namespace":"spire-system","name":"98c9c988.spiffe.io","uid":"5e3e9d0f-e23b-4970-8709-ef5dc1a4a9a5","apiVersion":"coordination.k8s.io/v1","resourceVersion":"7484"}, "reason": "LeaderElection"}
2023-07-09T21:47:44Z	INFO	webhook-manager	Received webhook added event
2023-07-09T21:47:44Z	ERROR	controller-runtime.source.EventHandler	if kind is a CRD, it should be installed before calling Start	{"kind": "ClusterStaticEntry.spire.spiffe.io", "error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1.1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:63
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:62
k8s.io/apimachinery/pkg/util/wait.loopConditionUntilContext
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/loop.go:63
k8s.io/apimachinery/pkg/util/wait.PollUntilContextCancel
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/poll.go:33
sigs.k8s.io/controller-runtime/pkg/internal/source.(*Kind).Start.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/source/kind.go:56
2023-07-09T21:47:44Z	ERROR	entry-reconciler	Failed to list ClusterStaticEntries	{"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
	/workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
	/workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219
2023-07-09T21:47:44Z	INFO	Starting workers	{"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "worker count": 1}
2023-07-09T21:47:44Z	DEBUG	Triggering reconciliation	{"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "ClusterSPIFFEID": {"name":"aegis-safe"}, "namespace": "", "name": "aegis-safe", "reconcileID": "24ae242b-1917-45f6-9533-86ed7f4310ab"}
2023-07-09T21:47:44Z	DEBUG	Triggering reconciliation	{"controller": "clusterspiffeid", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterSPIFFEID", "ClusterSPIFFEID": {"name":"aegis-sentinel"}, "namespace": "", "name": "aegis-sentinel", "reconcileID": "e509004d-66f0-40ef-8fc4-7eb675f8b6d0"}
2023-07-09T21:47:44Z	INFO	Starting workers	{"controller": "clusterfederatedtrustdomain", "controllerGroup": "spire.spiffe.io", "controllerKind": "ClusterFederatedTrustDomain", "worker count": 1}
2023-07-09T21:47:44Z	ERROR	entry-reconciler	Failed to list ClusterStaticEntries	{"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
	/workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
	/workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219
2023-07-09T21:47:44Z	INFO	Starting workers	{"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "worker count": 1}
2023-07-09T21:47:44Z	DEBUG	Triggering reconciliation	{"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"aegis-sentinel-547bc8f7f6-84nj9","namespace":"aegis-system"}, "namespace": "aegis-system", "name": "aegis-sentinel-547bc8f7f6-84nj9", "reconcileID": "6efc20f1-a69a-4e78-9b24-782715247a1f"}
2023-07-09T21:47:44Z	DEBUG	Triggering reconciliation	{"controller": "pod", "controllerGroup": "", "controllerKind": "Pod", "Pod": {"name":"aegis-safe-6b4bc89c78-7gpl5","namespace":"aegis-system"}, "namespace": "aegis-system", "name": "aegis-safe-6b4bc89c78-7gpl5", "reconcileID": "801f2c0b-f03e-452b-a397-c6b44dd9361b"}
2023-07-09T21:47:44Z	ERROR	entry-reconciler	Failed to list ClusterStaticEntries	{"error": "no matches for kind \"ClusterStaticEntry\" in version \"spire.spiffe.io/v1alpha1\""}
github.com/spiffe/spire-controller-manager/pkg/spireentry.(*entryReconciler).reconcile
	/workspace/pkg/spireentry/reconciler.go:89
github.com/spiffe/spire-controller-manager/pkg/reconciler.(*reconciler).Run
	/workspace/pkg/reconciler/reconciler.go:84
sigs.k8s.io/controller-runtime/pkg/manager.RunnableFunc.Start
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/manager.go:382
sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:219

Expectation:

SPIRE server should have given a warning (along the lines of “ClusterStaticEntry CRD is missing, please download at install it from {URL}.”

Or SPIRE Controller Manager container should have done a self-diagnosis and exit with a reason

Or both. Or something along those lines.

Other Notes and Resolutions:

@v0lkan
Copy link
Author

v0lkan commented Jul 9, 2023

Also, this is a breaking change (but it’s understandable to be so since it’s a nightly build); not sure the best way to handle it though since it is up to the user to add that CRD in the first place.

@azdagron
Copy link
Member

This should hopefully be as easy as detecting this particular failure reason when listing the CRDs during reconciliation and treating it as "no CRDs present".

@azdagron azdagron added good first issue Good for newcomers help wanted Extra attention is needed labels Jul 11, 2023
@MarcosDY
Copy link
Collaborator

We initially released it without this feature and then added documentation to ensure that users always upgrade CRDs when upgrading versions.

@MarcosDY MarcosDY removed this from the 0.3.0 milestone Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants