Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] Multiple crashes of the router pod before network stabilizes #1716

Open
ted-ross opened this issue Oct 11, 2024 · 2 comments
Open

[v2] Multiple crashes of the router pod before network stabilizes #1716

ted-ross opened this issue Oct 11, 2024 · 2 comments
Assignees

Comments

@ted-ross
Copy link
Member

ted-ross commented Oct 11, 2024

Describe the bug
Using a scripted two-site demonstration setup (script included below), the sites initialize but the router pods go into crash-loop-backoff and restart twice before the network finally stablizes.

How To Reproduce
On a cluster, create namespaces demo-dmz-a and demo-dmz-b and run the provided setup script. Watch the pods and sites to observe crashes prior to network stabilization.

Expected behavior
I expect the network to stabilize in an orderly fashion without seeing crash indications.

Environment details

  • Skupper CLI: None used
  • Skupper Operator (if applicable): head of the v2 branch
  • Platform: openshift

Additional context
The error seen in the router log prior to crashing:

2024-10-11 16:02:24.121816 +0000 ROUTER (critical) Router start-up failed: Python: CError: Configuration: Failed to configure TLS caCertFile '/etc/skupper-router-certs/skupper-site-server/ca.crt' from sslProfile 'skupper-site-server'

The script used to reproduce:

NS_PREFIX=demo

for i in dmz-a dmz-b; do
echo Create site $i in namespace ${NS_PREFIX}-$i
cat >temp <<EOF
apiVersion: skupper.io/v1alpha1
kind: Site
metadata:
  name: $i
spec:
  routerMode: "interior"
  ha: false
---
apiVersion: skupper.io/v1alpha1
kind: RouterAccess
metadata:
  name: $i-peer
spec:
  generateTlsCredentials: true
  issuer: skupper-site-ca
  roles:
  - name: inter-router
    port: 55671
  tlsCredentials: skupper-site-server
---
apiVersion: skupper.io/v1alpha1
kind: AccessGrant
metadata:
  name: $i-grant
spec:
  redemptionsAllowed: 10
  expirationWindow: 1h
EOF
kubectl apply -n ${NS_PREFIX}-$i -f temp

echo Generating access token token-$i.yaml
kubectl wait --for=condition=ready accessgrant/$i-grant -n ${NS_PREFIX}-$i
kubectl get accessgrant $i-grant -n ${NS_PREFIX}-$i -o yaml > temp
URL=`cat temp | yq '.status.url'`
CODE=`cat temp | yq '.status.code'`
CA=`cat temp | yq -r '.status.ca' | awk '{ print "    " $0 }'`
cat >token-$i.yaml <<EOF
apiVersion: skupper.io/v1alpha1
kind: AccessToken
metadata:
  name: $i-token
spec:
  url: ${URL}
  code: ${CODE}
  ca: |
${CA}
EOF

rm temp
echo
done

echo Link dmz-b to dmz-a
kubectl apply -f token-dmz-a.yaml -n ${NS_PREFIX}-dmz-b
@ted-ross
Copy link
Member Author

ted-ross commented Oct 14, 2024

For your convenience... Here is the cleanup script to un-do the above reproducer:

NS_PREFIX=demo

for i in dmz-a dmz-b;
do
kubectl delete accessgrant $i-grant -n ${NS_PREFIX}-$i
kubectl delete routeraccess $i-peer -n ${NS_PREFIX}-$i
kubectl delete site $i -n ${NS_PREFIX}-$i
rm token-$i.yaml
done

kubectl delete accesstoken dmz-a-token -n ${NS_PREFIX}-dmz-b

@ted-ross
Copy link
Member Author

Possible root cause:

The router has a new behavior post-3.0.0 (I was running the latest in this test). SslProfiles now load the referenced certificate files immediately upon configuration. The old behavior was to load the certificates upon connection-startup for every new connection.

This means that before an sslProfile is created, all of its referenced files must already exist in the file system.

In Skupper, the config-sync module stores the current configuration, including slProfiles, in the router's config-map. When the router starts up, it will read the configuration mounted from that config-map as the initial configuration. The certificate files, however, are not mounted into the router container. They are copied at run-time into a shared file system by the config-sync container.

This means that there is a race condition at pod-startup. If the router reads its initial configuration before config-sync can store the certificate files, the router will shut down due to the incomplete configuration.

@grs grs self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants