Cert renewal errors and fails to recover if signing cert took too long. #170

seandilda · 2022-04-18T15:03:29Z

When the ACME server takes more than 60 seconds to sign a cert, I've noticed openshift-acme gets into an odd state that it can't recover from without manual intervention. We have this happen somewhat regularly.

What happened:

Cert renewal was automatically started. Authorizations were validated, order moved to ready state, openshift-acme submitted the signing request. One minute later, openshift-acme errors out:

I0320 12:18:15.052950       1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "ready" state 
I0320 12:18:15.053020       1 route.go:1070] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" successfully validated
I0320 12:19:14.934962       1 route.go:498] Finished syncing Route "webapp/webapp.example.com"
E0320 12:19:14.935082       1 route.go:1308] webapp/webapp.example.com failed with : can't create cert order: context deadline exceeded
I0320 12:19:14.940581       1 route.go:496] Started syncing Route "webapp/webapp.example.com"
I0320 12:19:14.941364       1 route.go:563] Route "webapp/webapp.example.com" needs new certificate: Proactive renewal
I0320 12:19:14.941877       1 route.go:607] Using ACME client with DirectoryURL "https://acme-server.example.com/acme/v2/directory"
I0320 12:19:15.045235       1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "valid" state 
I0320 12:19:15.045285       1 route.go:498] Finished syncing Route "webapp/webapp.example.com"
I0320 12:19:24.631587       1 reflector.go:432] k8s.io/[email protected]/tools/cache/reflector.go:108: Watch close - *v1.Service total 0 items received
I0320 12:19:56.720177       1 reflector.go:432] k8s.io/[email protected]/tools/cache/reflector.go:108: Watch close - *v1.ReplicaSet total 0 items received
I0320 12:19:56.796697       1 reflector.go:338] k8s.io/[email protected]/tools/cache/reflector.go:108: watch of *v1.ReplicaSet ended with: The resourceVersion for the provided watch is too old.
I0320 12:19:57.796935       1 reflector.go:188] Listing and watching *v1.ReplicaSet from k8s.io/[email protected]/tools/cache/reflector.go:108

The cert is usually signed shortly after this and the order is set to valid on the ACME server. However, the acme.openshift.io/status on the route lists the order status as pending.

After future restarts of the pod, we'll see log lines like:

I0418 14:56:36.907743       1 route.go:563] Route "webapp/webapp.example.com" needs new certificate: In renewal period
I0418 14:56:36.907960       1 route.go:559] Route "webapp/www.webapp.example.com" doesn't need new certificate.
I0418 14:56:36.908105       1 route.go:607] Using ACME client with DirectoryURL "https://acme-server.example.com/acme/v2/directory"
I0418 14:56:36.908137       1 route.go:498] Finished syncing Route "webapp/www.webapp.example.com"
I0418 14:56:37.047642       1 route.go:650] Route "webapp/webapp.example.com": Order "https://acme-server.example.com/acme/v2/orders/uesdjZX9XbLRLiyaAv3UyA" is in "valid" state 
I0418 14:56:37.047704       1 route.go:498] Finished syncing Route "webapp/webapp.example.com"

What you expected to happen:

I expected openshift-acme to retry after the timeout. It could either try to download the signed cert or even throw out the previous order and start over. Either option would be preferable to the current situation.

How to reproduce it (as minimally and precisely as possible):

Have a cert signing take over a minute.
Unfortunately I can't make our acme server publicly available. You may need to add a sleep or artificially short timeout to test this against letsencrypt.

Anything else we need to know?:

Whenever this happens, we can resolve the issue by removing the acme.openshift.io/status annotation from the affected route. It'd be nice to not have to take that manual step.

We have a 3rd-party ACME server that occasionally takes multiple minutes to sign a cert. The duration of the signing process is outside of our control.

@tnozicka

The text was updated successfully, but these errors were encountered:

seandilda added the kind/bug label Apr 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cert renewal errors and fails to recover if signing cert took too long. #170

Cert renewal errors and fails to recover if signing cert took too long. #170

seandilda commented Apr 18, 2022

Cert renewal errors and fails to recover if signing cert took too long. #170

Cert renewal errors and fails to recover if signing cert took too long. #170

Comments

seandilda commented Apr 18, 2022