Need to handle KATPortalClient timeout errors better #14

david-macmahon · 2020-05-25T19:50:47Z

Occasionally, the KATPortalClient's connection to the KATPortal server times out. When this happens, manual intervention is required to get the system back into an operational state. The reason for these timeouts is not understood and may be outside our code base, but regardless of the underlying cause, KATPortalClient should handle this situation more gracefully so that the backend remains in an operational state (to whatever extent that's possible).

danielczech · 2020-05-26T13:28:12Z

This bug is not as easy to track down as the others, since the particular timeout error never occurs when testing on the CAM development system. As mentioned above, so far it has only occurred (intermittently) during live observations.
In the earlier katportal_server version, when this error occurred, the current observation would be lost (but the katportal_server would restart, allowing subsequent observations to continue).

I have tracked the problem down to the schedule_blocks sensor, and have made two changes (see 2205214) to try to handle this particular timeout more gracefully.

Firstly, I have manually specified a timeout duration for run_sync which will hopefully be sufficient. This raises the question: should a timeout duration be specified for all run_sync calls? The error has not been observed for any of the other "once-off" sensors so far.

Secondly, I have used a try block which will facilitate debugging during the next testing session and at least permit the current observation to continue without intervention (minus the schedule-block information).

I hope to test these improvements during the next testing session (likely 2020-05-28) as I have been unable to replicate the error with the development system.

danielczech · 2020-05-29T09:15:17Z

Following the testing session, it appears that explicitly extending the timeout duration has prevented this error occurring for the schedule_blocks sensor.
However, we observed the error occurring again for a different run_sync call; therefore it seems likely that these measures will be needed for every run_sync call.

david-macmahon added the bug Something isn't working label May 25, 2020

david-macmahon assigned danielczech May 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need to handle KATPortalClient timeout errors better #14

Need to handle KATPortalClient timeout errors better #14

david-macmahon commented May 25, 2020

danielczech commented May 26, 2020

danielczech commented May 29, 2020

Need to handle KATPortalClient timeout errors better #14

Need to handle KATPortalClient timeout errors better #14

Comments

david-macmahon commented May 25, 2020

danielczech commented May 26, 2020

danielczech commented May 29, 2020