Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to handle KATPortalClient timeout errors better #14

Open
david-macmahon opened this issue May 25, 2020 · 2 comments
Open

Need to handle KATPortalClient timeout errors better #14

david-macmahon opened this issue May 25, 2020 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@david-macmahon
Copy link

Occasionally, the KATPortalClient's connection to the KATPortal server times out. When this happens, manual intervention is required to get the system back into an operational state. The reason for these timeouts is not understood and may be outside our code base, but regardless of the underlying cause, KATPortalClient should handle this situation more gracefully so that the backend remains in an operational state (to whatever extent that's possible).

@david-macmahon david-macmahon added the bug Something isn't working label May 25, 2020
@danielczech
Copy link
Collaborator

This bug is not as easy to track down as the others, since the particular timeout error never occurs when testing on the CAM development system. As mentioned above, so far it has only occurred (intermittently) during live observations.
In the earlier katportal_server version, when this error occurred, the current observation would be lost (but the katportal_server would restart, allowing subsequent observations to continue).

I have tracked the problem down to the schedule_blocks sensor, and have made two changes (see 2205214) to try to handle this particular timeout more gracefully.

Firstly, I have manually specified a timeout duration for run_sync which will hopefully be sufficient. This raises the question: should a timeout duration be specified for all run_sync calls? The error has not been observed for any of the other "once-off" sensors so far.

Secondly, I have used a try block which will facilitate debugging during the next testing session and at least permit the current observation to continue without intervention (minus the schedule-block information).

I hope to test these improvements during the next testing session (likely 2020-05-28) as I have been unable to replicate the error with the development system.

@danielczech
Copy link
Collaborator

Following the testing session, it appears that explicitly extending the timeout duration has prevented this error occurring for the schedule_blocks sensor.
However, we observed the error occurring again for a different run_sync call; therefore it seems likely that these measures will be needed for every run_sync call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants