Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log Voter Timeout Disconnects #334

Open
tsawyer opened this issue Jun 19, 2024 · 13 comments
Open

Log Voter Timeout Disconnects #334

tsawyer opened this issue Jun 19, 2024 · 13 comments

Comments

@tsawyer
Copy link
Member

tsawyer commented Jun 19, 2024

Log Voter Client -- Voter client OatMtn disconnect (timeout) message with a NOTICE status.

@Mason10198
Copy link

Mason10198 commented Aug 2, 2024

Suggest also logging out-of-bounds packets (packet took longer than buflen to get to host), similar to out-of-bounds packet logging in VOTER client console.

Question - is there any circumstance (packet OOB, packet OOO, corrupt, etc) that causes a chan_voter instance to drop a client (remove client from VOTER display) other than a keepalive timeout, or does the loss of a client from the host perspective ALWAYS signify a keepalive timeout?

@tsawyer
Copy link
Member Author

tsawyer commented Aug 3, 2024

To my knowledge only a timeout (five missed pings) cause a RTCM (voter or missed client) to disconnect. Disconnects are indicated by a RTCM display dropout and a lost link CW message if a transmitter is connected.

Having said that, I've long had my suspicions that dropouts occur when there is no apparent network problem. So maybe you're on to something here.

@Mason10198
Copy link

I have noticed at times one of my TX/RX clients will flash on/off rapidly (a few times per second) in the VOTER display, Allmon, and Supermon. When this "flashing" occurs, there are gaps in the TX audio for that site.

The VOTER clients console shows no messages, and neither does the host Asterisk console whenever this "flashing" occurs.

The client in question is two 5GHz Ubiquiti hops away from the host, so it is difficult to determine the cause of these "drops". TX and RX buffers are set well above the actual latency of the network.

More logging in chan_voter would greatly help troubleshooting this.

@InterLinked1
Copy link
Member

More logging in chan_voter would greatly help troubleshooting this.

Have you enabled debug logging for the module on your system? (e.g. core set debug 10 chan_voter)

@Mason10198
Copy link

Mason10198 commented Aug 4, 2024

Have you enabled debug logging for the module on your system? (e.g. core set debug 10 chan_voter)

Yes, I have enabled debug logging, but it generates a significant amount of output (a few hundred lines per second), which is quite difficult to sift through:

[2024-08-03 19:07:42.802] DEBUG[687]: chan_voter.c:3850 voter_reader: Got rx packet, len 185 payload 1 challenge 574376273 digest 3ea42dd5
[2024-08-03 19:07:42.802] DEBUG[694]: chan_voter.c:2345 voter_xmit: Voter 1998: Sending KEEPALIVE (GPS) packet to client Client1 digest 2b8c5883
[2024-08-03 19:07:42.802] DEBUG[694]: chan_voter.c:2345 voter_xmit: Voter 1998: Sending KEEPALIVE (GPS) packet to client Client2 digest 59d79563
[2024-08-03 19:07:42.802] DEBUG[694]: chan_voter.c:2345 voter_xmit: Voter 1998: Sending KEEPALIVE (GPS) packet to client Client3 digest 764429f6
[2024-08-03 19:07:42.803] DEBUG[687]: chan_voter.c:3850 voter_reader: Got rx packet, len 185 payload 1 challenge 22652796 digest 008b685e
...

Out-of-bounds packets are logged as an error message on the client side in the VOTER/RTCM firmware but are not logged on the host side in a manner that's easily detectable.

To address this, I suggest logging these packets as actual warnings or errors, at the appropriate warning/error levels. This way, they will appear in the console/logs without requiring debug-level logging. This change would make it much easier to identify and address issues without having to manually parse through extensive debug logs.

Example: packet received by TallMountainSite 39ms too late for the RX buffer to accept

[2024-08-02 13:14:16.067] WARNING[687] chan_voter.c: Received [PAYLOAD 1] from TallMountainSite out of bounds: +39ms

@tsawyer
Copy link
Member Author

tsawyer commented Aug 4, 2024

To address this, I suggest logging these packets as actual warnings or errors, at the appropriate warning/error levels. This way, they will appear in the console/logs without requiring debug-level logging. This change would make it much easier to identify and address issues without having to manually parse through extensive debug logs.

No, please. The Asterisk console is too noisy already. Sending debug type messages as warnings or errors to the console is wholly inappropriate. Instead search the log or grep it in realtime... tail -f /var/log/asterisk/messages.log | grep "out of bounds".

@Mason10198
Copy link

No, please. The Asterisk console is too noisy already. Sending debug type messages as warnings or errors to the console is wholly inappropriate. Instead search the log or grep it in realtime... tail -f /var/log/asterisk/messages.log | grep "out of bounds".

While I agree that sending messages to the default level console should be used sparingly, I would argue that an out-of-bounds RX packet is an ERROR in an otherwise working system and should most certainly be presented to the console as an error (or at the very least a warning). The debug level information is just status type messages, not errors or warnings. It seems rather illogical to bury an actual ERROR message beneath heaps of mostly useless debugging information.

I am aware that the messages could be grep'd from the console/log, but they shouldn't have to be. Only systems with an active VOTER implementation would see these messages. As an administrator of one of these systems, this is error level information that I would want logged at ALL times... not just when I'm manually logged into the CLI and have cranked up the levels.

@InterLinked1
Copy link
Member

No, please. The Asterisk console is too noisy already. Sending debug type messages as warnings or errors to the console is wholly inappropriate. Instead search the log or grep it in realtime... tail -f /var/log/asterisk/messages.log | grep "out of bounds".

While I agree that sending messages to the default level console should be used sparingly, I would argue that an out-of-bounds RX packet is an ERROR in an otherwise working system and should most certainly be presented to the console as an error (or at the very least a warning). The debug level information is just status type messages, not errors or warnings. It seems rather illogical to bury an actual ERROR message beneath heaps of mostly useless debugging information.

I am aware that the messages could be grep'd from the console/log, but they shouldn't have to be. Only systems with an active VOTER implementation would see these messages. As an administrator of one of these systems, this is error level information that I would want logged at ALL times... not just when I'm manually logged into the CLI and have cranked up the levels.

I'm not familiar with the context of what actually causes this warning (I'm an Asterisk developer, not an ASL user), so I don't want to commit to a particular approach, but I agree if it actually means something is wrong, it should be a warning, and vice versa.

Error in Asterisk is generally reserved for specific cases and I'm not sure would be appropriate here but a warning seems logical. I think the main question is whether it would be possible in a correctly configuration system (on both sides) to have this happen. If the user can't control it, generally warning/error is a bit extreme.

As an administrator of one of these systems, this is error level information that I would want logged at ALL times... not just when I'm manually logged into the CLI and have cranked up the levels.

FYI, for reference, you can also control which log levels are logged in logger.conf, including verbose messages or debug messages up to a certain level, for example.

@tsawyer
Copy link
Member Author

tsawyer commented Aug 4, 2024

If the user can't control it, generally warning/error is a bit extreme.

This appears to be a network error happening on one link (a Wifi link) to a RTCM. The other RTCM links are fine if I read the OP correctly. Therefore this is not an Asterisk issue. That's not to say there isn't something with chan_voter misbehaving with excessive OOB packets.

@InterLinked1 How do other channel drivers handle poor network performance in terms of warnings or errors? I don't recall seeing IAX or SIP report OOB issues. I believe the correct approach is something like iax2 show netstats.

@tsawyer
Copy link
Member Author

tsawyer commented Aug 4, 2024

@Mason10198 What do you get for voter ping sitexx 100?

A good link will repeatedly give:

PING (SantaYnez): Packets tx: 100, rx: 100, oos: 0, Avg.: 9.540 ms
PING (SantaYnez):  Worst: 13 ms, Best: 8 ms, 100.0% Packets successfully received (0.0% loss)

If that's ok, I'm thinking this could be a GPS/timing issue.

@InterLinked1
Copy link
Member

If the user can't control it, generally warning/error is a bit extreme.

This appears to be a network error happening on one link (a Wifi link) to a RTCM. The other RTCM links are fine if I read the OP correctly. Therefore this is not an Asterisk issue. That's not to say there isn't something with chan_voter misbehaving with excessive OOB packets.

Okay, thanks for the background. I was thinking "out of bounds" meant something like the packet itself was malformed. If this can happen purely due to network conditions then that's a different matter.

Even if it's not something the user can influence, if it fundamentally causes major undesired behavior (something not working) a warning might still be appropriate. Can someone clarify what the actual impact of this issue happening is? Does the node become unusable or something like that? Is audio dropped completely?

@InterLinked1 How do other channel drivers handle poor network performance in terms of warnings or errors? I don't recall seeing IAX or SIP report OOB issues. I believe the correct approach is something like iax2 show netstats.

If it's during a call, I don't think there's much reporting for those sorts of things. The RTP stats might reflect some of those issues.

@Mason10198
Copy link

Mason10198 commented Aug 5, 2024

This appears to be a network error happening on one link (a Wifi link) to a RTCM. The other RTCM links are fine if I read the OP correctly. Therefore this is not an Asterisk issue. That's not to say there isn't something with chan_voter misbehaving with excessive OOB packets.

@tsawyer Correct. I never meant to imply that this was an issue with Asterisk or it's modules - I was only suggesting the presentation of information that is vital to system administrators.

What do you get for voter ping sitexx 100?

PING (VoterClient): Packets tx: 500, rx: 500, oos: 0, Avg.: 10.648 ms
PING (VoterClient):  Worst: 27 ms, Best: 7 ms, 100.0% Packets successfully received (0.0% loss)

and I have buflen = 480 in my voter.conf

If that's ok, I'm thinking this could be a GPS/timing issue.

I am aware of the possible causes of the issue I am experiencing. I am not trying to report or solve any issues here in this repo, I'm just suggesting better logging to provide vital information to system administrators that might encounter this (or similar) issues in order to make tracking down and resolving network and timing issues significantly easier.

I think the main question is whether it would be possible in a correctly configuration system (on both sides) to have this happen. If the user can't control it, generally warning/error is a bit extreme.

@InterLinked1 This has nothing to do with errors in Asterisk itself, I apologize if I did not make that clear. You are correct that ERROR would be extreme. WARNING should be appropriate.

I was thinking "out of bounds" meant something like the packet itself was malformed. If this can happen purely due to network conditions then that's a different matter.

Correct, "out of bounds" just means the packet took longer than buflen (voter.conf) to arrive at the host. The problem is that there is no sort of message whatsoever when one of the out of bounds packets is received by chan_voter.

Even if it's not something the user can influence, if it fundamentally causes major undesired behavior (something not working) a warning might still be appropriate. Can someone clarify what the actual impact of this issue happening is? Does the node become unusable or something like that? Is audio dropped completely?

Lets ignore the TX chain of the VOTER system for a second and only focus on the RX chain.

  • The VOTER host (Asterisk server) receives packets on its listening port from one or more VOTER clients.
  • The packets contain samples of audio and timing information.
  • Each packet is timestamped with high-accuracy GPS time of when it was created.
  • When the host receives packets from the clients, they go into the RX buffer which is manually set in ms via buflen in voter.conf.
  • If a packet is received where (current_time - packet_timestamp) > buflen, then the packet arrived too late and is "out of bounds"
  • Since out of bounds packets cannot be added to the audio buffer for processing and this results in audio gaps in the VOTER receiver
  • Might also cause "timeout" (full drop) of a client if bad enough??

My suggestion is simply to have some kind of warning message when these "out of bounds" packets are detected, that does not require manually increasing the debug levels. My opinion is that this should not be "debugging" information and should be levelled higher than debugging info.

@tsawyer
Copy link
Member Author

tsawyer commented Aug 7, 2024

My suggestion is simply to have some kind of warning message when these "out of bounds" packets are detected, that does not require manually increasing the debug levels. My opinion is that this should not be "debugging" information and should be levelled higher than debugging info.

All of your points are well stated but I'm not convinced poor network performance should be logged to the console. Nor am I convinced setting a debug level is an undue burden. I opened this issue to clean up the very annoying and inappropriate voter timeout messages. Adding what would be another annoying message is going the wrong direction.

I believe a better approach would be to have an Asterisk CLI command similar to iax2 show netstats, perhaps voter show netstats. That command could provide ping times, out of order, out of band, and other useful historical and realtime stats without cluttering the Asterisk CLI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants