Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Event Queue Overflow #1048

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Jamidd
Copy link
Contributor

@Jamidd Jamidd commented Aug 13, 2024

This PR aims to prevent overflow in the event queue. When the number of available task slots drops to less than a fifth of the original capacity, ticker events are discarded. If more than 30 ticker events are discarded, the OVMS will be restarted, as this indicates that too much time has passed since the last ticker event was processed.

@dexterbg
Copy link
Member

Jaime,

the PR again mixes two completely independent changes, besides the queue checking it includes an undocumented and potentially hazardous change to the event task stack size, raising it from 8K to 12K.

Any substantial new stack requirements need to be discussed first, best on the developer list. Stack is allocated from the internal 8 bit RAM segment, which is rather tight already. Depending on the vehicle, there will be additional tasks needing space, and less free 8 bit RAM can make the system less stable in a situation of multiple concurrent command tasks being executed (e.g. via API calls).

So there must be a good & valid reason to generally increase a task stack size. The event task runs all the event handling, thus already has a rather large stack of 8K. What caused a stack overflow for you, did you verify it was in the event task, and how can we reproduce this?

If the situation is setup/config specific (e.g. only occurring when using an MQTT server), instead of a general RAM hit, we should make the event stack size configurable.

On the actual PR topic:

Please try to avoid typos in system messaged ("droped"), and to be very precise about what is happening, as this is info for developers: "Timer service / ticker timer has died" isn't actually correct. The timer service & tickers are still working perfectly here, the issue is the event task needing too much time processing events.

And that leads to a potentially much more useful addition idea: how about also collecting statistics on which event handlers are the culprits? The event task could log suspiciously long event handling runs, and it could gather average and max handling times, so developers can take a look at which handler needs optimization / fixing.

I added some info on the currently being handled event for the crash aftermath (see m_current_started), but that will only tell about the last handler before the crash. General performance statistics could be very helpful in tracking down issues before they become fatal. Take @frogonwheels approach for the OBD performance stats as an example.

Regards,
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants