-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RBMC: Start looking at sibling BMC #60
base: 1110
Are you sure you want to change the base?
Conversation
co_await sdbusplus::async::sleep_for(ctx, 500ms); | ||
} | ||
|
||
lg2::info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the sibling BMC is not pop-up on the DBus within the timeout, we should disable redundancy and switch to non-redundant BMC system mode, correct? Are you planning to handle this error handling separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea. I'm going to handle enabling redundancy kinda like determining the role, where it'll pass in all of the relevant parameters to a function that will spit out if redundancy can be enabled and the reasons why it can't be if not.
ctx.spawn(doHeartBeat()); | ||
|
||
if (sibling->isBMCPresent()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we need to wait for the sibling BMC to be detected if it's not present? I believe both BMCs initialize in parallel, correct? Are you planning to handle this error separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's going to be a GPIO for hardware presence that says if the BMC card is even plugged in. So if the card isn't plugged in, there's no point in calling waitForSiblingUp.
If that other sled is plugged in and the sibling does eventually come up before we power on, the plan is to enable redundancy at that point.
struct RoleInfo | ||
{ | ||
Role role; | ||
ErrorCase error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think adding an error message will be helpful when creating the PEL in the future? Or would capturing the last N journal traces from the RBMC and CFAM manager be more useful for error logging?
(I noticed you have some good traces from the role determination function, and since this is related to error handling, we could consider it in the future to assist with debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an upcoming commit to do that, it just didn't get pushed up in this PR. It'll look like:
if (!result && !input.siblingProvisioned)
{
result = {Role::Active, ErrorCase::noError,
"Sibling is not provisioned"};
}
and then it will be saved in the journal and also in a persistent data file for debug. I also have an upcoming 'rbmctool' which will print it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those debug data is not planned to capture in the PEL if we are planning to create for some error case? Just in journal and a separate persistent file which will be collected as part of the BMC dump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have notes that we also have to create a PEL if we can't enable redundancy, so I may as well put the reasons in there too.
Create a Sibling class for reading data from the xyz.openbmc_project.State.BMC.Redundancy.Sibling D-Bus interface. Technically, Sibling is a base class and SiblingImpl is the implementation class, so that D-Bus access is kept out of the business logic and testcases have a way to provide sibling data when there is no D-Bus. This class will watch the InterfacesAdded, InterfacesRemoved, NameOwnerChanged, and PropertiesChanged signals so that it can always have the latest values on that interface. When the source interface isn't on D-Bus, all of the accessor methods will return std::nullopt so that the RBMC management code can't use stale values. It also returns std::nullopt when the sibling's heartbeat property isn't true for the same reason. This commit then uses this class to wait for up to six minutes for the sibling's heartbeat to be active going into role determination. Eventually, it will use the sibling information to help it determine the role. Tested: Traces on startup: ``` phosphor-rbmc-state-manager[640]: In Sibling init, interface present is True phosphor-rbmc-state-manager[640]: Waiting for sibling interface and/or heartbeat: Present = True, Heartbeat = False phosphor-rbmc-state-manager[640]: Starting heartbeat phosphor-rbmc-state-manager[640]: Sibling property Heartbeat changed phosphor-rbmc-state-manager[640]: Done waiting for sibling. Interface present = True, heartbeat = True ``` Traces on startup when no sibling D-Bus interface at first: ``` phosphor-rbmc-state-manager[1408]: In Sibling init, interface present is False phosphor-rbmc-state-manager[1408]: Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False ... phosphor-rbmc-state-manager[1408]: Sibling D-Bus interface added phosphor-rbmc-state-manager[1408]: After interfacesAdded, sibling service is xyz.openbmc_project.State.BMC.Redundancy.Sibling phosphor-rbmc-state-manager[1408]: Done waiting for sibling. Interface present = True, heartbeat = True ``` NameOwnerChanged: ``` phosphor-rbmc-state-manager[640]: Sibling D-Bus name lost ``` InterfacesAdded: ``` phosphor-rbmc-state-manager[640]: Sibling D-Bus interface added phosphor-rbmc-state-manager[640]: Sibling property Heartbeat changed ``` PropertiesChanged: ``` phosphor-rbmc-state-manager[640]: Sibling property BMCState changed ``` Signed-off-by: Matt Spinler <[email protected]>
Now that the sibling information is available, use it in role determination. At a high level the new rules are: 1. If something is wrong with the sibling, choose active. 2. If the sibling already has a role, choose the opposite. 3. Otherwise, choose based on position. There will be cases that that would require the local BMC to be passive, such as if it isn't provisioned. These would be checked before even looking at any sibling fields. The role determination function now also returns if there was an error case detected. This is needed so that in the future: 1. An error log can be created based on that specific case. 2. Eventually, the previous role will be used in role determination, before defaulting to the role based on position. However, we won't want to do that if the role was only passive because of an error. (Meaning the fact that there was an error will need to be persisted.) Tested: New unit tests pass. And in simulation: ``` phosphor-rbmc-state-manager[956]: Role = active due to the sibling already being passive phosphor-rbmc-state-manager[1005]: Role = passive due to the sibling already being active ``` Signed-off-by: Matt Spinler <[email protected]>
The first commit starts looking at the xyz.openbmc_project.State.BMC.Redundancy.Sibling D-Bus interface. It creates a class around it that can provide access to the properties.
The second commit starts using the sibling fields in active vs passive role determination.