Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RBMC: Start looking at sibling BMC #60

Open
wants to merge 2 commits into
base: 1110
Choose a base branch
from

Conversation

spinler
Copy link
Contributor

@spinler spinler commented Oct 10, 2024

The first commit starts looking at the xyz.openbmc_project.State.BMC.Redundancy.Sibling D-Bus interface. It creates a class around it that can provide access to the properties.

The second commit starts using the sibling fields in active vs passive role determination.

redundant-bmc/README.md Show resolved Hide resolved
redundant-bmc/src/manager.cpp Show resolved Hide resolved
redundant-bmc/src/sibling_impl.cpp Show resolved Hide resolved
redundant-bmc/src/sibling_impl.cpp Outdated Show resolved Hide resolved
redundant-bmc/src/manager.cpp Show resolved Hide resolved
co_await sdbusplus::async::sleep_for(ctx, 500ms);
}

lg2::info(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sibling BMC is not pop-up on the DBus within the timeout, we should disable redundancy and switch to non-redundant BMC system mode, correct? Are you planning to handle this error handling separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea. I'm going to handle enabling redundancy kinda like determining the role, where it'll pass in all of the relevant parameters to a function that will spit out if redundancy can be enabled and the reasons why it can't be if not.

ctx.spawn(doHeartBeat());

if (sibling->isBMCPresent())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to wait for the sibling BMC to be detected if it's not present? I believe both BMCs initialize in parallel, correct? Are you planning to handle this error separately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's going to be a GPIO for hardware presence that says if the BMC card is even plugged in. So if the card isn't plugged in, there's no point in calling waitForSiblingUp.

If that other sled is plugged in and the sibling does eventually come up before we power on, the plan is to enable redundancy at that point.

redundant-bmc/src/manager.cpp Outdated Show resolved Hide resolved
struct RoleInfo
{
Role role;
ErrorCase error;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think adding an error message will be helpful when creating the PEL in the future? Or would capturing the last N journal traces from the RBMC and CFAM manager be more useful for error logging?

(I noticed you have some good traces from the role determination function, and since this is related to error handling, we could consider it in the future to assist with debugging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an upcoming commit to do that, it just didn't get pushed up in this PR. It'll look like:

  if (!result && !input.siblingProvisioned)
    {
        result = {Role::Active, ErrorCase::noError,
                  "Sibling is not provisioned"};
    }

and then it will be saved in the journal and also in a persistent data file for debug. I also have an upcoming 'rbmctool' which will print it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those debug data is not planned to capture in the PEL if we are planning to create for some error case? Just in journal and a separate persistent file which will be collected as part of the BMC dump?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have notes that we also have to create a PEL if we can't enable redundancy, so I may as well put the reasons in there too.

Create a Sibling class for reading data from the
xyz.openbmc_project.State.BMC.Redundancy.Sibling D-Bus interface.

Technically, Sibling is a base class and SiblingImpl is the
implementation class, so that D-Bus access is kept out of the business
logic and testcases have a way to provide sibling data when there is no
D-Bus.

This class will watch the InterfacesAdded, InterfacesRemoved,
NameOwnerChanged, and PropertiesChanged signals so that it can always
have the latest values on that interface.  When the source interface
isn't on D-Bus, all of the accessor methods will return std::nullopt so
that the RBMC management code can't use stale values.  It also returns
std::nullopt when the sibling's heartbeat property isn't true for the
same reason.

This commit then uses this class to wait for up to six minutes for the
sibling's heartbeat to be active going into role determination.
Eventually, it will use the sibling information to help it determine the
role.

Tested:

Traces on startup:
```
phosphor-rbmc-state-manager[640]: In Sibling init, interface present is True
phosphor-rbmc-state-manager[640]: Waiting for sibling interface and/or heartbeat: Present = True, Heartbeat = False
phosphor-rbmc-state-manager[640]: Starting heartbeat
phosphor-rbmc-state-manager[640]: Sibling property Heartbeat changed
phosphor-rbmc-state-manager[640]: Done waiting for sibling. Interface present = True, heartbeat = True
```

Traces on startup when no sibling D-Bus interface at first:
```
phosphor-rbmc-state-manager[1408]: In Sibling init, interface present is False
phosphor-rbmc-state-manager[1408]: Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False
...
phosphor-rbmc-state-manager[1408]: Sibling D-Bus interface added
phosphor-rbmc-state-manager[1408]: After interfacesAdded, sibling service is xyz.openbmc_project.State.BMC.Redundancy.Sibling
phosphor-rbmc-state-manager[1408]: Done waiting for sibling. Interface present = True, heartbeat = True
```

NameOwnerChanged:
```
phosphor-rbmc-state-manager[640]: Sibling D-Bus name lost
```

InterfacesAdded:
```
phosphor-rbmc-state-manager[640]: Sibling D-Bus interface added
phosphor-rbmc-state-manager[640]: Sibling property Heartbeat changed
```

PropertiesChanged:
```
phosphor-rbmc-state-manager[640]: Sibling property BMCState changed
```

Signed-off-by: Matt Spinler <[email protected]>
Now that the sibling information is available, use it in role
determination.

At a high level the new rules are:
1. If something is wrong with the sibling, choose active.
2. If the sibling already has a role, choose the opposite.
3. Otherwise, choose based on position.

There will be cases that that would require the local BMC to be passive,
such as if it isn't provisioned.  These would be checked before even
looking at any sibling fields.

The role determination function now also returns if there was an error
case detected.  This is needed so that in the future:

1. An error log can be created based on that specific case.

2. Eventually, the previous role will be used in role determination,
   before defaulting to the role based on position.  However, we won't
   want to do that if the role was only passive because of an error.
   (Meaning the fact that there was an error will need to be
   persisted.)

Tested:
New unit tests pass.

And in simulation:
```
phosphor-rbmc-state-manager[956]: Role = active due to the sibling already being passive

phosphor-rbmc-state-manager[1005]: Role = passive due to the sibling already being active
```

Signed-off-by: Matt Spinler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants