Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental verify checkpoints #4487

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

ThomasBrady
Copy link
Contributor

@ThomasBrady ThomasBrady commented Sep 25, 2024

Resolves #4454

Description

Adds --trusted-hash-file argument to the verify-checkpoints command to support appending new verified checkpoints starting from the last checkpoint in the trusted hash file.

Adds --from-ledger to support generating a verified checkpoint hash file starting from a specific ledger to LCL/specified end ledger.

Design doc: https://docs.google.com/document/d/1GRzHAO4_YrfanXqoVc1UDIMhUV10PFqIMQyOxlPOW_s/edit

Usage example:

--from-ledger :

% src/stellar-core verify-checkpoints --from-ledger=53736369 --output-file=out.json --conf=../stellar-core.cfg
Result:

% cat out.json 
[
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Append to existing file:

src/stellar-core verify-checkpoints --trusted-hash-file=out.json --output-file=out2.json --conf=../stellar-core.cfg
Result:

cat out2.json 
[
[53736959, "4b1900cb4bbaa77e86e3c8abb33be966e24a84098acdbda3d57977f237c5b13e"],
[53736895, "a163415903fa39efb53e4c79198fa2857cdbb12f92cc64f0ac3bcd0e6a7f2cce"],
[53736831, "2977e0c5653960a11359552dd74508a17982a5ca422db961f809fc335cd17901"],
[53736767, "ff7d80daad82981c1512c0f296a9ff9902f7b9d1ffa8ec8ad02e588cca16a9fd"],
[53736703, "0fb92338560bfac48ebd78dac530735ca988009132846fd93e42c061caa8cc5f"],
[53736639, "ba407b9b13e077cf9fb0a1c277416e12c6ff6857a42beef62f5805a9fdeec8ce"],
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Usage of both --from-ledger and --trusted-hash-file -> ERROR

 % src/stellar-core verify-checkpoints --trusted-hash-file=out2.json --output-file=out3.json --from-ledger=9999 --conf=../stellar-core.cfg --ll trace 
Warning: running non-release version v22.0.0rc1-3-ge94e61395-dirty of stellar-core
2024-09-30T15:56:36.748 [default ERROR] Cannot specify both --from-ledger and --trusted-hash-file

Performance

Time for verification of checkpoints --from-ledger=53737040 to LCL=53739327
Output: hashes for checkpoints 53737023 to 53739327, total of 2304 ledgers = 2287 ledgers (from --from-ledger=53737040 to LCL=53739327) + 13 ledgers (from checkpoint 53737023 to --from-ledger=53737040):

time src/stellar-core verify-checkpoints --output-file=out4.json --from-ledger=53737040 --conf=../stellar-core.cfg

src/stellar-core verify-checkpoints --output-file=out4.json    15.22s user 1.25s system 8% cpu 3:25.09 total
  0.80s user 0.31s system 18% cpu 5.825 total

205 seconds / 2304 ledgers = 0.09 seconds, 90 milliseconds / ledger

Caveat: There is an overhead as the LCL is obtained from the network. On average we will wait 1/2 a checkpoint (32 ledgers) to find a checkpoint boundary LCL (32 ledgers * 5 seconds = 160 seconds).

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

@ThomasBrady ThomasBrady changed the title WIP: Incremental verify checkpoints Incremental verify checkpoints Sep 30, 2024
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

  1. Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
  2. Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

@@ -28,17 +28,24 @@ class WriteVerifiedCheckpointHashesWork : public BatchWork
WriteVerifiedCheckpointHashesWork(
Application& app, LedgerNumHashPair rangeEnd,
std::string const& outputFile,
std::optional<std::string> const& trustedHashFile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Prefer std::filesystem::path to std::string for file paths. We still have a bunch of strings around since path is a C++17 feature that we only recently upgraded to.

std::optional<std::string> const mTrustedHashFileName;
std::string const mOutputFileName;
std::optional<LedgerNumHashPair> mLatestTrustedHashPair;
std::optional<uint32_t> const& mFromLedger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential dangling reference

}
mLatestTrustedHashPair =
loadLatestHashPairFromJsonOutput(*mTrustedHashFileName);
CLOG_INFO(History, "trusted hash from {}: {}", *mTrustedHashFileName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If mTrustedHashFileName does not exist, this crashes.

@ThomasBrady
Copy link
Contributor Author

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

I agree that the error reporting should happen earlier. I thought that calling .required() on the clara parser for --output-file would have raised an error immediately if that flag isn't provided, but that doesn't seem to be the case. I'll raise an error before connecting to the network if output-file isn't specified. If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

Correct, the design was updated not to append to the trusted-hash-file implicitly. An output-file must be explicitly specified with all invocations. I'll modify the file output logic to write to a temporary file if the specified --output-file is equal to the --trusted-hash-file to support the append use case.

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

  1. Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
  2. Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

I've spotted a typo in commands.md (--trusted-checkpoint-hashes should be --trusted-checkpoint-file), and there was the example invocations in the design doc that erroneously included both --trusted-checkpoint-file and --from-ledger and excluded the mandatory --output-file argument. I've updated those in the relevant parts. Is that all you were referring to or are there other issues with the interface differing?

@SirTyson
Copy link
Contributor

SirTyson commented Oct 3, 2024

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

Ya the error I was referring to was that one, with no output-file.

If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

Sounds like a good idea!

Is that all you were referring to or are there other issues with the interface differing?

That definitely cleans up most of it, but I think there's still an issue in the command help message for "--trusted-hash-file":

        "file containing trusted hashes, generated by a previous call to "
        "verify-checkpoints or a non-existent file to generate a new one");

I don't think a non-existent file should be valid, and we should probably just crash immediately on startup in this case.

std::filesystem::path mOutputPath;
// If true, mOutputPath == mTrustedHashPath, and output
// will be written to a temporary file before being renamed to
// mOutputPath when verificaiton is complete.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

{
mRangeEndPromise.set_value(mRangeEnd);
if (mArchive)
{
CLOG_INFO(History, "selected archive {}", mArchive->getName());
}
startOutputFile();
maybeParseTrustedHashFile();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to make this parsing function static and call it from CommandLine.cpp before entering the work. Currently we throw a not super user readable execption, but more importantly we throw only after syncing with the network. Parsing the input immdediatialy on startup instead of inside this work would provide faster error checking of command inputs.

// If true, mOutputPath == mTrustedHashPath, and output
// will be written to a temporary file before being renamed to
// mOutputPath when verificaiton is complete.
bool mAppendToFile = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always write to a temp file. That way, in the case of a crash or a failed verification, we don't output a broken trusted hash file. This also allows you to remove the flag and additional logic around mAppendToFile

WriteVerifiedCheckpointHashesWork::loadLatestHashPairFromJsonOutput(
std::filesystem::path const& path)
{
if (!std::filesystem::exists(path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We have a cross platofrm filesystem library util/Fs.h

}
// The output file was written to a temporary file, so rename it to
// the trusted hash file name.
std::filesystem::rename(mOutputPath, *mTrustedHashPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer fs::durableRename here.

{
first = hm.lastLedgerBeforeCheckpointContaining(*mFromLedger);
}
releaseAssert(first <= *mFromLedger);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Change these to releaseAssertOrThrow so the work manager can catch it and gracefully fail.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall working much better! A few small issues regarding graceful failure and making sure we don't corrupt output files.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few final cleanups and one edge case question.

}
else if (mLatestTrustedHashPair)
{
return mCurrCheckpoint > mLatestTrustedHashPair->first;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be return mCurrCheckpoint >= mLatestTrustedHashPair->first;? Currently I don't think we actually check the hash consistency from mLatestTrustedHashPair.ledgerSeq to mLatestTrustedHashPair.ledgerSeq + 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, when validating down to mLatestTrustedHashPair, this won't check the hash consistency between mLatestTrustedHashPair.ledgerSeq and mLatestTrustedHashPair.ledgerSeq + 1 if mLatestTrustedHashPair.ledgerSeq is a checkpoint boundary (which it ought to be).

I think this is fine, as the user trusts the input already and the output will contain all the subsequent hashes based in the trust obtained from the network. In the case of fromLedger, we do always need to verify down to that ledger as we do not have a trusted hash for its corresponding checkpoint.

That said, I can widen the bounds if we want to be extra sure that the hashes match.

}
else
{
CLOG_WARNING(History, "failed to open trusted hash file {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an assert or a throw. If we made it to this point, we've had to open the trustedHashFile before starting verification. If for some reason we can't open that file at the end of verification, we can't output a proper appended file and should throw.

std::optional<LedgerNumHashPair> latestTrustedHashPair;
if (trustedHashFile)
{
if (!std::filesystem::exists(*trustedHashFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant, as we check for file existence in loadLatestHashPairFromJsonOutput

mOutputFile->close();
mOutputFile.reset();

if (!fs::exists(mTmpOutputPath.string()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to be this defensive, given that we just wrote to and closed this temp file on the line above.

@@ -182,13 +262,51 @@ WriteVerifiedCheckpointHashesWork::endOutputFile()
{
if (mOutputFile && mOutputFile->is_open())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably throw if mOutputFile && mOutputFile->is_open() is false, instead of silently suceeding without writing an output file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

endOutputFile is called by resetIter (which then calls start output file, so no problem there), ~ WriteVerifiedCheckpointHashesWork, and onSuccess. When the work finishes, onSuccess executes and the file is written to and closed. Then later when the shared ptr containing the work is reset ~ WriteVerifiedCheckpointHashesWork is called and mOutputFile is not open. (in this diff, an spurious warning will be emitted). I think instead we should not be emitting a warning or raising an error in that case. I agree that we shouldn't silently succeed without writing an output file, but Considering there is logic to report errors opening the file in startOutputFile, I don't think we need to warn or throw here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

allow running verify-checkpoints incrementally
2 participants