-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental verify checkpoints #4487
base: master
Are you sure you want to change the base?
Incremental verify checkpoints #4487
Conversation
5abf2eb
to
c4810de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular
stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist
crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file
. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.
I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have
stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json
which takes in a previous file called path/to/verified.json
, and at the end of the call updates path/to/verified.json
such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in
stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json --output-file=path/to/verified2.json
where the output file is a new file which contains the hashes from path/to/verified.json
. The issue is, this doesn't actually work as an append operations, as the --output-file
must not be the same as trusted-hash-file
. To demonstrate this, I ran the following commands on testnet:
stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443
This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with
stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out
which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.
While I found a couple issues, I think it would be helpful to
- Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
- Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc,
commands.md
doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.
@@ -28,17 +28,24 @@ class WriteVerifiedCheckpointHashesWork : public BatchWork | |||
WriteVerifiedCheckpointHashesWork( | |||
Application& app, LedgerNumHashPair rangeEnd, | |||
std::string const& outputFile, | |||
std::optional<std::string> const& trustedHashFile, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Prefer std::filesystem::path
to std::string
for file paths. We still have a bunch of strings around since path is a C++17 feature that we only recently upgraded to.
std::optional<std::string> const mTrustedHashFileName; | ||
std::string const mOutputFileName; | ||
std::optional<LedgerNumHashPair> mLatestTrustedHashPair; | ||
std::optional<uint32_t> const& mFromLedger; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential dangling reference
} | ||
mLatestTrustedHashPair = | ||
loadLatestHashPairFromJsonOutput(*mTrustedHashFileName); | ||
CLOG_INFO(History, "trusted hash from {}: {}", *mTrustedHashFileName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If mTrustedHashFileName
does not exist, this crashes.
Do you know what error was printed when you ran this? For me I get I agree that the error reporting should happen earlier. I thought that calling
Correct, the design was updated not to append to the
I've spotted a typo in commands.md ( |
Ya the error I was referring to was that one, with no output-file.
Sounds like a good idea!
That definitely cleans up most of it, but I think there's still an issue in the command help message for "--trusted-hash-file":
I don't think a non-existent file should be valid, and we should probably just crash immediately on startup in this case. |
std::filesystem::path mOutputPath; | ||
// If true, mOutputPath == mTrustedHashPath, and output | ||
// will be written to a temporary file before being renamed to | ||
// mOutputPath when verificaiton is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo
{ | ||
mRangeEndPromise.set_value(mRangeEnd); | ||
if (mArchive) | ||
{ | ||
CLOG_INFO(History, "selected archive {}", mArchive->getName()); | ||
} | ||
startOutputFile(); | ||
maybeParseTrustedHashFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to make this parsing function static and call it from CommandLine.cpp
before entering the work. Currently we throw a not super user readable execption, but more importantly we throw only after syncing with the network. Parsing the input immdediatialy on startup instead of inside this work would provide faster error checking of command inputs.
// If true, mOutputPath == mTrustedHashPath, and output | ||
// will be written to a temporary file before being renamed to | ||
// mOutputPath when verificaiton is complete. | ||
bool mAppendToFile = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always write to a temp file. That way, in the case of a crash or a failed verification, we don't output a broken trusted hash file. This also allows you to remove the flag and additional logic around mAppendToFile
WriteVerifiedCheckpointHashesWork::loadLatestHashPairFromJsonOutput( | ||
std::filesystem::path const& path) | ||
{ | ||
if (!std::filesystem::exists(path)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: We have a cross platofrm filesystem library util/Fs.h
} | ||
// The output file was written to a temporary file, so rename it to | ||
// the trusted hash file name. | ||
std::filesystem::rename(mOutputPath, *mTrustedHashPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer fs::durableRename
here.
{ | ||
first = hm.lastLedgerBeforeCheckpointContaining(*mFromLedger); | ||
} | ||
releaseAssert(first <= *mFromLedger); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Change these to releaseAssertOrThrow
so the work manager can catch it and gracefully fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall working much better! A few small issues regarding graceful failure and making sure we don't corrupt output files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few final cleanups and one edge case question.
} | ||
else if (mLatestTrustedHashPair) | ||
{ | ||
return mCurrCheckpoint > mLatestTrustedHashPair->first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be return mCurrCheckpoint >= mLatestTrustedHashPair->first;
? Currently I don't think we actually check the hash consistency from mLatestTrustedHashPair.ledgerSeq
to mLatestTrustedHashPair.ledgerSeq + 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, when validating down to mLatestTrustedHashPair, this won't check the hash consistency between mLatestTrustedHashPair.ledgerSeq
and mLatestTrustedHashPair.ledgerSeq + 1
if mLatestTrustedHashPair.ledgerSeq is a checkpoint boundary (which it ought to be).
I think this is fine, as the user trusts the input already and the output will contain all the subsequent hashes based in the trust obtained from the network. In the case of fromLedger
, we do always need to verify down to that ledger as we do not have a trusted hash for its corresponding checkpoint.
That said, I can widen the bounds if we want to be extra sure that the hashes match.
} | ||
else | ||
{ | ||
CLOG_WARNING(History, "failed to open trusted hash file {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an assert or a throw. If we made it to this point, we've had to open the trustedHashFile before starting verification. If for some reason we can't open that file at the end of verification, we can't output a proper appended file and should throw.
src/main/CommandLine.cpp
Outdated
std::optional<LedgerNumHashPair> latestTrustedHashPair; | ||
if (trustedHashFile) | ||
{ | ||
if (!std::filesystem::exists(*trustedHashFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is redundant, as we check for file existence in loadLatestHashPairFromJsonOutput
mOutputFile->close(); | ||
mOutputFile.reset(); | ||
|
||
if (!fs::exists(mTmpOutputPath.string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to be this defensive, given that we just wrote to and closed this temp file on the line above.
@@ -182,13 +262,51 @@ WriteVerifiedCheckpointHashesWork::endOutputFile() | |||
{ | |||
if (mOutputFile && mOutputFile->is_open()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably throw if mOutputFile && mOutputFile->is_open()
is false, instead of silently suceeding without writing an output file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
endOutputFile
is called by resetIter
(which then calls start output file, so no problem there), ~ WriteVerifiedCheckpointHashesWork
, and onSuccess
. When the work finishes, onSuccess
executes and the file is written to and closed. Then later when the shared ptr containing the work is reset ~ WriteVerifiedCheckpointHashesWork
is called and mOutputFile
is not open. (in this diff, an spurious warning will be emitted). I think instead we should not be emitting a warning or raising an error in that case. I agree that we shouldn't silently succeed without writing an output file, but Considering there is logic to report errors opening the file in startOutputFile
, I don't think we need to warn or throw here.
Resolves #4454
Description
Adds
--trusted-hash-file
argument to theverify-checkpoints
command to support appending new verified checkpoints starting from the last checkpoint in the trusted hash file.Adds
--from-ledger
to support generating a verified checkpoint hash file starting from a specific ledger to LCL/specified end ledger.Design doc: https://docs.google.com/document/d/1GRzHAO4_YrfanXqoVc1UDIMhUV10PFqIMQyOxlPOW_s/edit
Usage example:
--from-ledger
:% src/stellar-core verify-checkpoints --from-ledger=53736369 --output-file=out.json --conf=../stellar-core.cfg
Result:
Append to existing file:
src/stellar-core verify-checkpoints --trusted-hash-file=out.json --output-file=out2.json --conf=../stellar-core.cfg
Result:
Usage of both
--from-ledger
and--trusted-hash-file
-> ERRORPerformance
Time for verification of checkpoints
--from-ledger=53737040
to LCL=53739327Output: hashes for checkpoints 53737023 to 53739327, total of 2304 ledgers = 2287 ledgers (from
--from-ledger=53737040
to LCL=53739327) + 13 ledgers (from checkpoint 53737023 to --from-ledger=53737040):205 seconds / 2304 ledgers = 0.09 seconds, 90 milliseconds / ledger
Caveat: There is an overhead as the LCL is obtained from the network. On average we will wait 1/2 a checkpoint (32 ledgers) to find a checkpoint boundary LCL (32 ledgers * 5 seconds = 160 seconds).
Checklist
clang-format
v8.0.0 (viamake format
or the Visual Studio extension)