Incremental verify checkpoints #4487

ThomasBrady · 2024-09-25T21:13:04Z

Resolves #4454

Description

Adds --trusted-hash-file argument to the verify-checkpoints command to support appending new verified checkpoints starting from the last checkpoint in the trusted hash file.

Adds --from-ledger to support generating a verified checkpoint hash file starting from a specific ledger to LCL/specified end ledger.

Design doc: https://docs.google.com/document/d/1GRzHAO4_YrfanXqoVc1UDIMhUV10PFqIMQyOxlPOW_s/edit

Usage example:

`--from-ledger` :

% src/stellar-core verify-checkpoints --from-ledger=53736369 --output-file=out.json --conf=../stellar-core.cfg
Result:

% cat out.json 
[
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Append to existing file:

src/stellar-core verify-checkpoints --trusted-hash-file=out.json --output-file=out2.json --conf=../stellar-core.cfg
Result:

cat out2.json 
[
[53736959, "4b1900cb4bbaa77e86e3c8abb33be966e24a84098acdbda3d57977f237c5b13e"],
[53736895, "a163415903fa39efb53e4c79198fa2857cdbb12f92cc64f0ac3bcd0e6a7f2cce"],
[53736831, "2977e0c5653960a11359552dd74508a17982a5ca422db961f809fc335cd17901"],
[53736767, "ff7d80daad82981c1512c0f296a9ff9902f7b9d1ffa8ec8ad02e588cca16a9fd"],
[53736703, "0fb92338560bfac48ebd78dac530735ca988009132846fd93e42c061caa8cc5f"],
[53736639, "ba407b9b13e077cf9fb0a1c277416e12c6ff6857a42beef62f5805a9fdeec8ce"],
[53736575, "1de4bfa30f8af81716d2295b7c9f077afea250ddb88839345c13176de7b75e36"],
[53736511, "9f1bd24f21facc606b49216853c0e2162d55d2e3e898da96dd910ddd1ede784f"],
[53736447, "80a3083ea9e987b48949c2ad33006a5e750f06c6836c4814d5a853cab6bac1e3"],
[53736383, "2363bc49669667aa28da768588b5be7f09dc8c69c5e20416d870748b3739509b"],
[0, ""]
]

Usage of both `--from-ledger` and `--trusted-hash-file` -> ERROR

 % src/stellar-core verify-checkpoints --trusted-hash-file=out2.json --output-file=out3.json --from-ledger=9999 --conf=../stellar-core.cfg --ll trace 
Warning: running non-release version v22.0.0rc1-3-ge94e61395-dirty of stellar-core
2024-09-30T15:56:36.748 [default ERROR] Cannot specify both --from-ledger and --trusted-hash-file

Performance

Time for verification of checkpoints --from-ledger=53737040 to LCL=53739327
Output: hashes for checkpoints 53737023 to 53739327, total of 2304 ledgers = 2287 ledgers (from --from-ledger=53737040 to LCL=53739327) + 13 ledgers (from checkpoint 53737023 to --from-ledger=53737040):

time src/stellar-core verify-checkpoints --output-file=out4.json --from-ledger=53737040 --conf=../stellar-core.cfg

src/stellar-core verify-checkpoints --output-file=out4.json    15.22s user 1.25s system 8% cpu 3:25.09 total
  0.80s user 0.31s system 18% cpu 5.825 total

205 seconds / 2304 ledgers = 0.09 seconds, 90 milliseconds / ledger

Caveat: There is an overhead as the LCL is obtained from the network. On average we will wait 1/2 a checkpoint (32 ledgers) to find a checkpoint boundary LCL (32 ledgers * 5 seconds = 160 seconds).

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

SirTyson

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular

stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist

crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have

stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json

which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in

stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json

where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443

This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out

which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.
Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

SirTyson · 2024-10-02T18:28:52Z

src/historywork/WriteVerifiedCheckpointHashesWork.h

@@ -28,17 +28,24 @@ class WriteVerifiedCheckpointHashesWork : public BatchWork
    WriteVerifiedCheckpointHashesWork(
        Application& app, LedgerNumHashPair rangeEnd,
        std::string const& outputFile,
+        std::optional<std::string> const& trustedHashFile,


Nit: Prefer std::filesystem::path to std::string for file paths. We still have a bunch of strings around since path is a C++17 feature that we only recently upgraded to.

SirTyson · 2024-10-02T18:38:21Z

src/historywork/WriteVerifiedCheckpointHashesWork.h

+    std::optional<std::string> const mTrustedHashFileName;
+    std::string const mOutputFileName;
+    std::optional<LedgerNumHashPair> mLatestTrustedHashPair;
+    std::optional<uint32_t> const& mFromLedger;


Potential dangling reference

SirTyson · 2024-10-02T21:20:20Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+    }
+    mLatestTrustedHashPair =
+        loadLatestHashPairFromJsonOutput(*mTrustedHashFileName);
+    CLOG_INFO(History, "trusted hash from {}: {}", *mTrustedHashFileName,


If mTrustedHashFileName does not exist, this crashes.

ThomasBrady · 2024-10-02T23:19:55Z

Thanks for this change, sorry we kept going back and forth so much in the design phase :(. I did a quick pass, but I think there's a couple of issues with the interface that need to be fixed, then I'll do another pass once things are working a bit better. In particular
stellar-core --conf test.cfg verify-checkpoints --trusted-hash-file does-not-exist  
crashes after syncing with the network, but it looks like this should work based on the help comment from --trusted-hash-file. Either the comment should be changed and this error check should happen on startup if this is intended behavior, or it should be addressed.

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

I agree that the error reporting should happen earlier. I thought that calling .required() on the clara parser for --output-file would have raised an error immediately if that flag isn't provided, but that doesn't seem to be the case. I'll raise an error before connecting to the network if output-file isn't specified. If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

I'm also not quite sure what the intended interface for this is. It looks like in the doc, we have
stellar-core verify-checkpoints –conf=core.cfg –trusted-hash-file=path/to/verified.json 
which takes in a previous file called path/to/verified.json, and at the end of the call updates path/to/verified.json such that is contains hashes to lcl. However, it looks like the interface has changed in this PR, where we take in
stellar-core verify-checkpoints --trusted-hash-file=path/to/verified.json   --output-file=path/to/verified2.json 
where the output file is a new file which contains the hashes from path/to/verified.json. The issue is, this doesn't actually work as an append operations, as the --output-file must not be the same as trusted-hash-file. To demonstrate this, I ran the following commands on testnet:

Correct, the design was updated not to append to the trusted-hash-file implicitly. An output-file must be explicitly specified with all invocations. I'll modify the file output logic to write to a temporary file if the specified --output-file is equal to the --trusted-hash-file to support the append use case.

stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --from-ledger 249443
This command succeeded. After a few checkpoints passed, I then attempted to append to the file to catch up to lcl with
stellar-core ---conf testnet.cfg verify-checkpoints --output-file out --trusted-hash-file out
which crashed. I doubt that Horizon operators will want to manager a collection of files, so we probably do want a truly append operation.

While I found a couple issues, I think it would be helpful to

Validity checking on startup. If we crash due to a file not existing that's fine, but this should happen immediately on startup and not after waiting for the network's next checkpoint ledger.

Take a step back and solidify what the interface should be. I know we've had some irl conversations back and forth and the expectations have been changing a lot throughout, but currently the design doc, commands.md doc, and command line "help" output all define different, mutually exclusive interfaces. I think this is making review and implementation a bit tricky.

I've spotted a typo in commands.md (--trusted-checkpoint-hashes should be --trusted-checkpoint-file), and there was the example invocations in the design doc that erroneously included both --trusted-checkpoint-file and --from-ledger and excluded the mandatory --output-file argument. I've updated those in the relevant parts. Is that all you were referring to or are there other issues with the interface differing?

SirTyson · 2024-10-03T01:11:05Z

Do you know what error was printed when you ran this? For me I get 2024-10-02T15:43:40.210 GAL3A [default FATAL] Got an exception: error opening output file. If I specify a non-existent trusted hash file (with an output-file to write to), it verifies to genesis without raising an error.

Ya the error I was referring to was that one, with no output-file.

If --trusted-hash-file does not exist, I think it should also result in an error being reported rather than silently verifying from genesis so I'll report that too.

Sounds like a good idea!

Is that all you were referring to or are there other issues with the interface differing?

That definitely cleans up most of it, but I think there's still an issue in the command help message for "--trusted-hash-file":

        "file containing trusted hashes, generated by a previous call to "
        "verify-checkpoints or a non-existent file to generate a new one");

I don't think a non-existent file should be valid, and we should probably just crash immediately on startup in this case.

…, fix docs

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

SirTyson · 2024-10-08T18:11:32Z

src/historywork/WriteVerifiedCheckpointHashesWork.h

+    std::filesystem::path mOutputPath;
+    // If true, mOutputPath == mTrustedHashPath, and output
+    // will be written to a temporary file before being renamed to
+    // mOutputPath when verificaiton is complete.


src/historywork/WriteVerifiedCheckpointHashesWork.cpp

SirTyson · 2024-10-08T18:23:50Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

 {
    mRangeEndPromise.set_value(mRangeEnd);
    if (mArchive)
    {
        CLOG_INFO(History, "selected archive {}", mArchive->getName());
    }
    startOutputFile();
+    maybeParseTrustedHashFile();


I think it would be better to make this parsing function static and call it from CommandLine.cpp before entering the work. Currently we throw a not super user readable execption, but more importantly we throw only after syncing with the network. Parsing the input immdediatialy on startup instead of inside this work would provide faster error checking of command inputs.

SirTyson · 2024-10-08T18:25:36Z

src/historywork/WriteVerifiedCheckpointHashesWork.h

+    // If true, mOutputPath == mTrustedHashPath, and output
+    // will be written to a temporary file before being renamed to
+    // mOutputPath when verificaiton is complete.
+    bool mAppendToFile = false;


I think we should always write to a temp file. That way, in the case of a crash or a failed verification, we don't output a broken trusted hash file. This also allows you to remove the flag and additional logic around mAppendToFile

SirTyson · 2024-10-08T18:33:33Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+WriteVerifiedCheckpointHashesWork::loadLatestHashPairFromJsonOutput(
+    std::filesystem::path const& path)
+{
+    if (!std::filesystem::exists(path))


Nit: We have a cross platofrm filesystem library util/Fs.h

SirTyson · 2024-10-08T18:35:18Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+            }
+            // The output file was written to a temporary file, so rename it to
+            // the trusted hash file name.
+            std::filesystem::rename(mOutputPath, *mTrustedHashPath);


Prefer fs::durableRename here.

SirTyson · 2024-10-08T18:37:09Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+        {
+            first = hm.lastLedgerBeforeCheckpointContaining(*mFromLedger);
+        }
+        releaseAssert(first <= *mFromLedger);


Nit: Change these to releaseAssertOrThrow so the work manager can catch it and gracefully fail.

SirTyson

Overall working much better! A few small issues regarding graceful failure and making sure we don't corrupt output files.

SirTyson

Looks good! Just a few final cleanups and one edge case question.

SirTyson · 2024-10-11T18:23:36Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+    }
+    else if (mLatestTrustedHashPair)
+    {
+        return mCurrCheckpoint > mLatestTrustedHashPair->first;


Should this be return mCurrCheckpoint >= mLatestTrustedHashPair->first;? Currently I don't think we actually check the hash consistency from mLatestTrustedHashPair.ledgerSeq to mLatestTrustedHashPair.ledgerSeq + 1.

Correct, when validating down to mLatestTrustedHashPair, this won't check the hash consistency between mLatestTrustedHashPair.ledgerSeq and mLatestTrustedHashPair.ledgerSeq + 1 if mLatestTrustedHashPair.ledgerSeq is a checkpoint boundary (which it ought to be).

I think this is fine, as the user trusts the input already and the output will contain all the subsequent hashes based in the trust obtained from the network. In the case of fromLedger, we do always need to verify down to that ledger as we do not have a trusted hash for its corresponding checkpoint.

That said, I can widen the bounds if we want to be extra sure that the hashes match.

SirTyson · 2024-10-11T18:27:01Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

+            }
+            else
+            {
+                CLOG_WARNING(History, "failed to open trusted hash file {}",


This should be an assert or a throw. If we made it to this point, we've had to open the trustedHashFile before starting verification. If for some reason we can't open that file at the end of verification, we can't output a proper appended file and should throw.

SirTyson · 2024-10-11T18:32:52Z

src/main/CommandLine.cpp

+            std::optional<LedgerNumHashPair> latestTrustedHashPair;
+            if (trustedHashFile)
+            {
+                if (!std::filesystem::exists(*trustedHashFile))


This is redundant, as we check for file existence in loadLatestHashPairFromJsonOutput

SirTyson · 2024-10-11T18:35:30Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

        mOutputFile->close();
        mOutputFile.reset();
+
+        if (!fs::exists(mTmpOutputPath.string()))


I don't think we need to be this defensive, given that we just wrote to and closed this temp file on the line above.

SirTyson · 2024-10-11T18:36:46Z

src/historywork/WriteVerifiedCheckpointHashesWork.cpp

@@ -182,13 +262,51 @@ WriteVerifiedCheckpointHashesWork::endOutputFile()
 {
    if (mOutputFile && mOutputFile->is_open())


We should probably throw if mOutputFile && mOutputFile->is_open() is false, instead of silently suceeding without writing an output file.

endOutputFile is called by resetIter (which then calls start output file, so no problem there), ~ WriteVerifiedCheckpointHashesWork, and onSuccess. When the work finishes, onSuccess executes and the file is written to and closed. Then later when the shared ptr containing the work is reset ~ WriteVerifiedCheckpointHashesWork is called and mOutputFile is not open. (in this diff, an spurious warning will be emitted). I think instead we should not be emitting a warning or raising an error in that case. I agree that we shouldn't silently succeed without writing an output file, but Considering there is logic to report errors opening the file in startOutputFile, I don't think we need to warn or throw here.

Adds --trusted-hash-file and --from-file to verify-checkpoints

e94e613

ThomasBrady changed the title ~~WIP: Incremental verify checkpoints~~ Incremental verify checkpoints Sep 30, 2024

fix bounds on --from-ledger output

c4810de

ThomasBrady force-pushed the incremental-verify-checkpoints branch from 5abf2eb to c4810de Compare October 1, 2024 01:05

ThomasBrady requested review from marta-lokhova and SirTyson October 1, 2024 01:17

ThomasBrady added 2 commits October 1, 2024 10:36

Update docs for verify-checkpoints

45f3a65

fix docs

a619942

ThomasBrady mentioned this pull request Oct 2, 2024

Horizon: missing information passed to captive-core when configured to run "on disk" stellar/go#4538

Open

SirTyson requested changes Oct 2, 2024

View reviewed changes

ThomasBrady added 2 commits October 3, 2024 15:31

allow the same file for input and output (append), use std filesystem…

4de2273

…, fix docs

include filesystem?

fc48916

ThomasBrady commented Oct 3, 2024

View reviewed changes

src/historywork/WriteVerifiedCheckpointHashesWork.cpp Show resolved Hide resolved

format

6f77a40

SirTyson reviewed Oct 8, 2024

View reviewed changes

src/historywork/WriteVerifiedCheckpointHashesWork.cpp Show resolved Hide resolved

SirTyson reviewed Oct 8, 2024

View reviewed changes

SirTyson requested changes Oct 8, 2024

View reviewed changes

ThomasBrady added 2 commits October 8, 2024 14:56

Use FS, report errors early, always write to tmp file

1009960

typo

7f419c1

SirTyson reviewed Oct 11, 2024

View reviewed changes

ThomasBrady added 2 commits October 11, 2024 14:26

Cleanup

fa98355

cleanup

7eb1104

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental verify checkpoints #4487

Incremental verify checkpoints #4487

ThomasBrady commented Sep 25, 2024 •

edited

Loading

SirTyson left a comment

SirTyson Oct 2, 2024

SirTyson Oct 2, 2024

SirTyson Oct 2, 2024

ThomasBrady commented Oct 2, 2024

SirTyson commented Oct 3, 2024

SirTyson Oct 8, 2024

SirTyson Oct 8, 2024

SirTyson Oct 8, 2024

SirTyson Oct 8, 2024

SirTyson Oct 8, 2024

SirTyson Oct 8, 2024

SirTyson left a comment

SirTyson left a comment

SirTyson Oct 11, 2024

ThomasBrady Oct 11, 2024

SirTyson Oct 11, 2024

SirTyson Oct 11, 2024

SirTyson Oct 11, 2024

SirTyson Oct 11, 2024

ThomasBrady Oct 11, 2024

		@@ -182,13 +262,51 @@ WriteVerifiedCheckpointHashesWork::endOutputFile()
		{
		if (mOutputFile && mOutputFile->is_open())

Incremental verify checkpoints #4487

Are you sure you want to change the base?

Incremental verify checkpoints #4487

Conversation

ThomasBrady commented Sep 25, 2024 • edited Loading

Description

Usage example:

--from-ledger :

Append to existing file:

Usage of both --from-ledger and --trusted-hash-file -> ERROR

Performance

Checklist

SirTyson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasBrady commented Oct 2, 2024

SirTyson commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SirTyson left a comment

Choose a reason for hiding this comment

SirTyson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ThomasBrady commented Sep 25, 2024 •

edited

Loading

`--from-ledger` :

Usage of both `--from-ledger` and `--trusted-hash-file` -> ERROR