Horizon: missing information passed to captive-core when configured to run "on disk" #4538

MonsieurNicolas · 2022-08-11T18:39:17Z

What version are you using?

v19.02.19.0

What did you do?

We had a testnet outage that revealed that captive-core reconstructed a ledger and replayed transactions from an untrusted archive.

What did you expect to see?

We should always verify data before replaying. In this case we should always pass down proofs thatt were acquired from the network.

What did you see instead?

captive-core replayed transactions from a bad archive, horizon ingested this data and corrupted itself. This required clearing core and horizon.

Additional information

the root cause of the issue is that:

when running using the --in-memory flag, core uses --start-at-ledger and --start-at-hash to bootstrap trust (ledger information comes from the latest ledger that Horizon ingested). This works as expected
when running "on disk", Horizon invokes core in two steps. The first one is the problematic one: when running catchup (offline catchup), core needs to "anchor" trust somewhere. The way to do it is by storing trusted hashes in a file and passing it down to core using the --trusted-checkpoint-hashes option. Note that unlike --start-at-ledger the hashes that can be passed to core this way must be "checkpoint ledgers" and catchup has to be invoked to reconstruct that checkpoint (ie: the "to" ledger must be a checkpoint ledger).

Related core issue stellar/stellar-core#3503 to surface this better in the future as it's a footgun for sure.

The text was updated successfully, but these errors were encountered:

MonsieurNicolas · 2022-11-29T06:21:48Z

not sure if you discussed this during your release triage @ire-and-curses but as this created an outage before, we are looking at prioritizing the linked issue for core's January release

MonsieurNicolas · 2022-12-21T01:21:53Z

@ire-and-curses can you confirm that this is getting (or was) prioritized for your next release? Core is currently blocked on this one to merge stellar/stellar-core#3615

mollykarcher · 2022-12-21T18:52:45Z

@MonsieurNicolas this just came to my attention and we unfortunately did not prioritize this for our upcoming sprint/release. We don't have much capacity leading up to the next soroban release, but will prioritize getting it in the queue for our next sprint.

MonsieurNicolas · 2022-12-22T02:53:46Z

no problem - when would that next sprint ship roughly?

tsachiherman · 2022-12-22T16:24:50Z

@MonsieurNicolas given that the first week of Jan will be devoted to the upcoming release, and the next week or two would be devoted to the hackathon, I'd realistically plan to have this issue being reviewed, analyzed and handled by end of Jan.

bartekn · 2023-01-10T00:29:31Z

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

Additionally, when in memory mode, Horizon with a clean DB takes the values for --start-at-ledger and --start-at-hash from history archives. In this case, there is no command Horizon can use to start Stellar-Core safely.

This all makes me think that maybe we should take all these cases into account when working on this? How long does it take to generate the trusted hashes file? Maybe Stellar-Core could generate the file if it's not passed and store it in storage directory?

MonsieurNicolas · 2023-01-17T19:22:10Z

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

this is not true: there are situations (as we experienced during the testnet outage), where Horizon resets the node and runs "catchup" without any trusted anchor.

Additionally, when in memory mode, Horizon with a clean DB takes the values for --start-at-ledger and --start-at-hash from history archives. In this case, there is no command Horizon can use to start Stellar-Core safely.

This is confusing to me:

either Horizon already ingested the checkpoint (same case than when it needs to reset core), or
it needs to bootstrap trust which can be done just by asking core to generate the trusted hashes (command that was added as part of the design of captive core for that purpose).

This all makes me think that maybe we should take all these cases into account when working on this? How long does it take to generate the trusted hashes file? Maybe Stellar-Core could generate the file if it's not passed and store it in storage directory?

The reason it's a separate command is that it allows Horizon to verify archives as well, so it does not make sense to let Horizon rebuild state without having that information (and if rebuilt state, it should have that trust information already). This was part of the original design for captive-core 2+ years ago.

I do agree that at some point we probably need to simplify all this: there is a bunch of weird code in Horizon that was added to make "captive-core" work without "in-memory", but this ended up re-creating issues that are solved in core already. That should be a separate issue on how to clean things up after we've migrated to "buckets db". @marta-lokhova FYI

bartekn · 2023-01-20T17:08:32Z

The information in "Additional information" section is a little bit wrong. Currently, Horizon in on disk mode will run catchup command only when Horizon db is empty. This way it will init Stellar-Core state at the starting ledger which is the latest checkpoint. When restarting, Horizon runs run command so it simply tells Stellar-Core to start where it left off. Horizon will also run catchup when reingesting (which can be a parallel reingest).

this is not true: there are situations (as we experienced during the testnet outage), where Horizon resets the node and runs "catchup" without any trusted anchor.

I was describing the current code but, yes, it's possible that during that outage something you explained (clear storage dir when horizon DB is not empty) was possible. We fixed the issue in 621d634.

This is confusing to me:

either Horizon already ingested the checkpoint (same case than when it needs to reset core), or

it needs to bootstrap trust which can be done just by asking core to generate the trusted hashes (command that was added as part of the design of captive core for that purpose).

Horizon currently doesn't pass trusted hashes to Stellar-Core anywhere in the code. This is part of the problem I described in the previous comment.

The reason it's a separate command is that it allows Horizon to verify archives as well, so it does not make sense to let Horizon rebuild state without having that information (and if rebuilt state, it should have that trust information already). This was part of the original design for captive-core 2+ years ago.

Horizon doesn't need trusted hashes. It depends 100% on Stellar-Core for tx meta and for buckets it just checks buckets hash in ledger header it gets from Stellar-Core - if hashes doesn't match the DB tx that inserts entries to Horizon DB is cancelled.

mollykarcher · 2024-04-25T20:54:33Z

To summarize some offline conversation about this issue:

RPC had the same issue, but that was resolved here
Horizon live/forward ingestion could not replay from an untrusted archive because of how it verifies bucket hashes on download
Horizon reingestion could replay from an untrusted archive, so we want to adjust reingest to pass in trusted checkpoint hashes pulled from core to catchup

…sts that use catchup

…for tx size limit test

…it loop

…on tests

…out of space

…f disk space

…g it takes on CI

…md flag parsing as it's needed now since captive core does run mode

sreuland · 2024-09-03T16:03:13Z

moving this to blocked for now, after initial solution was proposed on #5431, review highlighted need to take step back and first capture/evaluate design options for obtaining trusted hashes at reingest time on doc, once these are reviewed, development can proceed with best option.

sreuland · 2024-09-18T19:30:09Z

we had a design review with team on 9/18 of the options outlined in trusted hash design options, the meeting wrapped up with an action item for @ThomasBrady to investigate --verify-checkpoints potential usage of skip lists for optimizing runtime duration. We will have another meeting to review outcome from that investigation to determine best usage of --verify-checkpoints, i.e. can its invocation be automated internally in captive core sdk wrapper or will it need to be run out-of-band by user.

marta-lokhova · 2024-09-20T00:43:03Z

@sreuland unfortunately, the skip list in core has been broken since genesis (it will be fixed in protocol 22, but only for new ledgers).

Is the concern that even after the initial file generation from genesis, it'll be too expensive to generate newer hashes?

sreuland · 2024-10-01T20:28:30Z

@marta-lokhova , yes, it was related to runtime duration of verify-checkpoints, however, in recent days we've had further discussion on skip list and a suggestion was made to repair it which would in turn enable catchup to emit ledger metadata on pipe with network anchored hashes, this appears to have gained consensus. I've summarized it as another option in our investigation of trusted hashes.

given the repair of skip list, it would eliminate the need for us to change the captive core wrapper sdk, as it already invokes catchup and would just obtain trusted hashes in ledger metadata emitted from meta pipe transparently when the underlying core bin version has the repair. Has a core ticket been created for skip list repair? I would like to refer to it here before closing this as no-op.

ThomasBrady · 2024-10-01T20:58:24Z

We have this PR for skip list repair stellar/stellar-core#4470 the PR says p22 but it is not going to be in the next release.

sreuland · 2024-10-01T21:42:16Z

@ThomasBrady, thanks! will stellar/stellar-core#4470 propagate the repair to existing history archives? or is that mechanism anticipated as a separate ticket/pr?

this bug covers reingestion of history which potentially could be from already printed checkpoint archives, so, would like to include reference to any other efforts that may be required before catchup will emit anchored hashes on ledgers in tx-meta pipe by default.

sreuland · 2024-10-02T17:28:04Z

@ThomasBrady , just to confirm, we don't anticipate changing captive core wrapper sdk to use verify-checkpoints in the interim while skip list repairs are performed, you were investigating enhancing verify-checkpoints performance, and just want to make sure that wasn't contingent on this use case, thanks!

ThomasBrady · 2024-10-02T17:47:28Z

Hey, yeah the verify-checkpoints changes are largely motivated by this use case. The work is already implemented and in review stellar/stellar-core#4487 and will most likely be included in v22.1.

AIUI there are a few components of the skip list repair including: a correct skip list in each ledger header -- the PR linked above -- which will probably not be fixed until p23, and then also designing and implementing a solution to backfill/distribute the pre-23 skip lists -- which has no current timeline. I can check on if we have a more concrete timeline, but I imagine those components won't be available until Q1 2025 at the earliest.

sreuland · 2024-10-02T18:01:42Z

@ThomasBrady ,thanks for additional insight.

@mollykarcher , does this type of tentative timeline into next year for skip list rollout sound reasonable as far as holding off on changing captive core wrapper sdk to use verify-checkpoints in interim?

MonsieurNicolas added horizon bug labels Aug 11, 2022

mollykarcher added objective-8 and removed horizon labels Apr 26, 2024

mollykarcher added cdp-horizon-scrum and removed objective-8 labels Jun 25, 2024

chowbao added this to the platform sprint 49 milestone Jul 2, 2024

sreuland self-assigned this Jul 23, 2024

Shaptic modified the milestones: platform sprint 49, platform sprint 50 Aug 13, 2024

sreuland added a commit to sreuland/go that referenced this issue Aug 19, 2024

stellar#4538: obtain trusted hash for captive core catchup command

61a5f80

sreuland added a commit to sreuland/go that referenced this issue Aug 19, 2024

stellar#4538: updated changelog notes

3b706e4

sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024

stellar#4538: use LedgerBackend interface for NewCaptive, fix unit te…

2937c18

…sts that use catchup

sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024

stellar#4538: fix govet warn

8da847f

sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024

stellar#4538: fixed TestTxSubLimitsBodySize to not depend on soroban …

f91fa55

…for tx size limit test

sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024

stellar#4538: added core container log output on integrationt test wa…

d049e73

…it loop

sreuland added a commit to sreuland/go that referenced this issue Aug 20, 2024

stellar#4538: pin stellar-core image in CI to last stable 21.2.1 image

805950a

sreuland added a commit to sreuland/go that referenced this issue Aug 21, 2024

stellar#4538: fix captive listen port conflicts on reingest integrati…

3b335b7

…on tests

sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024

stellar#4538: updated CHANGELOGs

e56e346

sreuland mentioned this issue Aug 26, 2024

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

Closed

7 tasks

sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024

stellar#4538: fixed verify-range ci test

2270bba

sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024

stellar#4538: fixed verify-range ci test, again

7315813

sreuland added a commit to sreuland/go that referenced this issue Aug 26, 2024

stellar#4538: fixed shell script syntax on verify range

d236308

sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024

stellar#4538: create free space on gh runner for verify-range due to …

32bb6b9

…out of space

sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024

stellar#4538: use next larger gh runner for verify-range due to out o…

0ee83dc

…f disk space

sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024

stellar#4538: try older range on pubnet for verify-range, see how lon…

659d0e0

…g it takes on CI

sreuland added a commit to sreuland/go that referenced this issue Aug 27, 2024

stellar#4538: use testnet for verify-range, shorter duration than pubnet

6e61306

sreuland added a commit to sreuland/go that referenced this issue Aug 28, 2024

stellar#4538: enabled captive core full config option from reingest c…

d58b15a

…md flag parsing as it's needed now since captive core does run mode

sreuland added a commit to sreuland/go that referenced this issue Aug 29, 2024

stellar#4538: included new file for db command tests

41eeb8f

sreuland mentioned this issue Aug 30, 2024

Validate ledger chain of prior history during ingestion catch up #5450

Open

sreuland modified the milestones: platform sprint 50, platform sprint 51 Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

MonsieurNicolas commented Aug 11, 2022

MonsieurNicolas commented Nov 29, 2022

MonsieurNicolas commented Dec 21, 2022

mollykarcher commented Dec 21, 2022

MonsieurNicolas commented Dec 22, 2022

tsachiherman commented Dec 22, 2022

bartekn commented Jan 10, 2023

MonsieurNicolas commented Jan 17, 2023

bartekn commented Jan 20, 2023

mollykarcher commented Apr 25, 2024

sreuland commented Sep 3, 2024

sreuland commented Sep 18, 2024

marta-lokhova commented Sep 20, 2024

sreuland commented Oct 1, 2024

ThomasBrady commented Oct 1, 2024

sreuland commented Oct 1, 2024

sreuland commented Oct 2, 2024

ThomasBrady commented Oct 2, 2024

sreuland commented Oct 2, 2024

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

Comments

MonsieurNicolas commented Aug 11, 2022

What version are you using?

What did you do?

What did you expect to see?

What did you see instead?

Additional information

MonsieurNicolas commented Nov 29, 2022

MonsieurNicolas commented Dec 21, 2022

mollykarcher commented Dec 21, 2022

MonsieurNicolas commented Dec 22, 2022

tsachiherman commented Dec 22, 2022

bartekn commented Jan 10, 2023

MonsieurNicolas commented Jan 17, 2023

bartekn commented Jan 20, 2023

mollykarcher commented Apr 25, 2024

sreuland commented Sep 3, 2024

sreuland commented Sep 18, 2024

marta-lokhova commented Sep 20, 2024

sreuland commented Oct 1, 2024

ThomasBrady commented Oct 1, 2024

sreuland commented Oct 1, 2024

sreuland commented Oct 2, 2024

ThomasBrady commented Oct 2, 2024

sreuland commented Oct 2, 2024