Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the way compute-coverage finds the hpc dir in dist-newstyle #2116

Merged
merged 2 commits into from
Sep 18, 2024

Conversation

sauclovian-g
Copy link
Contributor

Previously it searched blindly and would get confused if there were build results from multiple saw-script versions. Now we use cabal list-bin to find the proper version and look there.

Fixes #2114.

@sauclovian-g sauclovian-g added bug CI Continuous integration labels Sep 5, 2024
@sauclovian-g
Copy link
Contributor Author

In order to actually check that this fixes the problem, should we create an additional pull request to merge a version bump into this branch and see if the CI works there? (then close it) or is that not worthwhile? I did check that it locates the correct directory in my own checkout, which currently does have two versions' worth of build results in it.

Copy link
Contributor

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to actually check that this fixes the problem, should we create an additional pull request to merge a version bump into this branch and see if the CI works there?

I'd be OK with that. Alternatively, we could just bump the version in a temporary commit in this PR, and if that works as expected, we can undo that commit before landing the PR.

# The hpc dir we want is parallel to the build dir, so drop the end bits off
# and replace them.
SAW=$(cabal list-bin -v0 exe:saw)
HPC_ROOT=$(echo "$SAW" | sed 's,build/saw/saw$,hpc,')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, I asked the folks on the #ghc Matrix channel if this particular subdirectory was liable to change in future cabal versions. I'm glad I asked, because the location actually changed in cabal-install-3.12.0.0. In that version, cabal now puts generated hpc-related files under a subdirectory that looks something like:

/home/ryanscott/Documents/Hacking/Haskell/saw-script/dist-newstyle/build/x86_64-linux/ghc-9.4.8/saw-script-1.0.0.99/build/extra-compilation-artifacts/hpc/vanilla/mix/saw-script-1.0.0.99-inplace

That is to say, they're now located under saw-script-<ver>/build rather than saw-script-<ver>/hpc. It might be worth calling find on saw-script-<ver> instead to be forward-compatible with cabal-install-3.12.0.0 or later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it now does that.

Copy link
Member

@kquick kquick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, pending testing.

compute-coverage.sh Show resolved Hide resolved
@sauclovian-g
Copy link
Contributor Author

Hm, it failed after reverting. So you can bump the version up, but not down. Is that good enough? The script appears to be doing what we intended, I'm guessing the problem is that reverting the version doesn't recompile/rerun enough things (whereas bumping it creates a new tree)...?

@kquick
Copy link
Member

kquick commented Sep 7, 2024

Not quite enough information in the logs to determine if it's a .tix or .mix file, but I did notice that the "Setup test environment" step prior to that unpacks an hpc.tar.gz that it just downloaded as an artifact in the previous step. That tar unpack ends up with mix files from both your versions... and it also contains /bin/saw. I wonder if the /bin/saw is still from the new version and for whatever reason didn't get updated on your version rollback?

@sauclovian-g
Copy link
Contributor Author

I will defer to you to decide what to do (my inclination would be to merge what we've got since it seems to be at least a partial step forward)... I don't know enough about how the tool works to have any useful input.

@RyanGlScott
Copy link
Contributor

In a different context (#2118), I realized that we are inadvertently uploading the compiled saw-script binary to the same name on both the hpc and non-hpc jobs, and depending on the order in which this happens, one could potentially clobber the other. I'm not sure if this could explain the oddities observed here, but I wonder if we should try again after landing #2118 (which will include a fix for this latent clobbering issue).

@sauclovian-g
Copy link
Contributor Author

Worth a try, I think

@sauclovian-g
Copy link
Contributor Author

ok, that worked, now rolling it back again

@sauclovian-g
Copy link
Contributor Author

Nope, same result.

@sauclovian-g
Copy link
Contributor Author

Two of the intTests timed out, too, not sure what to make of that

@RyanGlScott
Copy link
Contributor

Two of the intTests timed out, too, not sure what to make of that

Something is very wrong here. The fact that the test_comp_bisim test case in particular times out strikes me as suspicious, since that is a test that is expected to time out on the HPC configuration (see here), but not in the non-HPC configuration. And indeed, if I download the ubuntu-22.04-bins artifact (which allegedly contains the non-HPC build of saw) and run a simple command like saw --help, it produces a .tix file, as though it were built with HPC!

I'm really quite baffled as to how this happened, given that I haven't observed anything like what is going on in this PR happening in any other jobs (the nightly job in particular is passing without issues), and this PR doesn't change anything related to how artifacts are uploaded or downloaded.

@sauclovian-g
Copy link
Contributor Author

Eeeenteresting, as they say... it must be something about this particular run and the state that's been left behind here, because this is the third run with the same tree, and the fifth if you don't count the version changes as different (and they should not be able to cause "real" problems...)

Sometimes I think every test run should both clone a fresh tree and build it and pull/update and rebuild the already-built tree from the previous run, and then diff them and fail if they aren't identical.

I am starting to suspect a cabal problem...

@RyanGlScott
Copy link
Contributor

I would be somewhat surprised if state had something to do with this, given that GitHub Actions artifacts are tied to particular workflows. That being said, I am out of other plausible explanations. Perhaps we should try nuking the cache and trying again just to be extra sure.

@sauclovian-g
Copy link
Contributor Author

Well, we know it's reusing the build dir or we wouldn't have seen the original behavior with two versions at once in dist-newstyle. That's definitely enough state for things to go off the rails if there's a problem in the rebuild logic somewhere.

@RyanGlScott
Copy link
Contributor

I have nuked the cache and rebuilt from the current state of d283404. Let's see what happens when we try a round of version back-and-forth now...

@sauclovian-g
Copy link
Contributor Author

sauclovian-g commented Sep 13, 2024

Another thing that got suggested was to try bumping back to a version that already got built once (that is, 1.2.0.99 -> 1.2.1.50 -> 1.2.0.99 and then back to 1.2.1.50), and see if that gives other behavior that might be illuminating.

Copy link
Contributor

@RyanGlScott RyanGlScott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI now passes this round of back-and-forth after nuking the cache. Personally, I'm satisfied with this as evidence that this works, so I'll go ahead and approve this. (If you want to try decrementing the version, that's fine too, although I don't think it's strictly necessary to demonstrate that this works as intended.)

Copy link
Member

@kquick kquick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is always so damn fiddly. I agree that this PR seems valid and the cache-based issues are not likely to be a behavioral regression, although we should add the behavior/cache-nuke-resolution to RyanM's documentation in case it is encountered in the future.

@sauclovian-g
Copy link
Contributor Author

I find it puzzling that nuking the cache would make it able to revert a version bump. I think there must be something bad going on, but I think it's likely a cabal bug and unless we want to go chasing after it for upstream's benefit there's little to be gained by mucking around further here. (And if we do, the first thing is to try to replicate the behavior outside the CI environment, which I'm not really yet in a position to do right now.) So I'll merge.

@sauclovian-g
Copy link
Contributor Author

However, I'm going to squash all the bump/unbump commits since we don't want those on the trunk forever.

Previously it searched blindly and would get confused if there were
build results from multiple saw-script versions. Now we use cabal
list-bin to find the proper version and look there.
This brings in the recent CI changes and particularly #2118.
@sauclovian-g
Copy link
Contributor Author

(I also squashed the original review adjustments, no real reason to keep those separate; but I kept the original branch point and the merge with master as those might be relevant if someone comes back later.)

@sauclovian-g sauclovian-g merged commit 9251017 into master Sep 18, 2024
10 checks passed
@sauclovian-g sauclovian-g deleted the 2114-coverage-version-bump branch September 18, 2024 00:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CI Continuous integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compute Coverage CI failing
3 participants