Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow benchmarks with runners beyond conbench CLI #99

Merged
merged 27 commits into from
Mar 1, 2023

Conversation

alistaire47
Copy link
Contributor

@alistaire47 alistaire47 commented Feb 14, 2023

Closes #97.

Relaxes the assumption that all benchmarks will be run with the conbench CLI. Also relaxes the assumption that a benchmarkable repo will always have only one benchmarks repo. Adds changes to allow running via arrowbench or adapters.

Notably does not yet turn on arrowbench (or adapter) benchmarks by changing config.py, but that change is all that should be required to do so. There are some differences between running via arrowbench instead of labs/benchmarks, though, specifically cache clearing (which voltrondata-labs/arrowbench#126 will fix) and tags["cpu_count"], which labs/benchmarks does not populate, but arrowbench does, so we'll need to clean up the db to unbreak histories. Full details: voltrondata-labs/benchmarks#130

Generally tries to change as little as possible so existing things keep working.

Specific changes:

  1. Changes run.py:
    1. Separates abstract BenchmarkGroupsRunner class out of the Run class. Moves the existing functionality to a ConbenchBenchmarkGroupsRunner, but also implements ArrowbenchBenchmarkGroupsRunner and AdapterBenchmarkGroupsRunner.
      1. arrowbench running notably calls arrowbench once for all benchmarks instead of once each, which allows arrowbench to manage the run. This has two implications:
        1. The memory monitoring in BenchmarkGroup will not represent each neatly in the db anymore.
        2. arrowbench benchmarks cannot be run in the same run as other benchmarks, because they need their own new run_id, because they will open and close the run, and other results attempting to submit to the same run will fail to post.
    2. Allows runner to be switched by benchmark JSON (the one that lives in the benchmark repo). Also loosens assumptions about command and adds optional name so it does not need to be parsed out of command. Documents schema. All new fields are optional and everything defaults to the conbench runner, so existing metadata will continue to function without changes.
    3. Separates CommandExecutor class out of Run so it can be injected into the groups runner class.
    4. Adds arrowbench and new adapters dir (see 2) to repos_with_benchmark_groups
  2. Adds top-level adapters directory with requirements.txt, benchmarks.json metadata, and mock-adapter.py. For now, this is all just the stuff we'll need to run real adapter here, but no real benchmarks, as all of those will require a lot of alterations to run here. The new stuff here will likely need some light tweaks (e.g. ensure dependencies get installed correctly), but this structure shows we can run adapters from here.
  3. Adds tests for arrowbench and adapter running. The adapter one will need small changes to run from main instead of this branch, but is currently set up so CI here should work.

Copy link
Contributor

@austin3dickey austin3dickey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool!

buildkite/benchmark/run.py Show resolved Hide resolved
r_command = f"""
bm_df <- arrowbench::get_package_benchmarks();
arrowbench::run(
bm_df[bm_df$name %in% c({str(bm_names)[1:-1]}), ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, wow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a janky way to turn a list of strings into a comma-space separated string with each initial string quoted (e.g. ['a', 'b'] -> "'a', 'b'". It could be ", ".join([f"'{x}'" for x in ['a', 'b']]), but I'm not sure that's more readable. I miss R's toString().

exit_on_failure=False,
)

# leave tempfile for validation if mock run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, doesn't this delete the tempfile if it's a mock run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, yeah, I'm missing a not...how did my tests pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this went a bit weird and I can't fully explain why, but I'm switching to not all(bg.mock run ...) because it works.

tests/buildkite/benchmark/test_run_benchmarks.py Outdated Show resolved Hide resolved
@jonkeane
Copy link
Contributor

For now, this is all just the stuff we'll need to run real adapter here, but no real benchmarks, as all of those will require a lot of alterations to run here

Do you mean "will require a lot of alterations to run here" for the current set of python macrobenchmarks? Or that we'll still have a large chunk of work ahead of us to be able to run our arrowbench-based benchmarks this way too?

Copy link
Contributor

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good — a few minor comments

],
},
"env_vars": {
"PYTHONFAULTHANDLER": "1", # makes it easy to debug segmentation faults
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this here? Or does it get passed all the way down to the adapter and we want those seg faults boiled up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is left over from when I copied it over from the labs/benchmarks one. I think it might get passed through all the way to adapters, but if it matters at that level I'd rather set it at that level (as doing so is easy). I'll remove this.

Comment on lines 73 to 72
"BENCHMARKS_DATA_DIR": f"{os.getenv('HOME')}/data", # allows to avoid loading Python and R benchmarks input data from s3 for every build
"ARROWBENCH_DATA_DIR": f"{os.getenv('HOME')}/data", # allows to avoid loading R benchmarks input data from s3 for every build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is for arrowbench we only need one of these (ARROWBENCH_DATA_DIR), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removing

Comment on lines 98 to 99
"BENCHMARKS_DATA_DIR": f"{os.getenv('HOME')}/data", # allows to avoid loading Python and R benchmarks input data from s3 for every build
"ARROWBENCH_DATA_DIR": f"{os.getenv('HOME')}/data", # allows to avoid loading R benchmarks input data from s3 for every build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chunk runs all of the benchmarks together, so we need them together, yeah? Does this run the arrowbench ones a second time?

Or I guess taking a step back: what's this code chunk helpful for | doing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The env vars here are vestigial and unneeded (as of now).

This dict is here because adapters will live in this repo (see root on line 80 is arrow-benchmarks-ci/adapters. We could stick them in a different repo if we want; it doesn't really matter. As of now the only one there is a mock for testing, but eventually at least the C++ and Go ones should live there.

Comment on lines +409 to +405
# NOTE: `bm.command` is the raw arrowbench name; `bm.name` is f"arrowbench/{bm.command}"
# to disambiguate from labs/benchmarks versions when filtering.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this distinction something we plan on coming back and getting rid of once we've transitioned this fully? If so a TODO might be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought so, but after implementation, I'd rather keep it namespaced. Even if we want to refactor the filters to include repo, in some way we really want benchmarks to be namespaced by repo, as a lot of languages are going to have a "read file" benchmark (etc.), and relying on all the repos in question to avoid name clashes with each other is not a good way to manage.

@alistaire47
Copy link
Contributor Author

For now, this is all just the stuff we'll need to run real adapter here, but no real benchmarks, as all of those will require a lot of alterations to run here

Do you mean "will require a lot of alterations to run here" for the current set of python macrobenchmarks? Or that we'll still have a large chunk of work ahead of us to be able to run our arrowbench-based benchmarks this way too?

It was written intended to imply to adapter benchmarks like C++ (mostly; it's still getting metadata through the legacy runner) and Go (a bit), but not arrowbench or Python benchmarks. arrowbench should be (🤞) good to go shortly (though we will have to clean up cpu_count). The Python ones will be a good bit more work, yes; we'll have to either extract the runner from Conbench or port everything to benchrun (and finish the datalogistik integration in Python).

@alistaire47
Copy link
Contributor Author

alistaire47 commented Feb 23, 2023

Ok! I got a working test build running: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2224#01867bb0-78ba-4337-bbd3-b4815be4a7c1 that runs file-read through labs/benchmarks, arrowbench/file-read, and adapters/mock-adapter, all successfully. It took some tweaks to get everything running. Most were minor, but I'll comment on places that make judgement calls that weren't already reviewed.

One note on that build is that running through arrowbench directly does log a lot less; in particular it doesn't log the results. If we want these there, I think the right way to do so is to change the log level for benchclients in benchconnect to DEBUG, which will show all the JSON sent in requests.

TODO: Revert config.py to the original set of benchmarks instead of my testing set. If we want to add more of the arrowbench ones to any particular machine, it's a good time to do that too, though I want to be cautious and not add lots of running time to busy machines. Opinions welcome.

Comment on lines 408 to 409
arrowbench::install_pipx();
arrowbench::install_benchconnect();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. I'm not sure if this is the best way to install benchconnect in this context—there are already a ton of environments in this repo, and we could just add it to the relevant one, if we could figure out which that is—but this works.

If we care about speed, it's probably worth switching from conda to mamba (or at least the mamba solver); we're currently repeatedly solving environments, and it's not fast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we care about speed, it's probably worth switching from conda to mamba (or at least the mamba solver); we're currently repeatedly solving environments, and it's not fast.

Would you mind making an issue in this repo for that? Agreed that's probably the right way (but we also probably shouldn't do this just yet unless it's super quick + easy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +477 to +482
if not Path(self.root).exists():
self.executor.execute_command(f"git clone {self.repo}")

self.executor.execute_command(
f"git fetch && git checkout {self.branch}", self.root
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made cloning conditional, because it was failing when labs/benchmarks and arrowbench both cloned arrowbench, specifically by complaining the directory already exists. We could rm -rf it and re-clone, but given it then fetches anyway, I'm pretty sure it'd be duplicative.

@jonkeane
Copy link
Contributor

One note on that build is that running through arrowbench directly does log a lot less; in particular it doesn't log the results. If we want these there, I think the right way to do so is to change the log level for benchclients in benchconnect to DEBUG, which will show all the JSON sent in requests.

IMO we should log those requests — I'm fine bumping that logging level like you suggest. Normally I would say something like that is overkill, but for a variety of other systems reasons I actually end up going to those logged output values pretty frequently (e.g. when we were hitting a bunch of 50xs recently, I went to the buildkite logs to get some ids from there instead of needing to get them through the list api that was 50xing). At some point we will probably get to a point where it's not needed anymore, but for now as we're iterating on this system (and in other places) it's quite nice to have them

@jonkeane
Copy link
Contributor

This looks good — I'm fine to merge it.

@ElenaHenderson if you had a sec to look through I would be super grateful as always (or if you're busy or would rather: we can merge and then post-hoc review).

@alistaire47
Copy link
Contributor Author

One note on that build is that running through arrowbench directly does log a lot less; in particular it doesn't log the results. If we want these there, I think the right way to do so is to change the log level for benchclients in benchconnect to DEBUG, which will show all the JSON sent in requests.

IMO we should log those requests — I'm fine bumping that logging level like you suggest. Normally I would say something like that is overkill, but for a variety of other systems reasons I actually end up going to those logged output values pretty frequently (e.g. when we were hitting a bunch of 50xs recently, I went to the buildkite logs to get some ids from there instead of needing to get them through the list api that was 50xing). At some point we will probably get to a point where it's not needed anymore, but for now as we're iterating on this system (and in other places) it's quite nice to have them

conbench/conbench#775 (will be quick)

@@ -429,7 +429,6 @@ def run_benchmark_groups(self, benchmark_groups: List[BenchmarkGroup]) -> None:
n_iter = 3L,
drop_caches = TRUE,
publish = TRUE,
run_id = '{os.getenv('RUN_ID')}',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unfortunately this isn't going to work. The passing is fine, but the run ID it generates isn't, because the way abci generates run IDs is with the buildkite build, not the Repo the benchmarks are in, so arrowbench fails because benchconnect tries to start and finish a run that other Repos are using. We could generate an ID here and pass it through, though that's no different than letting arrowbench (really benchconnect) handle it. I've made #104 to come back and change this; this is going to keep biting us if we want to, say, close a run for an adapter with why it failed.

@alistaire47
Copy link
Contributor Author

Ok! So! Things are working for real now; before it looked like there was a logging problem, but it was really installation and run ID uniqueness problems (which are now fixed). Build with full logging running labs/benchmarks, arrowbench, and adapters benchmarks: https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-arm64-m6g-linux-compute/builds/2265#01869a5a-db70-4c2e-94a2-f6048538d346

@alistaire47 alistaire47 merged commit 1cf2922 into main Mar 1, 2023
@alistaire47 alistaire47 deleted the edward/direct-running branch March 1, 2023 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support alternate runners
3 participants