Add option to drop disk caches before running a benchmark #134

alistaire47 · 2023-02-16T22:36:35Z

Closes #126. Implementation is based on that from the conbench runner, meaning it will attempt to drop disk caches, but if it fails it will a. not error and b. not try again (in Python, by setting an attribute, which I translated to R as an option()).

Resetting is in run_one, so if drop_caches is set to TRUE, they will be cleared once per case (not per benchmark-code or iteration). This is different from the conbench runner, which drops caches in each iteration. I can change this here if we like (though it will look more like how we handle profiling and less like other global options), but from previous discussion this seems like a thing we haven't quite figured out the best way to handle yet, and per-case is very much on the table so we could ignore warmup iterations. If we want to enable both and move beyond boolean, we could do that too. Opinions welcome!

The drop_caches global option is inserted in result.optional_benchmark_info.options, but not in result.tags, so this will not break any histories. Arguably it should, but presently cpu_count is the only global option that is (although the way we're running right now it's always null anyway); lib_path and mem_alloc are not. Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

austin3dickey · 2023-02-17T15:45:19Z

R/run.R

@@ -441,10 +451,12 @@ global_setup <- function(lib_path = NULL, cpu_count = NULL, mem_alloc = NULL,
 #' @param name Benchmark name, i.e. `bm$name`
 #' @param params Named list of parameters for the individual run, i.e. the case
 #' @param cpu_count Number of CPUs allocated
+#' @param drop_caches Attempt to drop the disk cache before each iteration.


I have the [very weak] opinion that we provide the multiple-choice of not dropping, dropping between cases, or dropping between iterations. I really don't know enough about most benchmarks to know which one would produce the most insightful results, and at least then we'll be able to try something for a while and react later.

Unless this would take a lot of time to implement (then we could add support when it's necessary)

austin3dickey · 2023-02-17T15:56:28Z

R/run.R

@@ -222,7 +223,7 @@ run_one <- function(bm,
  all_params <- list(...)

  # separate the global parameters, and make sure only those that are specified remain
-  global_param_names <- c("lib_path", "cpu_count", "mem_alloc")
+  global_param_names <- c("lib_path", "cpu_count", "mem_alloc", "drop_caches")


it doesn't seem like it's worth breaking [histories] now

Yeah, once we do conbench/conbench#565 then we can be a little more liberal with breaking histories.

Yeah, this kind of use case is one of the biggest for that ticket. It sounds ridiculously liberal, but in cases like this it will help us get out of the trap of "Well we can't update this code cause it'll break history and we need a transition there and and and"

jonkeane · 2023-02-17T17:23:45Z

Resetting is in run_one, so if drop_caches is set to TRUE, they will be cleared once per case (not per benchmark-code or iteration). This is different from the conbench runner, which drops caches in each iteration. I can change this here if we like (though it will look more like how we handle profiling and less like other global options), but from previous discussion this seems like a thing we haven't quite figured out the best way to handle yet, and per-case is very much on the table so we could ignore warmup iterations. If we want to enable both and move beyond boolean, we could do that too. Opinions welcome!

I think we should implement this at both the case and the iteration level (though this can be exclusive to one layer or the other or off totally — I can't foresee a circumstance we'll need to clear the cache both at the case level and at the iteration level!). For today we want to mimic what we are already running (not just for history, but also there is benefit in matching what Python is doing; and it will defer having to figure out the answer to warmup run differences). In the (hopefully nearterm!) future, I would love to toggle this (in both R and Python) to be at the case level experiment with a few runs + investigate what would be needed to support running these like this generally. I suspect this will include some expansion of what we're measuring (mean, min, max, median, cf #640) — that's great, we should do that at some point, but having the option to run a real experiment with the cache clearing only at the case level, analyze the data, make test cases for changes or expansion to our metrics would be really fantastic (as opposed to needing to implement all of them right now as quickly as possible to unblock this).

Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

I was under the impression that we did drop caches for both Python and R. Looking at labs/benchmarks and the json there: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks.json#L9

Which should trigger for Python:https://github.com/conbench/conbench/blob/175bc404b2f39f1518efef8e33a20848b4c4bac5/conbench/runner.py#L350 and for R: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks/_benchmark.py#L253

That this is confusing (and spans as much of our stack as it does!) is exactly the kind of cleanup that I'm looking forward to with our use-the-arrowbench-runner-to-run-arrowbench project that this is part of 😄!

jonkeane · 2023-02-17T17:37:58Z

R/benchmark.R

@@ -17,7 +17,7 @@
 #' a range of parameter combinations is handled by the runner, not the functions
 #' in the benchmark object.
 #'
-#' @section Parametrizing benchmarks:
+#' @section Parameterizing benchmarks:


Apparently "parametrizing" is allowable but more British? And I'd already written "parameterizing" elsewhere, so I changed this for consistency

LOL I just assumed I mis-spelled it 🙈

jonkeane · 2023-02-17T17:39:37Z

R/run.R

@@ -222,7 +223,7 @@ run_one <- function(bm,
  all_params <- list(...)

  # separate the global parameters, and make sure only those that are specified remain
-  global_param_names <- c("lib_path", "cpu_count", "mem_alloc")
+  global_param_names <- c("lib_path", "cpu_count", "mem_alloc", "drop_caches")


Yeah, this kind of use case is one of the biggest for that ticket. It sounds ridiculously liberal, but in cases like this it will help us get out of the trap of "Well we can't update this code cause it'll break history and we need a transition there and and and"

R/benchmark.R

alistaire47 · 2023-02-17T17:45:33Z

Because we're not actually dropping caches anywhere (Python, R), it doesn't seem like it's worth breaking this now. If we want to set any of these global options differently (or run more than one value), we should make a separate story to refactor. Or maybe just make them not-global parameters of the particular benchmarks where they matter; options and global_options are only runner things, not Conbench things.

I was under the impression that we did drop caches for both Python and R. Looking at labs/benchmarks and the json there: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks.json#L9

Which should trigger for Python:https://github.com/conbench/conbench/blob/175bc404b2f39f1518efef8e33a20848b4c4bac5/conbench/runner.py#L350 and for R: https://github.com/voltrondata-labs/benchmarks/blob/5ea34d76951be9a323683344c5233310eb867908/benchmarks/_benchmark.py#L253

Oh interesting. And that makes me realize that even though we're using run_one() right now, we're really making a separate call for each iteration—otherwise the clearing would be at the case level instead of the iteration level.

I think we should implement this at both the case and the iteration level (though this can be exclusive to one layer or the other or off totally — I can't foresee a circumstance we'll need to clear the cache both at the case level and at the iteration level!). For today we want to mimic what we are already running (not just for history, but also there is benefit in matching what Python is doing; and it will defer having to figure out the answer to warmup run differences). In the (hopefully nearterm!) future, I would love to toggle this (in both R and Python) to be at the case level experiment with a few runs + investigate what would be needed to support running these like this generally. I suspect this will include some expansion of what we're measuring (mean, min, max, median, cf #640) — that's great, we should do that at some point, but having the option to run a real experiment with the cache clearing only at the case level, analyze the data, make test cases for changes or expansion to our metrics would be really fantastic (as opposed to needing to implement all of them right now as quickly as possible to unblock this).

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

jonkeane · 2023-02-17T18:04:52Z

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

Yeah, that sounds right. In an ideal world we would supply a micro-binary that just does this cleaning that has no install drama and just works (so I think it simply cannot be Python, unfortunately). But for now copy | pasting these shell commands isn't so bad.

Parameterizing this will allow us to do the testing we need to gain confidence in what the right setting is (for our benchmarks). My heart + gut says it should be dropped only per case, and that we should do good + smart things about inter-iteration differences that might cause.

alistaire47 · 2023-02-17T18:08:47Z

I think for now I can switch it from logical to categorical to cover all three cases (no clearing, at the case level, at the iteration level) reasonably easily. Longer-term, I'm increasingly convinced cache clearing should be part of the benchmark, i.e. if you want it, call (parameterized, if you like) sync_and_drop_caches() directly in setup() (for case-level) or before_each() (for iteration-level) to use benchrun structure (arrowbench for now has setup() but not before_each()).

Yeah, that sounds right. In an ideal world we would supply a micro-binary that just does this cleaning that has no install drama and just works (so I think it simply cannot be Python, unfortunately). But for now copy | pasting these shell commands isn't so bad.

To be clear, I'm advocating still keeping a sync_and_drop_caches() util that benchmarks can call. If that calls a microbinary instead of reimplementing the same thing, that's cool. But I don't want each benchmark copypasta-ing the body of that function.

jonkeane · 2023-02-17T18:26:11Z

To be clear, I'm advocating still keeping a sync_and_drop_caches() util that benchmarks can call. If that calls a microbinary instead of reimplementing the same thing, that's cool.

nods yeah if that's the case let's make something that doesn't have the drama needed to get that working (anything we've tried so far with Python ends up having installation scripts | functions that are longer than the body of the copy-pastaed version!). I'm absolutely no fan of copy-pasta; but we can't keep bashing our heads on install drama that blocks us from getting done what we need to. But also, once we have this here we'll be fine for a little while with python and R sharing the same thing for a bit. We can defer building the cache dropper in go or rust or something that gives us better isolation + easy distributability for a bit (and possibly forever, depending on what we're benchmarking if it needs that or if that would even be helpful at all)

alistaire47 · 2023-02-17T19:54:15Z

Ok, refactored so drop_caches can be set to NULL (the default) meaning do nothing, "case", or "iteration". This is different than the same parameter in labs/benchmarks and the conbench runner (where it's boolean and true means iteration), but it can reproduce the functionality and the differences are additive. Left the default as NULL both because it's like the other two and because it sounds more in line with our longer-term plans. We'll need to set drop_caches = "iteration" in voltrondata-labs/arrow-benchmarks-ci#99 to reproduce the current behavior.

R/params.R

boshek

LGTM

boshek · 2023-02-18T00:26:32Z

R/run.R

@@ -441,10 +460,13 @@ global_setup <- function(lib_path = NULL, cpu_count = NULL, mem_alloc = NULL,
 #' @param name Benchmark name, i.e. `bm$name`
 #' @param params Named list of parameters for the individual run, i.e. the case
 #' @param cpu_count Number of CPUs allocated
+#' @param drop_caches Attempt to drop the disk cache before each case or iteration.


Is it worth inheriting this from measure? Or the other way around?

I ended up making the measure one a bit more precise, since the values have different implications there (i.e. "iteration" drops once, and both "case" and NULL result in no dropping)

alistaire47 · 2023-02-18T01:33:19Z

Alright, ensured global options like drop_caches get passed through from run.BenchmarkDataFrame() and populate the correct places on the results. This stuff is quite a mess, but it's not clear how to make it less so without rebuilding a lot and/or breaking things, so for now I've just tried to keep it all working and not increase the complication.

Also discovered and patched a related bug breaking run.BenchmarkDataFrame() when specifying n_iter—n_iter can be double-specified in run_benchmark() (once in params and once in n_iter)—so now a value in params will take precedence. (In theory you could run different numbers of iterations for different cases, which is nice. In practice, it unbreaks run.BenchmarkDataFrame() without changing get_default_parameters() significantly, which would have broader implications.)

Add option to drop disk caches before running a benchmark

e5c3e80

alistaire47 self-assigned this Feb 16, 2023

alistaire47 requested review from jonkeane, austin3dickey and boshek February 16, 2023 22:56

austin3dickey reviewed Feb 17, 2023

View reviewed changes

jonkeane suggested changes Feb 17, 2023

View reviewed changes

Allow caches to be cleared each case or iteration

51e7619

alistaire47 commented Feb 17, 2023

View reviewed changes

R/params.R Outdated Show resolved Hide resolved

jonkeane approved these changes Feb 17, 2023

View reviewed changes

fix params and ensure nulls are not dropped

ef00f1a

boshek approved these changes Feb 18, 2023

View reviewed changes

alistaire47 added 2 commits February 18, 2023 01:09

Fix global option passthrough bug and make sure our usual case is tested

0c74824

docs

8b6d032

alistaire47 merged commit ebfab3d into main Feb 18, 2023

alistaire47 deleted the edward/drop-caches branch February 18, 2023 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to drop disk caches before running a benchmark #134

Add option to drop disk caches before running a benchmark #134

alistaire47 commented Feb 16, 2023

austin3dickey Feb 17, 2023

austin3dickey Feb 17, 2023

jonkeane Feb 17, 2023

jonkeane commented Feb 17, 2023

jonkeane Feb 17, 2023

alistaire47 Feb 17, 2023

jonkeane Feb 17, 2023

jonkeane Feb 17, 2023

alistaire47 commented Feb 17, 2023

jonkeane commented Feb 17, 2023

alistaire47 commented Feb 17, 2023

jonkeane commented Feb 17, 2023

alistaire47 commented Feb 17, 2023

boshek left a comment

boshek Feb 18, 2023

alistaire47 Feb 18, 2023

alistaire47 commented Feb 18, 2023

Add option to drop disk caches before running a benchmark #134

Add option to drop disk caches before running a benchmark #134

Conversation

alistaire47 commented Feb 16, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonkeane commented Feb 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alistaire47 commented Feb 17, 2023

jonkeane commented Feb 17, 2023

alistaire47 commented Feb 17, 2023

jonkeane commented Feb 17, 2023

alistaire47 commented Feb 17, 2023

boshek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alistaire47 commented Feb 18, 2023