Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add helper for bridging causal fits #955

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

simonpcouch
Copy link
Contributor

PR 1/3 to address tidymodels/tune#652. This PR implements the generic and the model_fit method--note that this method isn't dispatched to from the workflow method (and thus the tune_results method).


if (rlang::is_missing(data) || !is.data.frame(data)) {
abort("`data` must be the data supplied as the data argument to `fit()`.")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only checks this PR makes on inputted data is that it's a data frame. This gives a window for folks to supply different data to fit() and weight_propensity(), probably resulting in uninformative errors at weight_propensity(). We could implement some functionality to "fingerprint" the training data, noting dims/column names or some other coarse set of identifying features, as a full hash would be too expensive to justify. Note that this is not an issue for the tune_results method (and its use of the workflow method), which accesses the training data via the underlying split and does not take a data argument.

#' @export
weight_propensity.model_fit <- function(object,
wt_fn,
.treated = object$lvl[2],
Copy link
Contributor Author

@simonpcouch simonpcouch Apr 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to pass .treated explicitly because the predicted probabilities depend on the treatment level and must match the argument supplied as .treated to propensity. This argument is roughly analogous to our event_level argument, where the event level is either "first" or "second" (rather than the actual level of the factor). Propensity not only parameterizes the argument differently, but the default is the second level, while ours is the first.

I've tried out a few different interfaces to this argument and don't feel strongly on how we can best handle this. We could alternatively add an event_level argument with the usual parameterization and then translate it to the right .treated level when interfacing with propensity. We could also ask that propensity changes the default/parameterization, though there is some information loss in the multi-level setting, and we need to translate "first"/"second" back to the level anyway at predict() (see L81-82).

Note that the current form of that argument is not checked / tested, pending a decision on how we want it to feel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's worth discussing with Lucy and Malcolm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LucyMcGowan and @malcolmbarrett:

We're currently working on a set of PRs to better accommodate causal workflows in tidymodels via a helper, weight_propensity(), that bridges the propensity and outcome models during resampling. Some description of the bigger picture is here. The interface feels something like this:

  fit_resamples(
    propensity_workflow,
    resamples = bootstraps(data),
    control = control_resample(extract = identity)
  ) %>%
  weight_propensity(wt_ate, ...) %>%
  fit_resamples(
    outcome_workflow,
    resamples = .
  )

where the second argument to weight_propensity() is a propensity weighting function and further arguments are passed to that function—the helper handles the arguments .propensity and .exposure internally. The result of weight_propensity() in the above case is what the output of bootstraps(data %>% mutate(.wts = wt_ate(...))) would look like, where .propensity and .exposure are handled internally for the user.

There's surely lots to digest here, but do you have opinions on how we should open up the interface to the .treated argument? Feel free to give me a holler if you'd appreciate additional context. :)

Copy link
Contributor

@malcolmbarrett malcolmbarrett May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all awesome!

A few comments:

  1. we noticed that in propensity we also use the .treated terminology but now think that's a poor idea because not everything is a treatment. So, we're going to change that to .exposed, and I think it should probably be that here, too.
  2. One issue with that language and with that default value for what is currently .treated is that it only applies to binary and multiclass variables. For continuous variables, there won't be a default level. If this really becomes an issue to make it fit nicely, we're actually moving away from PS models for continuous exposures because of some mathematical issues with them.
  3. As for the default value, we do pick the second level because we assume 0 is unexposed and 1 is exposed, but that's mostly to do with what a common logistic model spec looks like. I'm going back to propensity soon and would be happy to work with you all to make this all consistent.

#' [workflow][workflows::workflow()], or
#' tuning results (`?tune::fit_resamples`) object. If a tuning result, the
#' object must have been generated with the control argument
#' (`?tune::control_resamples`) `extract = identity`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to instead require save_pred = TRUE, but we couldn't make use of weight_propensity.workflow in that case. This approach is a bit more DRY.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a little bit of a rough edge to me. I'm not sure we need to sand over it right now in terms of the interface but I would add more documentation, especially on the "main" doc page in tune, which currently only mentions this in an example. What about adding a sentence to the Details section, explaining why this needs to be set like this? I think that would help people remember.

Copy link
Member

@hfrick hfrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Most of this interface/bridge now feels buttery-smooth, well done!

Comment on lines +15 to +18
#' @param wt_fn A function used to calculate the propensity weights. The first
#' argument gives the predicted probability of exposure, the true value for
#' which is provided in the second argument. See `?propensity::wt_ate()` for
#' an example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find that second sentence the easiest to read with the "the true value for which". Is the following correct?

Suggested change
#' @param wt_fn A function used to calculate the propensity weights. The first
#' argument gives the predicted probability of exposure, the true value for
#' which is provided in the second argument. See `?propensity::wt_ate()` for
#' an example.
#' @param wt_fn A function used to calculate the propensity weights. The first
#' argument gives the predicted probability of exposure, the second argument
#' gives the true value of exposure. See `?propensity::wt_ate()` for
#' an example.

Comment on lines +77 to +78
# TODO: I'm not sure we have a way to identify `y` via a model
# spec fitted with `fit_xy()`---this will error in that case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we started out with the "censored regression" mode, we required models to be fit via the formula interface, i.e., fit(), and had fit_xy() throw an error saying to use fit() for that mode.

In that spirit, you could add an error here to point people towards fit().

#' [workflow][workflows::workflow()], or
#' tuning results (`?tune::fit_resamples`) object. If a tuning result, the
#' object must have been generated with the control argument
#' (`?tune::control_resamples`) `extract = identity`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a little bit of a rough edge to me. I'm not sure we need to sand over it right now in terms of the interface but I would add more documentation, especially on the "main" doc page in tune, which currently only mentions this in an example. What about adding a sentence to the Details section, explaining why this needs to be set like this? I think that would help people remember.

#' @export
weight_propensity.model_fit <- function(object,
wt_fn,
.treated = object$lvl[2],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's worth discussing with Lucy and Malcolm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants