Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidyups: input / output information #1158

Open
EmilHvitfeldt opened this issue Jun 16, 2023 · 3 comments
Open

tidyups: input / output information #1158

EmilHvitfeldt opened this issue Jun 16, 2023 · 3 comments
Labels
feature a feature request or enhancement

Comments

@EmilHvitfeldt
Copy link
Member

Tidyup: variable input/output information in {recipes}

Champion: Emil

Co-Champion: Max

Status: Draft

Abstract

The recipes provide a pipe-able and flexible way of processing data. Each operation is done sequentially, using tidyselect and recipes specific selectors such as all_numeric_predictors() and all_outcomes().

We want to have a robust collection of which variables are passed into each step, which is passed out and their relation. Having this information will be valuable for the user, allowing us to determine the minimum set of required input, calculate feature importance, generate graphs, and more.

Motivation

It is not known beforehand which variables are used within each step. Take the recipe below

library(recipes)

data(ames, package = "modeldata")

rec <- recipe(~., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_nzv(all_predictors()) |> # remove Near Zero Variance columns
  step_pca(all_predictors())

we would have to run the code to be able to determine which variables are selected by step_pca(), beyond the variables in ames, some are created by step_dummy() and some are removed by step_nzv(), including those created by step_dummy().

knowing the input and output of step_nzv() lets the user know what variables are created.

Using the same recipe, we are given the variables "PC1", "PC2", "PC3", "PC4", "PC5", which are not that useful for the end user if they are interested in variable importance. If we had the input/output information. We would be able to deduce backward, seeing which dummy variables are created, and how much each of those contributes to each component.

Lastly depending on which variables were removed with step_nzv() we might be able to deduce that some variables won't be necessary as input, as they are fully removed.

On the dev side, we will be able to use this information to refactor some of the selecting and name-creating code that happens in many steps.

Solution

Each recipe step is essentially a list of information. Adding this information would be done as another field.

Implementation

This is where we need help!

I'm thinking that this information could be represented as a list of character vectors or as a sparse matrix. Keep in mind that you will need one for each step and that we will want to "combine" these to get an inference of what happens to each variable.

These are talks I'm not very well versed in, and I wouldn't be surprised if there was a igraph function that would do what we need with ease.

Backwards compatibility

Should be trivial, as we are "just" adding another field for each step.

Types of steps

  • one to one
  • one to many
  • many to many
  • removing
  • adding
  • none to none
  • all of above

In the above definition, "many" means non-negative.

One to one

one to many

many to many

removing

adding

none to none

all of above

@EmilHvitfeldt
Copy link
Member Author

This problem can be solved much much nicer with this #1199

@EmilHvitfeldt EmilHvitfeldt added the feature a feature request or enhancement label Sep 28, 2023
@EmilHvitfeldt
Copy link
Member Author

This is related to #1137 as well

@EmilHvitfeldt
Copy link
Member Author

Having ptype information is going to make this issue much nicer #1329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

1 participant