Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Step_integer() documentation and use with unseen data #1316

Open
nhward opened this issue May 28, 2024 · 2 comments
Open

Step_integer() documentation and use with unseen data #1316

nhward opened this issue May 28, 2024 · 2 comments

Comments

@nhward
Copy link

nhward commented May 28, 2024

The step_integer documentation states:

Description

step_integer() creates a specification of a recipe step that will convert new data into a set of integers based on the original data values.

Niavely, I thought that meant each value would be replaced by its integer truncation as I was looking for some data-type conversion steps. I made this mistake because I did not read (more correctly, had forgotten) the Details section which explains things fully. A (much) better description would be:

step_integer() creates a specification of a recipe step that will convert data into a set of ascending integers based on the ascending order of the original data values.

Once the true nature of the recipe step is made clear, users can ask themselves whether unseen observations can ever be truly passed through this step sensibly.

The code below shows that observations, that are not part of the 50 training cases, are given the integer value of zero.

train <- iris %>%
  nrow() %>%
  sample.int(size =50) %>%
  iris[.,]

train %>%
  recipes::recipe() %>%
  recipes::step_integer(Sepal.Length) %>%
  recipes::prep(strings_as_factors = FALSE) %>%
  recipes::bake(new_data = iris) %>%
  View()

Unless I misunderstand something this recipe step is fundamentally flawed as a step that can process unseen data. The zero_based parameter does not address this problem either. If the goal is to replace the variable with its rank order, then new observations can never be processed sensibly, (as things stand) since neither 0 nor max+1 are sensible values for new observations.

Perhaps the description should read:

step_integer() creates a specification of a recipe step that will convert data into a set of ascending integers based on the ascending order of the original data values. Its strict validity is limited to its training data alone.

I hope I am not being a moaner by raising this. I use recipe() all the time and appreciate the work of others. I do fear this recipe step will cause more harm than good as it is so easy to misuse.

@EmilHvitfeldt
Copy link
Member

EmilHvitfeldt commented May 28, 2024

Hello @nhward 👋 Thanks for bringing up this issue. I will respond to the 3 points I see:

clarity in documentation

and yes I agree, the documentation is a little unclear as to what is actually done.

input types

I was surprised to see that this method works with numeric input. I was expected (and we are only testing) that it works for character and factor input which makes the most sense. I kinda want to deprecate the use of this step in non character/factor input, but don't have a clear idea of how to do that right now

validity

This steps implements what is commonly called integer encoding another ref. It is a well defined method for dealing with categorical variables, although the performance is often not the best.

@nhward
Copy link
Author

nhward commented Jun 2, 2024

Great reply. I am familiar with label encoding but I did not immediately see this as label encoding until you pointed this out as I was wearing a numeric variate hat.

Personally, I avoid label encoding as it makes observation-distance calculations less meaningful for nominal variables (for methods that depend on distance calculations.)

Documentation

I noticed that a Google search for "R Recipe step label encoding" does not return a reference to step_integer() so I suspect I am not alone in my misunderstanding. I see that step_ordinalscore() is designed for label encoding ordinal variables.

The wording employed in the documentation of step_integer() leads me to speculate that it was designed to generalize label encoding to deal with all data types. In contrast, proper label encoding is restricted to categorical data only (i.e. nominal & ordinal). It is curious that the documentation makes no reference to "label-encoding" - perhaps this can be fixed easily.

Data types

Ordinal should be handled with step_ordinalscore() or step_dummy() which throw errors if a new ordinal level is introduced in unseen data. Step_integer() does not.

Numeric data should only be allowed in step_integer() if its cardinality during training is low, say, < 15. Even so, the assignment of new numbers to 0 or max+1 is dubious for numeric data (that has obvious rank - as does ordinal data).

Alternatively, (and most preferably), numeric data could throw errors if a new numeric value is introduced in unseen data, as per step_ordinalscore(). Ordinal variables (if permitted) should do the same.

I seem to be introducing more problems than solutions. I will leave it to you to assess the value of these suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants