Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize names for steps that create dummy variables #918

Closed
EmilHvitfeldt opened this issue Feb 24, 2022 · 5 comments
Closed

Standardize names for steps that create dummy variables #918

EmilHvitfeldt opened this issue Feb 24, 2022 · 5 comments

Comments

@EmilHvitfeldt
Copy link
Member

The use of dummy in step names have lead to some confusion, especially with the addition of step_dummy_multi_choice() and step_dummy_extract() which has dummy as a part of their name, while other steps such as step_regex(), step_count(), step_indicate_na(), and step_holiday() which do produce dummies, does not.

Before I go any further I'm going to lay down some terminology.

  • Dummy variable: A numeric variable that only takes the value 0 or 1 that indicates a categorical effect. (also known as indicator variable)
  • Count variable: A numeric variable that indicate number of occurrences. Can any whole number.

Using the above definition I will say that

  • step_dummy() produces a set of dummy variables.
  • step_dummy_multi_choice() produces a set of dummy variables.
  • step_holiday() produces a set of dummy variables.
  • step_dummy_extract() produces a set of count variables.
  • step_indicate_na() produces a single dummy variable.
  • step_regex() produces a single dummy variable.
  • step_count() produces a single count variable (when normalize = FALSE)

A way to standardize the naming would be to turn step_holiday() -> step_dummy_holiday(), step_dummy_regex(), etc, etc.

not all dummy steps can have a related count step, but all count steps can have a related dummy step.

What I'm not sure what to do naming wise for steps that produces counts, as it is only step_count() and step_dummy_extract(). step_dummy_extract() could in theory be changed to return dummies instead of counts, and create another step called step_count_extract() that does what step_dummy_extract() does now.

All the above a using a somehow loose definition of categorical effect.

@juliasilge
Copy link
Member

Two that are especially confusingly named right now are step_dummy_extract() and step_regex(). One option could be:

  • step_regex() ➡️ step_detect_regex()
  • step_dummy_extract() ➡️ step_extract_regex()
  • step_count() ➡️ step_count_regex()

I feel like step_dummy_multi_choice() could be better. One option would be step_dummy_coalesce()?

I'd lean toward keeping step_holiday() as is and then adjusting the title to say it makes dummy variables:

Generate dummy variables for holidays

@EmilHvitfeldt
Copy link
Member Author

Another verb we have going on here is indicate that in my mind is very similar to detect. It would be nice to avoid keeping both around

@EmilHvitfeldt
Copy link
Member Author

I feel like step_dummy_multi_choice() could be better. One option would be step_dummy_coalesce()?

I like coalesce!

@EmilHvitfeldt
Copy link
Member Author

overall a good idea. But I don't think the benefit of unifying these function names outweigh the annoyance we would get for changing them.

Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants