Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

step_pca + prep changing not predictor column names when names ends with ... followed by a number #1347

Open
ilaria-kode opened this issue Jul 9, 2024 · 1 comment · May be fixed by #1348
Labels
bug an unexpected problem or unintended behavior

Comments

@ilaria-kode
Copy link

The problem

When prepping a recipe that includes a step_pca step, if some of the columns in the dataset (not used by the PCA) have names that end with the "...[:digit:]" pattern (which is what is usually obtained for example when loading the dataset using .name_repair = "unique" ), their names will be changed after the execution of the prep function.

I have noticed that this only happens if the effected columns names' number is not "aligned" with their position in the dataframe (i.e. the column is named foo...6 but is in position 1 in the resulting dataframe, see example).

Is there a way to change this behaviour and force the recipe to keep the column names untouched?

Reproducible example

library(tidyverse)

# when the columns effected by name repair have names that are aligned with
# their position in the dataset, the names are kept the same
sample_data <- tibble::tibble(
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10),
  foo = runif(10),
  foo = runif(10),
  .name_repair = "unique"
)
#> New names:
#> • `foo` -> `foo...5`
#> • `foo` -> `foo...6`
sample_data
#> # A tibble: 10 × 6
#>        x1    x2     x3    x4 foo...5  foo...6
#>     <dbl> <dbl>  <dbl> <dbl>   <dbl>    <dbl>
#>  1 0.0328 0.796 0.0740 0.543   0.812 0.000651
#>  2 0.313  0.752 0.837  0.428   0.803 0.942   
#>  3 0.451  0.758 0.864  0.991   0.737 0.403   
#>  4 0.0677 0.636 0.937  0.758   0.826 0.787   
#>  5 0.103  0.852 0.682  0.801   0.314 0.530   
#>  6 0.522  0.120 0.233  0.708   0.650 0.266   
#>  7 0.110  0.737 0.605  0.389   0.617 0.356   
#>  8 0.199  0.471 0.684  0.735   0.664 0.324   
#>  9 0.659  0.106 0.536  0.555   0.818 0.347   
#> 10 0.830  0.996 0.669  0.366   0.881 0.315

# expected behaviour, colnames are retained
rec <- recipes::recipe(sample_data, formula = ~.) %>%
  recipes::update_role(contains("foo"), new_role = "info") %>%
  recipes::step_pca(
    num_comp = 2,
    recipes::all_numeric_predictors()
  ) %>%
  recipes::prep(strings_as_factors = FALSE)

rec$template
#> # A tibble: 10 × 4
#>    foo...5  foo...6    PC1     PC2
#>      <dbl>    <dbl>  <dbl>   <dbl>
#>  1   0.812 0.000651 -0.792  0.344 
#>  2   0.803 0.942    -1.21   0.104 
#>  3   0.737 0.403    -1.57  -0.108 
#>  4   0.826 0.787    -1.31   0.160 
#>  5   0.314 0.530    -1.32   0.260 
#>  6   0.650 0.266    -0.728 -0.475 
#>  7   0.617 0.356    -0.994  0.266 
#>  8   0.664 0.324    -1.10  -0.0298
#>  9   0.818 0.347    -0.845 -0.569 
#> 10   0.881 0.315    -1.36  -0.138

# when the columns effected by name repair have names that are not
# aligned with their position in the dataset,
# the names are changed after step_pca + prep
sample_data <- tibble::tibble(
  foo...10 = runif(10), # forcing different numbering
  foo...11 = runif(10),
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10)
)
sample_data
#> # A tibble: 10 × 6
#>    foo...10 foo...11     x1      x2      x3     x4
#>       <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
#>  1    0.438    0.846 0.258  0.768   0.131   0.885 
#>  2    0.816    0.255 0.0259 0.492   0.784   0.0304
#>  3    0.271    0.760 0.861  0.00659 0.00730 0.838 
#>  4    0.788    0.696 0.601  0.845   0.283   0.587 
#>  5    0.286    0.531 0.500  0.676   0.582   0.0797
#>  6    0.289    0.553 0.417  0.258   0.682   0.922 
#>  7    0.239    0.322 0.423  0.937   0.338   0.245 
#>  8    0.531    0.210 0.826  0.139   0.162   0.522 
#>  9    0.227    0.136 0.199  0.922   0.515   0.434 
#> 10    0.217    0.107 0.0141 0.0988  0.116   0.0111

rec <- recipes::recipe(sample_data, formula = ~.) %>%
  recipes::update_role(contains("foo"), new_role = "info") %>%
  recipes::step_pca(
    num_comp = 2,
    recipes::all_numeric_predictors()
  ) %>%
  recipes::prep(strings_as_factors = FALSE)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
rec
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs 
#> Number of variables by role
#> predictor: 4
#> info:      2
#> 
#> ── Training information 
#> Training data contained 10 data points and no incomplete rows.
#> 
#> ── Operations 
#> • PCA extraction with: x1, x2, x3, x4 | Trained
rec$template
#> # A tibble: 10 × 4
#>    foo...1 foo...2    PC1     PC2
#>      <dbl>   <dbl>  <dbl>   <dbl>
#>  1   0.438   0.846 -1.10  -0.0898
#>  2   0.816   0.255 -0.619  0.575 
#>  3   0.271   0.760 -0.854 -0.838 
#>  4   0.788   0.696 -1.20   0.0149
#>  5   0.286   0.531 -0.897  0.354 
#>  6   0.289   0.553 -1.10  -0.257 
#>  7   0.239   0.322 -1.01   0.355 
#>  8   0.531   0.210 -0.806 -0.514 
#>  9   0.227   0.136 -1.07   0.422 
#> 10   0.217   0.107 -0.115  0.0920

Created on 2024-07-09 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language EN
#>  collate  Italian_Italy.utf8
#>  ctype    Italian_Italy.utf8
#>  tz       Europe/Rome
#>  date     2024-07-09
#>  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  class          7.3-22     2023-05-03 [2] CRAN (R 4.3.1)
#>  cli            3.6.1      2023-03-23 [1] CRAN (R 4.3.1)
#>  codetools      0.2-19     2023-02-01 [2] CRAN (R 4.3.1)
#>  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.3.1)
#>  data.table     1.14.8     2023-02-17 [1] CRAN (R 4.3.1)
#>  digest         0.6.33     2023-07-07 [1] CRAN (R 4.3.1)
#>  dplyr        * 1.1.3      2023-09-03 [1] CRAN (R 4.3.1)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.3.1)
#>  evaluate       0.22       2023-09-29 [1] CRAN (R 4.3.1)
#>  fansi          1.0.5      2023-10-08 [1] CRAN (R 4.3.1)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.3.1)
#>  forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.3.1)
#>  fs             1.6.3      2023-07-20 [1] CRAN (R 4.3.1)
#>  future         1.33.0     2023-07-01 [1] CRAN (R 4.3.1)
#>  future.apply   1.11.0     2023-05-21 [1] CRAN (R 4.3.1)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.3.1)
#>  ggplot2      * 3.4.4      2023-10-12 [1] CRAN (R 4.3.1)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.3.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.3.1)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.3.0)
#>  gtable         0.3.4      2023-08-21 [1] CRAN (R 4.3.1)
#>  hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.3.1)
#>  hms            1.1.3      2023-03-21 [1] CRAN (R 4.3.1)
#>  htmltools      0.5.6.1    2023-10-06 [1] CRAN (R 4.3.1)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.3.1)
#>  knitr          1.44       2023-09-11 [1] CRAN (R 4.3.1)
#>  lattice        0.21-8     2023-04-05 [2] CRAN (R 4.3.1)
#>  lava           1.7.2.1    2023-02-27 [1] CRAN (R 4.3.1)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.3.1)
#>  listenv        0.9.0      2022-12-16 [1] CRAN (R 4.3.1)
#>  lubridate    * 1.9.3      2023-09-27 [1] CRAN (R 4.3.1)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.3.1)
#>  MASS           7.3-60     2023-05-04 [2] CRAN (R 4.3.1)
#>  Matrix         1.5-4.1    2023-05-18 [2] CRAN (R 4.3.1)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.3.1)
#>  nnet           7.3-19     2023-05-03 [2] CRAN (R 4.3.1)
#>  parallelly     1.36.0     2023-05-26 [1] CRAN (R 4.3.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.3.1)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.3.1)
#>  prodlim        2023.08.28 2023-08-28 [1] CRAN (R 4.3.1)
#>  purrr        * 1.0.2      2023-08-10 [1] CRAN (R 4.3.1)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.3.3)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.3.3)
#>  R.oo           1.26.0     2024-01-24 [1] CRAN (R 4.3.3)
#>  R.utils        2.12.3     2023-11-18 [1] CRAN (R 4.3.3)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.3.1)
#>  Rcpp           1.0.11     2023-07-06 [1] CRAN (R 4.3.1)
#>  readr        * 2.1.4      2023-02-10 [1] CRAN (R 4.3.1)
#>  recipes        1.0.8      2023-08-25 [1] CRAN (R 4.3.1)
#>  reprex         2.1.1      2024-07-06 [1] CRAN (R 4.3.3)
#>  rlang          1.1.1      2023-04-28 [1] CRAN (R 4.3.1)
#>  rmarkdown      2.25       2023-09-18 [1] CRAN (R 4.3.1)
#>  rpart          4.1.19     2022-10-21 [2] CRAN (R 4.3.1)
#>  rstudioapi     0.15.0     2023-07-07 [1] CRAN (R 4.3.1)
#>  scales         1.2.1      2022-08-20 [1] CRAN (R 4.3.1)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.3.1)
#>  stringi        1.7.12     2023-01-11 [1] CRAN (R 4.3.0)
#>  stringr      * 1.5.0      2022-12-02 [1] CRAN (R 4.3.1)
#>  styler         1.10.3     2024-04-07 [1] CRAN (R 4.3.3)
#>  survival       3.5-5      2023-03-12 [2] CRAN (R 4.3.1)
#>  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.3.1)
#>  tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.3.1)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.3.1)
#>  tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.3.1)
#>  timechange     0.2.0      2023-01-11 [1] CRAN (R 4.3.1)
#>  timeDate       4022.108   2023-01-07 [1] CRAN (R 4.3.0)
#>  tzdb           0.4.0      2023-05-12 [1] CRAN (R 4.3.1)
#>  utf8           1.2.3      2023-01-31 [1] CRAN (R 4.3.1)
#>  vctrs          0.6.4      2023-10-12 [1] CRAN (R 4.3.1)
#>  withr          2.5.1      2023-09-26 [1] CRAN (R 4.3.1)
#>  xfun           0.40       2023-08-09 [1] CRAN (R 4.3.1)
#>  yaml           2.3.7      2023-01-23 [1] CRAN (R 4.3.0)
#> 
#>  [1] C:/Users/ilari/AppData/Local/R/win-library/4.3
#>  [2] C:/Program Files/R/R-4.3.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@EmilHvitfeldt EmilHvitfeldt added the bug an unexpected problem or unintended behavior label Jul 9, 2024
@EmilHvitfeldt
Copy link
Member

Hello @ilaria-kode 👋

Thanks for filing this bug report. This does appear to be a bug.

In step_pca() we call vctrs::cbind() on the data, which is where we are getting this issue.

sample_data <- tibble::tibble(
  foo...10 = runif(10),
  foo...11 = runif(10),
  x1 = runif(10),
  x2 = runif(10),
  x3 = runif(10),
  x4 = runif(10)
)
vctrs::vec_cbind(sample_data)
#> New names:
#> • `foo...10` -> `foo...1`
#> • `foo...11` -> `foo...2`
#> # A tibble: 10 × 6
#>    foo...1  foo...2     x1     x2      x3    x4
#>      <dbl>    <dbl>  <dbl>  <dbl>   <dbl> <dbl>
#>  1   0.845 0.394    0.820  0.0637 0.00586 0.165
#>  2   0.409 0.367    0.0105 0.492  0.411   0.757
#>  3   0.409 0.702    0.0760 0.829  0.917   0.780
#>  4   0.435 0.331    0.335  0.940  0.202   0.334
#>  5   0.953 0.0558   0.0857 0.395  0.434   0.129
#>  6   0.130 0.340    0.258  0.161  0.793   0.939
#>  7   0.507 0.000829 0.296  0.547  0.318   0.115
#>  8   0.293 0.00540  0.733  0.860  0.739   0.374
#>  9   0.113 0.0957   0.153  0.684  0.894   0.397
#> 10   0.590 0.737    0.724  0.955  0.329   0.301

Created on 2024-07-09 with reprex v2.1.0

More reading on why this is happening: r-lib/vctrs#685

@EmilHvitfeldt EmilHvitfeldt linked a pull request Jul 9, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants