Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding new levels for factors #508

Closed
statist-bhfz opened this issue Sep 20, 2020 · 2 comments
Closed

Encoding new levels for factors #508

statist-bhfz opened this issue Sep 20, 2020 · 2 comments

Comments

@statist-bhfz
Copy link

statist-bhfz commented Sep 20, 2020

My question in somewhat related to #71
I'm not able to implement very simple approach to dealing on prediction stage with factor levels unseen during training:

library(data.table)
library(mlr3verse)

dt_train <- data.table(fct = factor(c(1, 1, 1, 2, 2, 2)),
                       target_result = factor(1:2))
dt_test <- data.table(fct = factor(c(2, 3)),
                      target_result = factor(1:2))
task <- TaskClassif$new(id = "id", 
                        backend = dt_train, 
                        target = "target_result")
task_test <- TaskClassif$new(id = "id", 
                        backend = dt_test, 
                        target = "target_result")

gr <- po("fixfactors") %>>%
  list(
    po("nop"),
    po("missind")
    ) %>>% 
  po("featureunion") %>>%
  po("imputeoor") %>>%
  po("encode", method = "one-hot")

gr$train(task)

res <- gr$predict(task_test)

res$encode.output$data()

#    target_result fct.1 fct.2
# 1:             1     0     1
# 2:             2    NA    NA

Desired output:

#    target_result fct.1 fct.2 .MISSING
# 1:             1     0     1        0
# 2:             2     0     0        1

NA can be replaced with 0 using mlr_pipeops_imputeconstant() (and it is sufficient in most cases), but how to add column for new factor level(s)? It looks like an issue with interaction between po("fixfactors") and po("missind") applied sequentially.

@pfistfl
Copy link
Sponsor Member

pfistfl commented Sep 23, 2020

Hey,

If I understand your question correctly, your goal is to add a .MISSING dummy indicator during test, when a level is
not available during training.
This is not sensible, as this column would be constant during training and thus contain virtually 0 information.
To me it is also unclear, how any algorithm should do something meaningful with such new information during predict.
Feel free to provide references / hints towards situations where this is being dealt with differently, happy to learn there.

po("fixfactors"), thus simply recodes the new factor level to NA, and several imputation strategies can be employed to impute a different level.

@statist-bhfz
Copy link
Author

statist-bhfz commented Sep 23, 2020

My question was more about unexpected behaviour rather than about practical usage.
I slightly changed my example by adding NA's in training set:

dt_train <- data.table(fct = factor(c(1, 1, NA, 2, 2, 2)),
                       target_result = factor(1:2))

and got the desired output (mapping new factor levels to the same fct..MISSING level as NA's):

   target_result fct.1 fct.2 fct..MISSING
1:             1     0     1            0
2:             2     0     0            1

So, everything works fine, additional column is produced only when it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants