Encoding new levels for factors #508

statist-bhfz · 2020-09-20T13:23:23Z

My question in somewhat related to #71
I'm not able to implement very simple approach to dealing on prediction stage with factor levels unseen during training:

library(data.table)
library(mlr3verse)

dt_train <- data.table(fct = factor(c(1, 1, 1, 2, 2, 2)),
                       target_result = factor(1:2))
dt_test <- data.table(fct = factor(c(2, 3)),
                      target_result = factor(1:2))
task <- TaskClassif$new(id = "id", 
                        backend = dt_train, 
                        target = "target_result")
task_test <- TaskClassif$new(id = "id", 
                        backend = dt_test, 
                        target = "target_result")

gr <- po("fixfactors") %>>%
  list(
    po("nop"),
    po("missind")
    ) %>>% 
  po("featureunion") %>>%
  po("imputeoor") %>>%
  po("encode", method = "one-hot")

gr$train(task)

res <- gr$predict(task_test)

res$encode.output$data()

#    target_result fct.1 fct.2
# 1:             1     0     1
# 2:             2    NA    NA

Desired output:

#    target_result fct.1 fct.2 .MISSING
# 1:             1     0     1        0
# 2:             2     0     0        1

NA can be replaced with 0 using mlr_pipeops_imputeconstant() (and it is sufficient in most cases), but how to add column for new factor level(s)? It looks like an issue with interaction between po("fixfactors") and po("missind") applied sequentially.

The text was updated successfully, but these errors were encountered:

pfistfl · 2020-09-23T07:05:25Z

Hey,

If I understand your question correctly, your goal is to add a .MISSING dummy indicator during test, when a level is
not available during training.
This is not sensible, as this column would be constant during training and thus contain virtually 0 information.
To me it is also unclear, how any algorithm should do something meaningful with such new information during predict.
Feel free to provide references / hints towards situations where this is being dealt with differently, happy to learn there.

po("fixfactors"), thus simply recodes the new factor level to NA, and several imputation strategies can be employed to impute a different level.

statist-bhfz · 2020-09-23T11:08:30Z

My question was more about unexpected behaviour rather than about practical usage.
I slightly changed my example by adding NA's in training set:

dt_train <- data.table(fct = factor(c(1, 1, NA, 2, 2, 2)),
                       target_result = factor(1:2))

and got the desired output (mapping new factor levels to the same fct..MISSING level as NA's):

   target_result fct.1 fct.2 fct..MISSING
1:             1     0     1            0
2:             2     0     0            1

So, everything works fine, additional column is produced only when it makes sense.

statist-bhfz closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding new levels for factors #508

Encoding new levels for factors #508

statist-bhfz commented Sep 20, 2020 •

edited

Loading

pfistfl commented Sep 23, 2020

statist-bhfz commented Sep 23, 2020 •

edited

Loading

Encoding new levels for factors #508

Encoding new levels for factors #508

Comments

statist-bhfz commented Sep 20, 2020 • edited Loading

pfistfl commented Sep 23, 2020

statist-bhfz commented Sep 23, 2020 • edited Loading

statist-bhfz commented Sep 20, 2020 •

edited

Loading

statist-bhfz commented Sep 23, 2020 •

edited

Loading