Skip to content

preserve task attributes when creating reduced_task by modifying data using mlr3pipelines#11

Open
mikoontz wants to merge 2 commits intobips-hb:masterfrom
mikoontz:reduced-task-creation-as-mlr3-pipeline
Open

preserve task attributes when creating reduced_task by modifying data using mlr3pipelines#11
mikoontz wants to merge 2 commits intobips-hb:masterfrom
mikoontz:reduced-task-creation-as-mlr3-pipeline

Conversation

@mikoontz
Copy link
Copy Markdown

@mikoontz mikoontz commented Dec 7, 2022

This PR fixes what I think could be a problematic approach to creating a new mlr3 task with knockoff data.

Currently, the reduced task is created by creating a new data.frame with knockoff data substituted for a feature (or group of features), then creating a new task using mlr3::as_task_regr() or mlr3::as_task_classif(), then specifying the new data.frame as the backend, then specifying the same target as the original task as the target.

This approach doesn't respect other settings of the Task, such as observation-level weights or coordinates (as in the case when a user is setting up a spatial cross validation). It also requires logic that queries whether a task is a regression or classification task in order to call the correct function (mlr3::as_task_regr() or mlr3::as_task_classif()). This logic break with other possible task types that should still work (e.g., classif_st in the case where a user wants to set up spatial cross validation using coordinates associated with each observation with mlr3spatiotempcv::as_task_classif_st().

From here, the preferred way to modify the data in a task is to use mlr3pipelines. This PR implements this approach and adds a "suggests" dependency (mlr3pipelines package) in order to use the PipeOpMutate method. It works for both individual features and grouped features. An ancillary benefit is that we can reduce the logic complexity because we no longer need to query whether the task is regression or classification. It also allows for other task types that aren't strictly "classif" or "regr" (e.g., "classif_st" for spatial cross validation).

All checks pass when building and NAMESPACE and DESCRIPTION documentation was updated using {roxygen2::roxygenize()` and manual editing, respectively.

…koff data substituted for one feature or a group of features) using the mlr3pipelines mutate operation rather than creating a new task with the same target and a new data backend. This reduces the need to query what kind of task (regression or classification), and also preserves other task attributes (e.g., weights, coordinates) in the reduced task.
@mikoontz
Copy link
Copy Markdown
Author

mikoontz commented Dec 8, 2022

I just noticed that the logic of deciding whether the task_type is "regr" or "classif" does exist at least one other place in {cpi}. In the cpi() function, this block:

  if (is.null(measure)) {
    if (task$task_type == "regr") {
      measure <- msr("regr.mse")
    } else if (task$task_type == "classif") {
      measure <- msr("classif.logloss")
    } else {
      stop("Unknown task type.")
    }
  }

So this will break if a user has a "classif_st" task_type, even though the code block that would execute for the "classif" task_type should work fine.

Just noting that the "ancillary benefit" I point out above doesn't fully realize that benefit if a user doesn't specify a value for the measure argument!

Using grepl could work, but maybe that overcorrects and is too anti-conservative (happy to make a separate PR if helpful):

  if (is.null(measure)) {
    if (grepl(x = task$task_type, pattern = "regr")) {
      measure <- msr("regr.mse")
    } else if (grepl(x = task$task_type, pattern = "classif")) {
      measure <- msr("classif.logloss")
    } else {
      stop("Unknown task type.")
    }
  }

@mikoontz mikoontz changed the title preserve task features when creating reduced_task by modifying data using mlr3pipelines preserve task attributes when creating reduced_task by modifying data using mlr3pipelines Dec 8, 2022
@mnwright
Copy link
Copy Markdown
Member

Thanks for the PR! I think using pipelines for this makes perfect sense and I wonder why I didn't come up with that myself?

I will look more into this later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants