Random Forest Permutation Importance for random forests
permimp.Rd
Standard and partial/conditional permutation importance for
random forest-objects fit using the party or randomForest
packages, following the permutation principle of the `mean decrease in
accuracy' importance in randomForest . The partial/conditional permutation
importance is implemented differently, selecting the predictions to condition
on in each tree using Pearson Chi-squared tests applied to the
by-split point-categorized predictors. In general the new implementation has
similar results as the original varimp
function. With
asParty = TRUE
, the partial/conditional permutation importance is
fully backward-compatible but faster than the original varimp
function in party.
Usage
permimp(object, ...)
# S3 method for class 'randomForest'
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
conditional = FALSE, threshold = .95, whichxnames = NULL,
thresholdDiagnostics = FALSE, progressBar = interactive(), do_check = TRUE,
oldSeedSelection = FALSE, cl = NULL, ...)
# S3 method for class 'RandomForest'
permimp(object, nperm = 1, OOB = TRUE, scaled = FALSE,
conditional = FALSE, threshold = .95, whichxnames = NULL,
thresholdDiagnostics = FALSE, progressBar = interactive(),
pre1.0_0 = conditional, AUC = FALSE, asParty = FALSE, mincriterion = 0,
oldSeedSelection = FALSE, cl = NULL, ...)
Arguments
- object
an object as returned by
cforest
orrandomForest
.- mincriterion
the value of the test statistic or 1 - p-value that must be exceeded in order to include a split in the computation of the importance. The default
mincriterion = 0
guarantees that all splits are included.- conditional
a logical that determines whether unconditional or conditional permutation is performed.
- threshold
the threshold value for (1 - p-value) of the association between the predictor of interest and another predictor, which must be exceeded in order to include the other predictor in the conditioning scheme for the predictor of interest (only relevant if
conditional = TRUE
). A threshold value of zero includes all other predictors.- nperm
the number of permutations performed.
- OOB
a logical that determines whether the importance is computed from the out-of-bag sample or the learning sample (not suggested).
- pre1.0_0
Prior to party version 1.0-0, the actual data values were permuted according to the original permutation importance suggested by Breiman (2001). Now the assignments to child nodes of splits in the variable of interest are permuted as described by Hapfelmeier et al. (2012), which allows for missing values in the predictors and is more efficient with respect to memory consumption and computing time. This method does not apply to the conditional permutation importance, nor to random forests that were not fit using the party package.
- scaled
a logical that determines whether the differences in prediction accuracy should be scaled by the total (null-model) error.
- AUC
a logical that determines whether the Area Under the Curve (AUC) instead of the accuracy is used to compute the permutation importance (cf. Janitza et al., 2012). The AUC-based permutation importance is more robust towards class imbalance, but it is only applicable to binary classification.
- asParty
a logical that determines whether or not exactly the same values as the original
varimp
function in party should be obtained.- whichxnames
a character vector containing the predictor variable names for which the permutation importance should be computed. Only use when aware of the implications, see section 'Details'.
- thresholdDiagnostics
a logical that specifies whether diagnostics with respect to the threshold-value should be prompted as warnings.
- progressBar
a logical that determines whether a progress bar should be displayed.
- do_check
a logical that determines whether a check requiring user input should be included.
- oldSeedSelection
a logical that determines whether the selection of random numbers should be the same is in the 1.1 version of the package. The default is
FALSE
, so that seeds are generated for each tree, and the results are reproducible, also when parallel processing is used.- cl
A cluster object created by
makeCluster
, or an integer to indicate number of child-processes (integer values are ignored on Windows) for parallel evaluations (see Details on parallel computing).NULL
(default) refers to sequential evaluation.- ...
additional arguments to be passed to the Methods
Details
Function permimp
is highly comparable to varimp
in party,
but the partial/conditional variable importance has a different, more efficient
implementation. Compared to the original varimp
in party,
permimp
applies a different strategy to select the predictors to condition
on (ADD REFERENCE TO PAPER).
With asParty = TRUE
, permimp returns exactly the same values as
varimp
in party, but the computation is done more efficiently.
If conditional = TRUE
, the importance of each variable is computed by
permuting within a grid defined by the predictors that are associated
(with 1 - p-value greater than threshold
) to the variable of interest.
The threshold
can be interpreted as a parameter that moves the permutation
importance across a dimension from fully conditional (threshold = 0
) to
completely unconditional (threshold = 1
), see Debeer and Strobl (2020).
Using the wichxnames
argument, the computation of the permutation importance
can be limited to a smaller number of specified predictors. Note, however, that when
conditional = TRUE
, the (other) predictors to condition on are also
limited to this selection of predictors. Only use when fully aware of the
implications.
For parallel processing, the pbapply package, a wrapper around the parallel
package is used. Parallel processing can be enabled through the cl
argument. parLapply
is called when cl
is a 'cluster'
object, mclapply
is called when cl
is an integer.
When doing parallel processing, other objects might need to pushed to the workers, and random numbers must be handled with care (see the Examples of the pbapply package).
When using parallel processing, showing the progress bar increases the
communication overhead between the main process and nodes / child processes
compared to the parallel equivalents of the functions without the progress bar.
The functions fall back to their original equivalents when
progressBar = FALSE
. This is the default when interactive()
is
FALSE
(i.e. called from command line R script)
For further details, please refer to the documentation of varimp
.
Value
An object of class varimp
, with the mean decrease in accuracy
as its $values
.
References
Leo Breiman (2001). Random Forests. Machine Learning, 45(1), 5–32.
Alexander Hapfelmeier, Torsten Hothorn, Kurt Ulm, and Carolin Strobl (2012). A New Variable Importance Measure for Random Forests with Missing Data. Statistics and Computing, https://link.springer.com/article/10.1007/s11222-012-9349-1
Torsten Hothorn, Kurt Hornik, and Achim Zeileis (2006b). Unbiased Recursive Partitioning: A Conditional Inference Framework. Journal of Computational and Graphical Statistics, 15 (3), 651-674. Preprint available from https://www.zeileis.org/papers/Hothorn+Hornik+Zeileis-2006.pdf
Silke Janitza, Carolin Strobl and Anne-Laure Boulesteix (2013). An AUC-based Permutation Variable Importance Measure for Random Forests. BMC Bioinformatics.2013, 14 119. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-119
Carolin Strobl, Anne-Laure Boulesteix, Thomas Kneib, Thomas Augustin, and Achim Zeileis (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307
Debeer Dries and Carolin Strobl (2020). Conditional Permutation Importance Revisited. BMC Bioinformatics, 21, 307. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03622-2
Examples
### for RandomForest-objects, by party::cforest()
set.seed(290875)
readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills,
control = party::cforest_unbiased(mtry = 2, ntree = 25))
### conditional importance, may take a while...
# party implementation:
set.seed(290875)
party::varimp(readingSkills.cf, conditional = TRUE)
#> nativeSpeaker age shoeSize
#> 11.060828 47.984469 1.457061
# faster implementation but same results
set.seed(290875)
permimp(readingSkills.cf, conditional = TRUE, asParty = TRUE)
#> nativeSpeaker age shoeSize
#> 11.060828 47.984469 1.457061
# different implementation with similar results
set.seed(290875)
permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE)
#> nativeSpeaker age shoeSize
#> 12.640102 50.529059 1.532613
### standard (unconditional) importance is unchanged
set.seed(290875)
party::varimp(readingSkills.cf)
#> nativeSpeaker age shoeSize
#> 12.40733 73.35346 20.63118
set.seed(290875)
permimp(readingSkills.cf, oldSeedSelection = TRUE)
#> nativeSpeaker age shoeSize
#> 12.40733 73.35346 20.63118
###
set.seed(290875)
readingSkills.rf <- randomForest::randomForest(score ~ ., data = party::readingSkills,
mtry = 2, ntree = 25, importance = TRUE,
keep.forest = TRUE, keep.inbag = TRUE)
### (unconditional) Permutation Importance
set.seed(290875)
permimp(readingSkills.rf, do_check = FALSE)
#> nativeSpeaker age shoeSize
#> 17.75154 85.57181 12.78995
# very close to
readingSkills.rf$importance[,1]
#> nativeSpeaker age shoeSize
#> 17.65385 82.11491 13.86879
### Conditional Permutation Importance
set.seed(290875)
permimp(readingSkills.rf, conditional = TRUE, threshold = .8, do_check = FALSE)
#> nativeSpeaker age shoeSize
#> 14.19811819 15.16896923 -0.01609468
if (FALSE) { # \dontrun{
### Parallel processing - Windows
# Only relevant for large trees, for small trees, there may not even be a
# 'speed up', but a 'slow down'
# Make a larger forest
set.seed(290875)
readingSkills.cf <- party::cforest(score ~ ., data = party::readingSkills,
control = party::cforest_unbiased(mtry = 2,
ntree = 200))
# sequentiall processing
set.seed(290875)
system.time(print(permimp(readingSkills.cf, conditional = TRUE, asParty = FALSE)))
# parallel processing
# note that the results are reproducible despite using multiple cores
cluster <- parallel::makeCluster(2)
set.seed(290875)
system.time(print(permimp(readingSkills.cf, conditional = TRUE,
asParty = FALSE, cl = cluster, progressBar = FALSE)))
parallel::stopCluster(cluster)
} # }