__init__.py

5dce0cd8ea

change gmdi to mdi+

1 year ago

block_transformers.py

5dce0cd8ea

change gmdi to mdi+

1 year ago

block_transformers_ys.py

dc1cd942f1

add basic treegam

8 months ago

local_stumps.py

8f94234485

Refactoring GMDI code. Change file names, some formatting and other changes to organization

1 year ago

mdi_plus.py

5dce0cd8ea

change gmdi to mdi+

1 year ago

ppms.py

5dce0cd8ea

change gmdi to mdi+

1 year ago

ranking_stability.py

2deb7bd93a

update documentation

1 year ago

readme.md

02ac58bfcc

add mdi+ to readme

8 months ago

rf_plus.py

36f533915a

fix mdi+ for pandas input

11 months ago

You have to be logged in to leave a comment.

MDI+: A Flexible Feature Importance Framework for Random Forests

MDI+ is a novel feature importance framework, which generalizes the popular mean decrease in impurity (MDI) importance score for random forests. At its core, MDI+ expands upon a recently discovered connection between linear regression and decision trees. In doing so, MDI+ enables practitioners to (1) tailor the feature importance computation to the data/problem structure and (2) incorporate additional features or knowledge to mitigate known biases of decision trees. In both real data case studies and extensive real-data-inspired simulations, MDI+ outperforms commonly used feature importance measures (e.g., MDI, permutation-based scores, and TreeSHAP) by substantional margins.

For further details, we refer to Agarwal et al. (2023).

Regression Example Usage:

from imodels.importance import RandomForestPlusRegressor

rf_plus_model = RandomForestPlusRegressor()
rf_plus_model.fit(X, y)
mdi_plus_scores = rf_plus_model.get_mdi_plus_scores(X, y)

Classification Example Usage:

from imodels.importance import RandomForestPlusClassifier

rf_plus_model = RandomForestPlusClassifier()
rf_plus_model.fit(X, y)
mdi_plus_scores = rf_plus_model.get_mdi_plus_scores(X, y)

Demo notebooks

MDI+ demo

Shows how to compute MDI+ importance scores for different tasks (regression and classification) and configurations (with flexible GLMs, scoring metrics, and custom transformations).
Provides starter code on how to choose the GLM and scoring metric within MDI+ via a stability metric and/or combine these fits in an ensemble

Overview of MDI+

Input: a collection of fitted trees (e.g., a random forest), data X and y

For each fitted tree in the forest:

Transform X using the learnt "stump" features from the fitted tree, and append any additional engineered features (e.g., the raw X features) to this transformed dataset.
Fit a prediction model on this augmented transformed dataset to predict y. Here, we recommend fitting a generalized linear model (GLM) to leverage computational speed-ups.
Use this fitted prediction model to make partial model predictions for each feature k. That is, for each feature k, get the model's predictions if the contribution of all other features (except feature k) were zeroed out. Put differently, the kth partial model predictions are the predictions we get when using only the engineered features that are related to feature k.
For each feature k, evaluate the similarity between the observed y and the kth partial model predictions using any user-defined similarity metric (i.e., a larger value should indicate greater feature importance).

This gives the MDI+ scores for a single tree. To get the MDI+ scores for the forest, these scores are averaged across all trees in the forest.

Practical Considerations

We show in Agarwal et al. (2023) that this framework is indeed a proper generalization of the popular MDI feature importance score. However, as a result of the increased flexibility provided by MDI+, there are several choices that must be made by the analyst to run MDI+ in practice. In particular,

In Step 1: What feature engineering/transformations to include?
- We recommend including the raw feature (i.e., X) in this transformed dataset. This is done by default via RandomForestPlus*(include_raw=True). To include additional transformations, create custom BlockTransformerBase object(s) and use the add_transformers argument in RandomForestPlus*().
In Step 2: Which GLM?
- We recommend using RidgeRegressorPPM() for regression tasks and LogisticClassifierPPM() for classification tasks and thus set these to be the defaults. To use a custom prediction model, use the prediction_model argument in RandomForestPlus*().
In Step 3: Which sample splitting strategy (if any) to use when making the partial model predictions?
- We recommend using a leave-one-out ("loo") sample splitting strategy as it overcomes the known correlation and entropy biases suffered by MDI. Out-of-bag ("oob") can also be used to overcome these biases but tends to be more unstable than leave-one-out across different random forest fits. MDI uses an in-bag sample splitting scheme and is not recommended. The sample splitting strategy is set to "loo" by default but can be changed via the sample_split argument in RandomForestPlus*().
In Step 4: Which similarity metric to use?
- We recommend using r-squared for regression tasks and log-loss for classification tasks, which are the defaults. To use a custom metric, use the scoring_fns argument in the get_mdi_plus_scores() method.

These recommendations are based on extensive simulations across a wide variety of data-generating processes, data sets, noise levels, and misspecifications.

Nevertheless, different choices may be better for different problems. For examples on how to implement some of these custom options, see the MDI+ Demo notebook. This demo also includes examples on how to aggregate feature importances from multiple MDI+ configurations in an ensemble as well as how to choose the "best" GLM and metric in a data-driven manner based upon a stability score.

Tip!

Press p or to see the previous file or, n or to see the next file

readme.md

MDI+: A Flexible Feature Importance Framework for Random Forests

Demo notebooks

Overview of MDI+

Practical Considerations

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

csinva / imodels mirror of https://github.com/csinva/imodels

readme.md

MDI+: A Flexible Feature Importance Framework for Random Forests

Demo notebooks

Overview of MDI+

Practical Considerations

Comments

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

csinva
/
imodels
mirror of https://github.com/csinva/imodels