Advancing Model Understanding: Human-Readable Pipeline Explanations
EvalML offers a wide variety of ways to understand the machine learning models that you train using our software.
From exploring the impact features have on the predictive power of a trained pipeline to digging in to how a model made an individual prediction, EvalML offers a wide variety of ways to understand the machine learning models that you train using our software. However, these sorts of methods provide values which can be hard to understand. Why are some of the permutation importance values negative? What does it mean if a feature has an “importance” of 0.0231, anyway?
To help answer these questions, we introduce a new tool for understanding how features impact a model as a whole: human-readable pipeline explanations. This tool takes the features used to train a pipeline or model and transforms them into a sentence or two that summarizes the most important takeaways. It can highlight which features have the heaviest impact on the performance of a model, or if there are any that detract from model performance.
The readable explanations are generated by calculating the importance values for each feature and splitting them into a few categories based on the percentage of the total importance they represent. With default arguments, any feature that contributes to more than 20% of the total importance is considered “heavily influential,” and anything that contributes between 5% and 20% is considered “somewhat influential.” These selected features are then added to a sentence template, listing out the top n most important features. The value of n is controlled by the user but defaults to 5.
Let us look at how you can use these readable explanations with one of EvalML’s demo datasets. In this case, we explain predictions about customer churn for a company. This dataset includes useful features such as user demographics, average usage and usage types, and what sort of contract the user has.
Here, there are two calls to readable_explanation - to highlight the importance_method argument, which determines how the impact of the features is calculated. This can either be the feature importance calculated directly by a trained model, or the permutation importance computed on a pipeline with a given objective. These two methods can give you a more in-depth breakdown of which features have an impact in what situations, which we detail below.
Feature importance is directly computed by the final estimator in your pipeline. This is the model itself highlighting which features it used most heavily when making predictions. Because of this, feature importance is computed based on the features that make it through the earlier parts of the pipeline to the final estimator, i.e., the transformed features. By the time the final estimator predicts on your data, categorical features are already one-hot encoded, so you can see if there is a specific category that affected the model predictions more than any other. Text features are already broken down into their numeric descriptors, so you can check if the number of characters per word had more of an impact than the diversity of those words.
If we examine our code again, we see:
Here, we see the relevant features are indeed the transformed features, rather than the original input. Features “Contract_Month-to-month,” “PaymentMethod_Electronic,” and “OnlineSecurity_No” are all products of the One Hot Encoder component, having been separated out from their original “Contract,” “PaymentMethod,” and “OnlineSecurity” columns. This tells us that within the “Contract” feature, whether a customer is on a month-to-month contract has more impact on whether they will churn or not than if they had a one-year contract.
Permutation importance, on the other hand, is computed with respect to the entire pipeline, so it will report on your original input features. It is also calculated with respect to specific holdout data and a desired objective, so you can get a better idea of how certain features impact performance on specific data, as measured by whatever performance metric you would like. Permutation importance is calculated by removing each feature one by one from the set of input features to the pipeline and measuring how the model performance decreases with that feature removed. In some cases, removing a feature will improve prediction performance as measured by the given metric. In these cases, the human-readable explanations will recommend removing these features, if you would like to have your model perform better on that objective.
The readable explanation from the permutation importance example therefore looks a little different from the feature importance one:
Notice here that the listed features are tied to a specified objective. In this case, the objective is the default for binary classification problems, binary log loss. To change this result, pass in a different objective, as below (note that the importance method is omitted because permutation is the default value):
The features in each category when evaluated with gini instead of binary log loss are slightly different, but the general trend of which features have a larger role in predicting still shines through.
Comparing the two
In these permutation importance examples as in the previous feature importance example, “Contract” is listed an important feature. However, with permutation importance it is taken as the original input feature rather than the one-hot encoded month-to-month contract. The features you see in the permutation importance explanations are exactly those that were passed into AutoMLSearch, while the features in the feature importance explanations are slightly modified.
The explanations generated using permutation importance list several features that could be removed to improve performance. In one of our examples, the feature "MonthlyCharges" was one of these detrimental features. This may be confusing, because in our example explanation generated using feature importance, "MonthlyCharges" was listed as "somewhat important." What gives? While the permutation importance measures features based on how they impact performance, feature importance simply reports how much weight a feature is given in the final prediction, completely ignoring their impact on measured performance. Feature importance is then more a relative measure of prediction impact, rather than absolute measure of performance.
By leveraging readable_explanation and the different approaches offered by feature importance and permutation importance, you can gain important insights into your data and the kinds of predictions you can make with EvalML.