Efficient Statistical Inference for Population Variable Importance Using Shapley Values

Jul 12, 2020



The true importance of a variable in a prediction task provides useful knowledge about the underlying data-generating mechanism and can help in deciding which measurements to collect in subsequent experiments. Existing approaches often define population variable importance as the difference between the oracle prediction risk with and without the feature. However, these measures are difficult to interpret for correlated features, which can be assigned low importance even if they are highly predictive. To this end, we propose defining population variable importance using the Shapley value instead, which averages the predictive value of a feature relative to all possible feature subsets. Given n training observations, we present a computationally tractable statistical inference procedure that estimates the Shapley Population Variable Importance Measure (SPVIM) at an asymptotically optimal rate using only m = Θ(n) randomly sampled feature subsets. We derive the asymptotic distribution of this estimator to construct valid confidence intervals and hypothesis tests. Finally, we analyze the importance of lab measurements for predicting in-hospital mortality and find that i) our procedure is significantly faster than existing sampling-based approaches and ii) gives more consistent estimates across different modeling procedures.



