Addressing Fairness in Prediction Models by Improving Subpopulation Calibration

Dec 14, 2019

Speakers

About

Background: The use of prediction models in medicine is becoming increasingly common, and there is an essential need to ensure that these models produce predictions that are fair to minorities. Of the many performance measures for risk prediction models, calibration (the agreement between predicted and observed risks) is of specific importance, as therapeutic decisions are often made based on absolute risk thresholds. Calibration tends to be poor for subpopulations that were under-represented in the development set of the models, resulting in reduced performance for these subpopulations. In this work we empirically evaluated an adapted version of the fairness algorithm designed by Hebert-Johnson et al. (2017) to improve model calibration in subpopulations, which should lead to greater accuracy in medical decision-making and improved fairness for minority groups. Methods: This is a retrospective cohort study using the electronic health records of a large sick fund. Predictions of cardiovascular risk based on the Pooled Cohort Equations (PCE) and predictions of osteoporotic fracture risk based on the FRAX model were calculated as of a retrospective index date. We then evaluated the calibration of these models by comparing the predictions to events documented during a follow-up period, both in the overall population and in subpopulations. The subpopulations were defined by the intersection of five protected variables: age, sex, ethnicity, socioeconomic status and immigration history, resulting in hundreds of combinations. We next applied the fairness algorithm as a post processing step to the PCE and FRAX predictions and evaluated whether calibration in subpopulations improved using the metrics of calibration-in-the-large (CITL) and calibration slope. To evaluate whether the process had a negative effect on the overall discrimination, we measured the area under the Receiver Operating Characteristic Curve (AUROC). Results: 1,021,041 patients aged 40-79 were included in the PCE population and 1,116,324 patients aged 50-90 were included in the FRAX population. After local adjustment, baseline overall model calibration of the two tested models was good (CITL was 1.01 and 0.99 for PCE and FRAX, respectively). However, the calibration in a substantial portion of the subpopulations was poor, with 20% having CITL values of greater than 1.49 and 1.25 for PCE and FRAX, respectively, and 20% having CITL values less than 0.81 and 0.87 for PCR and FRAX, respectively. After applying the fairness algorithm, subpopulation calibration statistics were greatly improved, with the 20th and 80th percentiles moving to 0.97 and 1.07 in the PCE model and 0.95 and to 1.03 in the FRAX model. In addition, the variance of the CITL values across all subpopulations was reduced by 98.8% and 95.7% in the PCE and FRAX models, respectively. The AUROC remained unharmed (+0.12% and +0.31% in the PCE and FRAX, respectively). Conclusions: A post-processing and model-independent fairness algorithm for recalibration of predictive models greatly improved subpopulation calibration and thus fairness and equality, without harming overall model discrimination.

Organizer

Categories

About NIPS 2019

Neural Information Processing Systems (NeurIPS) is a multi-track machine learning and computational neuroscience conference that includes invited talks, demonstrations, symposia and oral and poster presentations of refereed papers. Following the conference, there are workshops which provide a less formal setting.

Store presentation

Should this presentation be stored for 1000 years?

How do we store presentations

Total of 0 viewers voted for saving the presentation to eternal vault which is 0.0%

Sharing

Recommended Videos

Presentations on similar topic, category or speaker

Interested in talks like this? Follow NIPS 2019