Dec 2, 2022
Averaging predictions of a deep ensemble of networks is a popular and effective method to improve predictive performance and calibration in various benchmarks and Kaggle competitions.However, the runtime and training cost of deep ensembles grow linearly with the size of the ensemble, making them unsuitable for many applications.Averaging ensemble weights instead of predictions circumvents this disadvantage during inference and is typically applied to intermediate checkpoints of a model to reduce training cost. Albeit effective, only few works have improved the understanding and the performance of weight averaging.Here, we revisit this approach and show that a simple weight fusion (WF) strategy can lead to a significantly improved predictive performance and calibration. We describe what prerequisites the weights must meet in terms of weight space, functional space and loss. Furthermore, we present a new test method (called oracle test) to measure the functional space between weights. We demonstrate the versatility of our WF strategy across state of the art segmentation CNNs and Transformers as well as real world datasets such as BDD100K and Cityscapes. We compare WF with similar approaches and show our superiority for in- and out-of-distribution data in terms of predictive performance and calibration.
Professional recording and live streaming, delivered globally.
Presentations on similar topic, category or speaker