Detecting test-time distribution shift has emerged as a key capability for safely deployed machine learning models, with the question being tackled under various guises in recent years. In this paper, we aim to provide a consolidated view of the two largest sub-fields within the community: out-of-distribution (OOD) detection and open-set recognition (OSR). In particular, we aim to provide rigorous empirical analysis of different methods across settings and provide actionable takeaways for practitioners and researchers. Concretely, we make the following contributions: (i) We perform rigorous cross-evaluation between state-of-the-art methods in the OOD detection and OSR settings and identify a strong correlation between the performances of methods for them; (ii) We propose a new, large-scale benchmark setting which we suggest better disentangles the problem tackled by OOD detection and OSR, re-evaluating state-of-the-art OOD detection and OSR methods in this setting; (iii) We surprisingly find that the best performing method on standard benchmarks (Outlier Exposure) struggles when tested at scale, while scoring rules which are sensitive to the deep feature magnitude consistently show promise; and (iv) We conduct empirical analysis to explain these phenomena and highlight directions for future research.
We benchmark OOD detection and OSR tasks across nine common datasets, with different training strategies and scoring rules. The results are averaged from five independent runs. Although there is not always one clear winner regarding methodology, we have three main observations.
Firstly, MLS and Energy tend to perform best across OOD and OSR datasets. This is because both are sensitive to the magnitude of the feature vector before the networks’ classification layer.
Secondly, Outlier Exposure provides excellent performance on the OOD detection benchmarks, often nearly saturating performance.
Thirdly, for small-scale datasets, OOD detection accuracy is positively related to ID accuracy, while an inverse correlation is observed for large-scale datasets.
We also evaluate a selection of previously discussed methods on our large-scale benchmark for both OOD detection and OSR. Through this large-scale evaluation, we find that in terms of training methods, among CE, ARPL (+CS), and OE, there are no clear winners across the board. It is surprising that the best performer on the previous small scale benchmarks, OE, appears to struggle when scaled up.
To analyze this further, we plot the distribution of maximum activation of the output feature (from the last layer) for samples from different data sources: ID data, OOD data, and auxiliary data. The results are shown below. It is worth noting that the OOD detection performance strongly (negatively) correlates with the overlapped region of the ID and OOD curve. Additionally, when using 300K random images as auxiliary OOD data (as shown in the 2nd row), there is a high correlation with actual OOD data, resulting in excellent performance.
We also provide results on a large-scale dataset. From the plot, the overlapping region is in line with the poor performance of OE on the large-scale benchmark.
We further retrieve nearest neighbors for the given samples on both small-scale (e.g., Textures and Places365) and large-scale (e.g., ImageNet-C and ImageNet-R) benchmarks using models trained by OE. By retrieving the nearest neighbors from the union of ID data and auxiliary data, we observe that for small-scale scenarios, the retrieved nearest neighbors are found in the auxiliary data. However, such phenomenon does not occur consistently in the large-scale datasets. This observation aligns with the correlation between the distance from OOD data to auxiliary data and the OOD detection performance of models trained using OE
We compute the distance (denoted as OOD-AUX data distance) for both small-scale and large-scale OOD data, which further validates that closer proximity between auxiliary data and OOD data leads to better performance by OE models.
Finally, we introduce a new metric to reconcile the problems of detecting covariate shift and being robust to it. AUROC does not capture the model's ability to reliably classify testing samples in the presence of distribution shifts. To analyze the relationship of performance between covariate shift and robustness, we introduce a novel measure, which we term Outlier-Aware Accuracy (OAA). OAA measures the frequency of `correct' predictions made by the model at a given threshold. The `correct' predictions are defined as: (1) Among the testing samples predicted as ID, those whose semantic class labels are \textit{correctly} predicted; (2) True OOD samples that have incorrect semantic class predictions.
@article{wang2024dissect,
author = {Wang, Hongjun and Vaze, Sagar and Han, Kai},
title = {Dissecting Out-of-Distribution Detection and Open-Set Recognition: A Critical Analysis of Methods and Benchmarks},
journal = {International Journal of Computer Vision (IJCV)},
year = {2024}
}