Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning
Published in ML4H 2025 Symposium, Findings Track, 2025
This paper evaluates how well current multiple instance learning (MIL) models capture critical inter-instance correlations in medical imaging. By generating our own synthetic dataset, we construct a Bayes estimator that serves as an optimal upper bound on model performance. We benchmark both correlated MIL methods (such as attention-based and transformer architectures) and non-correlated approaches, revealing that there is a generalization gap in correlated and non-correlated MIL architectures and that narrowing that gap takes much more labeled data than is normally available in the medical domain.
Recommended citation: Ethan Harvey, Dennis Johan Loevlie, and Michael C. Hughes. (2025). "Synthetic Data Reveals Generalization Gaps in Correlated Multiple Instance Learning." ML4H 2025 Symposium, Findings Track.
Download Paper
