Researchers propose a new undersampling approach to tackle imbalanced data classification issues by avoiding synthetic data pitfalls and under-fitting.
Their method selects datapoints based on their potential to improve model loss rather than randomly undersampling majority data.
The approach aims to identify an optimal subset of majority training data by rejecting redundant datapoints, leveraging a bilevel optimization problem.
Experimental results demonstrate F1 scores up to 10% higher compared to existing state-of-the-art methods.