For scalable machine learning on large data sets, importance sampling is commonly used to subsample a representative subset for efficient model training.
This paper examines the privacy properties of importance sampling, specifically focusing on individualized privacy analysis.
The study finds that privacy in importance sampling is aligned with utility but conflicts with sample size.
The paper proposes two approaches for constructing sampling distributions that optimize privacy-efficiency trade-off and provide utility guarantees through coresets.