Differentially private (DP) machine learning often relies on the availability of public data for tasks like privacy-utility trade-off estimation, hyperparameter tuning, and pretraining.
For tabular data, the assumption of public data may not hold due to heterogeneity across domains.
To address this, the proposal is to generate surrogate public data from schema-level specifications without accessing sensitive records.
Experiments demonstrate that surrogate public tabular data can effectively replace traditional public data for tasks such as pretraining differentially private tabular classifiers.