A novel token-aware data imputation method leveraging large language models has been developed for generating synthetic tabular data with class imbalance problems.
The method combines a structured group-wise CSV-style prompting technique and eliminates irrelevant contextual information in the input prompt.
Experimental results show that the approach reduces input prompt size while maintaining or improving imputation quality compared to the baseline prompt, particularly for smaller datasets.
This work emphasizes the importance of prompt design in leveraging large language models for synthetic data generation and provides a practical solution for data imputation in class-imbalanced datasets with missing data.