The study focuses on Large Language Models (LLMs) used for generating tabular data with in-context learning.
LLMs are crucial for data augmentation in scenarios with limited data availability.
Previous research showcased LLMs enhancing task performance by augmenting underrepresented groups.
However, this enhancement often assumes access to unbiased in-context examples.
Real-world data is typically noisy and skewed, differing from ideal scenarios.
The research delves into how biases within in-context examples impact the distribution of synthetic tabular data.
Even subtle biases in in-context examples can cause significant global statistical distortions.
An adversarial situation is introduced where a malicious contributor injects bias via in-context examples, jeopardizing classifier fairness for a specific subgroup.
The study uncovers a vulnerability in LLM-based data generation pipelines when using in-context prompts in sensitive domains.