Concerns about data contamination in LLM-driven Verilog coding raise questions about evaluation validity and industrial adoption.
Limited attention has been given to risks of data contamination in hardware coding using LLMs.
First-time analysis of Verilog code generation evaluation frameworks (VerilogEval and RTLLM) for contamination detection using CCD and Min-K% Prob methods.
Study covers evaluation of commercial and open-source LLMs (CodeGen2.5, Minitron 4b, Mistral 7b, phi-4 mini, LLaMA-{1,2,3.1}, GPT-{2,3.5,4o}, Deepseek-Coder, and CodeQwen 1.5), in baseline and fine-tuned models (RTLCoder and Verigen).
Findings confirm data contamination as a critical concern in Verilog code generation.
Analysis explores mitigations and trade-offs between code quality and fairness, aiming for unbiased benchmarking.