This study analyzes 13 state-of-the-art large language models (LLMs) to evaluate their performance in generating technically relevant humor for software developers.
The study tests different temperature settings and prompt variations, finding that 73% of models achieve peak performance at lower stochasticity settings.
The analysis reveals significant performance variations across models, with certain architectures demonstrating 21.8% superiority over baseline systems.
The study provides practical guidelines for model selection and configuration, highlighting the impact of temperature adjustments and architectural considerations on humor generation effectiveness.