Large Language Models (LLMs) have shown impressive capabilities, but ensuring their outputs adhere to strict structural or grammatical constraints remains a challenge.
Constrained decoding with context-free grammar provides a method to ensure LLMs produce outputs in a specific format by dynamically creating a token logits mask.
A novel dynamic pruning strategy called ZapFormat, based on the Earley algorithm, has been proposed to eliminate invalid or redundant Earley states in real-time, reducing memory usage and improving speed.
Experiments show that the new constrained decoding engine Formatron, incorporating ZapFormat, maintains high-precision compliant outputs and achieves significant speed improvements compared to existing implementations.