A new technique has been developed to reconstruct the exact input that led to a language model's output, aiding in post-incident analysis and fake output detection.
The technique, called SODA, is a gradient-based algorithm that outperforms existing methods in recovering shorter out-of-distribution inputs from language models.
The experiments conducted on LLMs ranging from 33M to 3B parameters showed that SODA was successful in fully recovering 79.5% of shorter inputs but faced challenges with longer input sequences.
The study suggests that standard deployment practices may currently offer sufficient protection against the potential misuse of this reconstructive method.