A new paper discusses the limitations of popular foundation models in rendering accurate user-requested content.
Examples showcase difficulties faced by OpenAI's Sora model and Adobe Firefly generative diffusion engine in accurately depicting certain concepts.
Researchers introduce a new dataset methodology named VideoUFO, aiming to align data collections better with user expectations.
VideoUFO dataset includes 1.9 million videos on user-focused topics, distinct from popular existing datasets with only 0.29% overlap.
The approach involves filtering YouTube videos with Creative Commons licenses based on pre-estimated user needs, ensuring novel content selection.
Emphasis is placed on data curation around user demand to counter the biased distribution of internet content in generative video systems.
Researchers employ a methodology of topic analysis using SentenceTransformers, K-means clustering, and leveraging GPT-4o for refining dataset topics.
Videos are scraped based on topic criteria, with each entry featuring brief and detailed captions; video quality assessment is conducted with VBench project methods.
The paper evaluates generative models' performance with BenchUFO benchmark, highlighting varied success rates on user-focused topics across different architectures.
Current text-to-video models exhibit inconsistencies in performing well on user-focused topics like 'giant squid' or 'Van Gogh' due to insufficient training.