Salesforce Research, in collaboration with researchers from The University of Hong Kong, has introduced AGUVIS, an autonomous GUI interaction framework transforming workflows across platforms.
Current GUI automation tools suffer due to the discrepancy between natural language instructions and the visual representations of GUIs.
Existing tools often depend on closed-source models for reasoning and planning capabilities, leading to information loss.
AGUVIS utilizes a pure-vision approach, leveraging image-based input, increasing the accuracy of decision-making and reducing token costs.
The AGUVIS Collection unifies and augments existing datasets with synthetic data to train robust and adaptable models.
The two-stage process of AGUVIS focuses on grounding and planning, allowing the model to perform single and multi-step tasks effectively.
AGUVIS demonstrated remarkable results in GUI grounding benchmarks, with accuracy rates of 88.3%, 85.7%, and 81.8% on web, mobile, and desktop platforms.
The system can generalize across platforms and handle platform-specific actions, such as swiping on mobile devices.
AGUVIS achieved efficient results, reducing USD inference costs by 93% as compared to existing models.
The vision-based AGUVIS framework provides an efficient and capable solution for autonomous GUI tasks.