Details

  • Google AI Developers have introduced Gemini 2.5 Computer Use, now available in a limited preview via the Gemini API.
  • This new feature, built on the multimodal Gemini 2.5 Pro model, enables agent code to recognize screen elements and trigger clicks, keystrokes, and scrolls like a user would.
  • Testing shows Gemini 2.5 delivers lower latency and higher success rates than unnamed rivals on standard web and mobile control tasks.
  • The preview focuses on browser workflows such as filling forms, navigation, and extracting data, with early promise shown for Android UI automation.
  • Developers can demo the technology in Google AI Studio, review sample code on GitHub, and access hands-on guides in public API documentation.
  • Google aims to reduce dependency on fragile CSS selectors or RPA scripts, making autonomous agents more adaptable and robust.
  • Access to the API is currently limited as Google collects feedback before a wider rollout, following its pattern with previous Gemini releases.

Impact

With this launch, Google intensifies competition with OpenAI’s GPT-4o Agent and Anthropic’s Claude Actions in workflow automation. Real-time, vision-aligned UI control could reshape expectations and challenge both LLM and traditional RPA vendors. As Google extends support to Android, it may strengthen its mobile ecosystem while raising new privacy and regulatory considerations.