Details
- Microsoft released Phi-Ground-Any, a 4-billion-parameter vision model fine-tuned from Phi-3.5-vision-instruct, available on Hugging Face for GUI grounding and AI agent applications.
- The model achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters, with 95.2% accuracy on real Windows app interactions and end-to-end scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision.
- Phi-Ground-Any takes user instructions directly and outputs click coordinates as relative values on a normalized canvas, enabling precise screen element selection for autonomous agent navigation.
- Current production grounding models operate at only 65% accuracy, making them unreliable for real-world deployment; Phi-Ground addresses this gap through improved training methodology and data collection practices.
- The model uses fixed input resolution of 1680x1008 and requires specific preprocessing; inference is supported via Hugging Face or vLLM, though no inference provider has deployed it yet.
Impact
Phi-Ground-Any narrows the accuracy gap that has hindered reliable autonomous agent deployment. At 95.2% accuracy on Windows interactions within a sub-10B parameter footprint, the model improves upon prior grounding approaches while maintaining computational efficiency—an advantage over larger competitors like those from OpenAI or Google. The public release on Hugging Face accelerates accessibility for developers building AI agents, potentially lowering barriers to enterprise adoption of autonomous UI automation.
