V-Droid: Advancing Mobile GUI Agent Through Generative Verifiers

Gaole Dai1 *, Shiqi Jiang2, Ting Cao2, Yuanchun Li3, Yuqing Yang2, Rui Tan1, Mo Li4, Lili Qiu2
1 Nanyang Technological University   2 Microsoft Research   3 Tsinghua University   4 Hong Kong University of Science and Technology  
*The work is done during internship at Microsoft Research.
MY ALT TEXT

We introduce V-Droid – the first mobile GUI agent with near-real-time, high-quality decision making ability. Unlike conventional agents that rely on large language models (LLMs) to generate actions from scratch at every step, V-Droid employes LLMs as verifiers evaluating candidate actions to ensure optimal performance. We introduce a comprehensive design for the verifier-driven GUI agent, comprising: (1) discretizing action space with a prefilling-only workflow; (2) pairwise process preference training to enhance the verifier’s decision-making and self-correction abilities; and (3) scalable human-agent joint annotation.

Abstract

We propose V-Droid, a mobile GUI task automation agent. Unlike previous mobile agents that utilize Large Language Models (LLMs) as generators to directly generate actions at each step, V-Droid employs LLMs as verifiers to evaluate candidate actions before making final decisions. To realize this novel paradigm, we introduce a comprehensive framework for constructing verifier-driven mobile agents: the discretized action space construction coupled with the prefilling-only workflow to accelerate the verification process, the pair-wise progress preference training to significantly enhance the verifier's decision-making capabilities, and the scalable human-agent joint annotation scheme to efficiently collect the necessary data at scale. V-Droid sets a new state-of-the-art task success rate across several public mobile task automation benchmarks: 59.5% on AndroidWorld, 38.3% on AndroidLab, and 49% on MobileAgentBench, surpassing existing agents by 9.5%, 2.1%, and 9%, respectively. Furthermore, V-Droid achieves an impressively low latency of 0.7 seconds per step, making it the first mobile agent capable of delivering near-real-time, effective decision-making capabilities.

Video Demos

V-Droid in the following demos is hosted on 2x4090 GPUs, the videos are presented without acceleration.

Delete the recipes from Broccoli app: Chicken Alfredo Pasta, Tomato Basil Bruschetta, Grilled Cheese with Tomato and Basil.

Open Wi-Fi for me.



Send a text message to +16597910719 with message: Beauty is in the eye of the beholder.

Experiment Results

MY ALT TEXT

Task success rate and decision making latency per step of current SOTA mobile agents 1and V-Droid evaluated on AndroidWorld benchmark. The latency of 2B, 7B and 8B agents are measured on 2× Nvidia 4090. For 72B or MoE agents, the latency is measure on 4× Nvidia A100 80G.

BibTeX

@article{dai2025advancingmobileguiagents,
      title={Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment}, 
      author={Gaole Dai and Shiqi Jiang and Ting Cao and Yuanchun Li and Yuqing Yang and Rui Tan and Mo Li and Lili Qiu},
      year={2025},
      eprint={2503.15937},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.15937}, 
}