This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗 Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act. Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection. Project page: https://t.co/O7JMe8tzFM
NVIDIA LocateAnything Uses Parallel Decoding to Speed Up Visual Grounding
- Throughput
- 12.7 BPS (Hybrid Mode)
- Training data
- 138M language queries
- Bounding boxes
- 785M
- Speedup
- 10x over Qwen3-VL
- Architecture
- Moon-ViT and Qwen2.5
This update provides a high-speed alternative to Google's Magic Pointer interaction model for robotics, where agents must act instantly. While Zhipu AI's GLM-5V-Turbo multimodal integration prioritizes task variety, LocateAnything optimizes for raw spatial grounding speed. It achieves 12.7 boxes per second, surpassing OpenRouter's Perceptron Mk1 spatial grounding in throughput.
You can use the model for GUI navigation, document analysis, and dense object detection. It features a Hybrid Mode that defaults to high-speed parallel decoding but falls back to sequential decoding if it detects spatial ambiguity. The research is supported by 138 million queries and is available on Hugging Face for experimentation.
Still wondering? A few quick answers below.



