HeadsUpAI

NVIDIA LocateAnything Uses Parallel Decoding to Speed Up Visual Grounding

NVIDIA researchers introduced LocateAnything, a vision-language model (VLM) that redefines how AI locates objects. It replaces standard autoregressive decoding with Parallel Box Decoding—predicting all spatial coordinates simultaneously in one pass. This shift preserves geometric coherence and removes the computational bottleneck of sequential generation.
Throughput
12.7 BPS (Hybrid Mode)
Training data
138M language queries
Bounding boxes
785M
Speedup
10x over Qwen3-VL
Architecture
Moon-ViT and Qwen2.5

This update provides a high-speed alternative to Google's Magic Pointer interaction model for robotics, where agents must act instantly. While Zhipu AI's GLM-5V-Turbo multimodal integration prioritizes task variety, LocateAnything optimizes for raw spatial grounding speed. It achieves 12.7 boxes per second, surpassing OpenRouter's Perceptron Mk1 spatial grounding in throughput.

You can use the model for GUI navigation, document analysis, and dense object detection. It features a Hybrid Mode that defaults to high-speed parallel decoding but falls back to sequential decoding if it detects spatial ambiguity. The research is supported by 138 million queries and is available on Hugging Face for experimentation.

NVIDIA AI
NVIDIA AI
@NVIDIAAI
X

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗 Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act. Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection. Project page: https://t.co/O7JMe8tzFM

2retweets7likes
View on X

Still wondering? A few quick answers below.

LocateAnything is a vision-language model designed for high-speed object detection and visual grounding. It allows AI agents and robots to pinpoint specific items, text, or user interface elements within an image based on natural language descriptions. The model is specifically optimized to provide the fast spatial reasoning required for real-time interaction in physical and digital environments.

Traditional models predict bounding box coordinates one by one as a sequence of text tokens, which is slow and computationally expensive. Parallel Box Decoding treats a bounding box as a single atomic unit, allowing the model to predict all four spatial coordinates simultaneously in one forward pass. This approach significantly increases throughput while maintaining the geometric relationship between coordinates.

LocateAnything achieves a throughput of 12.7 boxes per second on a single NVIDIA H100 GPU. This makes it approximately 10 times faster than text-based models like Qwen3-VL and 2.5 times faster than quantized coordinate models like Rex-Omni. The speed gains are most significant in dense scenes where the model must identify hundreds of objects at once.

Hybrid Mode is a flexible inference strategy that balances speed and reliability. It uses Fast Mode by default to predict bounding boxes in parallel for maximum throughput. If the model detects spatial ambiguity or malformed syntax, it automatically falls back to a slower, token-by-token decoding method to ensure the final coordinate predictions remain accurate and robust.

The model was trained on LocateAnything-Data, a massive corpus containing 138 million language queries and 785 million bounding boxes. This dataset covers a diverse range of tasks, including general object detection, graphical user interface grounding, optical character recognition, and referring expression comprehension, where the model must link complex human intent to specific image regions.

Share this update