NVIDIA LocateAnything Uses Parallel Decoding to Speed Up Visual Grounding

NVIDIA

May 28, 2026 · Updated Jun 7, 2026

NVIDIA released LocateAnything, a vision-language model that uses parallel box decoding to pinpoint objects up to 10 times faster than traditional text-based models. By treating bounding boxes as single atomic units rather than sequences of tokens, the model achieves state-of-the-art accuracy for AI agents navigating graphical interfaces and physical environments.

NVIDIA researchers introduced LocateAnything, a vision-language model (VLM) that redefines how AI locates objects. It replaces standard autoregressive decoding with Parallel Box Decoding—predicting all spatial coordinates simultaneously in one pass. This shift preserves geometric coherence and removes the computational bottleneck of sequential generation.

Throughput: 12.7 BPS (Hybrid Mode)
Training data: 138M language queries
Bounding boxes: 785M
Speedup: 10x over Qwen3-VL
Architecture: Moon-ViT and Qwen2.5

This update provides a high-speed alternative to Google's Magic Pointer interaction model for robotics, where agents must act instantly. While Zhipu AI's GLM-5V-Turbo multimodal integration prioritizes task variety, LocateAnything optimizes for raw spatial grounding speed. It achieves 12.7 boxes per second, surpassing OpenRouter's Perceptron Mk1 spatial grounding in throughput.

You can use the model for GUI navigation, document analysis, and dense object detection. It features a Hybrid Mode that defaults to high-speed parallel decoding but falls back to sequential decoding if it detects spatial ambiguity. The research is supported by 138 million queries and is available on Hugging Face for experimentation.

View the full update on research.nvidia.com

NVIDIA AI

@NVIDIAAIMay 28

This #CVPR2026 paper from our research team is trending #1 on @HuggingFace 🤗 Meet LocateAnything: a vision-language detection model that rethinks bounding box prediction. For AI agents and robots, “seeing” is only useful if a model can pinpoint where something is fast enough to act. Trained on 138M high-quality samples, LocateAnything decodes bounding boxes in parallel instead of one coordinate at a time, improving localization accuracy while dramatically increasing throughput for visual grounding and detection. Project page: https://t.co/O7JMe8tzFM

View on X

Still wondering? A few quick answers below.

LocateAnything is a vision-language model designed for high-speed object detection and visual grounding. It allows AI agents and robots to pinpoint specific items, text, or user interface elements within an image based on natural language descriptions. The model is specifically optimized to provide the fast spatial reasoning required for real-time interaction in physical and digital environments.

Traditional models predict bounding box coordinates one by one as a sequence of text tokens, which is slow and computationally expensive. Parallel Box Decoding treats a bounding box as a single atomic unit, allowing the model to predict all four spatial coordinates simultaneously in one forward pass. This approach significantly increases throughput while maintaining the geometric relationship between coordinates.

LocateAnything achieves a throughput of 12.7 boxes per second on a single NVIDIA H100 GPU. This makes it approximately 10 times faster than text-based models like Qwen3-VL and 2.5 times faster than quantized coordinate models like Rex-Omni. The speed gains are most significant in dense scenes where the model must identify hundreds of objects at once.

Hybrid Mode is a flexible inference strategy that balances speed and reliability. It uses Fast Mode by default to predict bounding boxes in parallel for maximum throughput. If the model detects spatial ambiguity or malformed syntax, it automatically falls back to a slower, token-by-token decoding method to ensure the final coordinate predictions remain accurate and robust.

The model was trained on LocateAnything-Data, a massive corpus containing 138 million language queries and 785 million bounding boxes. This dataset covers a diverse range of tasks, including general object detection, graphical user interface grounding, optical character recognition, and referring expression comprehension, where the model must link complex human intent to specific image regions.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from NVIDIA →

Keep reading

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA released Nemotron-Labs-Diffusion, a family of open-weight models that unify standard autoregressive decoding with parallel diffusion-based generation. By switching attention patterns within a single model, these 3B to 14B parameter models achieve up to 4x higher throughput on modern hardware compared to traditional sequential generation.

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChainJun 7

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain announced immediate support for NVIDIA Nemotron 3 Ultra, an open frontier model designed for long-running AI agents. This integration makes the model's 5x faster inference and up to 30% lower cost for complex agentic tasks directly available to developers using the LangChain framework.

What is NVIDIA LocateAnything?

How does Parallel Box Decoding work?

How fast is LocateAnything compared to other models?

What is the LocateAnything Hybrid Inference Mode?

What data was used to train LocateAnything?

Keep reading

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

Keep reading

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

NVIDIA Releases Nemotron-Labs-Diffusion for 6x Faster Parallel Token Generation

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents

LangChain Adds NVIDIA Nemotron 3 Ultra for Faster AI Agents