ElevenLabs Adds Multimodal Input to ElevenAgents for End-to-End Task Resolution

ElevenLabs

May 7, 2026 · Updated May 15, 2026

ElevenLabs expanded ElevenAgents to process images, PDFs, audio notes, and locations across WhatsApp and web widgets. By maintaining context across channels, these agents can now handle complex workflows like insurance claims or banking onboarding without human handoffs.

ElevenLabs, an AI platform for voice synthesis and conversational agents, added multimodal (AI that processes multiple input types) support to ElevenAgents. Agents can now process images, PDFs, audio messages, and location pins. This allows agents to see and read documents directly within a conversation, moving beyond simple voice and text.

New input modalities: Images, PDFs, Audio notes, and more
Supported channels: WhatsApp, Web Widget, In-app, and more
Context management: Cross-channel persistence
Availability: Available now
Integration options: WhatsApp docs and widget docs

This update addresses the handoff bottleneck where AI agents previously required human intervention to verify documents. By integrating these senses into ElevenLabs' business workflow templates, companies can automate lifecycles—like the ElevenLabs banking support workflows recently deployed—where proof of address or medical records are mandatory for completion.

You can deploy these capabilities now through the ElevenAgents dashboard for web widgets and WhatsApp. The system preserves context across channels, enabling an agent to start a voice call and transition to WhatsApp to process a signed PDF. These features are available to all users currently building with the platform.

View the full update on elevenlabs.io

ElevenLabs

@ElevenLabsMay 6

Introducing new modalities for ElevenAgents Your customers don't just talk or type. They send photos, files, voice notes, and locations, and reach out across channels. Now your agents handle all of it. https://t.co/dx4B4GchPu

18215

View on X

Still wondering? A few quick answers below.

ElevenAgents now supports multimodal inputs, meaning the AI can process more than just voice and text. The agents can now understand and interact with images, PDF files, audio notes, contact cards, and location pins. This allows the agents to resolve complex customer requests that require visual evidence or document verification without needing human assistance.

ElevenAgents maintains full context across different communication channels and formats. A single interaction can transition between a voice call and a messaging platform like WhatsApp without the agent losing the history of the conversation. This allows an agent to send a document mid-call and process the customer's response in one continuous, unified thread.

The new multimodal capabilities are primarily available through the ElevenAgents web widget and WhatsApp integration. On WhatsApp, agents can process images, files, audio, contacts, and locations. In the web widget, users can directly upload images and PDFs into the chat interface, allowing the agent to analyze documents alongside voice and text interactions.

ElevenAgents is now capable of processing a variety of non-text formats to complete end-to-end tasks. This includes visual data like images and photos, document formats such as PDFs, and communication-specific data like audio voice notes, contact information, and location pins. These capabilities enable agents to handle specialized workflows like insurance claims or medical record reviews.

These new multimodal features are available now for all users of the ElevenAgents platform. Developers and businesses can implement these capabilities on top of their existing voice and chat experiences across web, in-app, and telephony channels. Detailed documentation is provided for both the WhatsApp integration and the customizable web widget to help users get started.

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from ElevenLabs →

Keep reading

ElevenLabs Launches 50 Agent Templates to Automate Business Workflows

ElevenLabs released a library of 50+ pre-configured templates for ElevenAgents, covering sales, support, and internal operations. These turnkey blueprints allow teams to deploy voice-first agents with predefined prompts and integrations rather than building agent logic from scratch.

ElevenCreativeMar 18

ElevenLabs Launches Flows to Chain Image Video and Audio Models

ElevenLabs launched Flows, a node-based canvas inside ElevenCreative for chaining 35+ image, video, voice, and music models into reusable creative pipelines. Batch-execute a flow with swapped inputs — different products, avatars, or voices — to produce campaign variants at scale.

Weaviate AI DatabaseApr 8

Weaviate Adds PDF Support to Agent Skills for Autonomous Document Ingestion

Weaviate added PDF import capabilities to its Agent Skills framework, allowing AI agents to autonomously configure schemas and ingest document libraries. By combining multimodal embeddings with the MUVERA algorithm, the system enables high-accuracy multi-vector retrieval without the typical memory and cost overhead.

Google DeepMindMay 1

Google Previews AI co-clinician Agents With Real Time Multimodal Senses

Google announced the AI co-clinician research initiative, a system of multimodal agents designed to assist doctors and patients through real-time audio and video. By moving beyond text-based chat to eyes, ears, and a voice, the system can guide physical exams and medication reasoning.

What are the new modalities for ElevenAgents?

How does ElevenAgents handle cross-channel conversations?

Which platforms support the new ElevenAgents multimodal features?

What specific file types can ElevenAgents process?

Who can access the new multimodal features in ElevenAgents?

Keep reading

ElevenLabs Launches 50 Agent Templates to Automate Business Workflows

ElevenLabs Launches 50 Agent Templates to Automate Business Workflows

ElevenLabs Launches Flows to Chain Image Video and Audio Models

ElevenLabs Launches Flows to Chain Image Video and Audio Models

Weaviate Adds PDF Support to Agent Skills for Autonomous Document Ingestion

Weaviate Adds PDF Support to Agent Skills for Autonomous Document Ingestion

Google Previews AI co-clinician Agents With Real Time Multimodal Senses

Google Previews AI co-clinician Agents With Real Time Multimodal Senses

Keep reading

ElevenLabs Launches 50 Agent Templates to Automate Business Workflows

ElevenLabs Launches 50 Agent Templates to Automate Business Workflows

ElevenLabs Launches Flows to Chain Image Video and Audio Models

ElevenLabs Launches Flows to Chain Image Video and Audio Models

Weaviate Adds PDF Support to Agent Skills for Autonomous Document Ingestion

Weaviate Adds PDF Support to Agent Skills for Autonomous Document Ingestion

Google Previews AI co-clinician Agents With Real Time Multimodal Senses

Google Previews AI co-clinician Agents With Real Time Multimodal Senses