HeadsUpAI

ElevenLabs Adds Multimodal Input to ElevenAgents for End-to-End Task Resolution

· Updated

ElevenLabs, an AI platform for voice synthesis and conversational agents, added multimodal (AI that processes multiple input types) support to ElevenAgents. Agents can now process images, PDFs, audio messages, and location pins. This allows agents to see and read documents directly within a conversation, moving beyond simple voice and text.
New input modalities
Images, PDFs, Audio notes, and more
Supported channels
WhatsApp, Web Widget, In-app, and more
Context management
Cross-channel persistence
Availability
Available now
Integration options
WhatsApp docs and widget docs

This update addresses the handoff bottleneck where AI agents previously required human intervention to verify documents. By integrating these senses into ElevenLabs' business workflow templates, companies can automate lifecycles—like the ElevenLabs banking support workflows recently deployed—where proof of address or medical records are mandatory for completion.

You can deploy these capabilities now through the ElevenAgents dashboard for web widgets and WhatsApp. The system preserves context across channels, enabling an agent to start a voice call and transition to WhatsApp to process a signed PDF. These features are available to all users currently building with the platform.

ElevenLabs
ElevenLabs
@ElevenLabs
X

Introducing new modalities for ElevenAgents Your customers don't just talk or type. They send photos, files, voice notes, and locations, and reach out across channels. Now your agents handle all of it. https://t.co/dx4B4GchPu

18retweets215likes
View on X

Still wondering? A few quick answers below.

ElevenAgents now supports multimodal inputs, meaning the AI can process more than just voice and text. The agents can now understand and interact with images, PDF files, audio notes, contact cards, and location pins. This allows the agents to resolve complex customer requests that require visual evidence or document verification without needing human assistance.

ElevenAgents maintains full context across different communication channels and formats. A single interaction can transition between a voice call and a messaging platform like WhatsApp without the agent losing the history of the conversation. This allows an agent to send a document mid-call and process the customer's response in one continuous, unified thread.

The new multimodal capabilities are primarily available through the ElevenAgents web widget and WhatsApp integration. On WhatsApp, agents can process images, files, audio, contacts, and locations. In the web widget, users can directly upload images and PDFs into the chat interface, allowing the agent to analyze documents alongside voice and text interactions.

ElevenAgents is now capable of processing a variety of non-text formats to complete end-to-end tasks. This includes visual data like images and photos, document formats such as PDFs, and communication-specific data like audio voice notes, contact information, and location pins. These capabilities enable agents to handle specialized workflows like insurance claims or medical record reviews.

These new multimodal features are available now for all users of the ElevenAgents platform. Developers and businesses can implement these capabilities on top of their existing voice and chat experiences across web, in-app, and telephony channels. Detailed documentation is provided for both the WhatsApp integration and the customizable web widget to help users get started.

Share this update