3 new models from @xai's Grok creative stack are live on OpenRouter: • Grok Imagine Image Quality: photoreal image generation and editing • Grok Imagine Video: short clips from text, image, or reference • Grok Voice TTS 1.0: 5 voices across 20+ languages More on each below 🧵
OpenRouter Adds xAI Creative Stack for Unified Video and Voice Generation
OpenRouter, a platform providing a unified API for accessing hundreds of models, integrated the xAI creative stack. The suite includes
Grok Imagine Image Quality for photorealistic visuals, Grok Imagine Video for short clips, and Grok Voice TTS 1.0 for text-to-speech. This launch expands the platform's multimodal capabilities.- Video length
- 1 to 15 seconds
- Video resolution
- 480p or 720p
- Voice options
- 5 built-in voices
- Language support
- 20+ languages
- Video pricing
- From $0.05 per second
- Voice pricing
- $15 per million characters
This integration follows the launch of OpenRouter's unified video generation API, signaling a shift toward standardized access for generative media. By hosting these models alongside Grok 4.3's reasoning capabilities, the platform enables developers to build end-to-end creative agents that can reason about a task and execute final asset production.
You can now programmatically generate 15-second videos at 720p using text or up to seven reference images for character consistency. The speech model supports five voices across 20 languages with inline tags for pitch and pacing. Access is available via API, with video starting at $0.05 per second and voice at $15 per million characters.
OpenRouter
@OpenRouter
4retweets78likes
View on XStill wondering? A few quick answers below.
The Grok creative stack is a suite of multimodal models from xAI now available through the OpenRouter API. It includes Grok Imagine Image Quality for photorealistic image generation, Grok Imagine Video for creating short video clips, and Grok Voice TTS 1.0 for high-quality text-to-speech audio generation across multiple languages.
Grok Imagine Video generates clips between 1 and 15 seconds at 24 frames per second. It supports three distinct modes: text-to-video from prompts, image-to-video to animate still inputs, and reference-to-video. The reference mode allows users to ground the output in up to seven images to maintain consistent characters, styles, and settings.
Grok Voice TTS 1.0 is a text-to-speech model that supports over 20 languages with automatic language detection. It features five distinct voices named Eve, Ara, Rex, Sal, and Leo. Developers can use inline speech tags to precisely control audio parameters such as pitch, speed, emphasis, pauses, and overall vocal style.
Pricing for the xAI creative suite varies by modality. Grok Imagine Video starts at 0.05 dollars per second for 480p resolution and 0.07 dollars per second for 720p. Grok Voice TTS 1.0 is priced at 15 dollars per million characters. Grok Imagine Image Quality charges 0.002 dollars per image input for reference tasks.
The model produces video at 480p or 720p resolutions across seven different aspect ratios, including 1:1, 16:9, 9:16, 4:3, 3:4, 3:2, and 2:3. This flexibility allows developers to generate content optimized for various platforms, from traditional widescreen displays to vertical social media formats, all through a single API integration.


