Qwen 3.5 Series Releases GPTQ-Int4 Weights for Limited-GPU Inference

Qwen

Mar 3, 2026 · Updated Apr 25, 2026

Qwen 3.5 Series, Alibaba's open-weight model family, now has GPTQ-Int4 quantized weights available for its larger model sizes. They work natively with vLLM and SGLang, cutting VRAM needs so teams can run larger Qwen 3.5 models on constrained GPU setups.

Qwen 3.5 Series now has GPTQ-Int4 quantized weights available — covering models from 35B to 397B parameters. The weights ship with native support for vLLM and SGLang, two popular open-source inference engines, meaning no additional setup is required to run them. GPTQ-Int4 packs model weights into 4-bit integers, significantly reducing the VRAM footprint without requiring custom serving infrastructure.

This brings the larger Qwen 3.5 models within reach of teams that don't have access to high-end GPU clusters. Running a 35B parameter model, for instance, becomes feasible on hardware that previously couldn't accommodate it — expanding who can self-host frontier-class open-weight models.

Grab the weights from Hugging Face or ModelScope. Both vLLM and SGLang users can load the GPTQ-Int4 checkpoints with the same configuration they use for standard models.

View the full update on huggingface.co

Qwen

@Alibaba_QwenMar 3

🔥 Qwen 3.5 Series GPTQ-Int4 weights are live. Native vLLM & SGLang support. ⚡️ Less VRAM. Faster inference. Run powerful models on limited-GPU setups. 👇 Grab the weights + example code: Hugging Face: https://t.co/3MSb7miq68 ModelScope: https://t.co/LGHruBHP6Q

View on X

Every HeadsUpAI update is written based on its original source and reviewed before it's published. Read our editorial standards →

See all AI news & updates from Qwen →

Keep reading

Qwen3.5-397B-A17B FP8 Open Weights Now Available for Self-Hosting

Alibaba open-sourced Qwen3.5-397B-A17B-FP8 weights - a Mixture-of-Experts model activating only 17B of 397B parameters per token, matching frontier-model performance at a fraction of compute cost. SGLang support is merged; vLLM lands in days, making it self-hostable on standard inference infrastructure.

vLLM Adds Day-0 Support for Alibaba Qwen3.6-27B Dense Model

vLLMApr 24

vLLM Adds Day-0 Support for Alibaba Qwen3.6-27B Dense Model

vLLM now supports Qwen3.6-27B, the flagship dense model of Alibaba's latest series, on the day of its release. This integration allows developers to immediately serve the model with high throughput using a dedicated inference recipe.

Fireworks AI Adds Managed Fine-Tuning for Qwen 3.6 27B

Fireworks AIMay 15

Fireworks AI Adds Managed Fine-Tuning for Qwen 3.6 27B

Fireworks AI launched managed fine-tuning for Alibaba's Qwen 3.6 27B model, supporting 256K context windows and out-of-the-box DPO. This allows developers to specialize a high-performance dense model for complex coding and reasoning tasks on a production-ready stack.

ASI:Cloud Adds MiniMax M2.5, Qwen, and GLM Models for Inference

CUDOSMar 19

ASI:Cloud Adds MiniMax M2.5, Qwen, and GLM Models for Inference

ASI:Cloud, a serverless AI inference platform, added three new models for immediate use: MiniMax M2.5, Qwen 3.5-35B-A3B, and GLM 4.7 Flash. No waitlists — all three are live now via its OpenAI-compatible API.