Details
- Qwen has adapted its new multimodal model, Qwen3-VL, to run on the open-source llama.cpp runtime as of November 1, 2025.
- The model can be executed locally on CPUs or GPUs using CUDA, Apple Metal, Vulkan, OpenCL, and more, removing dependence on cloud computing.
- GGUF weight files are available for all model sizes—2 B, 7 B, 14 B, 70 B, 110 B, and the high-end 235 B—offering 8-, 4-, and 2-bit quantisation options.
- The release bundle features sample commands and configuration files, making it straightforward for developers to fine-tune or deploy the model straight from laptops, smartphones, or edge devices.
- Weights are shared under Qwen’s permissive open-source license, allowing for commercial usage as long as license conditions are met and maintaining consistency with previous Qwen2 and Qwen-VL releases.
Impact
By joining the llama.cpp ecosystem, Qwen3-VL raises the competitive bar against popular models like LLaVA-Next and Ferret. Its support for efficient local deployment cuts down on latency and costs while appealing to privacy-focused industries adapting to data-sovereignty regulations. This move is likely to accelerate investment and experimentation in real-time multimodal AI, especially for AR and robotics, as developers gain easier, more affordable access.
