Magistral 1.2: The 24B Reasoning Model You Can Run Locally
A guide to deploying Mistral’s 24B multimodal reasoning LLM locally using vLLM — open-source, fast, and built for deep thinking. magistral-medium-2509
How to run Magistral 1.2 on a Local GPU
Earlier this summer Mistral AI dropped their groundbreaking Magistral-Small-2506 model - this wasn’t just another checkpoint model it was one of the first usable open-weight “reasoning” models that could reasonably run on a local nVidia GPU. Great for agentic tool-use, chain of thought (CoT) reasoning and more all running on your local AI PC.
What’s new in Magistral 1.2?
Some would consider this a “minor” upgrade but in reality 1.2 is what 1.1 was intended to be on it’s release date in June. The official release name for Magistral 1.2 we’re covering today on local GPUs is Magistral-Small-2509.
Multimodality:
Now equipped with a vision encoder, these models handle both text and images seamlessly. This is incredible, since before we just thought this was a new approach to CoT or “thinking” models not a fully featured VLLM.
Performance Boost:
15% improvements for math and coding benchmarks such as AIME 24/25 and LiveCodeBench v5/v6. 15% should merit this release a full model of it’s own, versioning 1.1 to 1.2 seems like an understatement. This model is so good it’s “medium” variant (Magistral Medium 2509) is already the new default for
How to Run Locally with nVidia GPU’s
The full open-weights model of Magistral 1.2 is quite large however our friends over at Unsloth AI have already released same-day quants that enable smaller GPU’s to run this model locally right now! (It’s almost like Mistral gave them early access or something…)
Unfortunately, the smallest model quants achieved today (day of launch) while retaining good performance is a quantization of Magistral 1.2 small. This model still exceeds at coding + math and can actually fit the entire 24B model in 32GB of VRAM.
This means the following GPU’s are already a great option to run this model right now:
nVidia RTX 5090 32GB - details
nVidia V100 32GB SXM2 (see modded build on llamabuilds.ai)
nVidia RTX 4090D - details
2x nVidia RTX 3090 with NVLINK bridge (see build on llamabuilds.ai)
Or if you have a handful of GPU’s with less than 32GB of vRam and already setup Vllm, but your performance may vary spreading inference across multiple GPUs. New models sometimes don’t behave well with early quants.
If you don’t have nVidia GPU’s this model also fits nicely on 32GB macs! Hopefully we’ll soon see a dedicated MLX port!
Inference
vllm (recommended): See below
transformers: See below
llama.cpp: See https://huggingface.co/mistralai/Magistral-Small-2509-GGUF
Unsloth GGUFs: See https://huggingface.co/unsloth/Magistral-Small-2509-GGUF
Kaggle: See https://www.kaggle.com/models/mistral-ai/magistral-small-2509
LM Studio: See https://lmstudio.ai/models/mistralai/magistral-small-2509
Fine-tuning
Axolotl: See https://github.com/axolotl-ai-cloud/axolotl/tree/main/examples/magistral
Unsloth: See https://docs.unsloth.ai/models/tutorials-how-to-fine-tune-and-run-llms/magistral-how-to-run-and-fine-tune
How to run Magistral 1.2 Small with vLLM (recommended)
We recommend using this model with the vLLM library to implement production-ready inference pipelines.
Installation (full guide here)
Make sure you install the latest vLLM code:
pip install --upgrade vllm
Doing so should automatically install mistral_common >= 1.8.5.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve model as follows:
vllm serve mistralai/Magistral-Small-2509 \
--reasoning-parser mistral \
--tokenizer_mode mistral --config_format mistral \
--load_format mistral --tool-call-parser mistral \
--enable-auto-tool-choice --limit-mm-per-prompt '{"image":10}' \
--tensor-parallel-size 2
Conclusion:
Magistral 1.2 (Magistral-Small-2509) is a model that proves size isn’t everything, it’s about purposeful fine-tuning that results in big wins in performance and capability. With Mistral’s latest training pipeline, Apache 2.0 license, and performance-focused deployment options like vLLM, it’s not just a research release it’s an invitation to build.
Whether anyone wants a dev deploying AI locally, a startup testing multilingual reasoning, or an academic exploring chain-of-thought architectures, this model belongs in the model stack.
Recommended Builds:
Hardigg Dual 3090
see full build details here