Running AI Models Locally with Apple’s MLX
If you want to run an AI model on your own machine without using llama.cpp or Ollama, Apple’s free MLX platform is a powerful alternative. In this guide, we will walk through running an open-source Qwen 1.5GB Math Instruct model locally on a MacBook Pro without modifying its weights.
What is MLX?
MLX (Machine Learning Acceleration Library) is Apple’s optimized machine learning framework designed to run AI models efficiently on Apple Silicon (M1/M2/M3 chips). MLX provides a lightweight, easy-to-use interface that enables high-performance machine learning on macOS devices.
With MLX, you can:
- Run AI models locally without needing cloud-based inference.
- Leverage Apple’s Metal API for optimized GPU acceleration.
- Fine-tune and deploy models without complex configurations.
Running Qwen 1.5B Math Instruct Model on MLX
To demonstrate MLX’s capabilities, we will run Qwen 2.5 Math 1.5B Instruct, an open-source model designed for math-related tasks. The model can be downloaded directly from Hugging Face, a popular platform for sharing, training, and deploying machine learning models, and executed locally on your MacBook Pro.
Installation Steps
1. Install MLX and Required Dependencies
First, install the MLX framework, MLX Language Model (mlx-lm), and Hugging Face Transformers:
pip install mlx mlx-lm transformers
2. Install Git Large File Storage (LFS)
Since the model files are large, you need Git LFS to handle them properly:
brew install git-lfs
3. Install Hugging Face CLI
To interact with the Hugging Face model hub, install the huggingface_hub package:
pip install huggingface_hub
4. Download the Qwen Model from Hugging Face
Clone the model repository using Git LFS:
git clone https://huggingface.co/Qwen/Qwen2.5-Math-1.5B-Instruct
This will download all required model weights and files to your local machine.
Running the Model Locally
Once the model is downloaded, create a Python script (e.g., run_qwen.py) and add the following code:
from mlx_lm import load, generate
from transformers import AutoTokenizer
# Define the local path where the model is stored
MODEL_PATH = "./Qwen2.5-Math-1.5B-Instruct"
# Load the model and tokenizer
model, tokenizer = load(MODEL_PATH)
print("Qwen Model loaded successfully!")
# Provide a text prompt
prompt = "Create 5 math questions for 5-year-old kids, mix of addition and subtraction that is suitable for that age group."
# If the tokenizer supports chat templates, use it
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
# Generate output
response = generate(model, tokenizer, prompt=prompt, verbose=True)
# Print the response
print("\nQwen Model Response:\n", response)
Running the Script
Execute the script in the terminal (make sure your model is in the right folder where this script is):
python run_qwen.py
Expected Output
Once the model runs successfully, it should generate five math questions suitable for a 5-year-old, for example:
Qwen Model Response:
1. What is 3 + 2?
2. If you have 5 apples and give 2 away, how many do you have left?
3. What is 1 + 4?
4. If you take away 3 from 7, how many remain?
5. What is 2 + 3?
Conclusion
By using MLX, you can efficiently run AI models like Qwen 1.5B Math Instruct locally on your MacBook Pro. This setup provides a fast, private, and cost-effective alternative to cloud-based inference while leveraging Apple’s optimized hardware acceleration.
Try experimenting with different models and fine-tuning them using MLX to explore the full potential of running AI on your local machine!