Running Mistral 8x7Bs Mixture of Experts on a Macbook

Below are the steps I used to get Mistral 8x7Bs Mixture of Experts (MOE) model running locally on my Macbook (with its Apple M2 chip and 24 GB of memory). Here’s a great overview of the model for anyone interested in learning more. Short version:

The Mistral “Mixtral” 8x7B 32k model,developed by Mistral AI, is a Mixture of Experts (MoE) model designed to enhance machine understanding and generation of text. Similar to GPT-4, Mixtral-8x7b uses a Mixture of Experts (MoE) architecture with 8 experts, each having 7 billion parameters.

Mixtral 8x7b has a total of 56 billion parameters, supports a 32k context window, and displaces both Meta Llama 2 and OpenAI GPT-3.5 in 4 out of 7 leading LLM benchmarks.

These step below were inspired by those shared by Einar Vollset on X, but specific to the 8x7B MOE model, not the Mistral-7B-Instruct-v0.2 model, and taking into account recent changes to the llama.cpp repo to support this model.

Note that Einar’s 16gb Macbook generated 10 tokens/second with the Instruct model, but my 24gb Macbook absolutely crawled running this MOE model, generating more like 10 tokens/hour, and becoming unusable in the process. Here’s my command line output if anyone can help me figure out why it’s so slow, though it’s likely that the model is just too much this hardware. Unless you have a very powerful Macbook, I’d recommend running the Mistral 7B Instruct model instead of this 8x7Bs MOE model.

1) Clone llama.cpp and install the necessary packages

gh repo clone ggerganov/llama.cpp

This uses the GitHub CLI, though it isn’t completely necessary.

After you have it cloned:

python3 -m pip install -r requirements.txt

2) Download the model torrent

I use µTorrent, though any other torrent app will do.

Here’s a direct link to the torrent.

3) Move the model directory into llama.cpp/models

4) Convert the model to F16

python3 convert.py ./models/mixtral-8x7b-32kseqlen --outfile ./models/mixtral-8x7b-32kseqlen/ggml-model-f16.gguf --outtype f16

This converts the model to a 16-bit floating-point representation to reduce the model’s size and computational requirements.

5) Quantize it

./quantize ./models/mixtral-8x7b-32kseqlen/ggml-model-f16.gguf ./models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf q4_0

Quantizing the model reduces the precision of the numbers used in a model, which can lead to smaller model sizes and faster inference times at the cost of some accuracy. The “q4_0” means that the model is being quantized to use 4-bit representation for each number.

6) Use it

Either via the command line:

./main -m ./models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -ngl 999 -s 1 -n 128 -t 8

Or the built-in web app:

make -j && ./server -m models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf -c 4096

Enjoy!

Leave a comment