Running Mistral 7B Instruct on a Macbook

Similar to yesterday’s post on running Mistral 8x7Bs Mixture of Experts (MOE) model, I wanted to document the steps I took to run Mistral’s 7B-Instruct-v0.2 model on a Mac for anyone else interested in playing around with it.

Unlike yesterday’s post though, this 7B Instruct model’s inference speed is about 20 tokens/second on my M2 Macbook with its 24GB of RAM, making it something a lot more practical to play around with than the 10 tokens/hour MOE model.

These instructions are once again inspired by Einar Vollset’s post where he shared his steps, though updated to account for a few changes in recent days.

Update Dec 19: A far easier way to run this model is to use Ollama. Simply install it on your Mac, open it, then run ollama run mistral from the command line. However, if you want to go the more complex route, here are the steps:

1) Download HuggingFace’s model downloader

bash <(curl -sSL https://g.bodaay.io/hfd) -h

2) Download the Mistral 7B Instruct model

./hfdownloader -m mistralai/Mistral-7B-Instruct-v0.2

For me, I ran both of the commands above in my ~/code directory, and the downloader downloaded the model into ~/code/Storage/mistralai_Mistral-7B-Instruct-v0.2.

3) Clone llama.cpp and install the necessary packages

Using the GitHub CLI:

gh repo clone ggerganov/llama.cpp

And after you have it cloned, install the necessary packages:

python3 -m pip install -r requirements.txt

4) Move the 7B model folder into llama.cpp/models

5) Convert to F16

python3 convert.py models/mistralai_Mistral-7B-Instruct-v0.2 --outfile models/mistralai_Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf --outtype f16

6) Quantize it

./quantize models/mistralai_Mistral-7B-Instruct-v0.2/ggml-model-f16.gguf models/mistralai_Mistral-7B-Instruct-v0.2/ggml-model-q4_0.gguf q4_0

7) Run it

./main -m ./models/mistralai_Mistral-7B-Instruct-v0.2/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -ngl 999 -s 1 -n 128 -t 8

Alternatively, run the built-in web server:

make -j && ./server -m models/mistralai_Mistral-7B-Instruct-v0.2/ggml-model-q4_0.gguf -c 4096

Unless you have a very powerful Macbook, definitely experiment with this model instead of the MOE model 🤣.

Running Mistral 8x7Bs Mixture of Experts on a Macbook

Below are the steps I used to get Mistral 8x7Bs Mixture of Experts (MOE) model running locally on my Macbook (with its Apple M2 chip and 24 GB of memory). Here’s a great overview of the model for anyone interested in learning more. Short version:

The Mistral “Mixtral” 8x7B 32k model,developed by Mistral AI, is a Mixture of Experts (MoE) model designed to enhance machine understanding and generation of text. Similar to GPT-4, Mixtral-8x7b uses a Mixture of Experts (MoE) architecture with 8 experts, each having 7 billion parameters.

Mixtral 8x7b has a total of 56 billion parameters, supports a 32k context window, and displaces both Meta Llama 2 and OpenAI GPT-3.5 in 4 out of 7 leading LLM benchmarks.

These step below were inspired by those shared by Einar Vollset on X, but specific to the 8x7B MOE model, not the Mistral-7B-Instruct-v0.2 model, and taking into account recent changes to the llama.cpp repo to support this model.

Note that Einar’s 16gb Macbook generated 10 tokens/second with the Instruct model, but my 24gb Macbook absolutely crawled running this MOE model, generating more like 10 tokens/hour, and becoming unusable in the process. Here’s my command line output if anyone can help me figure out why it’s so slow, though it’s likely that the model is just too much this hardware. Unless you have a very powerful Macbook, I’d recommend running the Mistral 7B Instruct model instead of this 8x7Bs MOE model.

1) Clone llama.cpp and install the necessary packages

gh repo clone ggerganov/llama.cpp

This uses the GitHub CLI, though it isn’t completely necessary.

After you have it cloned:

python3 -m pip install -r requirements.txt

2) Download the model torrent

I use µTorrent, though any other torrent app will do.

Here’s a direct link to the torrent.

3) Move the model directory into llama.cpp/models

4) Convert the model to F16

python3 convert.py ./models/mixtral-8x7b-32kseqlen --outfile ./models/mixtral-8x7b-32kseqlen/ggml-model-f16.gguf --outtype f16

This converts the model to a 16-bit floating-point representation to reduce the model’s size and computational requirements.

5) Quantize it

./quantize ./models/mixtral-8x7b-32kseqlen/ggml-model-f16.gguf ./models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf q4_0

Quantizing the model reduces the precision of the numbers used in a model, which can lead to smaller model sizes and faster inference times at the cost of some accuracy. The “q4_0” means that the model is being quantized to use 4-bit representation for each number.

6) Use it

Either via the command line:

./main -m ./models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -ngl 999 -s 1 -n 128 -t 8

Or the built-in web app:

make -j && ./server -m models/mixtral-8x7b-32kseqlen/ggml-model-q4_0.gguf -c 4096

Enjoy!

Emergent Mind in The Atlantic

One of the many people who saw the incorrect Google Quick Answer about Kenya was an editor at The Atlantic who asked Caroline Mimbs Nyce, one of their reporters, to look into it. Caroline interviewed me for the article they just published which focused on the challenges Google is facing in the age of AI-generated content.

From the article:

Given how nonsensical this response is, you might not be surprised to hear that the snippet was originally written by ChatGPT. But you may be surprised by how it became a featured answer on the internet’s preeminent knowledge base. The search engine is pulling this blurb from a user post on Hacker News, an online message board about technology, which is itself quoting from a website called Emergent Mind, which exists to teach people about AI—including its flaws. At some point, Google’s crawlers scraped the text, and now its algorithm automatically presents the chatbot’s nonsense answer as fact, with a link to the Hacker News discussion. The Kenya error, however unlikely a user is to stumble upon it, isn’t a one-off: I first came across the response in a viral tweet from the journalist Christopher Ingraham last month, and it was reported by Futurism as far back as August. (When Ingraham and Futurism saw it, Google was citing that initial Emergent Mind post, rather than Hacker News.)

One thing I learned from the article is the reason why Google hasn’t removed the Kenya quick answer, despite it being obviously incorrect and existing since at least August, is that it doesn’t violate their Terms of Service, and they are more focused on addressing the larger accuracy issue, not dealing with one-off instances of incorrect answers:

The Kenya result still pops up on Google, despite viral posts about it. This is a strategic choice, not an error. If a snippet violates Google policy (for example, if it includes hate speech) the company manually intervenes and suppresses it, Nayak said. However, if the snippet is untrue but doesn’t violate any policy or cause harm, the company will not intervene. Instead, Nayak said the team focuses on the bigger underlying problem, and whether its algorithm can be trained to address it.

The Atlantic article was published before I was alerted earlier this week by Full Fact, a UK fact-checking organization, about a more egregious example where Google misinterpreted a creative writing example on Emergent Mind about the health benefits of eating glass and was showing it as a Quick Answer:

You can read Full Fact’s article about this glass-eating snippet here: Google snippets falsely claimed eating glass has health benefits. As I noted on X, I quickly removed this page from Emergent Mind on the off-chance that someone misinterprets it as health advice.

Something tells me this won’t be the last we’ll hear about Google misinterpreting ChatGPT examples on Emergent Mind. Until then…

Exploring ChatGPT’s Knowledge Cutoff

A recurring topic of discussion on the OpenAI forums, on Reddit, and on Twitter is about what ChatGPT’s knowledge cutoff date actually is. It seems like it should be straightforward enough to figure out (just ask it), but it can be confusing due to ChatGPT’s inconsistent answers about its cutoff month, differences from official documentation, and varying capabilities between the API and playground.

ChatGPT’s knowledge cutoff is of interest to me because I run Preceden, an AI timeline maker, and when users request to generate a timeline about a recent event, I need to know whether the ChatGPT API will respond with reliable information about that topic.

In addition to the knowledge cutoff month, I was also curious to understand what GPT-4 Turbo’s knowledge cutoff of ‘April 2023’ actually means. This raises questions such as whether it refers to the beginning, end, or a point in between that month. I’m also interested in knowing if the quality of its knowledge declines as we approach that date, and whether it sometimes has knowledge about events after that date.

Seeking unambiguous, easily verifiable knowledge

How can we evaluate the accuracy of ChatGPT’s knowledge (or any LLM for that matter) over time?

After brainstorming a few options like news headlines and election results, I arrived at the same conclusion that some forum commenters did: ask ChatGPT about celebrity deaths and specifically, celebrity death dates.

While a bit morbid (similar to the well-known Titantic survival prediction challenge), celebrity death dates work well for our purpose because they are widely reported (increasing the odds that the LLM was exposed to them as part of its training data) and the LLM’s response is easy to verify (if the person died on Feb 10, 2023 but the LLM says they are alive or they died in March, then it’s clearly incorrect).

How we’ll go about this

USAToday has pages that list celebrity deaths each year, including an ongoing list for 2023.

We’ll scrape that data to give us a list of 168 celebrities who have died between Jan 1, 2022 and August 1, 2023 (several months after GPT-4 Turbo’s official knowledge cutoff, so we can test whether it knows anything about more recent celebrity deaths).

We’ll then repeatedly ask the ChatGPT API about the deaths dates of the celebrities in that list. We’ll do this with multiple temperatures to minimize the impact the temperature might have on the results (temperature is a parameter that controls the “creativity” or randomness of the response).

Here’s what the prompt looks like (note that I’m including the celebrity’s birth date to avoid any ambiguity when there are multiple celebrities with the same name):

Given a list of celebrity names and their birth dates, generate a JSON object containing their death dates.

For example, given ‘Michael Jackson (1958-08-28)’, you should output: { “Michael Jackson”: “2009-06-25” }

If the person is still alive, return a blank string for their death date.

Now, return a JSON object containing every celebrity in this list:

1. Dan Reeves (1944-01-19)

2. Sidney Poitier (1927-02-20)

3. Peter Bogdanovich (1939-07-30)

4. Bob Saget (1956-05-17)

We’ll then compare ChatGPT’s knowledge about their death dates to the actual death dates to determine how accurate ChatGPT’s responses are. Occasionally GPT gives dates that are a day or two off, so I made a judgment call and am calling the date accurate as long as it is close to the actual death, indicating GPT is least aware that the celebrity had died, even if there is slight ambiguity about the exact date, possibly stemming from when the death occurred vs when it was reported.

Finally, we’ll analyze the results to see how often ChatGPT returned an accurate response.

You can find the source code for the USA Today scraper in this celebrity-deaths-scraper repo and the full source code for the Jupyter Notebook used in this analysis in this llm-knowledge-cutoff repo. While this analysis focuses on GPT-4 Turbo, the notebook is easy enough to customize for other ChatGPT models or other LLMs for anyone interested in continuing this work.

Things to keep in mind going into this

  • I don’t have a deep understanding of how LLMs work which may lead me to interpret some of these results incorrectly. If you spot anything I’ve overlooked in this analysis, please share in a comment, drop me an email, or let me know on X and I’ll update this post accordingly.
  • There could be subtle things like the wording I used in the prompt, the order of the celebrities, etc that is causing issues. Similar to the last point, if you notice anything like that, please let me know.
  • It’s likely that OpenAI is tweaking these ChatGPT models frequently which may causes differences between the results below and what you see in the future if you run this notebook.

What does gpt-4-1106-preview know?

Overall GPT-4 does fairly well returning accurate celebrity death dates prior to 2023, but as we approach its knowledge cutoff, the probability that it returns an correct response declines significantly.

For example, there were 13 celebrities who passed away in January 2022 and we asked ChatGPT about each of them 50 times (10 times for each of the 5 temperatures), resulting in 650 responses. Out of those 650 responses, 596 were accurate, so the accuracy % for that month is 91.69%.

However, when we look at January 2023, which also saw 13 deaths, only 501 responses were correct, for an accuracy rate of 77%. By March 2023, only 210/450 or 44% of the responses were correct.

This chart also answers the question about what the April 2023 knowledge cutoff represents: it appears to mean GPT-4 Turbo was trained on data prior to April 2023 and not including that month. I can’t say for certain that’s true of 100% of its knowledge, but at least in terms of celebrity deaths, it’s not aware of anything from April 1st onward.

You may wonder as well what impact temperature had on the results. Not much:

Across each of the 5 temperatures it evaluated (0, 0.25, 5, 0.75, and 1), the accuracy of GPT-4 Turbo’s responses over these 168 celebrities was very similar.

Here’s how it does for each celebrity:

Many of the celebrities it returns the correct date for 100% of the time as we can see from the top half of that chart.

For others, ChatGPT will sometimes say the person is still alive, other times that he or she passed away. For example, Bud Grant, a former Minnesota Vikings coach, died on March 11, 2023 so in theory should be part of GPT-4 Turbo’s knowledge. However, sometimes the API returns a correct death date for him in the script, but often not. Ditto if you ask the playground about him. Here it responds he’s still alive:

But here it responds that he has indeed passed away:

And for two celebrities, ChatGPT never returned a correct date for their death:

  1. Ryuichi Sakamoto, a Japanese composer who died on March 28, 2023. His death doesn’t seem to have been reported until April 2, after a April 1 knowledge cutoff, so would make sense that ChatGPT wouldn’t know about it.
  2. Lola Chantrelle Mitchell, a former member of Three 6 Mafia, who died on January 1, 2023. I asked about her on X and someone received a correct response, but he hasn’t been able to reproduce it and I haven’t seen it in my testing. I’m stumped on why its having such trouble with her, given that her death was widely reported at the beginning of 2023.

Again, the probabilistic nature of LLMs means that there is inherent randomness in its responses. But why does it return 100% correct responses for some celebrities, 5-10% for others, and 0% for yet others? What about its training makes that happen? I asked GPT-4 Turbo and here is its response which I suspect answers it well:

This highlights the need to not only consider the possibility of hallucinations in ChatGPT’s responses, but in the possibility of ignorance as well, especially for information about recent events. This will be less of an issue if you’re using GPT-4 with browsing enabled, but will still impact older models as well as usage of GPT-4 via the API or playground.

Again, if anyone spots anything I’ve overlooked in this analysis, please drop me a note.

Thanks for reading đź‘‹.