***

Fine-Tuning on an M1 Mac With Mistral, Ollama, and Together.ai

2024-02-08

Est. 8m read

OpenAI’s gpt-3.5-turbo-1106 is good enough for me most of the time, but I am hesitant of the costs. Currently, executing a fine-tune job with ~220k tokens is about $5! And I’m only at 100 examples. Luckily, every API “completion” after that is a fraction of a cent.

Until OpenAI came around I’ve never really paid for API usage. It’s not something I’m used to. This has pushed me to try cheaper ways of training and using LLMs.

Overview

1. Mistral

Mistral¹ is one of the many “open models” out there. It performs better than Llama2 and is commonly recommended as a GPT-3.5 replacement.

There are different variants of Mistral, in this guide we’ll be using Mistral-7B-Instruct-v0.2. One limitation to know about this variant is it only supports 8k tokens in the context window whereas OpenAI’s gpt-3.5-turbo-1106 supports 16k. Other variants likely require a different format for fine-tuning. Variants like yarn-mistral offer increases in Mistral’s context window, but I haven’t tested it yet.

Now that we’ve decided on Mistral-7B-Instruct-v0.2, how do we fine-tune it? There’s a specific format that Mistral expects which is different from the OpenAI JSONL format. This is likely where you’ll need to start– by converting your existing data into the Mistral format which we’ll go over next.

Understanding the Mistral Format

Here’s the OpenAI format you may be used to for fine-tuning. It typically starts with a system message, followed by a user message, and then the assistant’s response:

{"messages": [{"role": "system", "content": "You are an expert in world capitals. Reply to this with only the answer:"}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]}
{"messages": [{"role": "system", "content": "You are an expert in world capitals. Reply to this with only the answer:"}, {"role": "user", "content": "What is the capital of Germany?"}, {"role": "assistant", "content": "Berlin"}]}

With Mistral, the format is described in the Mistral documentation.

<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]

Here’s an example of the Mistral format that performs the same task:

{"text": "<s> [INST] You are an expert in world capitals. You will only reply to users with the capital requested. For example 'What is the capital of France?' you should respond with: [/INST] Paris</s>"}
{"text": "<s> [INST] You are an expert in world capitals. You will only reply to users with the capital requested. For example 'What is the capital of Germany?' you should respond with: [/INST] Berlin</s>"}

Please note that I could not find many examples of Mistral fine-tuning data to base this on, but I believe this is a valid example.

I haven’t tested this yet, but I think it could reduce cost/noise for training. Leave a comment if you’ve got some experience with this please!

Automating the Conversion

Luckily, there’s a Python package called transformers² that can help us convert between formats for most models. I hacked together this script to help me convert from the OpenAI format to Mistral’s format. It’s important that your messages must alternate between user/assistant/user/assistant/…

If you’d like to know more about the transformers package, I highly recommend checking out these HuggingFace docs.

import json
from transformers import AutoTokenizer

checkpoint = "mistralai/Mistral-7B-Instruct-v0.2" # Can be changed to other models
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
messages = [
    [
        # Notice there's no system message here, the system role doesn't work for this model
        {"role": "user", "content": "You are an expert in world capitals. Reply to this with only the answer: What is the capital of France?"},
        {"role": "assistant", "content": "Paris"}
    ],
    [
        {"role": "user", "content": "You are an expert in world capitals. Reply to this with only the answer: What is the capital of Germany?"},
        {"role": "assistant", "content": "Berlin"}
    ]
]

def create_mistral_data():
    lines = []
    for message in messages:
        tokenized_chat = tokenizer.apply_chat_template(
            message, tokenize=True, add_generation_prompt=True, return_tensors="pt")
        lines.append(json.dumps({
            "text": tokenizer.decode(tokenized_chat[0])
        }))

    # Write to out.jsonl
    with open('out.jsonl', 'w') as f:
        f.write('\n'.join(lines))

create_mistral_data()

After you’ve added 100+ (yes, 100+) examples, you can run the script like so:

pip install transformers
python instructions.py

An out.jsonl file will be created in the same directory. This file will be uploaded to Together.ai in the next step. It is their requirement that you have at least 100 examples for fine-tuning.

2. Together.ai

Once you’ve got your fine-tuning data prepared (congrats btw) you can use a wide number of services to fine-tune your model. All you really need is a nice GPU. I’ve been using Together.ai for this.

It still costs ~$5 to fine-tune my model, but we have the option to download the .safetensors so that we can run the model locally! Together.ai gave me $15 for free when I signed up, so I’ve been using that for now. I have seen people recommend RunPod and Google Colab for this as well, they will require a bit more work but may come out to be cheaper.

Start a Fine-Tune Job

Together.ai does not have a nice UI for fine-tuning like OpenAI does. You’ll need to upload your .jsonl file through the command line tool. Here’s a quick example of how to do that:

pip install together # install CLI tool
export TOGETHER_API_KEY=your-api-key # https://api.together.xyz/settings/api-keys
together files upload out.jsonl # out.jsonl is our Mistral fine-tune data

Next, start the fine-tune job using the file ID from the upload command:

together finetune create --training-file FILE_ID_FROM_UPLOAD \
                         --model mistralai/Mistral-7B-Instruct-v0.2

Once that starts, you can check the status from their website here. Once it processes and uploads you have the option to host it for $1.40/hr… or you can download the .safetensors and run it locally. We’ll cover the latter.

Prepare Files for Ollama

Once the download finishes, you’ll get a *.tar.zst file. This is a compressed file that you’ll need to extract with tar. I recommend extracting it in a new directory as there are many files and it can get messy.

tar --use-compress-program=unzstd -xvf ft-*.tar.zst

After a bit, you’ll have a directory with a bunch of files. Our next step is to convert these files into a format that Ollama can understand. Typically you’ll need a GGML or GGUF file (ending in .bin).

3. Convert .safetensors for Ollama

The steps for this can all be found here however I’ll break it down for completeness.

First, within the directory containing your .safetensors files, you need to clone this repository:

git clone [email protected]:ollama/ollama.git ollama
cd ollama
git submodule init
git submodule update llm/llama.cpp

Then, setup the dependencies. It should be this simple:

python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt

And then, build the quantize tool. We’ll see what that’s used for later.

make -C llm/llama.cpp quantize

Now comes the conversion process. We’ll need to point the convert.py tool to the entire directory where the .safetensors are located.

# ../ should refer to the directory where the .safetensors are located
python llm/llama.cpp/convert.py ../ --outtype f16 --outfile converted.bin

Once that’s done, you’ll have a converted.bin file

4. Quantize the Model

The quantize tool is pretty neat. People compare it to compression, however it’s not exactly like compression. Regardless, we need to quantize the converted model to be sure that it will run on our local machine. Experiment with skipping this step or trying other quantization types on your own. I’ve had best results with Q5_K_M and Q4_0 on my MacBook Air.

llm/llama.cpp/quantize converted.bin quantized.bin q4_0

5. Install Ollama

Ollama.ai let’s you run LLMs locally. Think of it like Docker for AI models. It’s a simple command line tool that can run Mistral, among the many other models it supports. It supports macOS and Linux right now, Windows users may be able to use WSL³.

Download it here

6. Create a Modelfile

You can create your Modelfile⁴ anywhere, I created it in the same directory as the .safetensors from previous steps. All you need is the following:

FROM ./ollama/quantized.bin
TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST]"""
SYSTEM """REPLACE_WITH_YOUR_SYSTEM_PROMPT"""

The only things you may need to change are the FROM path and the SYSTEM prompt.

7. Run the Model

Now you can run the model with the following command:

ollama create my-ft-model -f Modelfile
ollama run my-ft-model # starts an inference session in the terminal
>> What is the capital of France?
Paris

And that’s it! You’ve got a fine-tuned Mistral model running locally. Hopefully this helps you save some money on your AI projects. I’m still not sure if this is worth it compared to paying OpenAI, but I certainly learned a few things so I’ll call it a win. Cheers!

8. Bonus: Setup Ollama WebUI

If you’d like to chat with your model in a UI like ChatGPT, Ollama WebUI is the answer. It’s very straightforward if ran in a Docker container. Here’s a quick example of how to do that:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama-webui/ollama-webui:main

Once that’s up, you can visit http://localhost:3000/ and create a local account. Assuming you’ve followed this guide, your model should be available to select and it should just work.

?. Open Questions

Is my understanding of the Mistral fine-tuning format correct?
- There are some [/INST] artifacts in my own outputs that make me wonder…
What’s the best way to write your fine-tuning data once and convert it to the format required by Mistral, OpenAI, and others?
- NVIDIA/Megatron-LM has a tool for this, but is it more capable than HF transformers?
How does fine-tuning a multimodal compare to just text?
- From wandb.ai it looks pretty easy! Very similar process.

Pages

Blog