Fine-Tuning on an M1 Mac With Mistral, Ollama, and Together.ai
2024-02-08
Est. 8m read
OpenAI’s gpt-3.5-turbo-1106
is good enough for me most of the time, but I
am hesitant of the costs. Currently, executing a fine-tune job with ~220k tokens
is about $5! And I’m only at 100 examples. Luckily, every API “completion”
after that is a fraction of a cent.
Until OpenAI came around I’ve never really paid for API usage. It’s not something I’m used to. This has pushed me to try cheaper ways of training and using LLMs.
Overview
1. Mistral
Mistral1 is one of the many “open models” out there. It performs better than Llama2 and is commonly recommended as a GPT-3.5 replacement.
There are different variants of Mistral, in this guide we’ll be using
Mistral-7B-Instruct-v0.2
. One limitation to know about this variant is it
only supports 8k tokens in the context window whereas OpenAI’s
gpt-3.5-turbo-1106
supports 16k. Other variants likely require a different
format for fine-tuning. Variants like yarn-mistral
offer increases in Mistral’s context window, but I haven’t tested it yet.
Now that we’ve decided on Mistral-7B-Instruct-v0.2
, how do we fine-tune
it? There’s a specific format that Mistral expects which is different from the
OpenAI JSONL format. This is likely where you’ll need to start– by converting
your existing data into the Mistral format which we’ll go over next.
Understanding the Mistral Format
Here’s the OpenAI format you may be used to for fine-tuning. It typically starts with a system message, followed by a user message, and then the assistant’s response:
{"messages": [{"role": "system", "content": "You are an expert in world capitals. Reply to this with only the answer:"}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "Paris"}]}
{"messages": [{"role": "system", "content": "You are an expert in world capitals. Reply to this with only the answer:"}, {"role": "user", "content": "What is the capital of Germany?"}, {"role": "assistant", "content": "Berlin"}]}
With Mistral, the format is described in the Mistral documentation.
<s>[INST] Instruction [/INST] Model answer</s>[INST] Follow-up instruction [/INST]
Here’s an example of the Mistral format that performs the same task:
{"text": "<s> [INST] You are an expert in world capitals. You will only reply to users with the capital requested. For example 'What is the capital of France?' you should respond with: [/INST] Paris</s>"}
{"text": "<s> [INST] You are an expert in world capitals. You will only reply to users with the capital requested. For example 'What is the capital of Germany?' you should respond with: [/INST] Berlin</s>"}
Please note that I could not find many examples of Mistral fine-tuning data to base this on, but I believe this is a valid example.
I haven’t tested this yet, but I think it could reduce cost/noise for training. Leave a comment if you’ve got some experience with this please!
Automating the Conversion
Luckily, there’s a Python package called transformers
2 that can help us
convert between formats for most models. I hacked together this script to help
me convert from the OpenAI format to Mistral’s format. It’s important that
your messages must alternate between user/assistant/user/assistant/…
If you’d like to know more about the transformers
package, I highly recommend
checking out these HuggingFace docs.
import json
from transformers import AutoTokenizer
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2" # Can be changed to other models
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
messages = [
[
# Notice there's no system message here, the system role doesn't work for this model
{"role": "user", "content": "You are an expert in world capitals. Reply to this with only the answer: What is the capital of France?"},
{"role": "assistant", "content": "Paris"}
],
[
{"role": "user", "content": "You are an expert in world capitals. Reply to this with only the answer: What is the capital of Germany?"},
{"role": "assistant", "content": "Berlin"}
]
]
def create_mistral_data():
lines = []
for message in messages:
tokenized_chat = tokenizer.apply_chat_template(
message, tokenize=True, add_generation_prompt=True, return_tensors="pt")
lines.append(json.dumps({
"text": tokenizer.decode(tokenized_chat[0])
}))
# Write to out.jsonl
with open('out.jsonl', 'w') as f:
f.write('\n'.join(lines))
create_mistral_data()
After you’ve added 100+ (yes, 100+) examples, you can run the script like so:
pip install transformers
python instructions.py
An out.jsonl
file will be created in the same directory. This file will be
uploaded to Together.ai in the next step. It is their requirement that you
have at least 100 examples for fine-tuning.
2. Together.ai
Once you’ve got your fine-tuning data prepared (congrats btw) you can use a wide number of services to fine-tune your model. All you really need is a nice GPU. I’ve been using Together.ai for this.
It still costs ~$5 to fine-tune my model, but we have the option to download
the .safetensors
so that we can run the model locally! Together.ai gave me
$15 for free when I signed up, so I’ve been using that for now. I have
seen people recommend RunPod and Google Colab for this as well, they will
require a bit more work but may come out to be cheaper.
Start a Fine-Tune Job
Together.ai does not have a nice UI for fine-tuning like OpenAI does. You’ll
need to upload your .jsonl
file through the command line tool. Here’s a quick
example of how to do that:
pip install together # install CLI tool
export TOGETHER_API_KEY=your-api-key # https://api.together.xyz/settings/api-keys
together files upload out.jsonl # out.jsonl is our Mistral fine-tune data
Next, start the fine-tune job using the file ID from the upload command:
together finetune create --training-file FILE_ID_FROM_UPLOAD \
--model mistralai/Mistral-7B-Instruct-v0.2
Once that starts, you can check the status from their website here.
Once it processes and uploads you have the option to host it for $1.40/hr… or
you can download the .safetensors
and run it locally. We’ll cover the latter.
![images/togetherai-downloads.png](TogetherAI Downloads)
Prepare Files for Ollama
Once the download finishes, you’ll get a *.tar.zst
file. This is a compressed
file that you’ll need to extract with tar
. I recommend extracting it in a new
directory as there are many files and it can get messy.
tar --use-compress-program=unzstd -xvf ft-*.tar.zst
After a bit, you’ll have a directory with a bunch of files. Our next
step is to convert these files into a format that Ollama can understand. Typically
you’ll need a GGML or GGUF file (ending in .bin
).
3. Convert .safetensors for Ollama
The steps for this can all be found here however I’ll break it down for completeness.
First, within the directory containing your .safetensors
files, you need to
clone this repository:
git clone [email protected]:ollama/ollama.git ollama
cd ollama
git submodule init
git submodule update llm/llama.cpp
Then, setup the dependencies. It should be this simple:
python3 -m venv llm/llama.cpp/.venv
source llm/llama.cpp/.venv/bin/activate
pip install -r llm/llama.cpp/requirements.txt
And then, build the quantize
tool. We’ll see what that’s used for later.
make -C llm/llama.cpp quantize
Now comes the conversion process. We’ll need to point the convert.py
tool
to the entire directory where the .safetensors
are located.
# ../ should refer to the directory where the .safetensors are located
python llm/llama.cpp/convert.py ../ --outtype f16 --outfile converted.bin
Once that’s done, you’ll have a converted.bin
file
4. Quantize the Model
The quantize
tool is pretty neat. People compare it to compression, however
it’s not exactly like compression. Regardless, we need to quantize the converted
model to be sure that it will run on our local machine. Experiment with skipping
this step or trying other quantization types on your own. I’ve had best results
with Q5_K_M
and Q4_0
on my MacBook Air.
llm/llama.cpp/quantize converted.bin quantized.bin q4_0
5. Install Ollama
Ollama.ai let’s you run LLMs locally. Think of it like Docker for AI models. It’s a simple command line tool that can run Mistral, among the many other models it supports. It supports macOS and Linux right now, Windows users may be able to use WSL3.
6. Create a Modelfile
You can create your Modelfile4 anywhere, I created it in the same directory as
the .safetensors
from previous steps. All you need is the following:
FROM ./ollama/quantized.bin
TEMPLATE """[INST] {{ .System }} {{ .Prompt }} [/INST]"""
SYSTEM """REPLACE_WITH_YOUR_SYSTEM_PROMPT"""
The only things you may need to change are the FROM
path and the SYSTEM
prompt.
7. Run the Model
Now you can run the model with the following command:
ollama create my-ft-model -f Modelfile
ollama run my-ft-model # starts an inference session in the terminal
>> What is the capital of France?
Paris
And that’s it! You’ve got a fine-tuned Mistral model running locally. Hopefully this helps you save some money on your AI projects. I’m still not sure if this is worth it compared to paying OpenAI, but I certainly learned a few things so I’ll call it a win. Cheers!
8. Bonus: Setup Ollama WebUI
If you’d like to chat with your model in a UI like ChatGPT, Ollama WebUI is the answer. It’s very straightforward if ran in a Docker container. Here’s a quick example of how to do that:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v ollama-webui:/app/backend/data --name ollama-webui --restart always ghcr.io/ollama-webui/ollama-webui:main
Once that’s up, you can visit http://localhost:3000/ and create a local account. Assuming you’ve followed this guide, your model should be available to select and it should just work.
?. Open Questions
- Is my understanding of the Mistral fine-tuning format correct?
- There are some [/INST] artifacts in my own outputs that make me wonder…
- What’s the best way to write your fine-tuning data once and convert it to
the format required by Mistral, OpenAI, and others?
- NVIDIA/Megatron-LM has a tool for this, but is it more capable than HF transformers?
- How does fine-tuning a multimodal compare to just text?
- From wandb.ai it looks pretty easy! Very similar process.