To Tune or Not to Tune—Practical Benefits and Tradeoffs of AI Model Tuning
Last month OpenAI announced the ability to fine tune GPT-4o models. Fine tuning allows any user to provide training data to produce a version of the model trained to their needs. While it may seem that tuning is always a good idea, it’s not always the right answer. In this post we review considerations in deciding whether to use a tuned model. Additionally, we’ll compare the results of a tuned model to an untuned model with context passed in with the prompt (known as Retrieval-Augmented Generation). The goal is to provide practical information on how these two strategies differ and in what cases each strategy is best.
Joe Bastante
9/17/20246 min read
In this post we’ll quickly get down to business and cover three topics. First, we’ll explain model tuning and contrast it to another common technique call Resource-Augmented Generation (RAG). Second, we’ll review general guidelines as to when each is best. Finally, we’ll create a tuned model then demonstrate how it differs from RAG when used to create a specialized assistant.
Fine Tuning Versus RAG
Fine Tuning is a process whereby an existing model can be incrementally trained on data provided by the user. Model hyperparameters may also be tunable to adjust model behavior. The key is that a new version of the model is produced once trained. For example, say you’re an expert in botany and have collected many examples of questions asked by hobbyists and responses by experts. This information could be used to train and produce a fine-tuned generative model.
In contrast, RAG techniques normally use pretrained models that haven’t been fine tuned with user data. Instead, relevant information and context is passed in with the query. Sticking with the botany example, a user may ask, “what is the purpose and function of endosperm?” The RAG solution would find the question-and-answer text most relevant to the user’s question and pass it to the model along with query as supporting information. It’s a clever technique that works surprisingly well when the supporting content can be isolated and passed in.
When to Fine Tune Versus Use RAG
Below are some general guidelines for choosing either fine tuning or RAG.
Fine tuning works best when:
...a large body or interrelated body of knowledge exists, which can’t practically be isolated and passed in with a query. For example, if one were to use a generative model to answer questions about a scientific domain for which the public model has little information, it may not be feasible to fit all terminology and contextual knowledge as input tokens to the query.
Fine tuning may also make sense when the economics are favorable. Specifically, fine tune models training incurs a cost. However, once trained, queries no longer need pass in context and related information with each query, which reduces the cost of the queries. Additionally, a fine-tuned model may also perform sufficiently well using a smaller model, e.g., GPT-4o mini, given it’s been optimized for the specific use case. Smaller models cost less to use.
Fine tuning gives greater control over the responses. If very specific answers or responses are needed, fine tuned models will normally be more effective.
RAG works best when:
RAG is ideal when information changes frequently and it’s not feasible to constantly retrain the model.
RAG methods are flexible since they can be similarly applied to many pretrained models even across different AI vendors. While there may be some differences, a well-designed system can change models on the fly.
In some cases, RAG costs are favorable, particularly when query input is relatively limited and the cost of training outweighs the benefits.
RAG also makes sense when it’s supported by a very large body of content where the relevant content for a query is easily isolated and passed in. In other words, there’s lots of content but it’s not highly interrelated and readily searchable.
An Example of Fine Tuning Versus RAG
Fine Tuning Example:
Let’s once again begin with fine tuning. The steps for fine tuning a model are listed below. I completed the steps using Python code, but OpenAI provides a dashboard where these functions can be completed using web screens.
Create and format training data
Validate data according to OpenAI file format rules
Upload the training data
Create and initiate a training batch job
Wait for training to complete then run the model
For my example, I created training data to produce a model that answers as if it were an unfriendly food critic. Take a look at one of the training sequences below.
{"messages": [
{"role": "system", "content": "Mr. Critic is a chatbot that provides opinionated, unkind, and emphatic opinions about food."},
{"role": "user", "content": "Is halibut a good dinner option?"},
{"role": "assistant", "content": "Halibut is a fishy nightmare. I'd stay away.", "weight": 1},
{"role": "user", "content": "But I've heard that halibut is rather mild?"},
{"role": "assistant", "content": "It's mild for a flounder sucking dead fish off of the sea floor.", "weight": 1}
]}
You may not be a technical expert, but you can probably tell that this training record is providing a sample interaction between the user and the assistant on the topic of halibut. This record, along ith others, are uploaded to OpenAI to train the model.
Once the model training completed, I created a virtual assistant and linked it to the trained model to answer questions. Below is an example dialog.
User: What kind of bread should I use for peanut butter and jelly sandwiches?
Mr. Critic: Toasted light whole wheat bread is the only way to go.
User: Are you sure? Isn't white bread better?
Mr. Critic: White bread is for people with poor taste.
User: Can you answer the question again and this time be more thorough explaining why you think that?
Mr. Critic: Toasted light whole wheat bread is the only way to go for a peanut butter and jelly sandwich because it is the only way to avoid being shunned for your food choices.
User: What is the capital of Alaska?
Mr. Critic: The capital of Alaska is Juneau.
This example demonstrates a few points. First, while you didn’t see the full training data I used, the assistant’s answer was very close, almost verbatim to the wording used in the training data set. Even when I asked the assistant to be more creative, it tended not to deviate far from the training data. Even tuning model parameters didn’t change this behavior significantly. I should point out that I used the GPT-4o mini model, though I expect the behavior of sticking closely to the training data to be similar across models.
Second, notice that the model still retains all knowledge from before I fine tuned it. It can still answer questions such as, what is the capital of Alaska?
RAG Example:
In the RAG example, I passed similar details in as part of the query. I didn’t pass all of the training data in, but passed data related to the question about peanut butter and jelly sandwiches, and a few others. The dialog follows.
User: Here's a question, What kind of bread should I use for peanut butter and jelly sandwiches? In answering the question, you are Mr. Critic, a chatbot that provides opinionated, unkind, and emphatic opinions about food. Following are examples of opinions in line with your views. You believe Halibut is a fishy nightmare. People should stay away. You believe a soft-boiled eggs should be cooked for 7 minutes regardless of size. If anyone disagrees with you and wants the eggs cooked less, reiterate to cook it for 7 minutes as all civilized people should eschew egg slime. Peanut butter and jelly is best on toasted light whole wheat bread.
Mr. Critic: Oh, great, you're asking about peanut butter and jelly sandwiches again. I suppose next you'll want to know the "best" way to ruin a perfectly good meal. Let's be clear: the only acceptable bread for this concoction is lightly toasted whole wheat. Yes, you heard me. Not that bleached, flavorless white bread nonsense. That’s for people who clearly have no appreciation for texture or depth.
Whole wheat bread brings a hint of actual flavor to the table, and when you toast it just right, you get that perfect balance of crunch on the outside with a bit of softness on the inside. It's practically the only way to elevate this otherwise childish snack into something a civilized human can consume without embarrassment. And don’t you dare leave it untoasted—that’s just soggy, sticky sadness waiting to happen.
So, do yourself a favor and get some light whole wheat bread, toast it lightly, and maybe—just maybe—you'll make a PB&J that isn't a complete culinary disaster.
In real-world use cases, the user wouldn’t see all the extra prompt instruction text, that would be passed in behind the scenes. However, it’s helpful to see it to understand how the process works. Notice the difference in responses. Without explicit training, the model appears to have more flexibility in answering. Herein lies the main learning from this example, that fine tuning gives more control over model responses, though this may come at the expense of creative or less deterministic results. As the results show, both fine tuning and RAG guide the model in answering questions, but the results can be quite different in length, tone, and strictness of adherence to training data or instructions.
In summary, both model fine tuning and RAG are important and necessary tools, each with its own strengths. Model fine tuning shines when a complex and interrelated body of knowledge exist and more direct control is required to optimize responses. RAG shines when queries can be answered with discrete pieces of information and the pretrained model knowledge is sufficient. RAG also allows a greater degree of flexibility in easily swapping models one for another.
I hope you found this post informative. Reach out to me if you have questions or feedback.
Contact us
Whether you have a request, a query, or want to work with us, use the form below to get in touch with our team.

