Fine tuning with function-calling

This notebook covers how to fine-tune to increase function calling accuracy and reliability. You can find more information on function calling here, and on fine tuning here

For context, from the function calling notebook above:

tools is an optional parameter in the Chat Completion API which can be used to provide function specifications. The purpose of this is to enable models to generate function arguments which adhere to the provided specifications. Note that the API will not actually execute any function calls. It is up to developers to execute function calls using model outputs.

Function calling is a very powerful tool when it functions as intended. However, we have seen that as the number of functions increases, and the complexity of the task at hand increases, function calling becomes less accurate (e.g.: more hallucinated invocations, and incorrect invocations).

Before fine tuning for function calling, it's best to begin with:

Improvements to the function definitions. Make them more clear, and more distinct from one another.
Experiment with prompt engineering: often a more detailed prompt can help the model call the correct function.

If the steps above fail to improve function calling to a satisfactory level, then you can try fine tuning for function calling.

Overview

This notebook contains three sections

Assessing baseline function calling performance: Evaluating an out-of-the-box gpt-3.5-turbo model on our given function (let's assume that for latency + cost reasons we cannot use gpt-4o for a drone copilot)
Generating synthetic data: Using gpt-4o to create 'golden' set of prompts and function invocations to use as training data
Fine-tuning: Running the fine tuning job, and evaluating the fine-tuned model

Note: This notebook provides an example of how to create synthetic training data for fine tuning for function calling given just a list of functions. While real-world production test evals are preferable, this method produces strong results and can be used in conjunction with real-world training data.

Getting baseline function calling performance

Utilities

Let's define utility functions for making calls to the Chat Completions API, one to get the completion and one to get the function call.

Baseline testing

Let's build an intelligent drone co-pilot. We want to be able to give the co-pilot commands, and have it either call the function
for that command, or deny that request if the command is unfeasible.
We can first define a system prompt for the copilot.

Now let's define functions for all of the actions the copilot can take.

For starters, let's see how function calling performs with some straight forward feasible prompts, and then couple of obviously impossible requests which call the 'reject_request' function.

Nice! The model performs quite well with these requests. Now let's try some more difficult requests: requests that are almost feasible and are drone-related, but that the drone cannot actually do, and the pilot should reject.

Now we run into some problems.
The model here should reject all of these requests, as they are impossible/conflicting/ambiguous given the functions, however instead the model calls functions that are somewhat related to the request, but incorrect. For example, the model sets follow_me_mode when asked to initiate following on social media.

In this simple case, more prompt engineering may resolve some of these issues, but for the purpose of this example we will demonstrate how fine tuning can be used to improve performance. Additionally, while this case is relatively straightforward, as the number of and complexity of the functions increases, fine tuning becomes more and more impactful.

Again, our goal here is to improve performance and use less tokens, so fine-tuning allows us to:

Omit function and parameter descriptions: remove the description field from function and parameters
Omit parameters: remove the entire properties field from the parameters object
Omit function entirely: remove the entire function object from the functions array

Generating synthetic data

Helper functions

We want to generate every invocation of every function, so that we have
full coverage of all potential invocations to create synthetic data for. Then, we will use gpt-4o to come up with prompts that would call each invocation, and we will use that prompt - function invocation pair as training data.

Generating every invocation for a function with fixed enums is more simple, but for a function such as
control_gimbal we need to set the tilt and pan integer values, so to generate those synthetic invocations we will first set a placeholder, and then later use gpt-4o to come up with reasonable values.

The functions below take in all the functions from the function list, and look
at all the potential invocations of those functions given each function's parameters.
The functions also account for required parameters, so that all the invocations
are actually feasible.

Let's generate every invocation for every function first

Prompts:

In the below snippet, we generate the invocation of each function except for the reject_request function.

To perform effective fine-tuning we need correctly labeled data. We could manually come up with examples and label the data,
or we can generate synthetic data with the help of gpt-4o

Empirically, gpt-4o needs a bit more help to get good realistic examples of prompts that would generate the reject_request function, so we'll do that next...

Now that we have all the invocations, let's use gpt-4o to generate prompts that would result in those invocations

Now let's format the training examples properly. For more documentation on the proper training data formatting for fine tuning for function calling, see here: https://platform.openai.com/docs/guides/fine-tuning/fine-tuning-examples

Now, back to the rejection function. Let's generate some prompts that are nearly possible, but should result in the reject_request function being called. To do so, we queried gpt-4o asking for requests that are related to, but not quite possible with, the given list of functions.

Now combine all the training examples together

Fine tuning

Finally, we can kick off the fine-tuning job

In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.

After a fine-tuning job has finished, you can also see metrics around how the training process went by querying a fine-tuning job, extracting a file ID from the result_files, and then retrieving that files content. Each results CSV file has the following columns: step, train_loss, train_accuracy, valid_loss, and valid_mean_token_accuracy. While metrics can he helpful, evaluating samples from the fine-tuned model provides the most relevant sense of model quality.

Evaluations

Great! We trained a fine-tuned model for function calling. Let's see how it does on our evaluation set for prompts that the drone assistant
should automatically reject.

Great! While the original model only rejected 60%, the fine tuned model rejected 100% requests and used less tokens to do so.

Conclusion

Congratulations! You are now ready to fine tune your model for function calling. We can't wait to see what you build.

Fine tuning with function-calling